Instability of iSCSI ?

Status
Not open for further replies.

Jano

Dabbler
Joined
Jan 7, 2014
Messages
31
Hi,

I'm using FN9.3 (latests updates available) connected with 5 ESXI hosts via iSCSI.
Regulary I observate in the log the following messages (log with all messages [some duplication were removed] attached bellow):

> WARNING: 1.0.1.15 (iqn.0000-00.xxx.PROD-H1:0): connection error; dropping connection
> ctl_datamove: tag 0x179ce5 on (6:4:0:1) aborted

yesterday I found something new:

> (17:11:5/7): WRITE(10). CDB: 2a 00 1d e9 8d 28 00 00 02 00
> (6:4:1/0): Tag: 0x179cea, type 1
> (17:11:5/7): Tag: 0x11e1d8, type 1
> (12:4:1/0): WRITE(10). CDB: 2a 00 0d c5 b0 18 00 01 00 00
> (17:11:5/7): ctl_process_done: 150 seconds
> (12:4:1/0): Tag: 0x178ef7, type 1
> (12:4:1/0): ctl_datamove: 137 seconds

From time to time (ones a week let's say) UI fozes for several minites (on all statistics on UI there is just gap), ISCSI as well as console, ssh are working correctly during this event. After few minutes also UI again works fine.

Solution I found --if we can call this solution-- is to restart ISCSI.

Previously FN9.2.1.5 was used and all works fine but of course there was no CTL and such performance as FN9.3 has.

FN is installed on the following hardware:
I7 4280K / 64GB ram / 14x HDD WD RED 2TB as RAID10 + L2ARC120GB SSD / LSI 9260-16i / 4x NIC as multipath with JUMBO 9000

Please, can you tell me what these messages inform about ?
How to deeply analyse reason of them ?
 

Attachments

  • log.txt
    28.9 KB · Views: 352

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
Well, aside from the fact you seemed to do all the right things with your hardware except the choice of CPU (non-ECC), jumbo frames isn't recommended. It can cause all sorts of nastiness that appears to be random problems caused by hitting edge cases. They are damn near impossible to pinpoint.

I don't know what ESXi version you are using, but I will tell you that ESXi 5.1 supports a max of 9000 MTU, and if you've set MTU to 9000 on FreeNAS you definitely have mismatched MTUs. So I'd try setting everything to defaults. ;)
 

Jano

Dabbler
Joined
Jan 7, 2014
Messages
31
To complete information:
- ESXi 5.5 (latest updates)
- the same hardware configuration works well with FN9.2 without these kind of problems

Of course I can try to turn off jumbo but as it is production installation rather prefer to first understand problem before starting "trying".

Can you tell me what messages above mean ?
How it is possible to learn more reason of them ?
If you suggest turning of jumbo it is because FN9.3 treat it in different way than FN9.2 or just by your experience you know it is just bad idea at all ?
An at the end... jumbo set on FN to 9000 and set in esxi to 9000 means anyway incomatibility ? why ?
 

Dave Genton

Contributor
Joined
Feb 27, 2014
Messages
133
I JUST posted having same issue ! I've been running iSCSI for 2 years on this freenas box with these clients. After an update to 9.3 the week before last this started for all Global SAN OS X clients, but VMware is working fine. Yes it does it both ways, jumbo MTU or not. Nobody here likes jumbo mtu's and being a Network Engineer I know that issue better than anyone, but on an isolated vlan for storage using jumbo's with vmware it is best practices. None the less I have it running both ways, 2 servers, both logging same as yours and the OS X client wont stay up for 30 seconds before ejecting and then returning, non stop until I stop it. Running VM's on ESXi 5.5 via iSCSI on same FreeNAS box without issue.

dave
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
Have you filed a detailed bug report? The best chance of helping developers duplicate an issue and all that...
 

Jano

Dabbler
Joined
Jan 7, 2014
Messages
31
That's really crazy... really turning JUMBO off solved all these problems or at least I do not see them any longer in logs.

I do not see also performance degrdation, no UI freezs (but this can be result of recent updates of FN), so, fianlly I will stay with MTU 1500.

Maybe one observation at the end... L2ARC is loading slower than with MTU9000... but with my workload I have it anyway on the level of 31% hit (while ARC with [minimum]: 66%) so I think no reasons to worry.
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
Cuz jumbo sux.

No, seriously, spent years trying to maintain a jumbo-capable infrastructure and it always comes back to the problem that jumbo is just much less robust than standard ethernet. With drivers using a different codepath and strategies for dealing with jumbo, many problems only appear only when the entire system is put under heavy load with large traffic, and the specific thing misbehaving becomes very difficult to locate and fix.

You're probably wrong that you don't see performance degradation, but it is probably only a few percent, which is enough to be almost unnoticeable. If you're like me, turning off a feature that OUGHT to work and OUGHT to be awesome is a repugnant solution to the problem, but after too-much using your head as a wrecking ball against a brick wall, the answer becomes obvious.

I've said it before: Screw jumbo.
 
Status
Not open for further replies.
Top