ESXi to Freenas iSCSI connection lost

Status
Not open for further replies.

Zhaolo

Cadet
Joined
Apr 14, 2015
Messages
9
Having a strange issue maybe somebody can shed some lights on what to look for.

Freenas Server: 9.3 stable
HP DL380G5 Intel(R) Xeon(R) CPU E5440 @ 2.83GHz
32GB RAM
iSCSI to 6 600GB SAS disk in RAID 10
NFS to 4 500GB SSD in RAID 10
Only basic service enabled and no plugin installed.

3 ESXi host connected to both datastores via different gigabit NIC. iSCSI been working for few days without issue, but it experienced a disconnect where it appears the service stopped and had to disable and enable iSCSI from Freenas GUI to get it to reconnect. There were no issue with NFS when this occurred. Couldn't see any unusual spikes in load from vmware or freenas side. Any idea what to look for to troubleshoot this?
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
Aside from looking in the logs for some message to indicate what happened, no.
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
There is a memory leak in FreeNAS 9.3 and a fix is due out later this week. It is possible that this hit you. :(
 

Zhaolo

Cadet
Joined
Apr 14, 2015
Messages
9
Thanks guys for the reply. Hopefully it's just a bug in 9.3, at least it's not a critical issue at the moment. Where would the iscsi logs be in the system? Checked var/logs but couldn't find anything.
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
Since the iscsi service is kernel based, I'm guessing it'd be in the kernel log, but you'd probably need to check syslog.conf to know for sure.
 

Zhaolo

Cadet
Joined
Apr 14, 2015
Messages
9
Just reporting back on the issue. The log didn't specify anything to indicate what happened, and didn't experience issue for rest of the week. However this morning, the NFS service stopped, then the server rebooted it self when I attempt to login to the Freenas server. And of course, logs have no information after the reboot. Both issues occurred on a monday around the same time (11am). I don't have any service or snapshots scheduled at that time. Seems like there are some issues with 9.3, going to try another OS.
 

Zhaolo

Cadet
Joined
Apr 14, 2015
Messages
9
Looks like scrubs enabled by default on Sunday of every week. Hmm wondering if that's causing the issue. Have a p400 (iscsi) and LSI 9240-8i (SSDs)
 
L

L

Guest
I found this awhile back.. iscsi log level is very low by default. You can set it higher with #sysctl kern.cam.ctl.debug=7

Make sure you turn it back to 0 or 1 when you are done debugging. It is set to 0 by default.
 

Zhaolo

Cadet
Joined
Apr 14, 2015
Messages
9
Thanks for the suggestion Linda, but it doesn't look like it's not isolated to iscsi as the entire server were affected this time. HB may be on to something regarding scrubbing of the disks. Going to backup and try a manual scrub when I get the chance to see if it reproduce the issue.
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
Looks like scrubs enabled by default on Sunday of every week. Hmm wondering if that's causing the issue. Have a p400 (iscsi) and LSI 9240-8i (SSDs)

Very likely the P400 is at fault here. It has known issues with hanging entire systems under heavy I/O, which a scrub would definitely trigger.

In addition it's not an HBA, so unless you've made a hardware RAID10 and passed the raw device as an iSCSI extent, you've probably got ZFS on top of hardware RAID as well. Which is bad juju.
 

Zhaolo

Cadet
Joined
Apr 14, 2015
Messages
9
Thanks HB, good to know. Going to rebuild the raid and avoid software raid on the p400 all together, also going to run some memtest on the server as well just in case of bad RAM(saw a post for OmniOS and Nas4free regarding scrub causing reboots but no cause or solution provided).
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
Check the HP IML (Integrated Management Logs) in the BIOS as well, if you've got failing RAM it should have an error there (and should point out the bad DIMM by proc/slot) but my inclination is that it's the P400.

By "avoid software RAID" do you mean "remove it from the system" or do the hardware RAID10 -> device mapped as an extent directly? In the latter case you won't get any of the ZFS benefits (ARC, cached writes, self-healing)
 

Zhaolo

Cadet
Joined
Apr 14, 2015
Messages
9
I'll check the HP logs as well but you're right, most likely not a RAM issue.

Will do hardware raid 10 on P400, and map it with NFS rather than iscsi, as everything ran with that setup without issue.
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112

Zhaolo

Cadet
Joined
Apr 14, 2015
Messages
9
Yeah it's true, maybe without the need for ZFS to do the raid calculation maybe that's why it was more stable? Device map is an option as well, I guess there's couple of things to play around with.
 

Zhaolo

Cadet
Joined
Apr 14, 2015
Messages
9
Reporting back, no issues after changes made for the disks on the P400. Couldn't verify if mix of sw/hw raid caused the issue, but it's most likely the cause since after the changes, the system been stable. For some reason speed of iscsi device mapping is slower compared to previous setup, but it's stable which is a lot better than mystery reboots.
 
Last edited:
Status
Not open for further replies.
Top