ESXi to Freenas iSCSI connection lost

Zhaolo · Apr 14, 2015

Having a strange issue maybe somebody can shed some lights on what to look for.

Freenas Server: 9.3 stable
HP DL380G5 Intel(R) Xeon(R) CPU E5440 @ 2.83GHz
32GB RAM
iSCSI to 6 600GB SAS disk in RAID 10
NFS to 4 500GB SSD in RAID 10
Only basic service enabled and no plugin installed.

3 ESXi host connected to both datastores via different gigabit NIC. iSCSI been working for few days without issue, but it experienced a disconnect where it appears the service stopped and had to disable and enable iSCSI from Freenas GUI to get it to reconnect. There were no issue with NFS when this occurred. Couldn't see any unusual spikes in load from vmware or freenas side. Any idea what to look for to troubleshoot this?

jgreco · Apr 15, 2015

Aside from looking in the logs for some message to indicate what happened, no.

cyberjock · Apr 15, 2015

There is a memory leak in FreeNAS 9.3 and a fix is due out later this week. It is possible that this hit you. :(

Zhaolo · Apr 15, 2015

Thanks guys for the reply. Hopefully it's just a bug in 9.3, at least it's not a critical issue at the moment. Where would the iscsi logs be in the system? Checked var/logs but couldn't find anything.

jgreco · Apr 15, 2015

Since the iscsi service is kernel based, I'm guessing it'd be in the kernel log, but you'd probably need to check syslog.conf to know for sure.

Zhaolo · Apr 20, 2015

Just reporting back on the issue. The log didn't specify anything to indicate what happened, and didn't experience issue for rest of the week. However this morning, the NFS service stopped, then the server rebooted it self when I attempt to login to the Freenas server. And of course, logs have no information after the reboot. Both issues occurred on a monday around the same time (11am). I don't have any service or snapshots scheduled at that time. Seems like there are some issues with 9.3, going to try another OS.

HoneyBadger · Apr 20, 2015

Any scheduled scrubs at that time?

What controller/HBA is in that DL380? IIRC none of the factory-installed ones are capable of drive passthrough.

Zhaolo · Apr 20, 2015

Looks like scrubs enabled by default on Sunday of every week. Hmm wondering if that's causing the issue. Have a p400 (iscsi) and LSI 9240-8i (SSDs)

L · Apr 20, 2015

I found this awhile back.. iscsi log level is very low by default. You can set it higher with #sysctl kern.cam.ctl.debug=7

Make sure you turn it back to 0 or 1 when you are done debugging. It is set to 0 by default.

Zhaolo · Apr 20, 2015

Thanks for the suggestion Linda, but it doesn't look like it's not isolated to iscsi as the entire server were affected this time. HB may be on to something regarding scrubbing of the disks. Going to backup and try a manual scrub when I get the chance to see if it reproduce the issue.

HoneyBadger · Apr 20, 2015

Zhaolo said:
Looks like scrubs enabled by default on Sunday of every week. Hmm wondering if that's causing the issue. Have a p400 (iscsi) and LSI 9240-8i (SSDs)

Very likely the P400 is at fault here. It has known issues with hanging entire systems under heavy I/O, which a scrub would definitely trigger.

In addition it's not an HBA, so unless you've made a hardware RAID10 and passed the raw device as an iSCSI extent, you've probably got ZFS on top of hardware RAID as well. Which is bad juju.

Zhaolo · Apr 20, 2015

Thanks HB, good to know. Going to rebuild the raid and avoid software raid on the p400 all together, also going to run some memtest on the server as well just in case of bad RAM(saw a post for OmniOS and Nas4free regarding scrub causing reboots but no cause or solution provided).

HoneyBadger · Apr 20, 2015

Check the HP IML (Integrated Management Logs) in the BIOS as well, if you've got failing RAM it should have an error there (and should point out the bad DIMM by proc/slot) but my inclination is that it's the P400.

By "avoid software RAID" do you mean "remove it from the system" or do the hardware RAID10 -> device mapped as an extent directly? In the latter case you won't get any of the ZFS benefits (ARC, cached writes, self-healing)

Zhaolo · Apr 20, 2015

I'll check the HP logs as well but you're right, most likely not a RAM issue.

Will do hardware raid 10 on P400, and map it with NFS rather than iscsi, as everything ran with that setup without issue.

HoneyBadger · Apr 20, 2015

That setup still puts ZFS on top of hardware RAID though. You want to use one or the other, not both. RAID cards add a layer of abstraction that doesn't play nicely with ZFS. See my posting here:

https://forums.freenas.org/index.php?threads/hardare-raid-5-or-zfs.30204/#post-193970

(Odd that it would behave without crashing under NFS though.)

Zhaolo · Apr 20, 2015

Yeah it's true, maybe without the need for ZFS to do the raid calculation maybe that's why it was more stable? Device map is an option as well, I guess there's couple of things to play around with.

HoneyBadger · Apr 20, 2015

Just make sure you set up email/SNMP/LED/carrier pigeon alerting in the P400's BIOS setup so that you get notice of a drive's health failing.

Zhaolo · Apr 27, 2015

Reporting back, no issues after changes made for the disks on the P400. Couldn't verify if mix of sw/hw raid caused the issue, but it's most likely the cause since after the changes, the system been stable. For some reason speed of iscsi device mapping is slower compared to previous setup, but it's stable which is a lot better than mystery reboots.

Important Announcement for the TrueNAS Community.

ESXi to Freenas iSCSI connection lost

Zhaolo

Cadet

jgreco

Resident Grinch

cyberjock

Inactive Account

Zhaolo

Cadet

jgreco

Resident Grinch

Zhaolo

Cadet

HoneyBadger

actually does care

Zhaolo

Cadet

L

Guest

Zhaolo

Cadet

HoneyBadger

actually does care

Zhaolo

Cadet

HoneyBadger

actually does care

Zhaolo

Cadet

HoneyBadger

actually does care

Zhaolo

Cadet

HoneyBadger

actually does care

Zhaolo

Cadet

Similar threads