Spontaneous system crashes with iSCSI LUNs

Paccc · Nov 22, 2015

I have a system that had been running stable on 9.2.1 for over a year, and then I recently upgraded to 9.3.1 and started experiencing random reboots 1 or 2 times per week. This system had an average uptime of >100 days before upgrading.

I checked the logs, but there was nothing relevant in /var/log/messages, it shows normal log messages until it restarts seemingly randomly. I also checked the crash dump output in /data/crash, here's the contents of those dump files. I have also attached both of these files to this post.

Code:

[root@nas] /data/crash# cat info.last
Dump header from device /dev/dumpdev
  Architecture: amd64
  Architecture Version: 1
  Dump Length: 122368B (0 MB)
  Blocksize: 512
  Dumptime: Sun Nov 22 14:40:53 2015
  Hostname: nas.network.paccc.net
  Magic: FreeBSD Text Dump
  Version String: FreeBSD 9.3-RELEASE-p28 #0 r288272+a23e16d: Wed Nov  4 00:20:46 PST 2015
    root@build3.ixsystems.com:/tank/home/stable-builds/FN/objs/os-base/amd64/tank/home/stable-builds/FN/FreeBSD/src/sys
    ctl_check_for_blockage: Invalid serialization value 1634427759 for 4 => 14
  Panic String: ctl_check_for_blockage: Invalid serialization value 1634427759 for 4 => 14
  Dump Parity: 3160697212
  Bounds: 2
  Dump Status: good

Code:

[root@nas] /data/crash# gzip -dc textdump.tar.last.gz | tail -n 15
panic: ctl_check_for_blockage: Invalid serialization value 1634427759 for 4 => 14
cpuid = 5
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2a/frame 0xfffffe0866dbe300
kdb_backtrace() at kdb_backtrace+0x37/frame 0xfffffe0866dbe3c0
panic() at panic+0x1ce/frame 0xfffffe0866dbe4c0
ctl_check_ooa() at ctl_check_ooa+0xb7/frame 0xfffffe0866dbe520
ctl_work_thread() at ctl_work_thread+0x1f70/frame 0xfffffe0866dbeaa0
fork_exit() at fork_exit+0x11f/frame 0xfffffe0866dbeaf0
fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe0866dbeaf0
--- trap 0, rip = 0, rsp = 0xfffffe0866dbebb0, rbp = 0 ---
KDB: enter: panic
panic.txt
ctl_check_for_blockage: Invalid serialization value 1634427759 for 4 => 14
version.txt
FreeBSD 9.3-RELEASE-p28 #0 r288272+a23e16d: Wed Nov  4 00:20:46 PST 2015
    root@build3.ixsystems.com:/tank/home/stable-builds/FN/objs/os-base/amd64/tank/home/stable-builds/FN/FreeBSD/src/sys/FREENAS.amd64

The relevant error message here is:
ctl_check_for_blockage: Invalid serialization value 1634427759 for 4 => 14

This error is the same for the previous 4 crash dumps as well.

I looked through the src to find the function that it crashed on in /sys/cam/ctl/ctl.c:10986, and it looks like the ctl command queue has an ooa_io command (OOA=Order of Arrival) with an invalid serialization index (14=CTL_SERIDX_INVLD), while the next blocked I/O command has a valid index (4=CTL_SERIDX_SYNC).

All the LUNs that are managed by ctl are iSCSI devices, but I'm not sure what sequence of iSCSI commands is causing this crash. Here's a list of the devices reported by ctladm:

Code:

[root@nas] /root# ctladm devlist
LUN Backend       Size (Blocks)   BS Serial Number    Device ID
  0 block             524288000  512 0025908524dc060  iSCSI Disk      0025908524dc060
  1 block               4194304  512 0025908524dc080  iSCSI Disk      0025908524dc080
  2 block            4294967296  512 0025908524dc140  iSCSI Disk      0025908524dc140
  3 block            2147483648  512 0025908524dc070  iSCSI Disk      0025908524dc070
  4 block             268435456  512 0025908524dc120  iSCSI Disk      0025908524dc120
  5 block            4294967296  512 0025908524dc040  iSCSI Disk      0025908524dc040
  6 block             268435456  512 0025908524dc130  iSCSI Disk      0025908524dc130

The iSCSI targets are used by VMware v6.0 using the VAAI plugin to enable VAAI primitives. My VMware configuration enables the UNMAP command by enabling the setting /VMFS3/EnableBlockDelete:
esxcfg-advcfg —s 1 /VMFS3/EnableBlockDelete

This setting is disabled by default in VMware because of the sometimes long wait times while waiting for the iSCSI target to return UNMAP command status. I have it enabled because it frees up unused space on the VMFS filesystem and gives more accurate ZVOL used space reporting. I'm not sure if this setting has anything to do with this bug as it looks like UNMAP is supported from looking at the src code, but it's a possibility.

If you want me to provide any other information that might be helpful let me know. I attached the relevant files from the most recent crash dump to this post. I considered submitting a bug report, but I wanted to check here to see if there are any obvious solutions first.

Thanks!
Jason

Pointeo13 · Nov 24, 2015

I think you and I are running into the same issue, at first I thought I took care of it by replacing the bad memory dimm,but that didnt work, so I went ahead and swapped out all of the memory, the server motherboard, nics etc, but it's still rebooting on me. I have been pulling my hair out trying to trouble shoot the issue. Here is what I am seeing in the log, I'm going to revert to a previous version from a month ago to see if that takes care it. I am also running esxi 6.0 with iscsi.

panic: ctl_check_for_blockage: Invalid serialization value 1667590243 for 1 => 1
4
cpuid = 10
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2a/frame 0xfffffeb2f8410440
kdb_backtrace() at kdb_backtrace+0x37/frame 0xfffffeb2f8410500
panic() at panic+0x1ce/frame 0xfffffeb2f8410600
ctl_check_ooa() at ctl_check_ooa+0xb7/frame 0xfffffeb2f8410660
ctl_work_thread() at ctl_work_thread+0x1f70/frame 0xfffffeb2f8410be0
fork_exit() at fork_exit+0x11f/frame 0xfffffeb2f8410c30
fork_trampoline() at fork_trampoline+0xe/frame 0xfffffeb2f8410c30
--- trap 0, rip = 0, rsp = 0xfffffeb2f8410cf0, rbp = 0 ---
KDB: enter: panic

Here is my post about it: https://forums.freenas.org/index.php?threads/freenas-server-randomly-reboots.39338/

During this time I was moving to a new house, so my mind is a little foggy, but I think it started to happen at this version of freenas, FreeNAS-9.3-STABLE-201510290351, I could be wrong but I am going to try version FreeNAS-FreeNAS-9.3-STABLE-201509282017 and see if that fixes it.

cyberjock · Nov 24, 2015

There is a bug ticket on this issue. Basically there is a bug in ctl. *something* is going wrong. Nobody in the BSD community seems to know what is going wrong. However iXsystems is currently working with someone that is having this issue. The bug is internal, but the number is 12080.

I can only provide two "good" options:

1. Go back to a version prior to 9.3.0. Versions prior to 9.3.0 used istgt, which doesn't have this bug.
2. Wait until this bug is fixed.

*if* this bug is crashing within 24 hours of your system going online, and you are willing to allow iXsystems to use your box to isolate the problem, please send me a PM and I'll ask if they'd like to work with your box. The box they are trying to work with was crashing every few days, and after rigging it for full crash dumps its suddenly not crashed in almost a week. This is a pretty major bug, and iX is committed to fixing it as soon as possible. If you box isn't crashing within 24 hours its probably not going to be useful for us. I'm expecting that once we identify the bug and fix it another build will be released shortly after.

If your box is already in a crashed state and is at the debugger, definitely PM me and I can probably get a developer pretty quickly. You can also try calling iXsystems support, choose the options for support and whoever answers ask to speak to me.

Thanks.

Pointeo13 · Nov 24, 2015

Thanks cyberjock for the info! My box has been up for about 3 day's 27min, I expect mine to crash very soon

Paccc · Nov 25, 2015

Thanks for the info, cyberjock. If it crashes again I will let you know, it has been up for about 5 days now. Is there anything I should do or enable in preparation for the next crash in order to collect more useful data?

cyberjock · Nov 26, 2015

Paccc said:
Thanks for the info, cyberjock. If it crashes again I will let you know, it has been up for about 5 days now. Is there anything I should do or enable in preparation for the next crash in order to collect more useful data?

Nope. Just don't enable auto reboots on panic. ;)

mattbbpl · Nov 27, 2015

cyberjock said:
The box they are trying to work with was crashing every few days, and after rigging it for full crash dumps its suddenly not crashed in almost a week.

Having been a packaged-software developer myself, I can say that this result should have been entirely foreseen. Happens every time

Paccc · Nov 28, 2015

cyberjock, is it enough to enable the sysctl debug.debugger_on_panic = 1 to prevent the reboot? Is there anything else I should enable, like for remote debugging? I'm not too familiar with the freebsd kernel debugger. Thanks.

Pointeo13 · Nov 29, 2015

It finally crashed but it still rebooted after the panic, I do have debug.debugger_on_panic = 1, is their something else i need to do to prevent the auto reboot?

Edit: Just rebooted again, odd how it stayed up for a week and then other times it will reboot a few minutes after it recovers :-(

cyberjock · Nov 30, 2015

Just some words about this.. allegedly this is now known and a fix is being tested. So definitely update your systems when the next version comes out. I don't have confirmation if the update from 11-28 has the fix, but the commit was made on 11-27, so it is possibly fixed. ;)

Pointeo13 · Dec 1, 2015

Thanks again cyberjock for the update, does this mean if I install the nightly builds I should get the fix?

cyberjock · Dec 1, 2015

Pointeo13 said:
Thanks again cyberjock for the update, does this mean if I install the nightly builds I should get the fix?

Yes, the fix should definitely be in the nightlies. Just keep in mind that once you go to nightlies you can't really upgrade back to STABLE versions as the configuration file won't be compatible.

Pointeo13 · Dec 1, 2015

cyberjock said:
Yes, the fix should definitely be in the nightlies. Just keep in mind that once you go to nightlies you can't really upgrade back to STABLE versions as the configuration file won't be compatible.

Sounds good, I'v been using the nightlies the past few days, tonight will be my last time I upgrade to the nightlies and I'll just sit on that version for awhile to see what happens.

Paccc · Dec 2, 2015

Thanks for the update, cyberjock. I checked the new commits and saw the code changes made and it looks like it'll definitely fix this problem. It also looks like this fix was included in the 9.3.1 STABLE build that happened a few days ago (201511280648) so Pointeo, you should be able to skip the nightlies and go straight to the stable one to test it.

I've been running the new version for a few days now under heavy load and it hasn't crashed yet, which would have happened at least a 3 times by now if the bug was still present. So, in my expert opinion, I would consider this bug resolved. :)

Thanks cyberjock for keeping us updated, and thanks to mav, the developer who got this fixed! :)

Will Dormann · Dec 3, 2015

I've looked in /data/crash and I've confirmed that this is what I've been seeing in my other thread Spontaneous reboots under iSCSI load. I don't see anything in the changelog for the stable train that indicates that this critical bug has been fixed, though. Which FreeNAS bug ID is it filed under, just so that I can be certain that it's fixed? Our system has experienced the panic several times (in any time between hours and up to 2 days of running)

Edit: I see from above that it's issue #12080, which is not listed in the changelog.

If it's not fixed in STABLE yet, would disabling the iSCSI service be a workaround? We've been playing with running the VMs off of NFS instead of iSCSI anyway, but still have an iSCSI zpool around and it is still mounted in vSphere. And apparently this was still enough to cause the panic, despite the iSCSI channel not being under load this last time.

Paccc · Dec 3, 2015

Hi Will, you're right that the issue #12080 is not listed in the changelog, but I looked through the source that was used to build the latest stable version and it did indeed include it. Perhaps they forgot to update the changelog with that issue number?

Regardless, I'm not sure if this is the same error you're seeing. For me, I got the error message: ctl_check_for_blockage: Invalid serialization value 1634427759 for 4 => 14 every time it crashed. I didn't see that mentioned anywhere in your thread. It looks like yours may be related to a corrupt zpool based on what you said.

Either way, if you are experiencing the same bug, disabling iSCSI should stop it from happening. Have you tried updating to the latest STABLE version?

Will Dormann · Dec 3, 2015

Yes, ctl_check_for_blockage: Invalid serialization value is exactly the panic I was experiencing. Perhaps I didn't mention it in the other thread because I wasn't looking in the right place for panic logs. And the screenshots were unrelated in that they are triggered by daisy-chaining the JBODs, which I've learned is apparently unsupported by the vendor. Perhaps it's another FreeBSD bug, but in the end unrelated to the bug in this thread, and the one that was causing my spontaneous reboots in my thread (which were really kernel panics).

What specifically is the fix for issue #12080, and how can I verify that the latest STABLE build has the fix? I cannot imagine why they would leave out mention of a fix for a bug that I consider to be a BLOCKER!

Paccc · Dec 3, 2015

Perhaps it's because it was an internal issue number? I'm really not sure, I agree that it should have been included in the changelog because it's a pretty critical bug, but I don't know. Here is the link to the commit that fixes the bug: https://github.com/trueos/trueos/co...d2de15c#diff-6824b70fba9472f8f0b46835fcefe106

Will Dormann · Dec 3, 2015

Thanks. And because that patch went into TrueOS before the FreeNAS stable release date, that guarantees its inclusion in FreeNAS? I have to admit that I don't fully understand the relationship between TrueOS and FreeNAS.

Paccc · Dec 3, 2015

I don't fully know either, but from my understanding, TrueOS is a fork of the FreeBSD source used for the FreeNAS OS. FreeNAS simply uses this, and adds the FreeNAS UI on top of it to give it a web interface to manage it (and some stuff internally to generate the /etc/ config files based on the UI settings).

If you look at the FreeNAS git repo you see that it's just a bunch of python and shell scripts, and when you build the image it pulls in the source from TrueOS and a few other git repos and assembles it all together to create a final .iso image. That's about the extent of my knowledge on it, maybe someone else who knows more can give a better explanation. :)

Important Announcement for the TrueNAS Community.

Spontaneous system crashes with iSCSI LUNs

Cadet

Attachments

Explorer

Inactive Account

Explorer

Cadet

Inactive Account

Patron

Cadet

Explorer

Inactive Account

Explorer

Inactive Account

Explorer

Cadet

Explorer

Cadet

Explorer

Cadet

Explorer

Cadet

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Spontaneous system crashes with iSCSI LUNs"

Similar threads