I have a system that had been running stable on 9.2.1 for over a year, and then I recently upgraded to 9.3.1 and started experiencing random reboots 1 or 2 times per week. This system had an average uptime of >100 days before upgrading.
I checked the logs, but there was nothing relevant in /var/log/messages, it shows normal log messages until it restarts seemingly randomly. I also checked the crash dump output in /data/crash, here's the contents of those dump files. I have also attached both of these files to this post.
The relevant error message here is:
ctl_check_for_blockage: Invalid serialization value 1634427759 for 4 => 14
This error is the same for the previous 4 crash dumps as well.
I looked through the src to find the function that it crashed on in /sys/cam/ctl/ctl.c:10986, and it looks like the ctl command queue has an ooa_io command (OOA=Order of Arrival) with an invalid serialization index (14=CTL_SERIDX_INVLD), while the next blocked I/O command has a valid index (4=CTL_SERIDX_SYNC).
All the LUNs that are managed by ctl are iSCSI devices, but I'm not sure what sequence of iSCSI commands is causing this crash. Here's a list of the devices reported by ctladm:
The iSCSI targets are used by VMware v6.0 using the VAAI plugin to enable VAAI primitives. My VMware configuration enables the UNMAP command by enabling the setting /VMFS3/EnableBlockDelete:
esxcfg-advcfg —s 1 /VMFS3/EnableBlockDelete
This setting is disabled by default in VMware because of the sometimes long wait times while waiting for the iSCSI target to return UNMAP command status. I have it enabled because it frees up unused space on the VMFS filesystem and gives more accurate ZVOL used space reporting. I'm not sure if this setting has anything to do with this bug as it looks like UNMAP is supported from looking at the src code, but it's a possibility.
If you want me to provide any other information that might be helpful let me know. I attached the relevant files from the most recent crash dump to this post. I considered submitting a bug report, but I wanted to check here to see if there are any obvious solutions first.
Thanks!
Jason
I checked the logs, but there was nothing relevant in /var/log/messages, it shows normal log messages until it restarts seemingly randomly. I also checked the crash dump output in /data/crash, here's the contents of those dump files. I have also attached both of these files to this post.
Code:
[root@nas] /data/crash# cat info.last Dump header from device /dev/dumpdev Architecture: amd64 Architecture Version: 1 Dump Length: 122368B (0 MB) Blocksize: 512 Dumptime: Sun Nov 22 14:40:53 2015 Hostname: nas.network.paccc.net Magic: FreeBSD Text Dump Version String: FreeBSD 9.3-RELEASE-p28 #0 r288272+a23e16d: Wed Nov 4 00:20:46 PST 2015 root@build3.ixsystems.com:/tank/home/stable-builds/FN/objs/os-base/amd64/tank/home/stable-builds/FN/FreeBSD/src/sys ctl_check_for_blockage: Invalid serialization value 1634427759 for 4 => 14 Panic String: ctl_check_for_blockage: Invalid serialization value 1634427759 for 4 => 14 Dump Parity: 3160697212 Bounds: 2 Dump Status: good
Code:
[root@nas] /data/crash# gzip -dc textdump.tar.last.gz | tail -n 15 panic: ctl_check_for_blockage: Invalid serialization value 1634427759 for 4 => 14 cpuid = 5 KDB: stack backtrace: db_trace_self_wrapper() at db_trace_self_wrapper+0x2a/frame 0xfffffe0866dbe300 kdb_backtrace() at kdb_backtrace+0x37/frame 0xfffffe0866dbe3c0 panic() at panic+0x1ce/frame 0xfffffe0866dbe4c0 ctl_check_ooa() at ctl_check_ooa+0xb7/frame 0xfffffe0866dbe520 ctl_work_thread() at ctl_work_thread+0x1f70/frame 0xfffffe0866dbeaa0 fork_exit() at fork_exit+0x11f/frame 0xfffffe0866dbeaf0 fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe0866dbeaf0 --- trap 0, rip = 0, rsp = 0xfffffe0866dbebb0, rbp = 0 --- KDB: enter: panic panic.txt ctl_check_for_blockage: Invalid serialization value 1634427759 for 4 => 14 version.txt FreeBSD 9.3-RELEASE-p28 #0 r288272+a23e16d: Wed Nov 4 00:20:46 PST 2015 root@build3.ixsystems.com:/tank/home/stable-builds/FN/objs/os-base/amd64/tank/home/stable-builds/FN/FreeBSD/src/sys/FREENAS.amd64
The relevant error message here is:
ctl_check_for_blockage: Invalid serialization value 1634427759 for 4 => 14
This error is the same for the previous 4 crash dumps as well.
I looked through the src to find the function that it crashed on in /sys/cam/ctl/ctl.c:10986, and it looks like the ctl command queue has an ooa_io command (OOA=Order of Arrival) with an invalid serialization index (14=CTL_SERIDX_INVLD), while the next blocked I/O command has a valid index (4=CTL_SERIDX_SYNC).
All the LUNs that are managed by ctl are iSCSI devices, but I'm not sure what sequence of iSCSI commands is causing this crash. Here's a list of the devices reported by ctladm:
Code:
[root@nas] /root# ctladm devlist LUN Backend Size (Blocks) BS Serial Number Device ID 0 block 524288000 512 0025908524dc060 iSCSI Disk 0025908524dc060 1 block 4194304 512 0025908524dc080 iSCSI Disk 0025908524dc080 2 block 4294967296 512 0025908524dc140 iSCSI Disk 0025908524dc140 3 block 2147483648 512 0025908524dc070 iSCSI Disk 0025908524dc070 4 block 268435456 512 0025908524dc120 iSCSI Disk 0025908524dc120 5 block 4294967296 512 0025908524dc040 iSCSI Disk 0025908524dc040 6 block 268435456 512 0025908524dc130 iSCSI Disk 0025908524dc130
The iSCSI targets are used by VMware v6.0 using the VAAI plugin to enable VAAI primitives. My VMware configuration enables the UNMAP command by enabling the setting /VMFS3/EnableBlockDelete:
esxcfg-advcfg —s 1 /VMFS3/EnableBlockDelete
This setting is disabled by default in VMware because of the sometimes long wait times while waiting for the iSCSI target to return UNMAP command status. I have it enabled because it frees up unused space on the VMFS filesystem and gives more accurate ZVOL used space reporting. I'm not sure if this setting has anything to do with this bug as it looks like UNMAP is supported from looking at the src code, but it's a possibility.
If you want me to provide any other information that might be helpful let me know. I attached the relevant files from the most recent crash dump to this post. I considered submitting a bug report, but I wanted to check here to see if there are any obvious solutions first.
Thanks!
Jason
Attachments
Last edited: