zfs & zpool commands, web ui deadlock upon single drive failure

Status
Not open for further replies.

Will Dormann

Explorer
Joined
Feb 10, 2015
Messages
61
Hi folks

I recently had a single drive fail:

Apr 4 19:45:02 v1 (da8:mps0:0:17:0): READ(10). CDB: 28 00 80 60 a4 08 00 00 58 00
Apr 4 19:45:02 v1 (da8:mps0:0:17:0): CAM status: Command timeout
Apr 4 19:45:02 v1 (da8:mps0:0:17:0): Error 5, Retries exhausted
Apr 4 19:45:26 v1 (da8:mps0:0:17:0): READ(10). CDB: 28 00 00 40 00 80 00 01 00 00
Apr 4 19:45:26 v1 (da8:mps0:0:17:0): CAM status: SCSI Status Error
Apr 4 19:45:26 v1 (da8:mps0:0:17:0): SCSI status: Check Condition
Apr 4 19:45:26 v1 (da8:mps0:0:17:0): SCSI sense: Deferred error: HARDWARE FAILURE asc:15,1 (Mechanical positioning error)
Apr 4 19:45:26 v1 (da8:mps0:0:17:0): Info: 0x9cb2b84e
Apr 4 19:45:26 v1 (da8:mps0:0:17:0): Field Replaceable Unit: 131
Apr 4 19:45:26 v1 (da8:mps0:0:17:0): Actual Retry Count: 24
Apr 4 19:45:26 v1 (da8:mps0:0:17:0): Descriptor 0x80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Apr 4 19:45:26 v1 (da8:mps0:0:17:0): Retrying command (per sense data)
Apr 4 19:46:02 v1 (da8:mps0:0:17:0): READ(16). CDB: 88 00 00 00 00 01 d1 c0 be 87 00 00 00 01 00 00 length 512 SMID 586 command timeout cm 0xfffffe0000f2c120 ccb 0xfffff81f6e4dc800
Apr 4 19:46:02 v1 (noperiph:mps0:0:4294967295:0): SMID 68 Aborting command 0xfffffe0000f2c120
Apr 4 19:46:02 v1 mps0: Sending reset from mpssas_send_abort for target ID 17
Apr 4 19:46:02 v1 mps0: (da8:mps0:0:17:0): READ(16). CDB: 88 00 00 00 00 01 d1 c0 be 87 00 00 00 01 00 00
Apr 4 19:46:02 v1 Unfreezing devq for target ID 17
Apr 4 19:46:02 v1 (da8:mps0:0:17:0): CAM status: Command timeout
Apr 4 19:46:02 v1 (da8:mps0:0:17:0): Retrying command
Apr 4 19:46:26 v1 (da8:mps0:0:17:0): READ(10). CDB: 28 00 00 40 00 80 00 01 00 00 length 131072 SMID 871 command timeout cm 0xfffffe0000f43730 ccb 0xfffff8083964e800
Apr 4 19:46:26 v1 (noperiph:mps0:0:4294967295:0): SMID 69 Aborting command 0xfffffe0000f43730
Apr 4 19:46:26 v1 mps0: Sending reset from mpssas_send_abort for target ID 17
Apr 4 19:46:26 v1 mps0: (da8:mps0:0:17:0): READ(10). CDB: 28 00 00 40 00 80 00 01 00 00
Apr 4 19:46:26 v1 Unfreezing devq for target ID 17
Apr 4 19:46:26 v1 (da8:mps0:0:17:0): CAM status: Command timeout
Apr 4 19:46:26 v1 (da8:mps0:0:17:0): Retrying command


Since this started, the system is deadlocking upon attempting to run the zfs or zpool commands. Upon clicking the red "Alerts" icon in the top right corner of the web ui, the FreeNAS web ui has also deadlocked. (presumably it calls one of the deadlocking z* commands under the hood). The system otherwise seems to be working OK. It's serving up NFS and CIFS fine, and I can make new ssh connections so the system OK.

I've not seen this behavior since upgrading to FreeNAS 9.10. I've had a drive fail in 9.3, and it handled it like a champ. Is there some issue with 9.10 that is causing it to not gracefully handle drive failures even when there is ample redundancy?
 
D

dlavigne

Guest
I wonder if there's a difference in the mps driver... It's worth creating a bug report at bugs.freenas.org and posting the issue number here. When creating the bug report, include a debug for the devs (you can make that from System -> Advanced -> Save Debug).
 

Will Dormann

Explorer
Joined
Feb 10, 2015
Messages
61
I've created https://bugs.freenas.org/issues/14451
It's outlined in the above bug, but the zfs/zpool commands did eventually come back to life. I had left the system running overnight, and when I tried this morning, everything is fine (in that "zfs" and "zpool" commands return immediately). I suspect that in the 90 minute window between the drive going unresponsive and the zpool being declared degraded is when the commands would hang. And with my last drive failure under 9.3, I wasn't so quick to get in and test things, so I'm not 100% sure that this is a 9.10 regression.
 
Status
Not open for further replies.
Top