Will Dormann
Explorer
- Joined
- Feb 10, 2015
- Messages
- 61
Hi folks
I recently had a single drive fail:
Apr 4 19:45:02 v1 (da8:mps0:0:17:0): READ(10). CDB: 28 00 80 60 a4 08 00 00 58 00
Apr 4 19:45:02 v1 (da8:mps0:0:17:0): CAM status: Command timeout
Apr 4 19:45:02 v1 (da8:mps0:0:17:0): Error 5, Retries exhausted
Apr 4 19:45:26 v1 (da8:mps0:0:17:0): READ(10). CDB: 28 00 00 40 00 80 00 01 00 00
Apr 4 19:45:26 v1 (da8:mps0:0:17:0): CAM status: SCSI Status Error
Apr 4 19:45:26 v1 (da8:mps0:0:17:0): SCSI status: Check Condition
Apr 4 19:45:26 v1 (da8:mps0:0:17:0): SCSI sense: Deferred error: HARDWARE FAILURE asc:15,1 (Mechanical positioning error)
Apr 4 19:45:26 v1 (da8:mps0:0:17:0): Info: 0x9cb2b84e
Apr 4 19:45:26 v1 (da8:mps0:0:17:0): Field Replaceable Unit: 131
Apr 4 19:45:26 v1 (da8:mps0:0:17:0): Actual Retry Count: 24
Apr 4 19:45:26 v1 (da8:mps0:0:17:0): Descriptor 0x80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Apr 4 19:45:26 v1 (da8:mps0:0:17:0): Retrying command (per sense data)
Apr 4 19:46:02 v1 (da8:mps0:0:17:0): READ(16). CDB: 88 00 00 00 00 01 d1 c0 be 87 00 00 00 01 00 00 length 512 SMID 586 command timeout cm 0xfffffe0000f2c120 ccb 0xfffff81f6e4dc800
Apr 4 19:46:02 v1 (noperiph:mps0:0:4294967295:0): SMID 68 Aborting command 0xfffffe0000f2c120
Apr 4 19:46:02 v1 mps0: Sending reset from mpssas_send_abort for target ID 17
Apr 4 19:46:02 v1 mps0: (da8:mps0:0:17:0): READ(16). CDB: 88 00 00 00 00 01 d1 c0 be 87 00 00 00 01 00 00
Apr 4 19:46:02 v1 Unfreezing devq for target ID 17
Apr 4 19:46:02 v1 (da8:mps0:0:17:0): CAM status: Command timeout
Apr 4 19:46:02 v1 (da8:mps0:0:17:0): Retrying command
Apr 4 19:46:26 v1 (da8:mps0:0:17:0): READ(10). CDB: 28 00 00 40 00 80 00 01 00 00 length 131072 SMID 871 command timeout cm 0xfffffe0000f43730 ccb 0xfffff8083964e800
Apr 4 19:46:26 v1 (noperiph:mps0:0:4294967295:0): SMID 69 Aborting command 0xfffffe0000f43730
Apr 4 19:46:26 v1 mps0: Sending reset from mpssas_send_abort for target ID 17
Apr 4 19:46:26 v1 mps0: (da8:mps0:0:17:0): READ(10). CDB: 28 00 00 40 00 80 00 01 00 00
Apr 4 19:46:26 v1 Unfreezing devq for target ID 17
Apr 4 19:46:26 v1 (da8:mps0:0:17:0): CAM status: Command timeout
Apr 4 19:46:26 v1 (da8:mps0:0:17:0): Retrying command
Since this started, the system is deadlocking upon attempting to run the zfs or zpool commands. Upon clicking the red "Alerts" icon in the top right corner of the web ui, the FreeNAS web ui has also deadlocked. (presumably it calls one of the deadlocking z* commands under the hood). The system otherwise seems to be working OK. It's serving up NFS and CIFS fine, and I can make new ssh connections so the system OK.
I've not seen this behavior since upgrading to FreeNAS 9.10. I've had a drive fail in 9.3, and it handled it like a champ. Is there some issue with 9.10 that is causing it to not gracefully handle drive failures even when there is ample redundancy?
I recently had a single drive fail:
Apr 4 19:45:02 v1 (da8:mps0:0:17:0): READ(10). CDB: 28 00 80 60 a4 08 00 00 58 00
Apr 4 19:45:02 v1 (da8:mps0:0:17:0): CAM status: Command timeout
Apr 4 19:45:02 v1 (da8:mps0:0:17:0): Error 5, Retries exhausted
Apr 4 19:45:26 v1 (da8:mps0:0:17:0): READ(10). CDB: 28 00 00 40 00 80 00 01 00 00
Apr 4 19:45:26 v1 (da8:mps0:0:17:0): CAM status: SCSI Status Error
Apr 4 19:45:26 v1 (da8:mps0:0:17:0): SCSI status: Check Condition
Apr 4 19:45:26 v1 (da8:mps0:0:17:0): SCSI sense: Deferred error: HARDWARE FAILURE asc:15,1 (Mechanical positioning error)
Apr 4 19:45:26 v1 (da8:mps0:0:17:0): Info: 0x9cb2b84e
Apr 4 19:45:26 v1 (da8:mps0:0:17:0): Field Replaceable Unit: 131
Apr 4 19:45:26 v1 (da8:mps0:0:17:0): Actual Retry Count: 24
Apr 4 19:45:26 v1 (da8:mps0:0:17:0): Descriptor 0x80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Apr 4 19:45:26 v1 (da8:mps0:0:17:0): Retrying command (per sense data)
Apr 4 19:46:02 v1 (da8:mps0:0:17:0): READ(16). CDB: 88 00 00 00 00 01 d1 c0 be 87 00 00 00 01 00 00 length 512 SMID 586 command timeout cm 0xfffffe0000f2c120 ccb 0xfffff81f6e4dc800
Apr 4 19:46:02 v1 (noperiph:mps0:0:4294967295:0): SMID 68 Aborting command 0xfffffe0000f2c120
Apr 4 19:46:02 v1 mps0: Sending reset from mpssas_send_abort for target ID 17
Apr 4 19:46:02 v1 mps0: (da8:mps0:0:17:0): READ(16). CDB: 88 00 00 00 00 01 d1 c0 be 87 00 00 00 01 00 00
Apr 4 19:46:02 v1 Unfreezing devq for target ID 17
Apr 4 19:46:02 v1 (da8:mps0:0:17:0): CAM status: Command timeout
Apr 4 19:46:02 v1 (da8:mps0:0:17:0): Retrying command
Apr 4 19:46:26 v1 (da8:mps0:0:17:0): READ(10). CDB: 28 00 00 40 00 80 00 01 00 00 length 131072 SMID 871 command timeout cm 0xfffffe0000f43730 ccb 0xfffff8083964e800
Apr 4 19:46:26 v1 (noperiph:mps0:0:4294967295:0): SMID 69 Aborting command 0xfffffe0000f43730
Apr 4 19:46:26 v1 mps0: Sending reset from mpssas_send_abort for target ID 17
Apr 4 19:46:26 v1 mps0: (da8:mps0:0:17:0): READ(10). CDB: 28 00 00 40 00 80 00 01 00 00
Apr 4 19:46:26 v1 Unfreezing devq for target ID 17
Apr 4 19:46:26 v1 (da8:mps0:0:17:0): CAM status: Command timeout
Apr 4 19:46:26 v1 (da8:mps0:0:17:0): Retrying command
Since this started, the system is deadlocking upon attempting to run the zfs or zpool commands. Upon clicking the red "Alerts" icon in the top right corner of the web ui, the FreeNAS web ui has also deadlocked. (presumably it calls one of the deadlocking z* commands under the hood). The system otherwise seems to be working OK. It's serving up NFS and CIFS fine, and I can make new ssh connections so the system OK.
I've not seen this behavior since upgrading to FreeNAS 9.10. I've had a drive fail in 9.3, and it handled it like a champ. Is there some issue with 9.10 that is causing it to not gracefully handle drive failures even when there is ample redundancy?