Hi guys. I just had a drive failure that caused the system to become unresponsive. I received an email alert and checked the volume status, which showed several write errors on one disk. By the time I got to replacing the disk a few minutes later, the system was completely unresponsive and required a reset, coming back online without an issue. I'm running 11.1-U5 and a stripe of two 6 disk raidz2. It seems pretty unreasonable for the system to go down for one failed drive. Is there anything I can do? I can provide any info you need.
Edit:
I checked /var/log/messages and got this. The problem starts here:
The same sequence of errors continue for some time until it reaches:
As you can see, the system reboots. After reboot, the same loop of errors due the bad drive occur until I pull the drive and reset the machine. This is the second time I've had a drive failure in this system, and both times I've had the same unresponsive system with an unintended reboot.
Edit:
I checked /var/log/messages and got this. The problem starts here:
Code:
Jul 29 19:03:53 freenas (da7:mps0:0:15:0): WRITE(10). CDB: 2a 00 39 a7 a0 d0 00 00 10 00 length 8192 SMID 161 Aborting command 0xfffffe0000eee350 Jul 29 19:03:53 freenas mps0: Sending reset from mpssas_send_abort for target ID 15 Jul 29 19:03:57 freenas mps0: Unfreezing devq for target ID 15 Jul 29 19:03:57 freenas (da7:mps0:0:15:0): WRITE(10). CDB: 2a 00 39 a7 a0 d0 00 00 10 00 Jul 29 19:03:57 freenas (da7:mps0:0:15:0): CAM status: Command timeout Jul 29 19:03:57 freenas (da7:mps0:0:15:0): Retrying command Jul 29 19:03:57 freenas (da7:mps0:0:15:0): WRITE(10). CDB: 2a 00 39 a7 a0 d0 00 00 10 00 Jul 29 19:03:57 freenas (da7:mps0:0:15:0): CAM status: SCSI Status Error Jul 29 19:03:57 freenas (da7:mps0:0:15:0): SCSI status: Check Condition Jul 29 19:03:57 freenas (da7:mps0:0:15:0): SCSI sense: NOT READY asc:4,0 (Logical unit not ready, cause not reportable) Jul 29 19:03:57 freenas (da7:mps0:0:15:0): Retrying command (per sense data) Jul 29 19:03:57 freenas (da7:mps0:0:15:0): WRITE(10). CDB: 2a 00 39 a7 a0 d0 00 00 10 00 Jul 29 19:03:57 freenas (da7:mps0:0:15:0): CAM status: SCSI Status Error Jul 29 19:03:57 freenas (da7:mps0:0:15:0): SCSI status: Check Condition Jul 29 19:03:57 freenas (da7:mps0:0:15:0): SCSI sense: NOT READY asc:4,0 (Logical unit not ready, cause not reportable) Jul 29 19:03:57 freenas (da7:mps0:0:15:0): Retrying command (per sense data) Jul 29 19:03:57 freenas (da7:mps0:0:15:0): WRITE(10). CDB: 2a 00 39 a7 a0 d0 00 00 10 00 Jul 29 19:03:57 freenas (da7:mps0:0:15:0): CAM status: SCSI Status Error Jul 29 19:03:57 freenas (da7:mps0:0:15:0): SCSI status: Check Condition Jul 29 19:03:57 freenas (da7:mps0:0:15:0): SCSI sense: NOT READY asc:4,0 (Logical unit not ready, cause not reportable) Jul 29 19:03:57 freenas (da7:mps0:0:15:0): Retrying command (per sense data)
The same sequence of errors continue for some time until it reaches:
Code:
Jul 29 19:04:36 freenas mps0: mpssas_prepare_remove: Sending reset for target ID 15 Jul 29 19:04:36 freenas da7 at mps0 bus 0 scbus0 target 15 lun 0 Jul 29 19:04:36 freenas da7: <ATA ST33000651NS G008> s/n Z293JRCRmps0: detached Jul 29 19:04:36 freenas Unfreezing devq for target ID 15 Jul 29 19:04:36 freenas GEOM_MIRROR: Device swap2: provider da7p1 disconnected. Jul 29 19:04:36 freenas (da7:mps0:0:15:0): Periph destroyed Jul 29 19:14:09 freenas mps0: mpssas_ata_id_timeout checking ATA ID command 0xfffffe0000f039e0 sc 0xfffffe0000eb3000 Jul 29 19:14:09 freenas mps0: ATA ID command timeout cm 0xfffffe0000f039e0 Jul 29 19:14:09 freenas mpssas_get_sata_identify: request for page completed with error 0mps0: Sleeping 3 seconds after SATA ID error to wait for spinup Jul 29 19:14:13 freenas mps0: SAS Address for SATA device = 2a030a0a72786579 Jul 29 19:14:13 freenas mps0: SAS Address from SATA device = 2a030a0a72786579 Jul 29 19:14:13 freenas mps0: mpssas_add_device: sending Target Reset for stuck SATA identify command (cm = 0xfffffe0000f039e0) Jul 29 19:14:13 freenas (noperiph:mps0:0:15:0): SMID 3 sending target reset Jul 29 19:14:13 freenas mps0: mpssas_action_scsiio: Freezing devq for target ID 15 Jul 29 19:14:13 freenas (probe0:mps0:0:15:0): INQUIRY. CDB: 12 00 00 00 24 00 Jul 29 19:14:13 freenas (probe0:mps0:0:15:0): CAM status: CAM subsystem is busy Jul 29 19:14:13 freenas (probe0:mps0:0:15:0): Retrying command Jul 29 19:14:13 freenas (xpt0:mps0:0:15:ffffffff): SMID 3 recovery finished after target reset Jul 29 19:14:13 freenas mps0: Unfreezing devq for target ID 15 Jul 29 19:14:19 freenas da7 at mps0 bus 0 scbus0 target 15 lun 0 Jul 29 19:14:19 freenas da7: <ATA ST33000651NS G008> Fixed Direct Access SPC-4 SCSI device Jul 29 19:14:19 freenas da7: Serial Number Z293JRCR Jul 29 19:14:19 freenas da7: 600.000MB/s transfers Jul 29 19:14:19 freenas da7: Command Queueing enabled Jul 29 19:14:19 freenas da7: 2861588MB (5860533168 512 byte sectors) Jul 29 19:40:11 freenas syslog-ng[1826]: syslog-ng starting up; version='3.7.3' Jul 29 19:40:11 freenas Copyright (c) 1992-2017 The FreeBSD Project. Jul 29 19:40:11 freenas Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994 Jul 29 19:40:11 freenas The Regents of the University of California. All rights reserved. Jul 29 19:40:11 freenas FreeBSD is a registered trademark of The FreeBSD Foundation. Jul 29 19:40:11 freenas FreeBSD 11.1-STABLE #0 r321665+e0c4ca60dfc(freenas/11.1-stable): Wed May 30 14:18:20 EDT 2018
As you can see, the system reboots. After reboot, the same loop of errors due the bad drive occur until I pull the drive and reset the machine. This is the second time I've had a drive failure in this system, and both times I've had the same unresponsive system with an unintended reboot.
Last edited: