System unresponsive after drive failure

u238 · Jul 29, 2018

Hi guys. I just had a drive failure that caused the system to become unresponsive. I received an email alert and checked the volume status, which showed several write errors on one disk. By the time I got to replacing the disk a few minutes later, the system was completely unresponsive and required a reset, coming back online without an issue. I'm running 11.1-U5 and a stripe of two 6 disk raidz2. It seems pretty unreasonable for the system to go down for one failed drive. Is there anything I can do? I can provide any info you need.

Edit:

I checked /var/log/messages and got this. The problem starts here:

Code:

Jul 29 19:03:53 freenas	 (da7:mps0:0:15:0): WRITE(10). CDB: 2a 00 39 a7 a0 d0 00 00 10 00 length 8192 SMID 161 Aborting command 0xfffffe0000eee350
Jul 29 19:03:53 freenas mps0: Sending reset from mpssas_send_abort for target ID 15
Jul 29 19:03:57 freenas mps0: Unfreezing devq for target ID 15
Jul 29 19:03:57 freenas (da7:mps0:0:15:0): WRITE(10). CDB: 2a 00 39 a7 a0 d0 00 00 10 00
Jul 29 19:03:57 freenas (da7:mps0:0:15:0): CAM status: Command timeout
Jul 29 19:03:57 freenas (da7:mps0:0:15:0): Retrying command
Jul 29 19:03:57 freenas (da7:mps0:0:15:0): WRITE(10). CDB: 2a 00 39 a7 a0 d0 00 00 10 00
Jul 29 19:03:57 freenas (da7:mps0:0:15:0): CAM status: SCSI Status Error
Jul 29 19:03:57 freenas (da7:mps0:0:15:0): SCSI status: Check Condition
Jul 29 19:03:57 freenas (da7:mps0:0:15:0): SCSI sense: NOT READY asc:4,0 (Logical unit not ready, cause not reportable)
Jul 29 19:03:57 freenas (da7:mps0:0:15:0): Retrying command (per sense data)
Jul 29 19:03:57 freenas (da7:mps0:0:15:0): WRITE(10). CDB: 2a 00 39 a7 a0 d0 00 00 10 00
Jul 29 19:03:57 freenas (da7:mps0:0:15:0): CAM status: SCSI Status Error
Jul 29 19:03:57 freenas (da7:mps0:0:15:0): SCSI status: Check Condition
Jul 29 19:03:57 freenas (da7:mps0:0:15:0): SCSI sense: NOT READY asc:4,0 (Logical unit not ready, cause not reportable)
Jul 29 19:03:57 freenas (da7:mps0:0:15:0): Retrying command (per sense data)
Jul 29 19:03:57 freenas (da7:mps0:0:15:0): WRITE(10). CDB: 2a 00 39 a7 a0 d0 00 00 10 00
Jul 29 19:03:57 freenas (da7:mps0:0:15:0): CAM status: SCSI Status Error
Jul 29 19:03:57 freenas (da7:mps0:0:15:0): SCSI status: Check Condition
Jul 29 19:03:57 freenas (da7:mps0:0:15:0): SCSI sense: NOT READY asc:4,0 (Logical unit not ready, cause not reportable)
Jul 29 19:03:57 freenas (da7:mps0:0:15:0): Retrying command (per sense data)

The same sequence of errors continue for some time until it reaches:

Code:

Jul 29 19:04:36 freenas mps0: mpssas_prepare_remove: Sending reset for target ID 15
Jul 29 19:04:36 freenas da7 at mps0 bus 0 scbus0 target 15 lun 0
Jul 29 19:04:36 freenas da7: <ATA ST33000651NS G008> s/n Z293JRCRmps0:  detached
Jul 29 19:04:36 freenas Unfreezing devq for target ID 15
Jul 29 19:04:36 freenas GEOM_MIRROR: Device swap2: provider da7p1 disconnected.
Jul 29 19:04:36 freenas (da7:mps0:0:15:0): Periph destroyed
Jul 29 19:14:09 freenas mps0: mpssas_ata_id_timeout checking ATA ID command 0xfffffe0000f039e0 sc 0xfffffe0000eb3000
Jul 29 19:14:09 freenas mps0: ATA ID command timeout cm 0xfffffe0000f039e0
Jul 29 19:14:09 freenas mpssas_get_sata_identify: request for page completed with error 0mps0: Sleeping 3 seconds after SATA ID error to wait for spinup
Jul 29 19:14:13 freenas mps0: SAS Address for SATA device = 2a030a0a72786579
Jul 29 19:14:13 freenas mps0: SAS Address from SATA device = 2a030a0a72786579
Jul 29 19:14:13 freenas mps0: mpssas_add_device: sending Target Reset for stuck SATA identify command (cm = 0xfffffe0000f039e0)
Jul 29 19:14:13 freenas	 (noperiph:mps0:0:15:0): SMID 3 sending target reset
Jul 29 19:14:13 freenas mps0: mpssas_action_scsiio: Freezing devq for target ID 15
Jul 29 19:14:13 freenas (probe0:mps0:0:15:0): INQUIRY. CDB: 12 00 00 00 24 00
Jul 29 19:14:13 freenas (probe0:mps0:0:15:0): CAM status: CAM subsystem is busy
Jul 29 19:14:13 freenas (probe0:mps0:0:15:0): Retrying command
Jul 29 19:14:13 freenas	 (xpt0:mps0:0:15:ffffffff): SMID 3 recovery finished after target reset
Jul 29 19:14:13 freenas mps0: Unfreezing devq for target ID 15
Jul 29 19:14:19 freenas da7 at mps0 bus 0 scbus0 target 15 lun 0
Jul 29 19:14:19 freenas da7: <ATA ST33000651NS G008> Fixed Direct Access SPC-4 SCSI device
Jul 29 19:14:19 freenas da7: Serial Number Z293JRCR
Jul 29 19:14:19 freenas da7: 600.000MB/s transfers
Jul 29 19:14:19 freenas da7: Command Queueing enabled
Jul 29 19:14:19 freenas da7: 2861588MB (5860533168 512 byte sectors)
Jul 29 19:40:11 freenas syslog-ng[1826]: syslog-ng starting up; version='3.7.3'
Jul 29 19:40:11 freenas Copyright (c) 1992-2017 The FreeBSD Project.
Jul 29 19:40:11 freenas Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994
Jul 29 19:40:11 freenas	 The Regents of the University of California. All rights reserved.
Jul 29 19:40:11 freenas FreeBSD is a registered trademark of The FreeBSD Foundation.
Jul 29 19:40:11 freenas FreeBSD 11.1-STABLE #0 r321665+e0c4ca60dfc(freenas/11.1-stable): Wed May 30 14:18:20 EDT 2018

As you can see, the system reboots. After reboot, the same loop of errors due the bad drive occur until I pull the drive and reset the machine. This is the second time I've had a drive failure in this system, and both times I've had the same unresponsive system with an unintended reboot.

kdragon75 · Jul 29, 2018

We need the FULL DETAILS of ALL of your hardware INCLUDING MODEL number please! Also include the output of the following commands; camcontrol devlist, zpool list -v, and lspci!

u238 · Jul 29, 2018

Let me know if this hardware description is ok:

Supermicro X9SCL
Xeon E3-1220
4X4GB PC3-10600E memory
Dell H310 flashed to IT mode
Mellanox ConnectX-2

8 disks are connected to the H310 while the other 4 are on the intel SATA ports.

camcontrol devlist output:

Code:

root@freenas:~ # camcontrol devlist
<ATA Hitachi HUS72403 A5F0>		at scbus0 target 7 lun 0 (pass0,da0)
<ATA Hitachi HDS72303 AAM0>		at scbus0 target 8 lun 0 (da9,pass13)
<ATA Hitachi HUA72303 A800>		at scbus0 target 9 lun 0 (pass1,da1)
<ATA ST33000651NS G009>			at scbus0 target 10 lun 0 (pass2,da2)
<ATA Hitachi HDS72303 AAM0>		at scbus0 target 11 lun 0 (pass3,da3)
<ATA Hitachi HDS72303 AAM0>		at scbus0 target 12 lun 0 (pass4,da4)
<ATA ST33000651NS G008>			at scbus0 target 13 lun 0 (pass5,da5)
<ATA Hitachi HDS72303 AAM0>		at scbus0 target 14 lun 0 (pass6,da6)
<ST33000651NS G008>				at scbus3 target 0 lun 0 (pass7,ada0)
<ST33000651NS G008>				at scbus4 target 0 lun 0 (pass8,ada1)
<ST33000651NS G008>				at scbus5 target 0 lun 0 (pass9,ada2)
<ST33000651NS G009>				at scbus6 target 0 lun 0 (pass10,ada3)
<Kingston DataTraveler SE9 PMAP>   at scbus8 target 0 lun 0 (pass11,da7)
<Kingston DataTraveler 2.0 PMAP>   at scbus9 target 0 lun 0 (pass12,da8)

zpool list -v output:

Code:

root@freenas:~ # zpool list -v
NAME									 SIZE  ALLOC   FREE  EXPANDSZ   FRAG	CAP  DEDUP  HEALTH  ALTROOT
freenas-boot							14.4G   854M  13.5G		 -	  -	 5%  1.00x  ONLINE  -
  mirror								14.4G   854M  13.5G		 -	  -	 5%
	da8p2								   -	  -	  -		 -	  -	  -
	da7p2								   -	  -	  -		 -	  -	  -
volume1								 32.5T  21.3T  11.2T		 -	 5%	65%  1.00x  ONLINE  /mnt
  raidz2								16.2T  11.3T  4.96T		 -	 8%	69%
	gptid/cb7ae720-69a4-11e7-8b26-000c2984fb43	  -	  -	  -		 -	  -	  -
	gptid/c3b7158a-9321-11e8-8469-002590d3aef1	  -	  -	  -		 -	  -	  -
	gptid/d062280a-69a4-11e7-8b26-000c2984fb43	  -	  -	  -		 -	  -	  -
	gptid/d2e660a9-69a4-11e7-8b26-000c2984fb43	  -	  -	  -		 -	  -	  -
	gptid/d636b75e-69a4-11e7-8b26-000c2984fb43	  -	  -	  -		 -	  -	  -
	gptid/3393037d-e969-11e7-bef0-000c2984fb43	  -	  -	  -		 -	  -	  -
  raidz2								16.2T  10.0T  6.21T		 -	 3%	61%
	gptid/a83c83fa-eed9-11e7-9b86-000c2984fb43	  -	  -	  -		 -	  -	  -
	gptid/a912267d-eed9-11e7-9b86-000c2984fb43	  -	  -	  -		 -	  -	  -
	gptid/a9ea29b8-eed9-11e7-9b86-000c2984fb43	  -	  -	  -		 -	  -	  -
	gptid/468516cd-035a-11e8-9015-000c2984fb43	  -	  -	  -		 -	  -	  -
	gptid/ab9e93be-eed9-11e7-9b86-000c2984fb43	  -	  -	  -		 -	  -	  -
	gptid/ac6f0256-eed9-11e7-9b86-000c2984fb43	  -	  -	  -		 -	  -	  -
root@freenas:~ #

lspci output:

Code:

root@freenas:~ # lspci
00:00.0 Host bridge: Intel Corporation Xeon E3-1200 Processor Family DRAM Controller (rev 09)
00:01.0 PCI bridge: Intel Corporation Xeon E3-1200/2nd Generation Core Processor Family PCI Express Root Port (rev 09)
00:19.0 Ethernet controller: Intel Corporation 82579LM Gigabit Network Connection (Lewisville) (rev 05)
00:1a.0 USB controller: Intel Corporation 6 Series/C200 Series Chipset Family USB Enhanced Host Controller #2 (rev 05)
00:1c.0 PCI bridge: Intel Corporation 6 Series/C200 Series Chipset Family PCI Express Root Port 1 (rev b5)
00:1c.4 PCI bridge: Intel Corporation 6 Series/C200 Series Chipset Family PCI Express Root Port 5 (rev b5)
00:1d.0 USB controller: Intel Corporation 6 Series/C200 Series Chipset Family USB Enhanced Host Controller #1 (rev 05)
00:1e.0 PCI bridge: Intel Corporation 82801 PCI Bridge (rev a5)
00:1f.0 ISA bridge: Intel Corporation C202 Chipset Family LPC Controller (rev 05)
00:1f.2 SATA controller: Intel Corporation 6 Series/C200 Series Chipset Family 6 port Desktop SATA AHCI Controller (rev 05)
00:1f.3 SMBus: Intel Corporation 6 Series/C200 Series Chipset Family SMBus Controller (rev 05)
01:00.0 Serial Attached SCSI controller: LSI Logic / Symbios Logic SAS2008 PCI-Express Fusion-MPT SAS-2 [Falcon] (rev 03)
02:00.0 Ethernet controller: Mellanox Technologies MT26448 [ConnectX EN 10GigE, PCIe 2.0 5GT/s] (rev b0)
03:00.0 Ethernet controller: Intel Corporation 82574L Gigabit Network Connection
04:03.0 VGA compatible controller: Matrox Electronics Systems Ltd. MGA G200eW WPCM450 (rev 0a)

Thanks for your help!

Chris Moore · Jul 29, 2018

u238 said:
I checked /var/log/messages and got this. The problem starts here:

One of the most important details you need to provide for troubleshooting is missing as far as I can see, and it isn't because I am trying to be difficult, but we really do need to know, as it affects the answer.

Please review the Forum Rules:
https://forums.freenas.org/index.php?threads/updated-forum-rules-8-3-16.45124/

Chris Moore · Jul 29, 2018

u238 said:
Let me know if this hardware description is ok:

SuperMicro X9SCL
Xeon E3-1220
4X4GB PC3-10600E memory
Dell H310 flashed to IT mode
Mellanox ConnectX-2

This is fine. I actually had a system very similar to this, not so long ago.

u238 · Jul 29, 2018

Chris Moore said:
One of the most important details you need to provide for troubleshooting is missing as far as I can see, and it isn't because I am trying to be difficult, but we really do need to know, as it affects the answer.

Please review the Forum Rules:
https://forums.freenas.org/index.php?threads/updated-forum-rules-8-3-16.45124/

No worries. What important details did I miss?

I should also add, I'm 100% sure the disk failed. It was clicking horribly so I disassembled it and found deep scratches on the platters.

kdragon75 · Jul 29, 2018

u238 said:
Let me know if this hardware description is ok

Looks perfect! Thank you.

kdragon75 · Jul 29, 2018

u238 said:
No worries. What important details did I miss?

I should also add, I'm 100% sure the disk failed. It was clicking horribly so I disassembled it and found deep scratches on the platters.

Chris Moore · Jul 29, 2018

u238 said:
No worries. What important details did I miss?

I should also add, I'm 100% sure the disk failed. It was clicking horribly so I disassembled it and found deep scratches on the platters.

I am guessing it was out of warranty?

I know that your text in the first post says FreeBSD 11.1-STABLE but that doesn't tell the FreeNAS version.

You can find the FreeNAS version in the GUI under "System" in the "Information" tab.
Or, you can type tail /etc/version at the command line.
There is only one line in the file, but it will tell you, quick and easy, what the version of FreeNAS is.

Chris Moore · Jul 29, 2018

u238 said:
I'm running 11.1-U5 and a stripe of two 6 disk raidz2.

Sorry, either you added that or I missed it when I read your post the first time.

u238 said:
It seems pretty unreasonable for the system to go down for one failed drive. Is there anything I can do? I can provide any info you need.

You are correct, with the U5 version of FreeNAS a single disk having a catastrophic failure should not cause the system to hang. You might want to submit a bug report so the developers can look more deeply into the system logs and try to determine the cause and find a fix for the future.

toadman · Jul 29, 2018

Definitely weird. Swap related maybe? I see in there is at least mirroring of some swap going on ("swap2" is mentioned with da7p1 being the partition).

Do you know if the disk in question had a swap partition on it?
If yes, was it part of a GEOM swap mirror?
Do you know if the system was swapping at all (regardless of the swap config)?

Chris Moore · Jul 29, 2018

toadman said:
Swap related maybe?

With the new mirrored swap system, I would not have expected this to be a problem. Some drives are not even used for swap, but they should all have partitions for it. To get a swap crash, it would need to lose both drives from a swap mirror. Theoretically...

toadman · Jul 29, 2018

Agreed in general as that was the point of going to mirrors I suppose.

I was curious as to what the swap structure is on that system though given I don't know how the new system works exactly. For example, what if only three drives have a swap partition? I would assume two of them go into a mirror. What happens to the third partition? Is it left unused? If not, seems like a potential point of failure. So my guess is any odd partition is not used. (What if there is only one partition, is it used?)

I just don't know enough about the new system to know how it is supposed to allocate all the available swap partitions.

u238 · Jul 29, 2018

Chris Moore said:
I am guessing it was out of warranty?

I know that your text in the first post says FreeBSD 11.1-STABLE but that doesn't tell the FreeNAS version.

You can find the FreeNAS version in the GUI under "System" in the "Information" tab.
Or, you can type tail /etc/version at the command line.
There is only one line in the file, but it will tell you, quick and easy, what the version of FreeNAS is.

Yeah, definitely well out of warranty.

The version is 11.1-U5.

u238 · Jul 29, 2018

toadman said:
Definitely weird. Swap related maybe? I see in there is at least mirroring of some swap going on ("swap2" is mentioned with da7p1 being the partition).

Do you know if the disk in question had a swap partition on it?
If yes, was it part of a GEOM swap mirror?
Do you know if the system was swapping at all (regardless of the swap config)?

At the time of the reboot, the system was using 113.9MB of swap, according to the swap utilization graph. Over the past month, swap has been at an average of 247MB with a maximum of 734MB. The system is being used only for storage. There are no jails or plugins.

Chris Moore · Jul 29, 2018

You can't really find out if the swap partition was in use because every time freenas reboots it allocates the first ten swap partitions in 5 mirrored pairs. Or, as many swap partitions as you have for people who have less than 10 drives.

Sent from my SAMSUNG-SGH-I537 using Tapatalk

Chris Moore · Jul 29, 2018

u238 said:
Over the past month, swap has been at an average of 247MB with a maximum of 734MB. The system is being used only for storage.

Do you have autotune enabled? If you do, you might want to disable it and reduce the value in the Tunable named vfs.zfs.arc_max to about 1GB less than it is now. That should keep your system from using swap.

u238 · Jul 29, 2018

Chris Moore said:
Do you have autotune enabled? If you do, you might want to disable it and reduce the value in the Tunable named vfs.zfs.arc_max to about 1GB less than it is now. That should keep your system from using swap.

Autotune has always been disabled. I just input the tunable you suggested and confirmed that the maximum arc size is reduced by 1GB.

Chris Moore · Jul 29, 2018

u238 said:
Autotune has always been disabled. I just input the tunable you suggested and confirmed that the maximum arc size is reduced by 1GB.

What number did you go with? Do you know the system must reboot for the new value to be active?

Chris Moore · Jul 29, 2018

toadman said:
For example, what if only three drives have a swap partition?

I tested this on a system at work. It only uses an even number of partitions and only creates as many as 5 swap mirrors.

toadman said:
I would assume two of them go into a mirror.

If you have an odd number of drives, that odd drive is not used. If you later add another drive, even in a separate pool, it will grab the new partition and add it to the unused partition to create another mirror for system swap.

toadman said:
Is it left unused? If not, seems like a potential point of failure. So my guess is any odd partition is not used.

You are guessing pretty good.

toadman said:
What if there is only one partition, is it used?

I didn't test that scenario. I will give it a shot as soon as I get a chance.

toadman said:
I just don't know enough about the new system to know how it is supposed to allocate all the available swap partitions.

I have a system at work that has 80 drives. Only 10 of the swap partitions are used. All the rest, just sit idle. If you have the default 2GB per drive, it gives you a max of 10GB of swap space. In most situations, that is probably plenty. I set my arc_max to 59055800320 bytes and it allows me to not use swap unless I one of my VMs is started. I may need to make the arc_max even smaller.

Important Announcement for the TrueNAS Community.

System unresponsive after drive failure

Dabbler

Wizard

Dabbler

Hall of Famer

Hall of Famer

Dabbler

Wizard

Wizard

Hall of Famer

Hall of Famer

Guru

Hall of Famer

Guru

Dabbler

Dabbler

Hall of Famer

Hall of Famer

Dabbler

Hall of Famer

Hall of Famer

Similar threads