Pool devices disconnect after reboot

Scharbag · Oct 3, 2015

Just recently, I have starter having issues with one of my pools detaching devices for no real reason. I have 3 pools in addition to the mirrored boot pool. All issues are confined to only my backup pool. Here is what is happening:

Code:

Oct  3 08:44:08 freenas     (pass16:mps0:0:24:0): ATA COMMAND PASS THROUGH(16). CDB: 85 06 00 00 00 00 01 00 00 00 00 00 00 40 e3 00 length 0 SMID 509 command timeout cm 0xfffffe0000b24c28 ccb 0xfffff800477e6800
Oct  3 08:44:08 freenas     (noperiph:mps0:0:4294967295:0): SMID 1 Aborting command 0xfffffe0000b24c28
Oct  3 08:44:08 freenas mps0: Sending reset from mpssas_send_abort for target ID 24
Oct  3 08:44:08 freenas     (pass14:mps0:0:22:0): ATA COMMAND PASS THROUGH(16). CDB: 85 06 00 00 00 00 01 00 00 00 00 00 00 40 e3 00 length 0 SMID 524 command timeout cm 0xfffffe0000b25f60 ccb 0xfffff8000cc93800
Oct  3 08:44:08 freenas     (noperiph:mps0:0:4294967295:0): SMID 2 Aborting command 0xfffffe0000b25f60
Oct  3 08:44:08 freenas mps0: Sending reset from mpssas_send_abort for target ID 22
Oct  3 08:44:08 freenas     (pass17:mps0:0:25:0): ATA COMMAND PASS THROUGH(16). CDB: 85 06 00 00 00 00 01 00 00 00 00 00 00 40 e3 00 length 0 SMID 553 command timeout cm 0xfffffe0000b28488 ccb 0xfffff8000cc84000
Oct  3 08:44:08 freenas     (noperiph:mps0:0:4294967295:0): SMID 3 Aborting command 0xfffffe0000b28488
Oct  3 08:44:08 freenas mps0: Sending reset from mpssas_send_abort for target ID 25
Oct  3 08:44:08 freenas     (pass13:mps0:0:21:0): ATA COMMAND PASS THROUGH(16). CDB: 85 06 00 00 00 00 01 00 00 00 00 00 00 40 e3 00 length 0 SMID 533 command timeout cm 0xfffffe0000b26ae8 ccb 0xfffff8000cccd000
Oct  3 08:44:08 freenas     (noperiph:mps0:0:4294967295:0): SMID 4 Aborting command 0xfffffe0000b26ae8
Oct  3 08:44:08 freenas mps0: Sending reset from mpssas_send_abort for target ID 21
Oct  3 08:44:09 freenas     (da17:mps0:0:25:0): READ(16). CDB: 88 00 00 00 00 01 43 ca 52 60 00 00 00 08 00 00 length 4096 SMID 370 terminated ioc 804b scsi 0 state c xfer 0
Oct  3 08:44:09 freenas mps0: Unfreezing devq for target ID 25
Oct  3 08:44:10 freenas mps0: mpssas_prepare_remove: Sending reset for target ID 21
Oct  3 08:44:10 freenas mps0: mpssas_prepare_remove: Sending reset for target ID 22
Oct  3 08:44:10 freenas mps0: mpssas_prepare_remove: Sending reset for target ID 24
Oct  3 08:44:10 freenas da13 at mps0 bus 0 scbus0 target 21 lun 0
Oct  3 08:44:10 freenas da13: <ATA ST4000DM000-1F21 CC54> s/n             S300YC75 detached
Oct  3 08:44:10 freenas da14 at mps0 bus 0 scbus0 target 22 lun 0
Oct  3 08:44:10 freenas da14: <ATA ST4000DM000-1F21 CC54> s/n             S300YBTS detached
Oct  3 08:44:10 freenas da16 at mps0 bus 0 scbus0 target 24 lun 0
Oct  3 08:44:10 freenas da16: <ATA ST4000DM000-1F21 CC54> s/n             S300YBLN detached
Oct  3 08:44:10 freenas GEOM_ELI: Device da13p1.eli destroyed.
Oct  3 08:44:10 freenas GEOM_ELI: Detached da13p1.eli on last close.
Oct  3 08:44:10 freenas GEOM_ELI: Device da14p1.eli destroyed.
Oct  3 08:44:10 freenas GEOM_ELI: Detached da14p1.eli on last close.
Oct  3 08:44:10 freenas GEOM_ELI: Device da16p1.eli destroyed.
Oct  3 08:44:10 freenas GEOM_ELI: Detached da16p1.eli on last close.
Oct  3 08:44:10 freenas zfsd: Replace vdev(backuptank/17892785827659776471) by physical path: Unable to allocate spare target data.
Oct  3 08:44:10 freenas zfsd: Replace vdev(backuptank/12561127196141559507) by physical path: Unable to allocate spare target data.
Oct  3 08:44:10 freenas zfsd: Replace vdev(backuptank/15713028563994597996) by physical path: Unable to allocate spare target data.
Oct  3 08:44:10 freenas (da17:mps0:0:25:0): READ(16). CDB: 88 00 00 00 00 01 43 ca 52 60 00 00 00 08 00 00
Oct  3 08:44:10 freenas (da17:mps0:0:25:0): CAM status: SCSI Status Error
Oct  3 08:44:10 freenas (da17:mps0:0:25:0): SCSI status: Check Condition
Oct  3 08:44:10 freenas (da17:mps0:0:25:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
Oct  3 08:44:10 freenas (da17:mps0:0:25:0): Retrying command (per sense data)
Oct  3 08:44:10 freenas mps0: IOCStatus = 0x4b while resetting device 0x1a
Oct  3 08:44:10 freenas mps0: Unfreezing devq for target ID 24
Oct  3 08:44:10 freenas mps0: Unfreezing devq for target ID 24
Oct  3 08:44:11 freenas mps0: IOCStatus = 0x4b while resetting device 0x18
Oct  3 08:44:11 freenas mps0: Unfreezing devq for target ID 22
Oct  3 08:44:11 freenas mps0: Unfreezing devq for target ID 22
Oct  3 08:44:11 freenas mps0: IOCStatus = 0x4b while resetting device 0x17
Oct  3 08:44:11 freenas mps0: Unfreezing devq for target ID 21
Oct  3 08:44:11 freenas mps0: Unfreezing devq for target ID 21
Oct  3 08:44:11 freenas (da16:mps0:0:24:0): Periph destroyed
Oct  3 08:44:11 freenas (da14:mps0:0:22:0): Periph destroyed
Oct  3 08:44:11 freenas (da13:mps0:0:21:0): Periph destroyed

I am currently running FreeNAS-9.3-STABLE-201509220011 and I started to run into issues when updating using the GUI to the Sept 28th release. Not all drives disconnect consistently from the backup pool. On one occasion all drives were disconnected, on other occasions only a few would disconnect. Once the drive has disconnected, it will not reconnect even when I pull the drive tray and re-insert it when online. The system requires a reset to see the drives again.

The really strange thing is, when I reboot, the backup pool is fine and online for a few minutes and then the devices start to be detached. If I instead pull the drive trays, reboot the computer and then insert the drives and import the pool, everything is fine. I am lucky that this is my backup pool but is is a bit frustrating.

I have changed which of my 2 V20 LSI cards is used for my pools and the backup pool devices get detached regardless of the controller. Both controllers are the same V20 IT firmware and everything is connected trough a single Intel RES2SV240. Only 1 LSI card is connected to the Intel expander at any time. The Intel expander has the latest firmware available.

The difference between my production pool and my backup pool is I had the HDDs configured to turn off to save power. I have since removed that from my config now that I have things running again. Would this be a possible issue given the latest update seems to address something to due with LSI cards and power savings?

My server has, up to this point, been pretty much bullet proof. Not sure if there is a bad drive this is causing the system to panic or if there is a hardware issue. I checked all of my cables JIC but the server does not move. System temperatures are normal, all fans are functioning, system has been stable. My UPS reports ~220W with all drives in the system. I have a 750W power supply.

If anyone has a suggestion, please let me know.

Cheers,

BigDave · Oct 3, 2015

Scharbag said:
On one occasion all drives were disconnected, on other occasions only a few would disconnect.

This smells like the PSU.

Scharbag · Oct 3, 2015

I thought that too, but it is strange that if I pull the drives, then re-insert and import, it will stay stable.

I wonder if I can get the same PSU again so I do not need to re-cable if that is the issue. :)

BigDave · Oct 3, 2015

Scharbag said:
I thought that too, but it is strange that if I pull the drives, then re-insert and import, it will stay stable.

The fault (if it's indeed the PSU) could be somewhere other than the hard disk power.

I wonder if I can get the same PSU again so I do not need to re-cable if that is the issue. :)

That would be sweet!

Scharbag · Oct 3, 2015

The thing that gets me though is that only 1 pool is affected. If it were a PSU issue, I would expect one of my other pools to be affected at some point...

BigDave · Oct 3, 2015

Does your Corsair PSU have multiple rails?

Scharbag said:
wonder if I can get the same PSU again so I do not need to re-cable if that is the issue

Sounds like your PSU is modular, if it's a single rail, then it can't be at fault. If it multiple rails then...

Scharbag · Oct 3, 2015

Good point. I will see how the power cables are set-up in the case. Because of my OCD, all the backup pool drives are in the top 2 rows while the production pool drives are in the bottom 3 rows.

joeschmuck · Oct 3, 2015

BigDave said:
Sounds like your PSU is modular, if it's a single rail, then it can't be at fault. If it multiple rails then...

That is sometimes a tricky thing to find out. Many PSU's list on the side sticker that they have say three +12VDC rails but then they tie them together inside the PSU which results still in a single rail. The only way to know for certain is to find the model of the PSU and a reputable review site which has torn the PSU apart and tells you point blank if there are in fact separate rails or not. Sorry but that is just the way it is these days.

There are a few way to check if it's your PSU if you do not have a load tester... Run PRIME 95 or some other CPU stress test, if it fails, possible PSU. A RAM Test can lead to a PSU as well for unstable power. And honestly you should run both tests to rule out a hardware issue with the PSU, CPU, RAM, or MB.

However for the problem at hand, because it only affects a specific pool each and every time I'm incline to think it has something to do with the controller, but you said you swapped all the cables or was it the controllers? If you did not swap the cables and left those in place and only swapped the controllers, well you know I'm going to ask you to swap the cables out and see if the problem follows. Also what is the output of "zpool status" when the problem occurs.

-Mark

Scharbag · Oct 4, 2015

So one other thing: sometimes when I reboot the system, it takes forever. It seems to spend a bunch of time interrogating the 6 drives in the backup pool (green light on HDD tray goes on for some time on each drive). Not sure how to capture log information for that sort of thing.

I will replace the SAS cables to the upper 2 rows as well as verify the power cable connections. Probably will get to that today and will let you know what happens.

Cheers,

joeschmuck · Oct 4, 2015

Hopefully it's the SAS cable.

Scharbag · Oct 4, 2015

So I changed the 2 SAS cables that fed the top 2 rows, wiggled all the power connections and separated the SAS controllers (8 drives connected to the SuperMicro SAS controller, 12 drives attached to the on board SAS controller). Powered up and all drives were attached, boot was at normal speed and now the backup tank is scrubbing happily at ~625MBps.

I have never had a SAS cable go bad before. I will keep an eye on the system and see how it behaves. Thanks for the suggestions!!

Cheers,

joeschmuck · Oct 4, 2015

Cross your fingers it lasts.

Scharbag · Oct 8, 2015

System has been stable, rebooted this morning to install the latest update and all seemed to work great.

Thanks for the suggestion that it may have been the SAS cables.

Cheers,

joeschmuck · Oct 8, 2015

Glad to hear. Sometimes it's hard to believe it's a failed cable when they rarely get touched, but it does happen.

Important Announcement for the TrueNAS Community.

Pool devices disconnect after reboot

Scharbag

Guru

BigDave

FreeNAS Enthusiast

Scharbag

Guru

BigDave

FreeNAS Enthusiast

Scharbag

Guru

BigDave

FreeNAS Enthusiast

Scharbag

Guru

joeschmuck

Old Man

Scharbag

Guru

joeschmuck

Old Man

Scharbag

Guru

joeschmuck

Old Man

Scharbag

Guru

joeschmuck

Old Man

Similar threads

Important Announcement for the TrueNAS Community.

Pool devices disconnect after reboot

Guru

FreeNAS Enthusiast

Guru

FreeNAS Enthusiast

Guru

FreeNAS Enthusiast

Guru

Old Man

Guru

Old Man

Guru

Old Man

Guru

Old Man

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Pool devices disconnect after reboot"

Similar threads