- Joined
- Dec 8, 2017
- Messages
- 442
I believe systems specs are in my sig, but in addition I have an iStarUSA D406 case with two of their BPN-DE340SS hot swap cages installed.
These have two sata power connectors and 4 sata data connectors on back. The cages are connected to the SAS connectors on the motherboard using cables like this one.
For the last two months I've had drives sporadically dropping out of my pool with the following:
As you can see it happened again last night. It has not seemed correlated with drive activity. In the past it's happened when the drives are mostly idle. I also run a long smart scan, which turns up nothing. I once tried re-attaching a drive back to a pool and running 4 scrubs one after another which took well over 12 hours - nothing.
I've had drives in both of the cages drop out which leads me to think it's something other than a specific cage, a spcific cable, or that particular drive.
4 of the 8 drives have command timeout greater than 0 in their smart attributes, including one that previously dropped out:
The one from last night however is listed as 0 timeouts.
No drives have any listings for pending sector, offline uncorrectable, or usma_crc error. Temps for drives range between 40-46C.
As I'm writing this, I got the following:
The drive however did not drop out - apparently it self recovered. I should note that this drive did previously drop out, and I moved it to another drive bay within that same cage to see if the problem followed the drive. In this case it has. I am also currently running a very large file copy job that has been running since last night, so maybe it's possible that the additional activity exposed some problems. However this has happened on 3 separate drives, and I find it hard to believe that all 3 are faulty (although of course possible).
At the moment, I'm running bad blocks on the first drive that dropped out (da5) to see if that does anything.
I'd welcome any suggestions. Could it be the design of the cages? I have not started swapping out the SAS cables since it seems to affect both cages, but I can do that as a cheap solution. Could it be the motherboard?
Looking for suggestions and apologies if I'm missing something obvious or have not supplied enough information.
These have two sata power connectors and 4 sata data connectors on back. The cages are connected to the SAS connectors on the motherboard using cables like this one.
For the last two months I've had drives sporadically dropping out of my pool with the following:
Code:
Dec 29 02:40:49 nas (da5:mpr0:0:7:0): CAM status: Command timeout Dec 29 02:40:49 nas (da5:mpr0:0:7:0): Retrying command Dec 29 02:40:49 nas (da5:mpr0:0:7:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 Dec 29 02:40:49 nas (da5:mpr0:0:7:0): CAM status: SCSI Status Error Dec 29 02:40:49 nas (da5:mpr0:0:7:0): SCSI status: Check Condition Dec 29 02:40:49 nas (da5:mpr0:0:7:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred) Dec 29 02:40:49 nas (da5:mpr0:0:7:0): Error 6, Retries exhausted Dec 29 02:40:49 nas (da5:mpr0:0:7:0): Invalidating pack
As you can see it happened again last night. It has not seemed correlated with drive activity. In the past it's happened when the drives are mostly idle. I also run a long smart scan, which turns up nothing. I once tried re-attaching a drive back to a pool and running 4 scrubs one after another which took well over 12 hours - nothing.
I've had drives in both of the cages drop out which leads me to think it's something other than a specific cage, a spcific cable, or that particular drive.
4 of the 8 drives have command timeout greater than 0 in their smart attributes, including one that previously dropped out:
188 Command_Timeout 0x0032 100 099 000 Old_age Always - 2 2 4
The one from last night however is listed as 0 timeouts.
No drives have any listings for pending sector, offline uncorrectable, or usma_crc error. Temps for drives range between 40-46C.
As I'm writing this, I got the following:
Code:
Dec 29 11:18:25 nas (da2:mpr0:0:4:0): CAM status: CCB request completed with an error Dec 29 11:18:25 nas (da2:mpr0:0:4:0): Retrying command Dec 29 11:18:26 nas (da2:mpr0:0:4:0): READ(16). CDB: 88 00 00 00 00 02 99 fd 6a 68 00 00 01 00 00 00 Dec 29 11:18:26 nas (da2:mpr0:0:4:0): CAM status: SCSI Status Error Dec 29 11:18:26 nas (da2:mpr0:0:4:0): SCSI status: Check Condition Dec 29 11:18:26 nas (da2:mpr0:0:4:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred) Dec 29 11:18:26 nas (da2:mpr0:0:4:0): Retrying command (per sense data)
The drive however did not drop out - apparently it self recovered. I should note that this drive did previously drop out, and I moved it to another drive bay within that same cage to see if the problem followed the drive. In this case it has. I am also currently running a very large file copy job that has been running since last night, so maybe it's possible that the additional activity exposed some problems. However this has happened on 3 separate drives, and I find it hard to believe that all 3 are faulty (although of course possible).
At the moment, I'm running bad blocks on the first drive that dropped out (da5) to see if that does anything.
I'd welcome any suggestions. Could it be the design of the cages? I have not started swapping out the SAS cables since it seems to affect both cages, but I can do that as a cheap solution. Could it be the motherboard?
Looking for suggestions and apologies if I'm missing something obvious or have not supplied enough information.
Last edited: