Drives dropping out of pool. CAM Status Error

MikeyG

Patron
Joined
Dec 8, 2017
Messages
442
I believe systems specs are in my sig, but in addition I have an iStarUSA D406 case with two of their BPN-DE340SS hot swap cages installed.

These have two sata power connectors and 4 sata data connectors on back. The cages are connected to the SAS connectors on the motherboard using cables like this one.

For the last two months I've had drives sporadically dropping out of my pool with the following:

Code:
Dec 29 02:40:49 nas (da5:mpr0:0:7:0): CAM status: Command timeout
Dec 29 02:40:49 nas (da5:mpr0:0:7:0): Retrying command
Dec 29 02:40:49 nas (da5:mpr0:0:7:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00
Dec 29 02:40:49 nas (da5:mpr0:0:7:0): CAM status: SCSI Status Error
Dec 29 02:40:49 nas (da5:mpr0:0:7:0): SCSI status: Check Condition
Dec 29 02:40:49 nas (da5:mpr0:0:7:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
Dec 29 02:40:49 nas (da5:mpr0:0:7:0): Error 6, Retries exhausted
Dec 29 02:40:49 nas (da5:mpr0:0:7:0): Invalidating pack


As you can see it happened again last night. It has not seemed correlated with drive activity. In the past it's happened when the drives are mostly idle. I also run a long smart scan, which turns up nothing. I once tried re-attaching a drive back to a pool and running 4 scrubs one after another which took well over 12 hours - nothing.

I've had drives in both of the cages drop out which leads me to think it's something other than a specific cage, a spcific cable, or that particular drive.

4 of the 8 drives have command timeout greater than 0 in their smart attributes, including one that previously dropped out:

188 Command_Timeout 0x0032 100 099 000 Old_age Always - 2 2 4

The one from last night however is listed as 0 timeouts.

No drives have any listings for pending sector, offline uncorrectable, or usma_crc error. Temps for drives range between 40-46C.

As I'm writing this, I got the following:

Code:
Dec 29 11:18:25 nas (da2:mpr0:0:4:0): CAM status: CCB request completed with an error
Dec 29 11:18:25 nas (da2:mpr0:0:4:0): Retrying command
Dec 29 11:18:26 nas (da2:mpr0:0:4:0): READ(16). CDB: 88 00 00 00 00 02 99 fd 6a 68 00 00 01 00 00 00
Dec 29 11:18:26 nas (da2:mpr0:0:4:0): CAM status: SCSI Status Error
Dec 29 11:18:26 nas (da2:mpr0:0:4:0): SCSI status: Check Condition
Dec 29 11:18:26 nas (da2:mpr0:0:4:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
Dec 29 11:18:26 nas (da2:mpr0:0:4:0): Retrying command (per sense data)


The drive however did not drop out - apparently it self recovered. I should note that this drive did previously drop out, and I moved it to another drive bay within that same cage to see if the problem followed the drive. In this case it has. I am also currently running a very large file copy job that has been running since last night, so maybe it's possible that the additional activity exposed some problems. However this has happened on 3 separate drives, and I find it hard to believe that all 3 are faulty (although of course possible).

At the moment, I'm running bad blocks on the first drive that dropped out (da5) to see if that does anything.

I'd welcome any suggestions. Could it be the design of the cages? I have not started swapping out the SAS cables since it seems to affect both cages, but I can do that as a cheap solution. Could it be the motherboard?

Looking for suggestions and apologies if I'm missing something obvious or have not supplied enough information.
 
Last edited:

MikeyG

Patron
Joined
Dec 8, 2017
Messages
442
Thank you for asking! I ended up buying a new drive cage and installing it last night. This one is straight SAS, so it has one connector instead of the breakout cables like the other one. I also moved the drive that was acting up the most into that cage. I will run it for a while and see what happens. I wouldn't call it resolved yet, but I couldn't think of anything else to do. I'm operating on the premise that it's unlikely for multiple cables, drives, and cages to go bad. I'm hoping it's not the motherboard/controller - so that leaves the cage type or design.
 

MikeyG

Patron
Joined
Dec 8, 2017
Messages
442
Nearly a month later and I've encountered no further issues. It would appear to have been connection related. Either the connections on one of the cages were faulty, the cables were, or simply re-seating the cables as part of mucking about in the system did it. Glad my guess seems to have been right and I didn't start replacing drives instead.
 

IceBoosteR

Guru
Joined
Sep 27, 2016
Messages
503
Hi,

normally the CAM errors shows up when you have a faulty or flanky connection to your drive. IT can be the port, or the cable, or in my case, it was the PSU. CAM is just telling that the connection to that drive has errors. Glad you have resolved that with buying a new cage, but I suggest you keep an eye on it.
As I said, I had an issue with my PSU. Over some time, only one drive was dropping out of the array. I thought that the SATA controller of this particular one was faulty. After changing the drive it was quite for three weeks, then the drive dropped out again. I changed the drive, the controller, the cables - a whole mess of testing, drives dropping out of the array and so on. Could not find out the root cause. Until three drives was dropping out at the same time - pool was dead, data loss. And today I still believe it was the PSU. I have RMA'd it, but the tests on the manufactuer side showed no result (as for some days the PSU worked well for me too, without issues....). I got a replacement PSU as I have kindly asked for it in some emails (thanks to the kind person from support) - and the error was gone. Hopefully forever and the replacement unit is not going to do the same....
What I want to tell you, sometimes it is hard to figure out how the CAM errors occur. Keep an eye on it and schedule backups often.

Cheers,
Ice
 
Last edited:
Top