CAM Status Medium Error, still/again...

Status
Not open for further replies.

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,994
This time you got a Drive Ready Error message, hopefully different that the original message.

Lets assume this is a new problem as it will help during the troubleshooting effort.

Let me lay out some steps for you to take and please take good notes on what you do. You will need to track the drives by serial number, never assume ada4 is always the same physical drive. During this troubleshooting do not upgrade the FreeNAS software without writing down what you are doing. If the problem magically goes away then it's nice to try and point it somewhere.

1) Record the drive serial number for ada4.
2) Wait for the next failure message, if it's ada4 (the same drive by serial number) again then perform step 3, if not jump to step 4.
3) If the failure is ada4 again then:
3a) Record ada3 and ada4 serial number, Power down.
3b) On the motherboard or hard drive end (not both), swap the SATA cable from ada4 drive and ada3 drive (use the serial numbers). This will place the suspect drive now on the ada3 connection.
3c) Power on and once FreeNAS is bootstrapped, verify that ada3 is now the suspect drive by serial number.
4) If the problem occurs on a different drive (based on serial number) then maybe you have a power supply issue. Run MemTest86 (one full cycle minimum) and a CPU stress test (~2 hours). These generally can find a power supply issue. Ensure your hard drives are connected at least using the power connector so there is a valid power load.

This is all I've got for now. I hope you track your problem down quickly.
 

BlueMagician

Explorer
Joined
Apr 24, 2015
Messages
56
@joeschmuck: Thank you very much for your reply - your time is appreciated.

The steps you describe are pretty much what I've been through for the last few months whilst finally deciding to try and nail down the 'other' issue whilst using the H200 HBA.

The last couple of errors that occurred whilst using the Dell controller affected two different drives (identified by serial number) - one of which is the same serial which has now flagged this Drive Ready ATA error.

The voltages from the Seasonic PSU reported by the Intel BMC during full array access, and during a scrub, look good and stable.

I really wish this issue was more prevalent - it would be easier to diagnose if there were hundreds of errors per day, instead of one or two a month.

I've not tried a MemTest or stress cycle since I built the machine I must admit - but the CPU and board/case temps are exceptional even under load - for example when doing hard multi-stream transcoding in Plex. The system has never rebooted or hung unexpectedly.

In fact, other than this message appearing in the logs every once in a while, along with a <128kb repair to the pool during a scrub, it has been great.

There's always something to worry about, right?


S.
 

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,994
Well I can see that this will continue to take a while for you to isolate the problem. It's possible you have two drives that are having some problems. I don't think a SMART Long test will help you find this problem but ensure you do conduct at least daily SMART Short tests and then Long tests as you desire.
 

BlueMagician

Explorer
Joined
Apr 24, 2015
Messages
56
@joeschmuck: Again, your time is very much appreciated.

I perform short tests every 5 days, and long tests every 2 weeks, dovetailed with a scrub every 2 weeks so that no process overlaps.

The SMART indicators/counters on all drives are fine. No pending sectors, no remapped sectors, no CRC errors - and drive temps reach a mere 32 degrees centigrade during a scrub.

The SATA cables I bought last week when migrating to onboard ports, are all new and were not bargain bin price - so it would be just my luck if one was faulty out of the pack.

Two of my WD RED's have generated (albeit different) ATA errors in the last few weeks, and both will come to the end of their 3 year warranty in April - so if they are going to fail - I'd rather they did it now!

I could possibly convince WD to accept an RMA for one of them without specific SMART errors, but getting them to accept two without good reason would probably be near impossible - and I also don't want to give up on drives that may be absolutely fine - better the devils I know, rather than a reconditioned unit lottery!

I have no choice really but to start the troubleshooting from scratch, as you suggest, now that I've removed the HBA from the equation.

Incredibly frustrating but there you go.

Incidentally, since my last error in the DMESG log, the system has scrubbed twice (forced out of schedule) without issue - in good time and with no repairs.

I guess it stands to reason that the pool only needs a (tiny) repair _if_ one of these ATA errors occurs during a Write command and the data doesn't get committed correctly? I may be showing my ignorance there.

Googling last night yielded a few FreeBSD users from back in version 9 days, blaming "ATA status: 41 (DRDY ERR), error: 40 (UNC )" errors on bad ACPI / ACHI drivers .. but I started to glaze over at that point and decided I was following red-herrings...


S.
 

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,994
I guess it stands to reason that the pool only needs a (tiny) repair _if_ one of these ATA errors occurs during a Write command and the data doesn't get committed correctly? I may be showing my ignorance there.
It doesn't matter how the data corruption occurs, could be that or possibly the media flakes off. Typically if the media flakes off you will have a lot of failures and a SMART Long test will note this.

I wish I knew the exact cause of your issue but I suspect you wish you knew as well. Keep plugging along.

About your fan speed, you could plug your fans into a modified 4 pin Molex to 3 pin fan header that moves the -12VDC line (black) to the +5VDC pin (red), this would give you a 7VDC differential and run your Aux fans at a nice constant medium speed. Your CPU fan would remain where it is. If you are unsure how to do something like this, I could provide a photo. This would eliminate the need for the fan script.

I do not have your motherboard but can't you select a constant fan speed in the BIOS?

Basically I'd like to remove the script from the equation.
 

BlueMagician

Explorer
Joined
Apr 24, 2015
Messages
56
@joeschmuck:

Indeed, I wish I knew!

6 of my Noctua 120mm fans are already running directly from a PSU rail, at 7v through the supplied Noctua 'low noise' adaptors - which is basically a voltage differental configuration. Thus, they are already uncontrolled and constant.

The remaining two fans - the CPU fan and the single side-mounted case fan - are the only ones connected to the motherboard headers. They are configured in BIOS to increase when necessary...

Please forgive my ignorance, but the scripts you speak of - are these something that's configured in FreeNAS by default - or something you assume I have added as a quality of life improvement?

Thank you again, sincerely,
S.
 

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,994
Arg, I apologize. I got your problem confused with a different users problem. Please ignore me.
 
Status
Not open for further replies.
Top