When to replace drives?

hexadecagram

Dabbler
Joined
Jul 15, 2016
Messages
32
Should I do this when I start receiving emails with CAM timeouts + retries, when S.M.A.R.T. error counts start increasing, or when I start seeing alerts in the Web UI?

I would think it would be one of the latter 2 situations but I've worked with iXsystems support in the past and they wanted to replace drives as soon as I started seeing CAM errors and re-seating the drives didn't help.
 
Joined
Oct 18, 2018
Messages
969
Should I do this when I start receiving emails with CAM timeouts + retries, when S.M.A.R.T. error counts start increasing, or when I start seeing alerts in the Web UI?
When you're getting errors such as CAM timeouts, SMART errors, or bad sectors that are best explained by a disk that is experiencing issues I would suggest that you immediately make a plan to replace the disk. How urgently the replacement should take place depends.

In my view the variables affecting the urgency are as follows.
  1. How important is data availability? Note this isn't the same thing as data integrity. If you have frequent and quality backups you may not lose any data if the pool goes down but it could affect data availability as you restore from backup.
  2. How solid are your backups? If you keep no on-site or off-site backups a replacement may be more urgent than if you have on-site and off-site backups that maintain frequent snapshots.
  3. How irreplaceable is the data? This one may depend somewhat on your backup strategy. Even if your data is absolutely irreplaceable if you have 4 backups of your data in reliable storage and data availability is not paramount perhaps the situation is less dire.
So I would argue that once you have errors you should replace the disk, but consider your unique situation for deciding how urgent the replacement is. Personally I want to minimize down time and decrease risk of data loss so I have spare drives already burned in and ready to go so I can perform the replacement immediately.
 
Last edited:

Jessep

Patron
Joined
Aug 19, 2018
Messages
379
TL,DR: Depends on your risk tolerance.

If your data is critical (business or production storage) to you then replace drives when they start showing any errors and basics like rebooting, re-inserting, checking cabling doesn't resolve it.
Also replace drives before they are out of warranty and buy drives with 5 year warranty (enterprise drives).
In addition keep within the supported drive amounts per chassis, for WD Red that's 8 drives, beyond that you should be looking at enterprise drives (WD Gold or HGST Ultrastar DC) that are rated for more drives per chassis.
Have tested/burned in cold spares so you are not at the mercy of vendors and varying stock levels, shipping, or lead time.

If this is home data and you have good backups you can violate as many of those suggestions as your risk tolerance allows.
 

hexadecagram

Dabbler
Joined
Jul 15, 2016
Messages
32
Thanks for your responses. However, what I am getting at is how are each of these 3 alerts ranked, in terms of severity?

I would think that it is something like: CAM timeouts < S.M.A.R.T. < WebUI alerts, with a WebUI alert being "take immediate action" and CAM timeouts being "you may have a problem in a month or so" (reason being, just about every CAM timeout says "retrying" and I never see to many of them in succession).
 

hexadecagram

Dabbler
Joined
Jul 15, 2016
Messages
32
Would anyone care to comment? Do I have the severity of these alerts ranked in the correct order?
 

Yorick

Wizard
Joined
Nov 4, 2018
Messages
1,912
A drive can fail and never show smart errors. The WebUI has middleware that reports from ZFS, so you’d see stuff like “ZFS thinks this drive is now bad and has taken it out of the pool”, and yeah, that’s severe.

The advice above seems really good to me. I don’t know that trying to attach a severity rating to these alerts is getting you down a helpful path. Assessing your risk tolerance will.

Also replace drives before they are out of warranty

Real question: Why?

Is failure rate after 5 years that much higher, so that replacing before failure reduces the risk of data unavailability?

Is it about budgeting and getting to predictable expenditure?

Is it about surviving an audit?

All of the above?
 

johnnyspo

Dabbler
Joined
Nov 30, 2012
Messages
13
If it was me, I would replace the drive. Unlike, say, your shoulder, which may feel better in a day or two, electromechanical devices generally don't get better on their own. Just do it and save yourself some near-future aggravation.
 

hexadecagram

Dabbler
Joined
Jul 15, 2016
Messages
32
A drive can fail and never show smart errors. The WebUI has middleware that reports from ZFS, so you’d see stuff like “ZFS thinks this drive is now bad and has taken it out of the pool”, and yeah, that’s severe.
Thank you for responding. This is very helpful information.

The advice above seems really good to me. I don’t know that trying to attach a severity rating to these alerts is getting you down a helpful path. Assessing your risk tolerance will.
I agree. But I would think that it's impossible to assess risk properly without understanding the alerts I am receiving. I mean yes, all errors could be perceived as equally bad, but I would think that an actuator carving its initials into a platter would be more severe than a failed parity check.

CAM is software-level, whereas S.M.A.R.T. is hardware-level, yes? Where can I find technical information about CAM? IIRC, I first started seeing these messages in FreeBSD 5.0 and what they have been really trying to tell me has been a mystery ever since.

If it was me, I would replace the drive. Unlike, say, your shoulder, which may feel better in a day or two, electromechanical devices generally don't get better on their own. Just do it and save yourself some near-future aggravation.
Yeah, this isn't an "if I should" question but a "when I should" question.
 
Top