Drives getting old, what would be your upgrade process?

jamiejunk · Jul 27, 2021

So I have some systems with external JBODS that are getting pretty long in the tooth. I have the funds to replace a lot of these drives, but I’m wondering how other people would do it.

For example, I have a JBOD with fourteen 7200 RPM SAS drives in it. The pool has 7 VDEVS, using two drive mirrors.

The drives were manufactured in 2012, and have a powered on time of about 8.8 years. Maybe I don’t need to worry, but I have some budget now, and the age of these drives is starting to concern me.

Some Facts:
I DO have some other empty JBODs and SAS cables on hand.
I don’t need to expand the storage amount.
I’m happy with the performance.

My Options:
1) Purchase a set of replacement drives to have on hand (obviously I already have some spares) just wait for them to die, and replace them as I need.

2) Proactively replace the drives. Swapping out drives one at a time and letting the system re-silver until eventually all the drives are replaced.

3) Putting all the new drives in a secondary JBOD, and doing a ZFS send / receive to a new pool, then make that pool my primary and taking the old pool offline.

I’m currently running FreeNAS-11.3-U2.1. Historically, resilver tasks have taken a LONG time. But I thought I read somewhere (can’t find it now) that ZFS re silvering has been vastly improved. I could upgrade the system to TrueNAS-12.0-U3.1 if that would help.

So what would be your strategy?

ChrisRJ · Jul 27, 2021

Do you want to stay with mirrors for the I/O, or would RAIDZ2 also be an option? In case of the latter, you could go for bigger disks and drastically reduce their number and thus power consumption

jamiejunk · Jul 28, 2021

ChrisRJ said:
Do you want to stay with mirrors for the I/O, or would RAIDZ2 also be an option? In case of the latter, you could go for bigger disks and drastically reduce their number and thus power consumption

We need the mirrors for I/O. When we put this system together years ago, we started out with RAIDZ2 and we're happy with the performance. I thought about getting bigger disks, power savings would be nice, but we really need the spindles for performance.

ChrisRJ · Jul 28, 2021

Depending on how much data we are talking about, SSDs might be an option too.

Heracles · Jul 28, 2021

Hi,

One thing that I would not do is to buy all of the replacement drives at once. If you do so, you will end up with all your drives from the same manufacturing date, entering in service at the same time, with the same average time to failure, so they will basically all fail at the same time. Not on the very same day but in the same time frame.

As such, buy a disk or two, from two different sellers. That way, you increase your chance of having drives from different lots. You then deploy them in say mirror 1 and 2. In 2 months or so, you buy two others. You put them in mirrors 3 and 4. Keep doing that so you spread your drives and reduce your overall risk.

jamiejunk · Jul 28, 2021

ChrisRJ said:
Depending on how much data we are talking about, SSDs might be an option too.

I have some existing SSD pools. But for this workload, spinning drives is what's needed.

Mlovelace · Jul 28, 2021

jamiejunk said:
So what would be your strategy?

My strategy is to replace drives as needed and to keep enough spares on-hand to cover several simultaneous failures amongst all my systems at once. If you're concerned about loosing data should that pool die then you've identified a primary design flaw of your storage infrastructure, mainly that any level of redundancy at the disk level is not a backup.

jamiejunk · Jul 28, 2021

Heracles said:
Hi,

One thing that I would not do is to buy all of the replacement drives at once. If you do so, you will end up with all your drives from the same manufacturing date, entering in service at the same time, with the same average time to failure, so they will basically all fail at the same time. Not on the very same day but in the same time frame.

As such, buy a disk or two, from two different sellers. That way, you increase your chance of having drives from different lots. You then deploy them in say mirror 1 and 2. In 2 months or so, you buy two others. You put them in mirrors 3 and 4. Keep doing that so you spread your drives and reduce your overall risk.

That's a great idea! Thanks!
So for my initial problem.. I wonder if I should add another JBOD and add new drives as a 3rd and then 4th mirror. Then when data is copied over, remove mirrors 1 and 2? Once all the old mirrors are deprecated and removed, pull the old JBOD out of service.

jamiejunk · Jul 28, 2021

Mlovelace said:
My strategy is to replace drives as needed and to keep enough spares on-hand to cover several simultaneous failures amongst all my systems at once. If you're concerned about loosing data should that pool die then you've identified a primary design flaw of your storage infrastructure, mainly that any level of redundancy at the disk level is not a backup.

Thanks for the tip. Data is backed up to another pool on site, and a secondary pool offsite. If I loose the pool i'm not worried about the data.. but it would take a bit of time to restore.. Which is fine.. but not that fun :)

Heracles · Jul 28, 2021

Hi,

you said that your pool's design was good for you, so no need to change it. Say mirrors are 1 to 7, mirror 1 is made of disks A and B, mirror 2 is C and D and so for. I would buy 2 drives N1 and N2. I would add N1 to mirror 1 and N2 to mirror2 (you said that you have extra space available). Once they are in place, I would remove and decommission disks A and C. 2 months later, I would do one drive in mirror 3 and one in mirror 4. That way, you end up with a new and solid drive in each of your mirror in a short time, protecting you against the failure of any of your old drives. When you are done removing one half of each mirror, do the second half. In about a year, you will be done replacing all your old drives.

jamiejunk · Jul 28, 2021

Heracles said:
Hi,

you said that your pool's design was good for you, so no need to change it. Say mirrors are 1 to 7, mirror 1 is made of disks A and B, mirror 2 is C and D and so for. I would buy 2 drives N1 and N2. I would add N1 to mirror 1 and N2 to mirror2 (you said that you have extra space available). Once they are in place, I would remove and decommission disks A and C. 2 months later, I would do one drive in mirror 3 and one in mirror 4. That way, you end up with a new and solid drive in each of your mirror in a short time, protecting you against the failure of any of your old drives. When you are done removing one half of each mirror, do the second half. In about a year, you will be done replacing all your old drives.

These are some great ideas! Thank you!
So you would replace two mirrors at a time correct?

Side question.. how would you go about replacing the drives in a RAIDZ2? Or would you even bother since you have lots of redundancy?

rvassar · Jul 28, 2021

The neat thing about ZFS... It identifies the drives by a UUID, not slot or controller location. You don't have to take the current drives offline to silver in a new drive. Add a drive as a solo device in any location, and add it as a third mirror in the vdev and let it silver in and swap after the fact. You can then move the new drives to the old drive locations at a later date, put the old drive in the new drives former location and the pool will simply re-assemble itself. Then detach the old drive and recycle it. Lather... rinse... repeat... at a rate your risk aversion tolerates. Keeping the avoid uniform age suggestion above in mind...

Yorick · Jul 29, 2021

jamiejunk said:
how would you go about replacing the drives in a RAIDZ2

As long as I have the space, put in one drive at a time, without removing the old drive, and let it resilver. Remove old drive after resilver, rinse repeat.

I will likely wait for the first failure however, rather than replacing proactively.

mistermanko · Jul 29, 2021

Yorick said:
As long as I have the space, put in one drive at a time, without removing the old drive, and let it resilver. Remove old drive after resilver, rinse repeat.

I will likely wait for the first failure however, rather than replacing proactively.

Do keep them active/spinning until one drive fails (hot spare) OR do have them on a shelf ready to install (cold spare) OR do you have them in the Amazon order basket, ready for same-day-delivery (Amazon spare)?

Yorick · Jul 29, 2021

mistermanko said:
Do keep them active/spinning until one drive fails (hot spare) OR do have them on a shelf ready to install (cold spare) OR do you have them in the Amazon order basket, ready for same-day-delivery (Amazon spare)?

Amazon in my case, because the data I have on there isn’t terribly critical. If I cared more, I’d do shelf, with a drive that’s already burned in.

ChrisRJ · Jul 29, 2021

I have now looked at the whole thread again, but cannot see how much data we are talking about in total. That would be quite important to know.

As to replacement drives, I have one spare here for my RAIDZ2 pool of 8 drives, which are all about 9 months old. This drive has been burned in and is therefore relatively safe from failing right after being used as a replacement. Ordering only after failure might be ok, if a) the supplier always has the drive you want in stock, b) the prices only go down, and c) the supplier is known to deliver good drives with proper packaging. Amazon has a mixed track record for point c) and I have stopped buying HDDs from Amazon here in Germany, unless it is retail with good packaging.

In the past I have also only ordered after a drive failure, but in hindsight don't think this was such a great idea. Yes, I have all my data backed up (incl. off-site). But for various reasons those backups are somewhat scattered. And pulling everything together would be a serious effort. So having an extra drive for 350 Euros lying around is worth it for me.

Constantin · Jul 29, 2021

I agree that pre-qualifying replacement drives is a good idea. I have multiple pre-qualified drives here that can be used for both for the backup RAID enclosures or the NAS (by design).

Replacing drives before SMART errors start to indicate potential issues seems premature to me, but to each his/her/their own. I'm running a Z3 here, have a backup system, so I have more of a tolerance for failure than others.

Generally, all my drives were purchased used and hence come naturally come from multiple production lots. I expect them to fail randomly and every time a qualified spare is pulled into duty, a replacement will be purchased and qualified. The bigger issue going forward will be finding replacement 10TB helium drives. More and more OEMs likely will only make He drives at higher capacities for Data Center use.

jamiejunk · Aug 30, 2021

Constantin said:
I agree that pre-qualifying replacement drives is a good idea. I have multiple pre-qualified drives here that can be used for both for the backup RAID enclosures or the NAS (by design).

Replacing drives before SMART errors start to indicate potential issues seems premature to me, but to each his/her/their own. I'm running a Z3 here, have a backup system, so I have more of a tolerance for failure than others.

Generally, all my drives were purchased used and hence come naturally come from multiple production lots. I expect them to fail randomly and every time a qualified spare is pulled into duty, a replacement will be purchased and qualified. The bigger issue going forward will be finding replacement 10TB helium drives. More and more OEMs likely will only make He drives at higher capacities for Data Center use.

How do you go about qualifying drives? Just filling all the blocks and making sure it doesn't crater?

Constantin · Aug 30, 2021

There is a community resource page that goes into qualifying new drives. Combination of tmux, badblocks, and SMART tests allow every relevant bit of a HDD platter to be tested and depending how long you do this give you a reasonable confidence that it will work in a production environment. I do this testing with all new drives, pull them, then sleeve them back up in their space blankets and await the next drive failure.

Important Announcement for the TrueNAS Community.

Drives getting old, what would be your upgrade process?

Contributor

Wizard

Contributor

Wizard

Wizard

Contributor

Guru

Contributor

Contributor

Wizard

Contributor

Guru

Wizard

Guru

Wizard

Wizard

Vampire Pig

Contributor

Vampire Pig

Similar threads