How at risk is my data if I didn't burn in the hard drives?

EvanVanVan · Dec 28, 2018

I recently upgraded my server from eight 3TB drives to eight 10TB drives (shucked WD whites). For almost a year my server was at 93%+ capacity and I was getting constant warnings, so rather than taking the time to properly burn-in the drives I got to work resilvering them one by one until I was done and my pool capacity expanded.

The pool is in Raid-Z2, I have short and long smart tests run twice a month automatically and the volume is scrubbed once a month. The drives have been in use for a month and I haven't noticed any issues yet.

Should I try burning in the drives now that they're in use? What will errors look like (during normal use) if one or more of the drives are bad?

Thanks

jgreco · Dec 28, 2018

You're not at a real high risk, but you'll be better once you get past 1000 hours, which you almost are. Hard drives are most at risk for "infant mortality"..

You can safely run solnet-array-test, which is a nondestructive read burn-in tester, on an active array -- as long as you don't happen to lose three drives in the process. You ideally want to be able to return drives to a retailer for exchange, because the exchange will usually be a brand new drive. If you have to RMA a drive to a manufacturer, you are very likely to get a refurbished drive. So stress-testing now is a good idea.

You can run short tests every four or six hours and long tests once or twice a week for much more thorough early detection of problems.

Chris Moore · Dec 28, 2018

EvanVanVan said:
Should I try burning in the drives now that they're in use?

The burn-in process is a data destructive process. Now that you have data on the drives, you can't do a burn-in, unless you are prepared to move the data elsewhere for the duration of the test, which would probably run six or more days on 10 TB drives.

EvanVanVan said:
What will errors look like (during normal use) if one or more of the drives are bad?

I have (at work) a new server with 10 TB Seagate Exos drives which is another naming convention that Seagate has developed to confuse consumers. In that server, with a population of sixty (yes 60) drives, only one drive had one bad sector show up during testing. The point being that the failure statistics are (in a way) on your side. Historically, you might see around 5% of drives fail in the 'infant mortality' period (that equates to 5 out of 100) but the population of your system is only eight drives. Your statistical probability is that you won't see a failure, but you might, no way to know. That is the reason for doing the burn-in, to stress the drives and see if any of them break, before the data is in there. What I got on that batch of drives in the new server at work was, "just a bad sector," and it isn't the end of the world if you have a bad sector, but I generally would suggest replacing the drive for it because one bad sector usually leads to more.
That is the most likely error and your FreeNAS should alert you when that happens, even if it is five years from now, and the thing to do at that point will depend on your intentions. If you have filled the array again by then, you might want to replace all the drives with larger ones, by that point it might be common to see 20TB SSDs, who knows.

Chris Moore · Dec 28, 2018

PS. Here is an informative video that talks about that initial burn-in of the drives:
https://www.youtube.com/watch?v=9bh5ZK8z4ZA

He even points out a script that will automate the process of testing all the drives at once.

EvanVanVan · Dec 28, 2018

Thanks guys, sounds like I should be ok... Apparently I have 45 days to return the drives though, so if I decide to test them I do have a couple weeks left to exchange them if necessary. I had seen the solnet-array-test jgreco mentioned as a nondestructive tester. Something to think about. Although I'm wondering how long solnet-array-test will take on 8x10TB drives.

Constantin · Dec 28, 2018

It will take a while... something that works best when you have multiple old servers around to do burn in testing on while you use your current server to.. serve things. The closest I come to a burn-in rig is an external plug-in dock with an eSATA connection, allowing me to capture SMART errors as I torture a drive... but it takes a while and you only get to torture one drive at a time. The tools that jgreco and Chris Moore mention are clearly superior to this approach... if you have the bays to slap these drives into and/or a spare server.

I suppose this is another reason to acquire a used 4U, 48-bay unit like Chris. :)

Chris Moore · Dec 28, 2018

Constantin said:
I suppose this is another reason to acquire a used 4U, 48-bay unit like Chris. :)

This one is exactly the type I bought, except it appears to not include the drive trays, which I am not entirely sure are available separately at anything like a reasonable cost.

https://www.ebay.com/itm/SuperMicro...s-NR40700-Emac-MTW-5820V-P-S-NEW/163444574100

jgreco · Dec 29, 2018

Chris Moore said:
The burn-in process is a data destructive process. Now that you have data on the drives, you can't do a burn-in,

No, this isn't true. You do not need to use a data-destructive process for burn-in. It may be *better* to test with a data-destructive test, but many of the most common failures will reveal with a read-only test process.

jgreco · Dec 30, 2018

EvanVanVan said:
Thanks guys, sounds like I should be ok... Apparently I have 45 days to return the drives though, so if I decide to test them I do have a couple weeks left to exchange them if necessary. I had seen the solnet-array-test jgreco mentioned as a nondestructive tester. Something to think about. Although I'm wondering how long solnet-array-test will take on 8x10TB drives.

It'll be awhile. It doesn't hurt to let it go until it finishes. The linear read time of a modern 8TB disk is, well if we assume 200MBytes/sec, 40000 seconds or half a day, and when you start doing seeks that plummets.

Holt Andrei Tiberiu · Dec 30, 2018

Nice case, but it burns 820w / psu, and they aren't gold plated.
Chris what is the total consumption / hour of your's?

jgreco · Dec 30, 2018

Holt Andrei Tiberiu said:
Nice case, but it burns 820w / psu, and they aren't gold plated.

That's not how power supplies work. They'll only consume what's needed plus a small amount of overhead. It's definitely worth noting that you should probably feed it from two different 20A circuits though, just to be safe.

Chris Moore · Dec 30, 2018

jgreco said:
That's not how power supplies work. They'll only consume what's needed plus a small amount of overhead. It's definitely worth noting that you should probably feed it from two different 20A circuits though, just to be safe.

My system has three of the four power supply bays populated and it could probably run from two of the three with ease but that causes the power supply failure alarm to go off. I also have a separate UPS for each supply to ensure that a UPS failure does not bring the system down.

Holt Andrei Tiberiu said:
Nice case, but it burns 820w / psu,

My actual consumption is between 95 watts and 180 watts (measured at the UPS) depending on the activity of the system. Mostly it is on the low side of that. It only peaks during a scrub or some other disk intensive task. It is mostly the drives that burn the power. I could go with fewer drives but there is a trade-off between the number of drives and the performance of the system. I actually want more drives, not for the capacity, for the speed. I just can't rationalize the cost in power and heat generation for a home system that is just for my family. The whole thing is a bit overkill for home.

EvanVanVan · Jan 2, 2019

On the topic of "my drives have been scrubbing and SMART testing fine for the past month," I'm not sure they actually have been being tested properly :/ ... My pool is 30% full:

According to zpool status, a scrub of volume1 supposedly only takes 5 hours 38 minutes to complete. Does that sound right? I feel like when I had 8x6TB drives, it would take a day or two to complete.

Also, I've been thinking long+short SMART tests have been running 2+4 times a month respectively, but when I manually checked results via CLI it said they had not been completed. I checked the GUI and (and according to the GUI side bar) the smart tests were set to be conducted on "(ada0 da0 da0 da0 da0 da0 da0 da0 da0)." (ada0 is my boot drive.) It wasn't until I edited the tests and re-selected all the drives that it rectified itself and listed them on drives "(ada0 da0 da1 da2 da3 da4 da5 da6 da7)."

Thoughts?

danb35 · Jan 2, 2019

EvanVanVan said:
According to zpool status, a scrub of volume1 supposedly only takes 5 hours 38 minutes to complete. Does that sound right? I feel like when I had 8x6TB drives, it would take a day or two to complete.

Pool size isn't a factor at all in scrub times; they depend (among other things) on how much data is actually on the pool. But if your pool said it completed a scrub in 5:38, then it did.

EvanVanVan · Jan 2, 2019

danb35 said:
Pool size isn't a factor at all in scrub times; they depend (among other things) on how much data is actually on the pool. But if your pool said it completed a scrub in 5:38, then it did.

Ok, I think I basically knew that...was more just wondering, in combination with the second issue (where the drives didn't appear to be selected properly during smart tests), is it possible the pool wasn't fully selected or something and it wasn't scrubbing everything it should? But I guess not... Since unlike 8 individual drives, there is only 1 pool.

Important Announcement for the TrueNAS Community.

How at risk is my data if I didn't burn in the hard drives?

EvanVanVan

Patron

jgreco

Resident Grinch

Chris Moore

Hall of Famer

Chris Moore

Hall of Famer

EvanVanVan

Patron

Constantin

Vampire Pig

Chris Moore

Hall of Famer

jgreco

Resident Grinch

jgreco

Resident Grinch

Holt Andrei Tiberiu

Contributor

jgreco

Resident Grinch

Chris Moore

Hall of Famer

EvanVanVan

Patron

danb35

Hall of Famer

EvanVanVan

Patron

Similar threads

Important Announcement for the TrueNAS Community.

How at risk is my data if I didn't burn in the hard drives?

Patron

Resident Grinch

Hall of Famer

Hall of Famer

Patron

Vampire Pig

Hall of Famer

Resident Grinch

Resident Grinch

Contributor

Resident Grinch

Hall of Famer

Patron

Hall of Famer

Patron

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "How at risk is my data if I didn't burn in the hard drives?"

Similar threads