Hard Drive Burn-In Testing - Discussion Thread

Stux · Aug 20, 2017

Would depend on your device being an active GEOM device or not.

Journer · Sep 4, 2017

For future folks reading this thread, I just finished testing six WD 8TB drives (WD80EFZX).
I ran the tests executed by this script: https://github.com/Spearfoot/disk-burnin-and-testing , however, I skipped the first extended smart test. Ran this with the kern.geom.debugflags=0x10 set

The entire test took 6 days; ~5d of which was running badblocks.

Took quite a while, but the raw counts for the relevant smart outputs are all zero, so I've got confidence the drives are ok - at least for now.

Precise times:
Tue Aug 29 09:24:25 Start badblocks
Sun Sep 3 12:21:19 End badblocks, start smart short + extended
Mon Sep 4 12:01:20 Totally finished

joeschmuck · Sep 5, 2017

5 days of non-stop hard drive testing for badblocks, yup, large drives take quite a bit longer. This is the future I have to look forward too, Yikes!

Stux · Sep 5, 2017

joeschmuck said:
5 days of non-stop hard drive testing for badblocks, yup, large drives take quite a bit longer. This is the future I have to look forward too, Yikes!

Yeppers. If you have another FreeNAS with lots of bays... test the drives in that while its doing its NAS thing, and test the mem/cpu in the other build ;)

...

actually my thermals went to high, so I ended up doing it in the new build...

And I did the long test before/after too...

Used @Spearfoot 's burn-in script at least, so it was stupid-easy and generated nice logs to be filed away on the pool when it was created.

wblock · Sep 5, 2017

ViciousXUSMC said:
Curious about this:
sysctl kern.geom.debugflags=0x10

This command defeats the safety that prevents writing to parts of a disk that are in use (by having been mounted, generally).

It is the computer equivalent of removing the safety equipment from power machinery. Experts will say sometimes that is justified and the only way to accomplish particular tasks. Other people routinely defeat safety equipment, by rote, and think nothing of it until they are injured.

Jason B · Sep 8, 2017

qwertymodo said:
jgreco did a nice system build/test/burn-in guide here, but I (and many others) found the details a bit lacking in the hard drive section. He mentions S.M.A.R.T. tests, but doesn't go over how to run them, or how to view the results, etc. and then just kinda throws around dd commands without a lot of explanation there either. Yes, this information is available elsewhere, but for somebody (such as myself) looking for a single cohesive guide to burn-in testing, I figured it'd be nice to have all of the info in one place to just follow, with relevant commands. So, having worked my way through reading around and doing my own testing, here's a little more n00b-friendly guide, written by a n00b, so please feel free to chime in with suggestions or criticisms if you have any. I'm basing this guide more off of cyberjock's post here than jgreco's guide.

UPDATE: Thanks to cyberjock, I've updated the section on badblocks to include instructions for using tmux to test all drives in parallel. Considering that badblocks with default settings takes over 24 hours for a 2TB drive, that should significantly decrease testing times, especially for large arrays.

First of all, the S.M.A.R.T. tests. The first thing that someone unfamiliar with S.M.A.R.T. tests might find strange is the fact that no results are shown when you run the test. The way these tests work is that you initiate the test, it goes off and does its thing, then it records the results for you to check later. So, if this is an initial burn-in test for your entire system, you can initiate tests on all of the drives simultaneously by simply issuing the test command for each drive one after another.

The first test to run is a short self-test:

Code:
smartctl -t short /dev/adaX

It should indicate that the test will take about 5 minutes. You can immediately begin the same test on the next drive, but you can only run one test on each drive at a time. Once it has completed, run a conveyance test:

Code:
smartctl -t conveyance /dev/adaX

Again, wait for the test to complete (about 2 minutes this time). Finally, a long test:

Code:
smartctl -t long /dev/adaX

Now, before we can perform raw disk I/O, we need to enable the kernel geometry debug flags. This carries some inherent risk, and should probably not be done on a production system. This does not survive through a reboot, so when you're done, just reboot the machine to disable it:

Code:
sysctl kern.geom.debugflags=0x10

Now that we can execute raw I/O, run a badblocks r/w test.

Unlike the S.M.A.R.T. tests, badblocks runs in the foreground, so once you start it, you won't be able to use the console until the test completes. It also means that if you start it over SSH and lose your connection, the test will be canceled. The answer to this is to use a utility called tmux:

Code:
tmux

You should now see a green stripe at the bottom of the screen. Now, we can run badblocks. THIS TEST WILL DESTROY ANY DATA ON THE DISK SO ONLY RUN THIS ON A NEW DISK WITHOUT DATA ON IT OR BACK UP ANY DATA FIRST:

Code:
badblocks -ws /dev/adaX

badblocks also offers a non-destructive read-write test that (in theory) shouldn't damage any existing data, but if you do choose to run it on a production drive and suffer data loss, on your own head be it:

Code:
badblocks -ns /dev/adaX

It has been brought to my attention that badblocks has some limitations with larger drives >2TB. The easy workaround is to manually specify a larger block size for the test.

Code:
badblocks -b 4096 -ws /dev/adaX

or

Code:
badblocks -b 4096 -ns /dev/adaX

Once you've started the first test, press Ctrl+B, then " (the double-quote key, not the single quote twice). You should now see a half-white, half-green line through the screen (in PuTTY, it's q's instead of a line, but same thing) with the test continuing in the top half of the screen and a new shell prompt in the bottom. Run the badblocks command again on the next disk, then press Ctrl+B, " again to create another shell. Continue until you've started a test on each disk. If you are connecting over SSH and your session gets disconnected, all of the tests will continue running. When you reconnect, to resume the session and view the test status, simply type:

Code:
tmux attach

As with the S.M.A.R.T. tests, you can only run one test at a time per drive, but you can test all of your drives simultaneously. In my experience, the tests run just as fast with all drives testing as with a single drive, so for your initial burn-in, there's really no reason not to test all of the drives at once. Also, be prepared for this test to take a very long time, as it is basically the "meat and potatoes" of your burn-in process. For reference, the default 4-pass r/w test took a little over 24 hours on my WD Red 2TB drives, YMMV.

Because S.M.A.R.T. tests only passively detect errors after you've actually attempted to read or write a bad sector, you should run the S.M.A.R.T. long test again after badblocks completes:

Code:
smartctl -t long /dev/adaX

At this point, you have fully tested all of your drives, and now it's time to view the results of the various S.M.A.R.T. tests:

Code:
smartctl -A /dev/adaX

This should produce something like this (sorry for the formatting fail):

Code:
[root@freenas] ~# smartctl -A /dev/ada0 smartctl 6.2 2013-07-26 r3841 [FreeBSD 9.2-RELEASE-p4 amd64] (local build) Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org === START OF READ SMART DATA SECTION === SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0 3 Spin_Up_Time 0x0027 175 174 021 Pre-fail Always - 4208 4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 9 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0 7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0 9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 357 10 Spin_Retry_Count 0x0032 100 253 000 Old_age Always - 0 11 Calibration_Retry_Count 0x0032 100 253 000 Old_age Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 9 192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 4 193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always - 9 194 Temperature_Celsius 0x0022 119 113 000 Old_age Always - 28 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0 197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0030 100 253 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0 200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 0

Some of the more important fields right now include the Reallocated_Sector_Ct, Current_Pending_Sector, and Offline_Uncorrectable lines. All of these should have a RAW_VALUE of 0. I'm not sure why the VALUE field is listed as 200, but as long as the RAW_VALUE for each of these fields is 0, that means there are currently no bad sectors. Any result greater than 0 on a new drive should be cause for an immediate RMA.

Once all of your tests have completed, you should reboot your system to disable the kernel geometry debug flags.

OK. so I am logged on to my FreeNAS server using the web interface and following along with the hard drive burn in testing guide. I have opened a shell window and executed the first burn in command.

smartctl -t short /dev/adaX

After entering this command a few messages are displayed. One of which is "Please wait 1 minute for the test to complete". Then command returns immediately to the shell prompt. I guess I was kind of expecting the SMART test process to run in the shell window showing output from the test. As you can see I haven't used this OS before (I haven't used Unix in 20 years nor Linux in 10 years). So I assume the smart test process is running in the background. Where can I see the state, or status, of this smart test? Ok, I just saw the Display System Processes and Reporting > Disk tabs in the Web Interface. However, it's been well over a minute since I entered the command and I don't know what process name to look for. I think the steps in this guide should talk about how to determine the state of the test or process rather than stating "Once it has completed...". Also, monitoring processes to see when they disappear and monitoring disk utilization doesn't really tell you anything about the state of the test that is running.

Stux · Sep 9, 2017

The process is running on the hard drive.

smartctl -a /dev/adaX

To see progress/results

Deleted47050 · Sep 9, 2017

Jason B said:
I think the steps in this guide should talk about how to determine the state of the test or process rather than stating "Once it has completed...".

The original guide explains that already:

qwertymodo said:
At this point, you have fully tested all of your drives, and now it's time to view the results of the various S.M.A.R.T. tests:

Code:
smartctl -A /dev/adaX

guermantes · Oct 16, 2017

So, did I get this right?
If I ssh into FreeNAS from my (Linux) desktop, and then initiate tmux and start badblocks, can I power down the desktop over night and tmux attach in the morning (or whenever)?

Stux · Oct 16, 2017

guermantes said:
So, did I get this right?
If I ssh into FreeNAS from my (Linux) desktop, and then initiate tmux and start badblocks, can I power down the desktop over night and tmux attach in the morning (or whenever)?

Yes

Benc · Nov 19, 2017

I just finished burn-in of 5 new Ironwolf 4TB disks.
On four disks Raw_Read_Error_Rate was high, but this seems to be normal for Ironwolf. Seek_Error_Rate was high on all five disks.

But why on one of them Raw_Read_Error_Rate is 0? If high numbers are normal, I wonder what 0 means?
Otherwise all other important parameters mentioned in OP were 0. Should I try any other test or is it safe to continue with installation.

Stux · Nov 19, 2017

Benc said:
I just finished burn-in of 5 new Ironwolf 4TB disks.
On four disks Raw_Read_Error_Rate was high, but this seems to be normal for Ironwolf. Seek_Error_Rate was high on all five disks.

But why on one of them Raw_Read_Error_Rate is 0? If high numbers are normal, I wonder what 0 means?
Otherwise all other important parameters mentioned in OP were 0. Should I try any other test or is it safe to continue with installation.

Normally, on Seagate drivEs the raw read and seek error rate fields show you the count of raw reads/seeks and the errors, in an encoded form. Which is neat.

So, zero means no seeks or reads, which seems unlikely, or perhaps a different firmware.

joeschmuck · Nov 19, 2017

Benc said:
I just finished burn-in of 5 new Ironwolf 4TB disks.
On four disks Raw_Read_Error_Rate was high, but this seems to be normal for Ironwolf. Seek_Error_Rate was high on all five disks.

But why on one of them Raw_Read_Error_Rate is 0? If high numbers are normal, I wonder what 0 means?
Otherwise all other important parameters mentioned in OP were 0. Should I try any other test or is it safe to continue with installation.

Odd but like @Stux said, maybe a firmware difference. It's too bad they all were not a zero value but having a value in thiese Fields does not mean anything bad if all the other results are good. So if you ran a SMART Long/Extended test at the end and all the critical parameters are good then I'd say move forward and continue with your installation.

nojohnny101 · Nov 21, 2017

Great guide. Just received 4 new WD Reds (4TB) that I need to burn in and the only way I can do this is on a live system.

The disk is not in a pool or vdev, just attached to the system.
I know the command "sysctl kern.geom.debugflags=0x10" removes some safety features but my question is.
Can I safely do this and then test the one drive with badblocks or will this negatively affect the other drives that are actively running in a raidz2 vdev?

Unfortunately my options are extremely limited and this is the only hardware I can burn in on. I would like it if I could burn in the disk and keep the existing vdev up and running, but I don't want to harm my current pool by doing this. Can someone please advise?

Thanks everyone!

Deleted47050 · Nov 21, 2017

This was written just a few posts ago:

wblock said:
This command defeats the safety that prevents writing to parts of a disk that are in use (by having been mounted, generally).

It is the computer equivalent of removing the safety equipment from power machinery. Experts will say sometimes that is justified and the only way to accomplish particular tasks. Other people routinely defeat safety equipment, by rote, and think nothing of it until they are injured.

So I guess it's up to you to decide if this is worth the risk. If you don't have other hardware to do this on though, you don't have much of a choice.

joeschmuck · Nov 22, 2017

nojohnny101 said:
I know the command "sysctl kern.geom.debugflags=0x10" removes some safety features but my question is.
Can I safely do this and then test the one drive with badblocks or will this negatively affect the other drives that are actively running in a raidz2 vdev?

Yes, you can do this without impact to your system, meaning that FreeNAS or ZFS does not go crazy and try to write data to protected areas. So long as you are not trying to manually do anything else then you will be fine. Always ensure that you are doing the burnin to the proper drive, use the serial number. What you do not want is to start deleting your current pool.

Your system may slow down slightly while you do this but I doubt it would be perceivable.

If you wanted to take all the risk out of possible damage to your current pool then I'd power down the box and disconnect the drives for that pool and only have connected your new drives. This does mean that your data is not available during this process but it does give you that piece of mind that you are not going to screw something up by accident.

nojohnny101 · Nov 22, 2017

Thanks @joeschmuck that is what I wa looking for.

Maybe I will just power everything down and keep my data safe that way. I can live without my media for a couple of days.

Thanks!

joeschmuck · Nov 22, 2017

nojohnny101 said:
Thanks @joeschmuck that is what I wa looking for.

Maybe I will just power everything down and keep my data safe that way. I can live without my media for a couple of days.

Thanks!

This is what I'll be doing myself this Friday when I get my new hard drives but only because I don't have enough SATA ports available. I can survive as well without my NAS for a few days. Of course if I find a spare computer in my basement then I think I'll use that to test my drives, I just don't know if I have one laying around and a power supply. Guess I'll look when I get home today.

Stux · Nov 22, 2017

It’s fairly safe. Just don’t go badblock testing your boot or data drives.

You can re-enable it when you’re done too.

joeschmuck · Nov 22, 2017

Stux said:
You can re-enable it when you’re done too.

I generally reboot to re-enable.

Important Announcement for The TrueNAS Community.

Hard Drive Burn-In Testing - Discussion Thread

MVP

Dabbler

Old Man

MVP

Documentation Engineer

Dabbler

MVP

Deleted47050

Guest

Patron

MVP

Dabbler

MVP

Old Man

Wizard

Deleted47050

Guest

Old Man

Wizard

Old Man

MVP

Old Man

Similar threads

Important Announcement for The TrueNAS Community.