Resource icon

Hard Drive Burn-In Testing - Discussion Thread

Stux

MVP
Joined
Jun 2, 2016
Messages
4,419
Would depend on your device being an active GEOM device or not.
 

Journer

Dabbler
Joined
Jun 20, 2017
Messages
17
For future folks reading this thread, I just finished testing six WD 8TB drives (WD80EFZX).
I ran the tests executed by this script: https://github.com/Spearfoot/disk-burnin-and-testing , however, I skipped the first extended smart test. Ran this with the kern.geom.debugflags=0x10 set

The entire test took 6 days; ~5d of which was running badblocks.

Took quite a while, but the raw counts for the relevant smart outputs are all zero, so I've got confidence the drives are ok - at least for now.

Precise times:
Tue Aug 29 09:24:25 Start badblocks
Sun Sep 3 12:21:19 End badblocks, start smart short + extended
Mon Sep 4 12:01:20 Totally finished
 

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,994
5 days of non-stop hard drive testing for badblocks, yup, large drives take quite a bit longer. This is the future I have to look forward too, Yikes!
 

Stux

MVP
Joined
Jun 2, 2016
Messages
4,419
5 days of non-stop hard drive testing for badblocks, yup, large drives take quite a bit longer. This is the future I have to look forward too, Yikes!

Yeppers. If you have another FreeNAS with lots of bays... test the drives in that while its doing its NAS thing, and test the mem/cpu in the other build ;)

...

actually my thermals went to high, so I ended up doing it in the new build...

And I did the long test before/after too...

Used @Spearfoot 's burn-in script at least, so it was stupid-easy and generated nice logs to be filed away on the pool when it was created.
 

wblock

Documentation Engineer
Joined
Nov 14, 2014
Messages
1,506
Curious about this:
sysctl kern.geom.debugflags=0x10

This command defeats the safety that prevents writing to parts of a disk that are in use (by having been mounted, generally).

It is the computer equivalent of removing the safety equipment from power machinery. Experts will say sometimes that is justified and the only way to accomplish particular tasks. Other people routinely defeat safety equipment, by rote, and think nothing of it until they are injured.
 

Jason B

Dabbler
Joined
Dec 8, 2016
Messages
37
jgreco did a nice system build/test/burn-in guide here, but I (and many others) found the details a bit lacking in the hard drive section. He mentions S.M.A.R.T. tests, but doesn't go over how to run them, or how to view the results, etc. and then just kinda throws around dd commands without a lot of explanation there either. Yes, this information is available elsewhere, but for somebody (such as myself) looking for a single cohesive guide to burn-in testing, I figured it'd be nice to have all of the info in one place to just follow, with relevant commands. So, having worked my way through reading around and doing my own testing, here's a little more n00b-friendly guide, written by a n00b, so please feel free to chime in with suggestions or criticisms if you have any. I'm basing this guide more off of cyberjock's post here than jgreco's guide.

UPDATE: Thanks to cyberjock, I've updated the section on badblocks to include instructions for using tmux to test all drives in parallel. Considering that badblocks with default settings takes over 24 hours for a 2TB drive, that should significantly decrease testing times, especially for large arrays.

First of all, the S.M.A.R.T. tests. The first thing that someone unfamiliar with S.M.A.R.T. tests might find strange is the fact that no results are shown when you run the test. The way these tests work is that you initiate the test, it goes off and does its thing, then it records the results for you to check later. So, if this is an initial burn-in test for your entire system, you can initiate tests on all of the drives simultaneously by simply issuing the test command for each drive one after another.

The first test to run is a short self-test:
Code:
smartctl -t short /dev/adaX


It should indicate that the test will take about 5 minutes. You can immediately begin the same test on the next drive, but you can only run one test on each drive at a time. Once it has completed, run a conveyance test:
Code:
smartctl -t conveyance /dev/adaX


Again, wait for the test to complete (about 2 minutes this time). Finally, a long test:
Code:
smartctl -t long /dev/adaX


Now, before we can perform raw disk I/O, we need to enable the kernel geometry debug flags. This carries some inherent risk, and should probably not be done on a production system. This does not survive through a reboot, so when you're done, just reboot the machine to disable it:
Code:
sysctl kern.geom.debugflags=0x10


Now that we can execute raw I/O, run a badblocks r/w test.​

Unlike the S.M.A.R.T. tests, badblocks runs in the foreground, so once you start it, you won't be able to use the console until the test completes. It also means that if you start it over SSH and lose your connection, the test will be canceled. The answer to this is to use a utility called tmux:
Code:
tmux


You should now see a green stripe at the bottom of the screen. Now, we can run badblocks. THIS TEST WILL DESTROY ANY DATA ON THE DISK SO ONLY RUN THIS ON A NEW DISK WITHOUT DATA ON IT OR BACK UP ANY DATA FIRST:
Code:
badblocks -ws /dev/adaX


badblocks also offers a non-destructive read-write test that (in theory) shouldn't damage any existing data, but if you do choose to run it on a production drive and suffer data loss, on your own head be it:
Code:
badblocks -ns /dev/adaX



It has been brought to my attention that badblocks has some limitations with larger drives >2TB. The easy workaround is to manually specify a larger block size for the test.

Code:
badblocks -b 4096 -ws /dev/adaX

or
Code:
badblocks -b 4096 -ns /dev/adaX


Once you've started the first test, press Ctrl+B, then " (the double-quote key, not the single quote twice). You should now see a half-white, half-green line through the screen (in PuTTY, it's q's instead of a line, but same thing) with the test continuing in the top half of the screen and a new shell prompt in the bottom. Run the badblocks command again on the next disk, then press Ctrl+B, " again to create another shell. Continue until you've started a test on each disk. If you are connecting over SSH and your session gets disconnected, all of the tests will continue running. When you reconnect, to resume the session and view the test status, simply type:
Code:
tmux attach


As with the S.M.A.R.T. tests, you can only run one test at a time per drive, but you can test all of your drives simultaneously. In my experience, the tests run just as fast with all drives testing as with a single drive, so for your initial burn-in, there's really no reason not to test all of the drives at once. Also, be prepared for this test to take a very long time, as it is basically the "meat and potatoes" of your burn-in process. For reference, the default 4-pass r/w test took a little over 24 hours on my WD Red 2TB drives, YMMV.

Because S.M.A.R.T. tests only passively detect errors after you've actually attempted to read or write a bad sector, you should run the S.M.A.R.T. long test again after badblocks completes:
Code:
smartctl -t long /dev/adaX


At this point, you have fully tested all of your drives, and now it's time to view the results of the various S.M.A.R.T. tests:
Code:
smartctl -A /dev/adaX


This should produce something like this (sorry for the formatting fail):

Code:
[root@freenas] ~# smartctl -A /dev/ada0
smartctl 6.2 2013-07-26 r3841 [FreeBSD 9.2-RELEASE-p4 amd64] (local build)
Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME		  FLAG	VALUE WORST THRESH TYPE	  UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate	0x002f  200  200  051	Pre-fail  Always	  -	  0
  3 Spin_Up_Time			0x0027  175  174  021	Pre-fail  Always	  -	  4208
  4 Start_Stop_Count		0x0032  100  100  000	Old_age  Always	  -	  9
  5 Reallocated_Sector_Ct  0x0033  200  200  140	Pre-fail  Always	  -	  0
  7 Seek_Error_Rate		0x002e  200  200  000	Old_age  Always	  -	  0
  9 Power_On_Hours		  0x0032  100  100  000	Old_age  Always	  -	  357
10 Spin_Retry_Count		0x0032  100  253  000	Old_age  Always	  -	  0
11 Calibration_Retry_Count 0x0032  100  253  000	Old_age  Always	  -	  0
12 Power_Cycle_Count	  0x0032  100  100  000	Old_age  Always	  -	  9
192 Power-Off_Retract_Count 0x0032  200  200  000	Old_age  Always	  -	  4
193 Load_Cycle_Count		0x0032  200  200  000	Old_age  Always	  -	  9
194 Temperature_Celsius	0x0022  119  113  000	Old_age  Always	  -	  28
196 Reallocated_Event_Count 0x0032  200  200  000	Old_age  Always	  -	  0
197 Current_Pending_Sector  0x0032  200  200  000	Old_age  Always	  -	  0
198 Offline_Uncorrectable  0x0030  100  253  000	Old_age  Offline	  -	  0
199 UDMA_CRC_Error_Count	0x0032  200  200  000	Old_age  Always	  -	  0
200 Multi_Zone_Error_Rate  0x0008  200  200  000	Old_age  Offline	  -	  0


Some of the more important fields right now include the Reallocated_Sector_Ct, Current_Pending_Sector, and Offline_Uncorrectable lines. All of these should have a RAW_VALUE of 0. I'm not sure why the VALUE field is listed as 200, but as long as the RAW_VALUE for each of these fields is 0, that means there are currently no bad sectors. Any result greater than 0 on a new drive should be cause for an immediate RMA.

Once all of your tests have completed, you should reboot your system to disable the kernel geometry debug flags.

OK. so I am logged on to my FreeNAS server using the web interface and following along with the hard drive burn in testing guide. I have opened a shell window and executed the first burn in command.

smartctl -t short /dev/adaX

After entering this command a few messages are displayed. One of which is "Please wait 1 minute for the test to complete". Then command returns immediately to the shell prompt. I guess I was kind of expecting the SMART test process to run in the shell window showing output from the test. As you can see I haven't used this OS before (I haven't used Unix in 20 years nor Linux in 10 years). So I assume the smart test process is running in the background. Where can I see the state, or status, of this smart test? Ok, I just saw the Display System Processes and Reporting > Disk tabs in the Web Interface. However, it's been well over a minute since I entered the command and I don't know what process name to look for. I think the steps in this guide should talk about how to determine the state of the test or process rather than stating "Once it has completed...". Also, monitoring processes to see when they disappear and monitoring disk utilization doesn't really tell you anything about the state of the test that is running.
 

Stux

MVP
Joined
Jun 2, 2016
Messages
4,419
The process is running on the hard drive.

smartctl -a /dev/adaX

To see progress/results
 
D

Deleted47050

Guest
I think the steps in this guide should talk about how to determine the state of the test or process rather than stating "Once it has completed...".

The original guide explains that already:

At this point, you have fully tested all of your drives, and now it's time to view the results of the various S.M.A.R.T. tests:
Code:
smartctl -A /dev/adaX
 

guermantes

Patron
Joined
Sep 27, 2017
Messages
213
So, did I get this right?
If I ssh into FreeNAS from my (Linux) desktop, and then initiate tmux and start badblocks, can I power down the desktop over night and tmux attach in the morning (or whenever)?
 

Stux

MVP
Joined
Jun 2, 2016
Messages
4,419
So, did I get this right?
If I ssh into FreeNAS from my (Linux) desktop, and then initiate tmux and start badblocks, can I power down the desktop over night and tmux attach in the morning (or whenever)?

Yes
 

Benc

Dabbler
Joined
Nov 5, 2015
Messages
37
I just finished burn-in of 5 new Ironwolf 4TB disks.
On four disks Raw_Read_Error_Rate was high, but this seems to be normal for Ironwolf. Seek_Error_Rate was high on all five disks.

But why on one of them Raw_Read_Error_Rate is 0? If high numbers are normal, I wonder what 0 means?
Otherwise all other important parameters mentioned in OP were 0. Should I try any other test or is it safe to continue with installation.
 

Stux

MVP
Joined
Jun 2, 2016
Messages
4,419
I just finished burn-in of 5 new Ironwolf 4TB disks.
On four disks Raw_Read_Error_Rate was high, but this seems to be normal for Ironwolf. Seek_Error_Rate was high on all five disks.

But why on one of them Raw_Read_Error_Rate is 0? If high numbers are normal, I wonder what 0 means?
Otherwise all other important parameters mentioned in OP were 0. Should I try any other test or is it safe to continue with installation.

Normally, on Seagate drivEs the raw read and seek error rate fields show you the count of raw reads/seeks and the errors, in an encoded form. Which is neat.

So, zero means no seeks or reads, which seems unlikely, or perhaps a different firmware.
 

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,994
I just finished burn-in of 5 new Ironwolf 4TB disks.
On four disks Raw_Read_Error_Rate was high, but this seems to be normal for Ironwolf. Seek_Error_Rate was high on all five disks.

But why on one of them Raw_Read_Error_Rate is 0? If high numbers are normal, I wonder what 0 means?
Otherwise all other important parameters mentioned in OP were 0. Should I try any other test or is it safe to continue with installation.
Odd but like @Stux said, maybe a firmware difference. It's too bad they all were not a zero value but having a value in thiese Fields does not mean anything bad if all the other results are good. So if you ran a SMART Long/Extended test at the end and all the critical parameters are good then I'd say move forward and continue with your installation.
 

nojohnny101

Wizard
Joined
Dec 3, 2015
Messages
1,478
Great guide. Just received 4 new WD Reds (4TB) that I need to burn in and the only way I can do this is on a live system.

The disk is not in a pool or vdev, just attached to the system.
I know the command "sysctl kern.geom.debugflags=0x10" removes some safety features but my question is.
Can I safely do this and then test the one drive with badblocks or will this negatively affect the other drives that are actively running in a raidz2 vdev?

Unfortunately my options are extremely limited and this is the only hardware I can burn in on. I would like it if I could burn in the disk and keep the existing vdev up and running, but I don't want to harm my current pool by doing this. Can someone please advise?

Thanks everyone!
 
D

Deleted47050

Guest
This was written just a few posts ago:

This command defeats the safety that prevents writing to parts of a disk that are in use (by having been mounted, generally).

It is the computer equivalent of removing the safety equipment from power machinery. Experts will say sometimes that is justified and the only way to accomplish particular tasks. Other people routinely defeat safety equipment, by rote, and think nothing of it until they are injured.

So I guess it's up to you to decide if this is worth the risk. If you don't have other hardware to do this on though, you don't have much of a choice.
 

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,994
I know the command "sysctl kern.geom.debugflags=0x10" removes some safety features but my question is.
Can I safely do this and then test the one drive with badblocks or will this negatively affect the other drives that are actively running in a raidz2 vdev?
Yes, you can do this without impact to your system, meaning that FreeNAS or ZFS does not go crazy and try to write data to protected areas. So long as you are not trying to manually do anything else then you will be fine. Always ensure that you are doing the burnin to the proper drive, use the serial number. What you do not want is to start deleting your current pool.

Your system may slow down slightly while you do this but I doubt it would be perceivable.

If you wanted to take all the risk out of possible damage to your current pool then I'd power down the box and disconnect the drives for that pool and only have connected your new drives. This does mean that your data is not available during this process but it does give you that piece of mind that you are not going to screw something up by accident.
 

nojohnny101

Wizard
Joined
Dec 3, 2015
Messages
1,478
Thanks @joeschmuck that is what I wa looking for.

Maybe I will just power everything down and keep my data safe that way. I can live without my media for a couple of days.

Thanks!
 

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,994
Thanks @joeschmuck that is what I wa looking for.

Maybe I will just power everything down and keep my data safe that way. I can live without my media for a couple of days.

Thanks!
This is what I'll be doing myself this Friday when I get my new hard drives but only because I don't have enough SATA ports available. I can survive as well without my NAS for a few days. Of course if I find a spare computer in my basement then I think I'll use that to test my drives, I just don't know if I have one laying around and a power supply. Guess I'll look when I get home today.
 

Stux

MVP
Joined
Jun 2, 2016
Messages
4,419
It’s fairly safe. Just don’t go badblock testing your boot or data drives.

You can re-enable it when you’re done too.
 

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,994
Top