Resource icon

Hard Drive Burn-In Testing - Discussion Thread

qwertymodo

Contributor
Joined
Apr 7, 2014
Messages
144
Mod note: This document has been ported over to the Resources section. To get to the document itself, just use the tabs above and click the "Overview" tab.

This thread remains as the discussion thread, as before. The original version of the document follows below, inside the spoiler tags.

- Ericloewe


jgreco did a nice system build/test/burn-in guide here, but I (and many others) found the details a bit lacking in the hard drive section. He mentions S.M.A.R.T. tests, but doesn't go over how to run them, or how to view the results, etc. and then just kinda throws around dd commands without a lot of explanation there either. Yes, this information is available elsewhere, but for somebody (such as myself) looking for a single cohesive guide to burn-in testing, I figured it'd be nice to have all of the info in one place to just follow, with relevant commands. So, having worked my way through reading around and doing my own testing, here's a little more n00b-friendly guide, written by a n00b, so please feel free to chime in with suggestions or criticisms if you have any. I'm basing this guide more off of cyberjock's post here than jgreco's guide.

UPDATE: Thanks to cyberjock, I've updated the section on badblocks to include instructions for using tmux to test all drives in parallel. Considering that badblocks with default settings takes over 24 hours for a 2TB drive, that should significantly decrease testing times, especially for large arrays.

First of all, the S.M.A.R.T. tests. The first thing that someone unfamiliar with S.M.A.R.T. tests might find strange is the fact that no results are shown when you run the test. The way these tests work is that you initiate the test, it goes off and does its thing, then it records the results for you to check later. So, if this is an initial burn-in test for your entire system, you can initiate tests on all of the drives simultaneously by simply issuing the test command for each drive one after another.

The first test to run is a short self-test:
Code:
smartctl -t short /dev/adaX


It should indicate that the test will take about 5 minutes. You can immediately begin the same test on the next drive, but you can only run one test on each drive at a time. Once it has completed, run a conveyance test:
Code:
smartctl -t conveyance /dev/adaX


Again, wait for the test to complete (about 2 minutes this time). Finally, a long test:
Code:
smartctl -t long /dev/adaX


------
Note added by @wblock 2018-01-10: this section recommended enabling the kern.geom.debugflags sysctl. Many people still think it has something to do with allowing raw writes. It does not. Instead, it disables a safety system that is intended to prevent writes to disks that are in use (say, by having a mounted filesystem). From man 4 geom:
0x10 (allow foot shooting)
Allow writing to Rank 1 providers. This would, for example,
allow the super-user to overwrite the MBR on the root disk or
write random sectors elsewhere to a mounted disk. The
implications are obvious.
To summarize, this option should generally not be needed. It only makes it possible to harm data. Any disk you are going to overwrite with data should not be mounted or have anything you wish to keep. In fact, best practice is to not be erasing or stress-testing drives on a system that has actual data on it. Since those disks will not have mounted filesystems, this sysctl will not affect being able to write to them. In fact, it will only make it possible to blow away things that are in use.
------
Now, before we can perform raw disk I/O, we need to enable the kernel geometry debug flags.

This carries some inherent risk, and should probably not be done on a production system. This does not survive through a reboot, so when you're done, just reboot the machine to disable it:
Code:
sysctl kern.geom.debugflags=0x10


Now that we can execute raw I/O, run a badblocks r/w test.​

Unlike the S.M.A.R.T. tests, badblocks runs in the foreground, so once you start it, you won't be able to use the console until the test completes. It also means that if you start it over SSH and lose your connection, the test will be canceled. The answer to this is to use a utility called tmux:
Code:
tmux


You should now see a green stripe at the bottom of the screen. Now, we can run badblocks. THIS TEST WILL DESTROY ANY DATA ON THE DISK SO ONLY RUN THIS ON A NEW DISK WITHOUT DATA ON IT OR BACK UP ANY DATA FIRST:
Code:
badblocks -ws /dev/adaX


badblocks also offers a non-destructive read-write test that (in theory) shouldn't damage any existing data, but if you do choose to run it on a production drive and suffer data loss, on your own head be it:
Code:
badblocks -ns /dev/adaX



It has been brought to my attention that badblocks has some limitations with larger drives >2TB. The easy workaround is to manually specify a larger block size for the test.

Code:
badblocks -b 4096 -ws /dev/adaX

or
Code:
badblocks -b 4096 -ns /dev/adaX


Once you've started the first test, press Ctrl+B, then " (the double-quote key, not the single quote twice). You should now see a half-white, half-green line through the screen (in PuTTY, it's q's instead of a line, but same thing) with the test continuing in the top half of the screen and a new shell prompt in the bottom. Run the badblocks command again on the next disk, then press Ctrl+B, " again to create another shell. Continue until you've started a test on each disk. If you are connecting over SSH and your session gets disconnected, all of the tests will continue running. When you reconnect, to resume the session and view the test status, simply type:
Code:
tmux attach


As with the S.M.A.R.T. tests, you can only run one test at a time per drive, but you can test all of your drives simultaneously. In my experience, the tests run just as fast with all drives testing as with a single drive, so for your initial burn-in, there's really no reason not to test all of the drives at once. Also, be prepared for this test to take a very long time, as it is basically the "meat and potatoes" of your burn-in process. For reference, the default 4-pass r/w test took a little over 24 hours on my WD Red 2TB drives, YMMV.

Because S.M.A.R.T. tests only passively detect errors after you've actually attempted to read or write a bad sector, you should run the S.M.A.R.T. long test again after badblocks completes:
Code:
smartctl -t long /dev/adaX


At this point, you have fully tested all of your drives, and now it's time to view the results of the various S.M.A.R.T. tests:
Code:
smartctl -A /dev/adaX


This should produce something like this (sorry for the formatting fail):

Code:
[root@freenas] ~# smartctl -A /dev/ada0
smartctl 6.2 2013-07-26 r3841 [FreeBSD 9.2-RELEASE-p4 amd64] (local build)
Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME		  FLAG	VALUE WORST THRESH TYPE	  UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate	0x002f  200  200  051	Pre-fail  Always	  -	  0
  3 Spin_Up_Time			0x0027  175  174  021	Pre-fail  Always	  -	  4208
  4 Start_Stop_Count		0x0032  100  100  000	Old_age  Always	  -	  9
  5 Reallocated_Sector_Ct  0x0033  200  200  140	Pre-fail  Always	  -	  0
  7 Seek_Error_Rate		0x002e  200  200  000	Old_age  Always	  -	  0
  9 Power_On_Hours		  0x0032  100  100  000	Old_age  Always	  -	  357
10 Spin_Retry_Count		0x0032  100  253  000	Old_age  Always	  -	  0
11 Calibration_Retry_Count 0x0032  100  253  000	Old_age  Always	  -	  0
12 Power_Cycle_Count	  0x0032  100  100  000	Old_age  Always	  -	  9
192 Power-Off_Retract_Count 0x0032  200  200  000	Old_age  Always	  -	  4
193 Load_Cycle_Count		0x0032  200  200  000	Old_age  Always	  -	  9
194 Temperature_Celsius	0x0022  119  113  000	Old_age  Always	  -	  28
196 Reallocated_Event_Count 0x0032  200  200  000	Old_age  Always	  -	  0
197 Current_Pending_Sector  0x0032  200  200  000	Old_age  Always	  -	  0
198 Offline_Uncorrectable  0x0030  100  253  000	Old_age  Offline	  -	  0
199 UDMA_CRC_Error_Count	0x0032  200  200  000	Old_age  Always	  -	  0
200 Multi_Zone_Error_Rate  0x0008  200  200  000	Old_age  Offline	  -	  0


Some of the more important fields right now include the Reallocated_Sector_Ct, Current_Pending_Sector, and Offline_Uncorrectable lines. All of these should have a RAW_VALUE of 0. I'm not sure why the VALUE field is listed as 200, but as long as the RAW_VALUE for each of these fields is 0, that means there are currently no bad sectors. Any result greater than 0 on a new drive should be cause for an immediate RMA.

Once all of your tests have completed, you should reboot your system to disable the kernel geometry debug flags.
 
Last edited by a moderator:

trionic

Explorer
Joined
May 1, 2014
Messages
98
Would you run iozone in addition to badblocks?

What happens if someone has a case-load of hard disk to test all at once?
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
iozone is for benchmarking. It's not useful for diagnostics at all unless you are trying to stress the disk hard into breaking. You can run multiple badblocks and SMART test simultaneously(but not more than 1 on any disk at any time) using something like tmux or screen.
 

qwertymodo

Contributor
Joined
Apr 7, 2014
Messages
144
Thanks for the tip on tmux, I had started a test on my first disk earlier today, and now I just started simultaneous tests on the remaining 5 drives, so I'll run it for a bit and compare it to the speed on my initial test to see if there's any performance hit for parallel tests. S.M.A.R.T. tests are asynchronous, so you don't need it for them, but for badblocks, you do.

Edit: only 5% into the first pass, but the speed seems to be right on par with the first test, so that's nice. Glad I was able to get this started now, because my initial estimates of the test duration didn't take into account the fact that the readback pass occurs separately, or that the status bar only displays the status of the current pass, so it's looking like a little over 8 hours *per pass* on a 2TB WD Red, which is going to end up taking more than 24 hours for a complete 4-pass test. That's fine, since I'll be gone for the weekend, but again, REALLY glad you gave me the tip on tmux when you did, since that means I'll be able to have all 6 drives tested by the time I get back :)
 

qwertymodo

Contributor
Joined
Apr 7, 2014
Messages
144
Thanks to whoever moved this to the How-To Guides section. I originally tried posting here, but I still don't have permission to post in this section.
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
Thanks to whoever moved this to the How-To Guides section. I originally tried posting here, but I still don't have permission to post in this section.

The How-To section is where people were asking for guides, so I have locked out permissions to create threads here. This is where guides are supposed to go, not where people ask for guides. ;)

PS: I moved it. Very well written and worthy of our How-To section!
 

tromba

Dabbler
Joined
Jul 13, 2014
Messages
15
Now I don't claim to have read all the posts on this forum, but I have read this one and the ones it points to and several other related ones and have used the search box, and while they are a great help at explaining things and I feel that I follow everything in this guide (Thumbs UP!), its the very simple beginning that I do not understand. For example, to run these tests, do I install freenas and then ssh into the machine? Do I run my box off a usb-stick with a Linux or freedos distro? How do I get to the point where I can type in these commands? There seems to be some kind of common understanding that is just beyond me. Or maybe I'm the wrong kind of noob. (20+ years of experience, exclusively with embedded systems and DIY PCs under windows though).

I've been running memtest86 (&+) as well as Mprime23 and a couple of other CPU stressers for a couple of days now off Ultimate Boot CD and would really like to move on to hard drive testing, before the window my store allows for quick hardware returns closes. I don't want to wait for a couple of weeks while Western Digital processes my RMA (been there, done that).

Sorry if the question is silly.
 
Last edited:

qwertymodo

Contributor
Joined
Apr 7, 2014
Messages
144
I ran all of this from the FreeNAS shell, so yes, you'll need to set up FreeNAS on a flash drive and boot into it to run these tests. You could probably run them from some other BSD or *NIX environment, but for the sake of this forum, I'll just suggest using FreeNAS. If you're not sure how to install FreeNAS to the USB stick, that's a bit out of scope for this guide, but thankfully, the process is already documented here
 

tromba

Dabbler
Joined
Jul 13, 2014
Messages
15
Thanks for the quick reply qwertymodo.
I already have FreeNAS on a USB drive and will probably move to that tomorrow after the stress tests have run for a bit.
 

Ed Clarke

Dabbler
Joined
Jul 22, 2014
Messages
11
I'm certainly glad that I found this "Howto" and decided to follow it instead of thinking "It'll probably be ok...". 88% through the first pass:

badblocks -ws /dev/da4
badblocks: No such file or directory while trying to determine device size

This was after a slew of errors on that drive. The other six seem to be working ok so far. Now I'll find out how good the RMA process is on Amazon.
 

michaelkoehler

Dabbler
Joined
Jul 8, 2014
Messages
17
Great guide, qwertymodo.

One thing didn't work for me, though - "tmux reattach". After a little Googling, I found that "tmux attach" worked to reattach my session.

Also, I had a hell of a time opening a sixth panel for running badblocks on my last drive. Somehow I got it, but I'm not quite sure what it was.

Thanks for sharing your work.
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,175
tmux is very unintuitive at first. My recommendation to get 6 nicely distributed screens is to first carelessly open 6 of them. Then toggle between display options until you reach tiled (you absolutely need the man page for tmux).
 

Whattteva

Wizard
Joined
Mar 5, 2013
Messages
1,824
I had zero problems with tmux even when I first started using it. Then again, I read the man page extensively before I actually started using it though.

On a very high level, I find it rather simple actually. You really only ever need to know two commands: tmux (to start), tmux attach (to existing session).
Once you're attached to a session, it's all ctrl-b (or whatever you rebind this to) and some_key. That's all there is to it, really.

Not sure where the OP got reattach from. You can find "tmux attach" barely one page into the man page without much reading at all.
 
Last edited:

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
All that stuff in that link I do in a tmux session in FreeNAS. I always have thanks to tmux. I just tested 10 new disks less than a month ago in parallel.
 

Apollo

Wizard
Joined
Jun 13, 2013
Messages
1,449
I just finished running badblocks on two WD40EFRX WD RED 4TB drives and it took slightly over 72 hours to run 4 passes per disk.
Each pass is composed of a write sequence (9 hours write to the disk) and followed by a read and compare sequence (another 9 hours to read from the disk). Total per pass is about 18 hrs.
It is unfortunate there is no estimated time for the test completion.
 

BrooklynMatty

Dabbler
Joined
Sep 25, 2014
Messages
14
Everytime i try the smart commands i get this:

=== START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION ===
Sending command: "Execute SMART Short self-test routine immediately in off-line
mode".
Drive command "Execute SMART Short self-test routine immediately in off-line mod
e" successful.

Is it ok that it keeps showing as OFFLINE?

I dont have any volumes setup, i am just starting the install process for my freenas setup.

also - to open a new window, i cant get it to open new windows. I get the green part on the bottom of the shell, but it doesnt open a new terminal.
 

Robert Smith

Patron
Joined
May 4, 2014
Messages
270
Hello, BrooklynMatty,

What exactly are you worried about? Which commands are you executing and why?
Which utility are you using? If it is SMARTCTL, then offline refers to how the data is collected.

Related to this thread, we usually run self-test commands, and commands to show status...
 

Techanimal

Dabbler
Joined
Oct 1, 2014
Messages
13
So I started doing the badblocks tests and had to log out of the gui view. Now when I log back in, I can't access the shell anymore. I can see that the tests are running since the drive activity light is lit and if I go to reporting, I can see solid drive activity over the last 40 hrs.

How do I get back in to view progress and how long should it take to do 6 X 3 tb WD Red hard drives?

Sent from my SM-G900H using Tapatalk
 

Apollo

Wizard
Joined
Jun 13, 2013
Messages
1,449
So I started doing the badblocks tests and had to log out of the gui view. Now when I log back in, I can't access the shell anymore. I can see that the tests are running since the drive activity light is lit and if I go to reporting, I can see solid drive activity over the last 40 hrs.

How do I get back in to view progress and how long should it take to do 6 X 3 tb WD Red hard drives?

Have you tried running 6 concurrent badblocks test using tmux? if so then you can log back to it using the "tmux attach" command in shell.

Based on my setup, it took 72hr to run 4 passes on my 4TB drives. I would think you should expect something like 75% so 54 hours should be your target. Give a few extra hours to be safe. I had 1 drive finishing sooner than the other, maybe 10% speed gain, so your milage may vary.

If you can't attache back to tmux, I would suggest to run "top" and see which process is running.
 
Top