Hard Drive Burn-In Testing - Discussion Thread

Fraoch · Jan 11, 2015

I've seen SMART extended tests do that if the drive is otherwise busy - they will go to 90% completion then stay there until the other drive activity ceases, then complete the test.

solarisguy · Jan 12, 2015

@Matdif

Matdif said:
I have a problem with the badblocks test and was wonder if someone could interpret it.

A month ago I had a problem with a motherboard I was using to set up freenas. Specifically the motherboard is the
C2750D4I found in the recommended hardware thread. The specific problem I had was that the bottom 4 sata ports on the marvell controller could become unreliable when loaded with more then 1 hard drive. I found another person who had the problem and after he got an rma he says he still had it. Meanwhile I had sent my board in for replacement and told Asrock over the phone my specific problem. Meanwhile it took a month to get a board to me with ups losing it finding it sending it back to them plus the holidays.

Now I have a board back but it looks like an updated model. At least they changed the packaging from a black box to blue and such. The board itself looks much the same but I was hopefull they upgraded something or fixed the problem. Now I hooked everything back up made sure to fill every Marvell sata port as a test and started a badblocks test again on all slots.

The problem is I am getting an error I did not get before.

I just started the tests and every hard drive badblocks test immediately gave me back "set_o_direct: Inappropriate ioctl for device". The first board I had did not give this error I still have a screenshot from before.

You can see it in the pic. The tests seem to be continuing so I will wait and see if I get any error emails for the marvell controller becoming unreliable unstable like before. In the meantime I was wondering what everyone thinks of this error. Is the test still good? Or has something changed in the board that will make the test unreliable? Is there something else I should do now to test and make sure this thing will handle a zpool well?

Did you perform this step?
http://www.asrockrack.com/support/ipmi.asp#Marvell9230
(Disabling Marvell SE 9230 HW Raid)

P.S.
If you have time can you run for me the ECC test I have described here
https://forums.freenas.org/index.ph...formance-degradation.21327/page-2#post-124525

Since gcc is not ordinarily supplied with FreeNAS, you would need to compile using FreeBSD or just boot into FreeNAS 9.2.1.5 where gcc works by mistake, and do nothing else but compile and execute...

Thank you!

horse_porcupine · Jan 13, 2015

The badblocks run finally finished (~77hrs total) without any additional errors found in the final two passes, and I ran long SMART tests on all three drives the results of which you can find here. It says they've all passed and I don't see anything to be concerned with but some extra pairs of eyes are always helpful.

And Fraoch, you were correct - the drives did indeed finish the SMART tests I started once the badblocks were complete. Good to know for future worriers such as myself...

Ericloewe · Jan 13, 2015

Well, those three drives show no abnormal SMART data.

Matdif · Jan 16, 2015

solarisguy said:
@Matdif Did you perform this step?
http://www.asrockrack.com/support/ipmi.asp#Marvell9230
(Disabling Marvell SE 9230 HW Raid)

P.S.
If you have time can you run for me the ECC test I have described here
https://forums.freenas.org/index.ph...formance-degradation.21327/page-2#post-124525

Since gcc is not ordinarily supplied with FreeNAS, you would need to compile using FreeBSD or just boot into FreeNAS 9.2.1.5 where gcc works by mistake, and do nothing else but compile and execute...

Thank you!

Havent done either yet. If I can figure out how ill run your test. Badblocks has started doing something new and unusual see the screenshot.

solarisguy · Jan 17, 2015

Matdif said:
Haven't done either yet. If I can figure out how I'll run your test. Badblocks has started doing something new and unusual see the screenshot.

You have to disable Marvell SE 9230 HW Raid ! ! !

Please just forget about the ECC test.

hyram · Feb 5, 2015

Newbie alert... first post :)

How do you do a burn in for a drive that will go into an existing system? I've got two thought:

1. Do the burn in on a separate FreeBSD or Linux system and then swap in the drive.
2. Take the entire array offline. Run the burn in on the new drive. Bring the array back online and insert the new disk.

Issue with 1. would be what to do if you don't have a separate system to do the burn in on?
Issue with 2. is the length of time to burn in means you NAS is down for a couple of days.

A separate question... Does anyone keep a burned in drive sitting on the shelf ready to be placed into service? Other than in an enterprise environment that is.

Thanks!

solarisguy · Feb 5, 2015

@hyram, assuming home environment only. You can go the third way. Replace the hard drive and monitor the just replaced hard drive using S.M.A.R.T. during resilvering. Resilvering would be your burn in.

Otherwise, yes also at home, you would need a fully tested hard drive as a spare on a shelf.

You might wonder what enterprises do... They replace their hard drives immediately, that is within minutes or hours of a failure. Details would depend on importance of data, budget, IT organization, know-how within IT etc.

The biggest difference with home users is that enterprises tend to be instantly notified about any failure.

hyram · Feb 6, 2015

Thanks solarisguy... but resilvering would not be as complete as a badblock run would it? Qwertymodo's burn in read/writes every block (and every bit), while a resilvering would only read/write blocks that would potentially have data in them. It would leave the empty blocks untouched. Or did I misinterpret how resilvering works?

titan_rw · Feb 6, 2015

hyram said:
Newbie alert... first post :)

How do you do a burn in for a drive that will go into an existing system? I've got two thought:

1. Do the burn in on a separate FreeBSD or Linux system and then swap in the drive.
2. Take the entire array offline. Run the burn in on the new drive. Bring the array back online and insert the new disk.

Issue with 1. would be what to do if you don't have a separate system to do the burn in on?
Issue with 2. is the length of time to burn in means you NAS is down for a couple of days.

A separate question... Does anyone keep a burned in drive sitting on the shelf ready to be placed into service? Other than in an enterprise environment that is.

Thanks!

I do #1. I've got several spare 'bench' boards I can use to run badblocks / etc on drives before putting them in my main nas's. I've got an old FreeNAS 9.2 installation that I use exclusively for this, badblocks, dd, smartctl and tmux.

And I also keep a cold spare that's already passed 4-5 days of testing incase I need to swap a drive. With 24 drives in total, I should probably have more than one cold spare at this point.

Glorious1 · Feb 8, 2015

Assuming you've got a spare SATA port and bay in the box, why can't you just hook up the new drive but not add it to any array, then do the burn-in on it while your array hums along, none the wiser? I've done that.

solarisguy · Feb 9, 2015

hyram said:
Thanks solarisguy... but resilvering would not be as complete as a badblock run would it? Qwertymodo's burn in read/writes every block (and every bit), while a resilvering would only read/write blocks that would potentially have data in them. It would leave the empty blocks untouched. Or did I misinterpret how resilvering works?

You did not miss anything. I have experience with both enterprise and home servers, and in my opinion disadvantages of not running badblocks can be offset in home environment by paying closer attention to disk monitoring and notifications.

Also in terms of logistics, if a hard drive in a volume goes down, how fast are you going to get a new one inserted into your FreeNAS system? On a Friday before a long weekend? While vacationing in Las Vegas? :)

Although you can keep a hard drive on a shelf, there is never a 100% guarantee that it is going to work when its time comes. Even if it was tested by badblocks.

People replaced healthy hard drives due to unrecognized problems with a backplane, cabling, power or SATA controller and stopped replacing only after their spare part pool was exhausted :) Not running bad blocks is, in my opinion, a much lesser evil than not using S.M.A.R.T. to monitor disk health and not immediately replacing bad or suspected hard drives upon a failure.

hyram · Feb 11, 2015

Not running bad blocks is, in my opinion, a much lesser evil than not using S.M.A.R.T. to monitor disk health and not immediately replacing bad or suspected hard drives upon a failure.

Totally agree... I'm working on getting my email set up to send me the SMART status every night. There's a thread here somewhere I'm following. Just haven't had enough time to complete it.

Also in terms of logistics, if a hard drive in a volume goes down, how fast are you going to get a new one inserted into your FreeNAS system? On a Friday before a long weekend? While vacationing in Las Vegas?

If I've got one on the shelf, and SMART tells me something is amiss, I'd replace it ASAP.

Assuming you've got a spare SATA port and bay in the box, why can't you just hook up the new drive but not add it to any array, then do the burn-in on it while your array hums along, none the wiser? I've done that.

So obvious... no wonder I missed it :). Thanks for the idea.

chupathingee · Feb 25, 2015

Awesome guide! I started running badblocks yesterday after work and this morning it was at 11% after 15 hours on 4 3TB drives. I did some research and badblocks by default uses a block size of 1024 bytes. The block size in badblocks should correspond to the block size on your drives, so that you aren't checking fractions of a block or multiple blocks. Using a block size too small will increase the amount of time spent checking the drive without increasing the quality of the scan, whereas using a block size that is too large could compromise the integrity of the scan by reporting false negatives. The block size on my drives (and I believe most modern drives) is 4096. I cancelled badblocks and restarted with the flag "-b 4096" included in the command. When I left for work the drives were all 3% done after only an hour (as opposed to ~.73% an hour previously). That's a little over 4 times faster. I suggest you update the badblocks section in the guide to include instructions for setting block size.

iKersh · Feb 26, 2015

"I suggest you update the badblocks section in the guide to include instructions for setting block size."

I just purchased and installed 2 x 3TB drives into my shiny new microserver I got last Friday (20th Feb.) and have been following the burn-in procedure, currently at the badblocks test. I set the block size to what chupathingee has recommended (cheers for posting that by the way) - badblocks -b 4096 -ws /dev/adaX - before leaving for work today; hopefully the test on both drives will have progressed nicely by the time I get home tonight (not expecting it to be complete by any means). Also, for newbies to the NAS world (myself included), a handy tmux keystroke to know if you are running simultaneous tests at the console itself (not via SSH) which allows you to switch between screens is CTRL+b then press o (a list of useful keystrokes in tmux can be found here).

EDIT - after 57 hours approx. the badblocks testing completed; after running the long smart test, no errors found :)

GrumpyBear · Mar 2, 2015

chupathingee said:
... restarted with the flag "-b 4096" included in the command...

How about also setting "-c number of blocks" from it's default of 16 to 32 or 64 as well? Anyone tried that? The manpage indicates that increasing this

" ... Will increase the efficiency of badblocks but also will increase its memory usage. Badblocks needs memory proportional to the number of blocks tested at
once, in read-only mode, proportional to twice that number in read-write mode, and proportional to three times that number in non-destructive read-write mode."

Presumably increasing the efficiency will also decrease the amount of time it takes to complete? I'll have to give this a try after the long SMART tests are finished tomorrow.

EDIT - further research led to trying huge values for -c (badblocks was written back in the days of 200GB IDE drives in systems with limited memory); specifically "-c 196608" and "-c 98304" which consumed 384MB RAM and 768MB RAM for writes and 768MB RAM and 1536MB RAM for reads). The net effect was no perceivable increase in speed outside of what might be attributed to variances in performance between disks. More than 50% through the 4 passes here are the estimates for completion times:

Seagate 3TB NAS Disks

da0 48:52 -c 16
da2 47:28 -c 16
da4 48:28 -c 196608
da6 49:56 -c 98304

WD 3TB Red Disks

da1 55:36 -c 64
da3 59:12 -c 16
da5 59:00 -c 32
da7 57:16 -c 128

The Seagates are faster than the Western Digitals as expected as their raw disk access is rated higher. Both have the same rotational speed(s), capacity and amount of buffer but the Seagates start out writing at 170-180MB/s and the WDs at 140-150MB/s and the Seagates finish at 70-80MB/s and the WDs at 65-70MB/s.

Note that I did use a blocksize of 4096 ("-b 4096") on all drives.

Gilley7997 · Mar 14, 2015

GrumpyBear said:
The Seagates are faster than the Western Digitals as expected as their raw disk access is rated higher. Both have the same rotational speed(s), capacity and amount of buffer but the Seagates start out writing at 170-180MB/s and the WDs at 140-150MB/s and the Seagates finish at 70-80MB/s and the WDs at 65-70MB/s.

It's early (I have no idea why I am awake) and I was just looking at the progress of my badblocks test and noticed this steady write performance decrease on the Red's from the 140-150 to 70-80 as well at the end of the write cycle. I was just trying to reason my way through this as to why this occurs. I like to ask questions, so I will. What is the technical reason for this? I have an idea, but I wasn't able to confirm with some quick googling, so I figured I would just ask the experts. :)

This was also asked earlier and I didn't see it addressed so just figured I would bring it up again, upon initializing the badblocks test I get this message reported:

Code:

Testing with pattern 0xaa: set_o_direct: Inappropriate ioctl for device

As a new user, and shame on me for not reading all the way to the bottom of the thread before running the test, it would be nice to move some of these improvements such as the -b 4096 option into the main guide.

GrumpyBear · Mar 16, 2015

Gilley7997 said:
It's early (I have no idea why I am awake) and I was just looking at the progress of my badblocks test and noticed this steady write performance decrease on the Red's from the 140-150 to 70-80 as well at the end of the write cycle. I was just trying to reason my way through this as to why this occurs. I like to ask questions, so I will. What is the technical reason for this? I have an idea, but I wasn't able to confirm with some quick googling, so I figured I would just ask the experts. :)

The Seagates are rated at a Max Sustained Data Rate of 180MBps whereas the Western Digitals are 147MBps so that the Segates are faster is within the specifications. I suspect the variation in speed is due to the tracks on the HDD getting smaller as the heads move into the platter and having to do more seeking for the same datasize. I believe Eric Lowe mentioned this in a previous reply in this post.
Also I noted that the Seagate disks really sucked at some tests (5 concurrent reads) taking 5 times longer that the Western Digitals.

Gilley7997 said:
This was also asked earlier and I didn't see it addressed so just figured I would bring it up again, upon initializing the badblocks test I get this message reported:

Code:
Testing with pattern 0xaa: set_o_direct: Inappropriate ioctl for device

Normal behavior see this post[/QUOTE]

Gilley7997 · Mar 16, 2015

Just to make sure I understand. I had been monitoring the tests as they ran. I saw no errors reported. At the end of the test I would have expected something saying "Passed no errors reported." This is what I got though. Does it truly not give a summary of the test?

GrumpyBear · Mar 16, 2015

No news is good news. If there were errors it would list them to standard output (and possibly standard error).

After running Badblocks you might want to rerun the long smart tests now that the disks have had a bit of a work-out

Important Announcement for the TrueNAS Community.

Hard Drive Burn-In Testing - Discussion Thread

Patron

Guru

Dabbler

Server Wrangler

Explorer

Attachments

Guru

Dabbler

Guru

Dabbler

Guru

Guru

Guru

Dabbler

Cadet

Dabbler

Contributor

Dabbler

Contributor

Dabbler

Attachments

Contributor

Similar threads