Newbie questions on Burn-In testing

troudee

Explorer
Joined
Mar 26, 2020
Messages
69
Hello!

I am building my own NAS (and an identical one for my sister) for quite a
while now, because I almost never have time do get into reading as much as
I should.

I've plugged everything together (mainly mainboard and PSU into the chassis,
no drives yet) and updated/configured the IPMI. The next step on my list (I am
following the "Building, Burn-In, and Testing your FreeNAS system" suggestions)
would be the Burn-In tests, but I do not understand how to perform them.

In this forum, there are some threads (like this or that) where someone asks
exactly that question, but I apparently am missing the critical parts of any
of those threads. :(

1) I am a bit confused about the thermals during CPU stressing -- some places
in the Internet say "run the tests AND WATCH OUT THAT IT DOESNT OVERHEAT!!!11",
Stux says "Watch the thermals.", but then on the other hand the tests are supposed
to run continuously for days (u6f6o, Stux as well). For me, that's a bit contradictory! :D
And even more: What temperatures are okay?

2) I've now come up with the following process, maybe you could tell me if
it is any good:

- Run CPU stressing: Intel Processor Diagnostics, Intel Linpack Tests, Prime95 (small), Prime95 (blend) one after another, check that nothing crashed, stinks or catches fire.
- Run memtest86+ for at least a week.
- Insert all HDDs (mine and those of my sister)
- Run HDD burn-in (not specified, how, yet)
- Continue with FreeNAS installation, experimenting etc.

If the mainboard is important: Supermicro X10SDV-4C+-TLN4F
 

no_connection

Patron
Joined
Dec 15, 2013
Messages
480
I would run OCCTP https://www.ocbase.com/ for a few hours until everything settle in just to see if there is any errors. It is extremely reliable and if you see an error then something is 100% wrong.
Thermal should not be too critical but if you see anything above 80*C at full load you need to rethink thermals.

A modern CPU will thermal throttle before it takes any damage at all, which is commonly used in laptops to "overprovision" performance, that commonly happens at 100*C and the CPU shuold not let you go above it no matter how much you scream at it.

Watch out for a quick rise in temperature as that can mean thermal compound don't make good contact or something is not seated correctly.

For testing RAM you can run above program or memtest for a few hours to make sure it's not bad, but I believe any "burn in" of RAM is pretty much made up.

I am going to say the same for HDD but with an asterisk. While there might be some merit to short term catastrophic wear due to some defect and thrashing it might catch such a drive, I would rather build the system to "not care" if such a drive happen to sneak in.

I would however graph each HDD with HDTune for example and see that all looks similar and don't do anything funny.

My opinion ofc.
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,681
Generally the first hour of CPU stress testing will reveal any heat related problems. Spot checks after that are still a good idea, but the way thermal paste works is that it will cure or "break in" after around 150-250 hours. Your temperatures before this point will be a bit warmer, perhaps 5 to 10 degrees. After that, you should hit a point at which it is just doing its thing.
 
Joined
Jan 7, 2015
Messages
1,150
Test your RAM for a day for sure (3 passes), then maybe a good CPU heat stressing, but i think a good drive burn-in is the most important and takes the longest. Badblocks has always been my go to for this and has been discussed in this forum. Several times I found new drives with bad blocks out of the box. To be noted, badblocks destroys data, so make sure you only use this on new drives. There used to also be a script called solnet array test, several times for me it exposed drives that werent reading/writing at advertised specs which were promptly returned for different ones. Not sure how relevant this remains as this was many years ago.
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,681
Three passes of memtest is not enough to reliably find problems. solnet-array-test is still around as well, at the same location as always.
 

troudee

Explorer
Joined
Mar 26, 2020
Messages
69
First of all, thank you very much for your answers! :cool:

I think, now I understand it a bit better: I can run the stress testing for
longer durations and should watch the thermal values closely in the first hour,
loosely in the next hour and drop by now and then later on, but I am allowed
to sleep at night. That sounds good. :)

So can I assume this process is a reasonably good way:

- Boot up a Windows-based live medium (like Hiren's Boot CD PE)
- Run the OCCT tests which support nice live thermal charts; in the first hour check that the thermals do not easily exceed 80 C and simply do not go over 100 C. Make sure nothing stinks, crashes or catches fire. Let the CPU:OCCT and CPU:LINPACK run for at least three hours each. Afterwards, run the PSU test for a few hours. Check if the PSU fan has to kick in and take some looks at the power meter on how much power the system uses.
- Boot up Ultimate Boot CD
- Run Prime95 (small) test (should need about 6 h), then Prime95 (blend) test (should need about 12 h), in the first hour regularly switch to another terminal and call the sensors command to check the thermals like above.
- Run memtest86+ for at least a week (in the meantime do research about the HDD burn-in tests)
- Insert all HDDs (mine and those of my sister)
- Run HDD burn-in (not specified how, yet)
- Continue with FreeNAS installation, experimenting etc.

Or is this wrong/too few/too much in any way?
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,681
I don't think I have actually ever accused someone of doing too MUCH. The common issue is too LITTLE.

The normal way things work is someone's real excited because they finally have all the parts and they just want to get it all built and TRY IT OUT!!!

Okay so that's totally understandable IMHO. Professional system builders, including companies that build machines, generally do not do extensive burn-in testing because it is space intensive and customers don't like hearing "you'll get it next month." For personal computers, with a few gig of memory and minimal requirements, a less thorough burn-in is the usual. And people who build their own systems are also used to that.

But your NAS will have larger memory and lots of drives. It takes longer to do a thorough memory test, and drives are fickle about early death.

There is no one way that this stuff MUST be done; you are looking for single bit memory errors, CPU heat issues, hard drive failures, case cooling issues, and absolute stability doing heavy I/O under FreeNAS.
 

troudee

Explorer
Joined
Mar 26, 2020
Messages
69
:smile: So you would say my process (at least the first 5 steps) is at least not complete nonsense? And there's nothing essential missing? I am a software guy, that hardware stuff is a bit of black magic for me.
 

troudee

Explorer
Joined
Mar 26, 2020
Messages
69
Ermmm.... What does it mean if the CPU:OCCT test crashes after 5 to 10 seconds because of an "Illegal Instruction"?
 

troudee

Explorer
Joined
Mar 26, 2020
Messages
69
In the test settings, I can choose from {Automatic, AVX2, AVX, NoAVX} and it only works in NoAVX mode. My CPU is listed as AVX2 at Intel, so is it broken or is the test broken?
 

troudee

Explorer
Joined
Mar 26, 2020
Messages
69
Microsoft (exactly my instruction code says that "Windows PE does not support AVX state saving.". Hiren's boot CD is based on Windows PE. How am I supposed to run those OCCT tests then? :-(
 

no_connection

Patron
Joined
Dec 15, 2013
Messages
480
You can download and install Win10 to a temp HDD. If you go to download page on Linux you get an .ISO directly. You don't even need internet to activate.
I never tried OCCTP and Hiren but probably should have done.
 

troudee

Explorer
Joined
Mar 26, 2020
Messages
69
What would you think about this process?

- Boot up Hiren's Boot CD
- Run OCCT tests (CPU:OCCT small/medium/large for 3 hours each; CPU:LINPACK four 3 hours, POWER for 3 hours)
- Boot up Ubuntu live CD, "install"/download s-tui, stress, firestarter and prime95
- Run the tests, use s-tui to watch the thermals, start a second script that kills the stressors when the cpu gets too hot (get the values from s-tui)
--> stress for 6 hours
--> FIRESTARTER for 6 hours
--> mprime smallest/small/large FFT for 6 hours each
--> mprime blend for 24 hours
- Run memtest86+ for at least a week (in the meantime do research about the HDD burn-in tests)
- Insert all HDDs (mine and those of my sister)
- Run HDD burn-in (not specified how, yet)
- Continue with FreeNAS installation, experimenting etc.

Additional question: Would it be helpful if I wrote a tutorial about the tests to spare other newbies the research?
 

danb35

Hall of Famer
Joined
Aug 16, 2011
Messages
15,464
Do I understand it correctly, I am pretty much on the safe side if I do the Smartctl--Badblocks--Smartctl procedure and then the Solnet script?
I would think so. I ordinarily do the SMART test and a full run of badblocks (which on larger drives takes a number of days). If you're concerned about heat, some spot checks about a day into the badblocks run should let you know how they're going to do.
 

troudee

Explorer
Joined
Mar 26, 2020
Messages
69
Hello again!

My current process draft consists of

(Ubuntu live cd)
- SMART tests (short/conveyance/long)
- 1 run of badblocks (4 passes, I think), check if it reports any bad blocks --> drive BAD
- SMART again, check if the reallocated sectors count went up --> drive BAD

(FreeBSD live cd)
- solnet

I do not understand how I get "results" (being "drive BAD" or "drive not BAD yet") from the solnet script.

NO AMOUNT OF STRESS should result in the kernel reporting timeouts or other issues communicating with your drives.

Can I deduce from that, that I just have to leave it running with all the HDDs for a few days and check if there are error messages printed to the screen (--> drive BAD)? Am I supposed to do anything with the numerical output the script gives me (except from very clear things like "Wow, this number is ten times higher at this drive!")?

Edit, question 2: I have 8 slots and 4 drives (4 slots spare for future). Should I do the whole SMART/badblocks/SMART/solnet test for all 8 slots? Or is it enough just to run the solnet script after changing the already tested drives into the other 4 slots?
 

troudee

Explorer
Joined
Mar 26, 2020
Messages
69
I am almost done with the S.M.A.R.T./badblocks/S.M.A.R.T. part of the burn-in.

Could please someone tell me about the solnet script (#18 in this thread)? I want to understand the things I am doing!
If my questions are too dumb, please nudge me in the direction of the FAQ that I seem to have missed.
 
Top