Register for the iXsystems Community to get an ad-free experience and exclusive discounts in our eBay Store.
Resource icon

Building, Burn-In, and Testing your FreeNAS system

jgreco

Resident Grinch
Moderator
Joined
May 29, 2011
Messages
11,783
Thanks
3,021
#1
I've been meaning to post some guidance here for a while now. We frequently see people come to the forums with hardware problems that should have washed out in the system build process, but since many of the users here are DIY'ers without professional experience building servers, it goes from parts-in-box-to-in-use in a few hours.

This process also needs to be repeated when you are upgrading your system by adding new parts or changing out existing parts. There's a little bit of "use your brain" in how strict you need to be, but doing stuff like just dumping more RAM into your box and then booting it with a production pool attached can lead to great sadness. Don't do that. Your hardware needs to be solid and validated, and if it isn't, you can scramble your bits, possibly irretrievably.

This isn't a complete or comprehensive guide, more like a work-in-progress.

The Build Process

It is tempting to just rip open all those boxes you got, put it together, and pray. Bad idea.

1) Set up a suitable workspace.

1a) If you are in a low humidity environment, pay extra special attention to environmental static controls, including making sure that the clothes you're wearing have been conditioned with fabric softener. Fabric softener can be diluted with water in a sprayer and applied to carpets, which also reduces the nasty winter zaps!

1b) All computer assembly work should be done on top of an anti-static mat. Fry's, NewEgg, and Amazon all have these for around $20. This is not optional! Static damage can subtly damage your silicon.

1c) Assemble while wearing an anti-static wrist strap. These are available from $5-$20 at those retailers and elsewhere. Ideally you should wear some ESD gloves. Even the cheap -a-pair ones will not only help ESD but will also keep skin oils off your surfaces and help reduce contaminants.

1d) Make sure you hook up the wires from your anti-static gear to a proper ground.

1e) Do not wear static-prone clothing, particularly many synthetics. Short sleeves and shorts can help reduce static!

Handle all components only in your anti-static environment, preferably by their edges, never by exposed contacts. A lot of people like to install their RAM and CPU on a mainboard prior to installing in the case, but this increases the number of components in play at a time. It is ideal to install the mainboard in the chassis first, then ground the chassis, and then install components one at a time. In a small chassis build, this may be impractical, and even in a large chassis it could make installation of the CPU tricky. Only remove components from the packages as they are actually required. Resist the urge to unpack everything and spread it out. Extra handling is extra risk. Keep your mind on grounding and careful handling.

Make sure that when mounting your mainboard, that it is securely supported at all points where the chassis offers screw holes. Make sure that the chassis doesn't have any screw standoffs in places where the motherboard does not have a hole; these can short out a motherboard.

Tighten all screws until you feel moderate resistance, then give just a bit more of a twist. You want a solidly seated screw, not loose, not stripped.

Smoke Test

Our name for the initial power-on test. Computers run on smoke. Once the smoke comes out, they stop working.

You should smoke test on the bench with the chassis open so that you can visually inspect.

All cards should be fully seated, meaning that among other things you should see an even amount of the board's copper fingers exposed along the length of the socket.

All DIMM modules should be fully seated. A properly seated DIMM will include the clips on the side being fully engaged, which is a fairly quick visual test. Also verify that the module's copper fingers appear to be uniform.

Power it on and make sure that the hardware manifest agrees with what the BIOS reports. Observe to make sure that all your fans are spinning, and take some time to address any buzzy or unpleasant vibration noise.

Configuration

In the BIOS:

Reset the BIOS settings to defaults.

Configure server to power on after power loss (if this is desired)

Configure power button to NOT shut off host immediately upon press. A momentary press of the power button is supposed to send a signal to the OS to shut down cleanly and power off. It's a good idea to test that once the OS is installed.

Configure the boot priority to prioritize the USB or SATA DOM, and if possible, disable booting from the data disks.

Burn-In and Testing

The burn-in and testing may be separated out or done as kind of a commingled whole. It may disappoint you to discover that proper burn-in and testing actually requires more than a thousand hours of time - a month and a half. This is the time during which you want infant mortality to strike.

Be sure to run an exhaustive test of memtest86 on all the system memory. If you have chosen to play with fire by not buying an ECC system, this is really the only significant opportunity you have to discover problems without your pool being at risk. If you have ECC, try to identify the way in which your mainboard manufacturer provides failure logging (BIOS, BMC/IPMI, etc). Any memory failures should be addressed. Do not rely on ECC to repair problems in a problematic module. The memtest86 tests can be run several times throughout the burn-in period to validate your memory. Don't just run one pass. Run it for a week or two.

Find and run one of the CPU-stressing utilities such as "CPUstress" (http://www.ultimatebootcd.com/) and run this for a day, and monitor the temperature and fan behaviour in your chassis. Especially if you are attempting to build a "quiet" NAS, now is the opportunity to make sure that your cooling strategy is going to work.

Run SMART tests on all your drives, including the manufacturer's conveyance test and the long test. For this purpose it might be convenient to configure up an instance of FreeNAS.

SMART is not sufficient to weed out bad drives, though it is a good thing to run. It is unable to identify subtle problems in the SATA/SAS channels to your drives, for example. In order to validate both the drives and their ability to successfully shuffle data, be sure to run plenty of tests on them. I suggest:

1) Individual sequential read and write tests. This is basically just using "dd if=/dev/da${n} of=/dev/null bs=1048576" to do a read test, and "dd if=/dev/zero of=/dev/da${n} bs=1048576" to do a write test. Note the reported read/write speeds at the end and compare them to both the other drives in the pool and also any benchmark stats you can find for that particular drive on the Internet. They should be very close.

2) Simultaneous sequential read and write tests. This is using those same tests in parallel.

3) I am kind of lazy so I will then have it do seek testing by starting multiple tests per drive and separating them by a minute or two each. The drives will do lots of seeking within a relatively small locality. Other people like to use different tools to do this; I have absolutely no objection but as I noted I'm kind of lazy.

4) I've recently provided a tool to assist: https://forums.freenas.org/index.php?resources/solnet-array-test.1/

NO AMOUNT OF STRESS should result in the kernel reporting timeouts or other issues communicating with your drives. If it does - update drivers, reseat cables, Google for known problems with your drives or controller, upgrade drive firmware, and/or generally fix your busted hardware. You cannot use a system throwing spurious storage subsystem errors for ZFS.

Once you've hit a system with tests like these for a few weeks, confidence in the platform should be established. Then you move on to installing FreeNAS, setting up ZFS, and beating on that for awhile... some comments on that and networking testing hopefully coming soon...
 
Last edited:

jgreco

Resident Grinch
Moderator
Joined
May 29, 2011
Messages
11,783
Thanks
3,021
#3
Well keeeeerap. That doesn't work. XenForo for the FAIL.

Reposted here.

Back in the late '90's, I was managing a bunch of large whitebox storage servers. For the largest of these, I had the pleasure of building and deploying a massive storage server, 8 shelves of 9 drives each, Seagate ST173404LW 73GB drives, a whopping 5TB ... (*grin*)

Part of the problem was burning in these systems, and so I devised some shell scripty stuff that the hardware techs could use. I've become convinced that a variation on this would be helpful in the FreeNAS community, so I'm playing with a stripped-down version that does some basic disk read testing (at the time of this writing). It is suitable for testing and burn-in use.

I've included just two main passes, a parallel read pass, and a parallel read pass with multiple accesses per disk. The script will do some rudimentary performance analysis and point out possible issues. It needs more work but here it is anyways. This script is expected to be safe to run on a live pool, even though that's not a good idea for performance testing purposes. As with anything you download onto your machine, you are expected to verify the safety to your own satisfaction. Note that the only things that touch the disks are "dd" and they're all structured as "dd if=/dev/${disk}".

Link to the current version of the script

To run it, download it onto a FreeNAS box and execute it as root.

Code:
cd /tmp
fetch ftp://ftp.sol.net/incoming/solnet-array-test-v2.sh
chmod +x solnet-array-test-v2.sh
./solnet-array-test-v2.sh


It will give you a simple menu

Code:
sol.net disk array test v2

1) Use all disks (from camcontrol)
2) Use selected disks (from camcontrol|grep)
3) Specify disks
4) Show camcontrol list

Option:


You probably want to look at the disklist (option 4), then pick your target disks with option 2 and an appropriate pattern. For a Seagate ST4000DM000, you could select "ST4000" for example.

The test will run a variety of things and report status. It takes a while. Be patient. It will never terminate as it is intended as a burn-in aid, but you do want to let it do its thing for at least a pass or two to get an idea of how your system performs.

It is best to do this while the system is not busy and preferably before a pool is up and running. That said, it should be safe to use even on a busy filer. I've picked on a busy filer here to give an example of how this looks. Note that da14 is a spare drive and you'll notice that all the other drives are testing much slower (because they're in use). Also note that my numbers here are in a testing mode that doesn't have the script actually doing the entire disk, real results would look a bit different and take forever.

Code:
sol.net disk array test v2

1) Use all disks (from camcontrol)
2) Use selected disks (from camcontrol|grep)
3) Specify disks
4) Show camcontrol list

Option: 2

Enter grep match pattern (e.g. ST150176): ST4

Selected disks: da3 da4 da5 da6 da7 da8 da9 da10 da11 da12 da13 da14
<ATA ST4000DM000-1F21 CC52>  at scbus3 target 44 lun 0 (da3,pass5)
<ATA ST4000DM000-1F21 CC52>  at scbus3 target 45 lun 0 (da4,pass6)
<ATA ST4000DM000-1F21 CC52>  at scbus3 target 46 lun 0 (da5,pass7)
<ATA ST4000DM000-1F21 CC51>  at scbus3 target 47 lun 0 (da6,pass8)
<ATA ST4000DM000-1F21 CC51>  at scbus3 target 48 lun 0 (da7,pass9)
<ATA ST4000DM000-1F21 CC51>  at scbus3 target 49 lun 0 (da8,pass10)
<ATA ST4000DM000-1F21 CC52>  at scbus3 target 50 lun 0 (da9,pass11)
<ATA ST4000DM000-1F21 CC51>  at scbus3 target 51 lun 0 (da10,pass12)
<ATA ST4000DM000-1F21 CC52>  at scbus3 target 52 lun 0 (da11,pass13)
<ATA ST4000DM000-1F21 CC52>  at scbus3 target 53 lun 0 (da12,pass14)
<ATA ST4000DM000-1F21 CC52>  at scbus3 target 54 lun 0 (da13,pass15)
<ATA ST4000DM000-1F21 CC52>  at scbus3 target 55 lun 0 (da14,pass16)
Is this correct? (y/N): y
Performing initial serial array read (baseline speeds)
Tue Oct 21 08:21:23 CDT 2014
Tue Oct 21 08:26:47 CDT 2014
Completed: initial serial array read (baseline speeds)

Array's average speed is 97.6883 MB/sec per disk

Disk  Disk Size  MB/sec %ofAvg
------- ---------- ------ ------
da3  3815447MB  98  100
da4  3815447MB  90  92
da5  3815447MB  98  100
da6  3815447MB  97  99
da7  3815447MB  95  97
da8  3815447MB  82  84 --SLOW--
da9  3815447MB  87  89 --SLOW--
da10  3815447MB  84  86 --SLOW--
da11  3815447MB  97  99
da12  3815447MB  92  94
da13  3815447MB  102  104
da14  3815447MB  151  155 ++FAST++

Performing initial parallel array read
Tue Oct 21 08:26:47 CDT 2014
The disk da3 appears to be 3815447 MB.
Disk is reading at about 74 MB/sec
This suggests that this pass may take around 860 minutes

  Serial Parall % of
Disk  Disk Size  MB/sec MB/sec Serial
------- ---------- ------ ------ ------
da3  3815447MB  98  86  88 --SLOW--
da4  3815447MB  90  74  82 --SLOW--
da5  3815447MB  98  82  84 --SLOW--
da6  3815447MB  97  91  95
da7  3815447MB  95  72  76 --SLOW--
da8  3815447MB  82  80  97
da9  3815447MB  87  84  96
da10  3815447MB  84  111  133 ++FAST++
da11  3815447MB  97  120  124 ++FAST++
da12  3815447MB  92  116  126 ++FAST++
da13  3815447MB  102  123  121 ++FAST++
da14  3815447MB  151  144  95

Awaiting completion: initial parallel array read
Tue Oct 21 08:39:32 CDT 2014
Completed: initial parallel array read

Disk's average time is 741 seconds per disk

Disk  Bytes Transferred Seconds %ofAvg
------- ----------------- ------- ------
da3  104857600000  743  100
da4  104857600000  764  103
da5  104857600000  752  101
da6  104857600000  737  99
da7  104857600000  748  101
da8  104857600000  754  102
da9  104857600000  738  100
da10  104857600000  762  103
da11  104857600000  748  101
da12  104857600000  756  102
da13  104857600000  740  100
da14  104857600000  653  88 ++FAST++

Performing initial parallel seek-stress array read
Tue Oct 21 08:39:32 CDT 2014
The disk da3 appears to be 3815447 MB.
Disk is reading at about 58 MB/sec
This suggests that this pass may take around 1093 minutes

  Serial Parall % of
Disk  Disk Size  MB/sec MB/sec Serial
------- ---------- ------ ------ ------
da3  3815447MB  98  52  53
da4  3815447MB  90  48  53
da5  3815447MB  98  50  51
da6  3815447MB  97  50  52
da7  3815447MB  95  48  50
da8  3815447MB  82  48  59
da9  3815447MB  87  54  62
da10  3815447MB  84  47  56
da11  3815447MB  97  49  50
da12  3815447MB  92  50  55
da13  3815447MB  102  49  48
da14  3815447MB  151  52  34

Awaiting completion: initial parallel seek-stress array read
 
Last edited:

pro lamer

FreeNAS Guru
Joined
Feb 16, 2018
Messages
628
Thanks
110
#5
I haven't seen any place in our forums that would remind about an UPS burn in test (or rather a basic UPS-FreeNAS setup test). Is it too simple or too obvious?

Disclaimer: I've just run a "ups burn in" query in our forums finding the first found posts/threads not promising in terms of the above.

Anyway I wondered when to test the UPS setup: before adding the hard disks to the rig or after or both... I guess both. (Concern: scared that UPS failure during burn in may damage the drives. OTOH the drives consume some energy so testing an UPS without them is moot)

Sent from my phone
 

jgreco

Resident Grinch
Moderator
Joined
May 29, 2011
Messages
11,783
Thanks
3,021
#6
You cannot burn in a UPS, not really. Every deep cycle that a UPS goes through diminishes the remaining battery life. You are best off just doing whatever the manufacturer's recommended tests are.

Testing whether your UPS and FreeNAS are communicating is probably best done by tweaking the settings on the FreeNAS side to make it hyper-sensitive to power loss, and shutting down quickly after a power loss, thereby not drawing down the UPS battery much. This is perfectly fine to do.
 
Top