Hard drive topology for a small system: Limiting risk with a limited budget

nick0 · Oct 15, 2014

The purpose of the system is as a budget NAS for recording music, and as a backup store for various workstations. It will have a UPS and ECC memory.

Currently, I need about ~2.5TB, and should be able to keep my requirements under 4tb by archiving old projects to blue-ray disks. But that 2.5-to-4tb needs to be near bullet-proof, near silent, and as affordable as possible. I currently have one 2TB WD Red which can go to this system, and to also stagger age/failure. Assuming 2TB WD Reds cost $100, 4TB cost $200, and 3TB cost $130, what is the best route to cheap and near bullet-proof? 4TB WD Red Pros cost $269. I would be buying half the drives from one store, and half from another. Let's assume a constant 10% chance of a drive suddenly dying. Please help me with the math, and also estimated rebuild times, and the risk of losing everything during the rebuild. I ruled out HGST because of the yowling cat sound I remember from back in the day, and Seagate because the seek/write clicking is much more apparent than the WD Reds. From what I can tell, these are my options:

a) 3x2TB RAIDZ1 = 4TB, $200, 3 spindles ~6 platters. Risky?
b) 4x2TB RAIDZ2 = 4TB, $300, 4 spindles ~8 platters. Noisiest? Safe?
c) 2x3TB mirror = 3TB, $260, 2 spindles ~6 platters. More risky than (a)? Quieter than (a)?
d) 3x3TB dual mirror = 3TB, $390, 3 spindles ~9 platters. Safest?
e) 2x4TB mirror = 4TB, $400, 2 spindles ~8 platters. Most dangerous? Quietest?
f) 3x4TB dual mirror = 4TB, $600 -- Over budget.
g) 2x4TB Pros mirror = 4TB, $540, 2 spindles, ~8 platters. Expensive. Noisiest? As safe as (b)?

[edit] I had two ds, so I corrected the lettering. Oops!

nick0 · Oct 15, 2014

P.S. There will be offsite backup for the truly irreplaceable data.

Ericloewe · Oct 15, 2014

RAIDZ2 is inherently more reliable than striped mirrors, for four drives at least. d) is probably safest, but a bit on the crazy side. RAIDZ1 is to be avoided. b) would probably be my choice.

nick0 · Oct 15, 2014

In the 4x2TB RAIDZ2, two drives would be from the same batch. In the 2x4TB mirror, each would be from different batches...Supposing there's 2.5GB of data in the pool, the factory had a bad run, and half the drives outright die. Then I lose all my parity from my RAIDZ2, and 100% of my mirror in the other scenario. The RAIDZ2 is going to take longer to resilver than the mirror, so wouldn't I be more at risk? But if the drives from the bad batch only partially fail, then the RAIDZ2 is more resilient, yes?

So I guess the question is: For the WD Red drives, do they die suddenly, or do recoverable errors slowly creep in? I know the WD RE4 drives fail slowly with recoverable errors, but I'm not sure if the Red series fails in the same way...

Ericloewe · Oct 15, 2014

You won't have two failures in quick succession with four drives if they've been properly tested. If they haven't been, no amount of redundancy will solve the problem, so test them. Search the forums, there are a few methods for doing so from FreeNAS.

Besides, most drives fail slowly, giving you time to replace them early. Keep a spare handy (applies to all scenarios).

Worrying about batches is the domain of large servers, not four disk servers.

pjc · Oct 15, 2014

Ericloewe said:
Worrying about batches is the domain of large servers, not four disk servers.

Possibly a n00b question, but doesn't picking drives from different batches mitigate the risk of RAIDZ1? With a 4TB drive, you're stressing the pool for ~8 hours to resilver, but how likely are you to lose a second drive in that 8-hour window?

Or am I misunderstanding the reason people recommend against Z1?

cyberjock · Oct 15, 2014

pjc said:
Possibly a n00b question, but doesn't picking drives from different batches mitigate the risk of RAIDZ1? With a 4TB drive, you're stressing the pool for ~8 hours to resilver, but how likely are you to lose a second drive in that 8-hour window?

Or am I misunderstanding the reason people recommend against Z1?

Your argument is invalid. Even with RAIDZ1 you only need 1 disk to fail and another disk to have a handful of bad sectors in the right place and your pool is gone for good.

Also, buying from multiple batches, when dealing with small servers, increases the chances you'll get one of those bad batches.

pjc · Oct 15, 2014

Interesting. How likely are bad sectors in practice? In my pre-ZFS experience, I usually saw some reallocated sectors before actual failure, though admittedly without checksums it's hard to know whether there was data loss.

What kind of bad sectors are normally repaired by a healthy array (as opposed to using copies=2)?

cyberjock · Oct 15, 2014

pjc said:
Interesting. How likely are bad sectors in practice? In my pre-ZFS experience, I usually saw some reallocated sectors before actual failure, though admittedly without checksums it's hard to know whether there was data loss.

What kind of bad sectors are normally repaired by a healthy array (as opposed to using copies=2)?

Data is repaired if there is sufficient redundancy. During a disk replacement of a RAIDZ1 you have zero redundancy.

If you want to know more, read the link in my signature.

pjc · Oct 15, 2014

Thanks for the link to your ppt. A follow-up question:

You say "statistically you can also expect some ZFS corruption and data loss." Why is this likely? I would have expected that with regular scrubbing, everything not on the failed drive is good, no? How does URE (unrecoverable read error) rate interact with scrubbing?

Also, does the admonition against RAIDZ1 also argue against a 2x1 mirror? Or is there a reason that a mirror is safer?

no_connection · Oct 16, 2014

I have yet to see a single error on my NAS even though statistically something should have happened. Problems come when things start to go bad. (Which is what we prepare for, redundancy would be rather silly if nothing ever failed).

You are correct that the risk applies to mirrors as well, reason being that they have no redundancy when rebuilding.
Not sure if they are less likely to loose the pool for the same errors a Z1 would.

If you build your infrastructure so that you can loose your pool without a problem, then it won't matter that much what you do.
So build your system to fail, don't rely on it not failing.

Point being, if you need to worry about mismatching batches to spread out potential failures ore URE then the system is designed wrong. Yes redundancy costs money and wastes space, but there is good reason to do it. Something you hopefully won't realize after something bad happened.

Fraoch · Oct 16, 2014

nick0 said:
Please help me with the math, and also estimated rebuild times, and the risk of losing everything during the rebuild.

b) 4x2TB RAIDZ2 = 4TB, $300, 4 spindles ~8 platters. Noisiest? Safe?

I'm testing situation (b). It should be safe - safer than (a), RAID-Z1, for sure. It's a little noisy on read/write but that could be due to the hotswap bay I'm using.

I currently have about 300 GB written to it. A scrub takes 1 hour 16 minutes. I've simulated a rebuild with a spare drive but that was without any data in the zpool. Your post prompted me to try one out with this 300 GB of data. zpool status -v says:

Code:

pool: volume1
state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
   continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Thu Oct 16 13:27:02 2014
  120G scanned out of 1.25T at 284M/s, 1h9m to go
  29.2G resilvered, 9.43% done
config:

   NAME  STATE  READ WRITE CKSUM
   volume1  DEGRADED  0  0  0
    raidz2-0  DEGRADED  0  0  0
    gptid/e8865199-53dd-11e4-a741-0cc47a0984f3  ONLINE  0  0  0
    gptid/bcb83505-53d5-11e4-a741-0cc47a0984f3  ONLINE  0  0  0
    gptid/bd191bf4-53d5-11e4-a741-0cc47a0984f3  ONLINE  0  0  0
    replacing-3  OFFLINE  0  0  0
    4896928397775504369  OFFLINE  0  0  0  was /dev/gptid/bdd2d702-53d5-11e4-a741-0cc47a0984f3
    gptid/a3bcd842-5559-11e4-a741-0cc47a0984f3  ONLINE  0  0  0  (resilvering)

errors: No known data errors

Edit: resilvering is finished:

Code:

scan: resilvered 309G in 1h29m

cyberjock · Oct 16, 2014

pjc said:
Also, does the admonition against RAIDZ1 also argue against a 2x1 mirror? Or is there a reason that a mirror is safer?

Mirrors are safer only in that there is less chance of URE statistically. Keep in mind that "mirrors" are better than RAID1 in that you can have 3 disks in a mirror so you have 2 copies if a disk fails. I've helped a couple of companies go that route because they wanted more protection. ;)

nick0 · Oct 17, 2014

Ericloewe said:
most drives fail slowly, giving you time to replace them early...Keep a spare handy...Worrying about batches is the domain of large servers, not four disk servers.

It's clear that RAIDZ2 is safer than a striped pair of mirrors, but I wonder if how much more dangerous a 2x4TB mirror is vs a 4x2TB RAIDZ2...if most drives fail slowly, you have a spare drive, and the pool has 100% redundancy.

no_connection said:
Point being, if you need to worry about mismatching batches to spread out potential failures ore URE then the system is designed wrong.

cyberjock said:
Also, buying from multiple batches, when dealing with small servers, increases the chances you'll get one of those bad batches.

Ok, you have all made me rethink buying from multiple batches. It sounds like I'd just increasing potential risk of failure and wasting money on extra shipping costs. If the box with the hard drives is somehow molested, man-handled, or mysteriously degaused during shipping, will it probably fail in the first ~200 hours of use, after thorough testing, yes?

cyberjock said:
Mirrors are safer only in that there is less chance of URE statistically. Keep in mind that "mirrors" are better than RAID1 in that you can have 3 disks in a mirror so you have 2 copies if a disk fails. I've helped a couple of companies go that route because they wanted more protection. ;)

I've experimented with virtualised hard drive images, and then purposefully corrupted them by using dd to seek to certain regions and then zero them out, but my results were inconclusive. With the mirror, if the blocks associate with a file are corrupted on both drives, the file is irreparably gone from the live copy and all snapshots, assuming the file has not been modified, right? But what about with the RAIDZ2? With four drives it seems less likely that the same block will be affected, but then there's four rather than two drives that can fail.

Fraoch said:
I'm testing situation (b)...It's a little noisy on read/write but that could be due to the hotswap bay I'm using...Your post prompted me to try [a rebuild] with this 300 GB of data. zpool status -v says:
...
120G scanned out of 1.25T at 284M/s, 1h9m to go
...

Code:
scan: resilvered 309G in 1h29m

Fraoch, thanks for the numbers! Wow, 284M/s. That's much faster than I expected. What kind of drives are you using? I wonder how much slower it will be once the pool is 80% full? As for noise, yeah, because ZFS isn't designed to minimize seeks, I worry whether the seek noise of 4x2TB will be substantially worse than 2x4TB?

Honestly, I'm now mostly pursuaded 4x2TB RAIDZ2 is probably the way to go.

Fraoch · Oct 17, 2014

nick0 said:
Fraoch, thanks for the numbers! Wow, 284M/s. That's much faster than I expected. What kind of drives are you using?

Bog-standard WD Red 2 TB drives, WD20EFRX. Obviously 284 MB/s is not their speed, dd tests them at a maximum of about 116 MB/s - which is very good for a 5400 RPM drive.

Fraoch · Oct 17, 2014

nick0 said:
If the box with the hard drives is somehow molested, man-handled, or mysteriously degaused during shipping, will it probably fail in the first ~200 hours of use, after thorough testing, yes?

(hopefully) it will show up in the SMART conveyance test, which is specifically designed to test for damage during shipping:

Code:

smartctl -t conveyance /dev/adaX

cyberjock · Oct 17, 2014

Conveyance is supposed to find that kind of thing, but 200 hours is a bit short. The general rule is 1000 hours (or about 40 days of continuous use).

pjc · Oct 19, 2014

cyberjock said:
Mirrors are safer only in that there is less chance of URE statistically.

Got it: a single 4TB drive has a lower risk of URE (roughly 1 in 3) than 3x4TB does (roughly 100% assuming full usage).

But what is the cause of UREs?

Is it "unrecoverable" at the drive level because it times out or a cosmic ray flips a bit (and thus a checksum failure with subsequent re-read at a higher layer might be happy with it)?

Or is it essentially a write failure that didn't manifest until you read the data later? If so, wouldn't a scrub likely catch it in some instances, especially write-once repositories?

I'm just trying to get a better handle on what a 1-error-in-12TB-read means in practice: am I guaranteed a read error if I read a 4TB drive 3 times? If so, why? If I only wrote it once, why wouldn't it have failed the first two times?

Ericloewe · Oct 19, 2014

pjc said:
Got it: a single 4TB drive has a lower risk of URE (roughly 1 in 3) than 3x4TB does (roughly 100% assuming full usage).

But what is the cause of UREs?

Is it "unrecoverable" at the drive level because it times out or a cosmic ray flips a bit (and thus a checksum failure with subsequent re-read at a higher layer might be happy with it)?

Or is it essentially a write failure that didn't manifest until you read the data later? If so, wouldn't a scrub likely catch it in some instances, especially write-once repositories?

Could be anything. Drives operate at the edge of what is physically possible - the only thing that catches them if they fall is internal error-correction.

In the end, it does not matter. If the corruption happens while there is no redundancy, buh-bye, pool.

pjc said:
I'm just trying to get a better handle on what a 1-error-in-12TB-read means in practice: am I guaranteed a read error if I read a 4TB drive 3 times? If so, why? If I only wrote it once, why wouldn't it have failed the first two times?

There are no guarantees here or anywhere in statistics - only significance and probabilities.

First of all, if you read the same drive three times in quick succession, you do not have three independent trials - the latter two's probability of success depends on the first one. An error is highly unlikely to pop out of nowhere in a short period.
If you simultaneously read three different drives, then you can apply that thought process (which still has a flaw, see below).

Next, while the probability of the drives having an error will tend towards 1, it will not equal 1. You may have a very large probability of success (or, in this case, failure), but it won't be 1 except at infinity.

Manufacturers rate their consumer drives at less than one error per 10^14 read bits, which is about an 8% chance of an error in each read Terabyte - this is of course only an estimate, which I'd expect to be optimistic.

pjc · Oct 20, 2014

I'm trying to get at the underlying assumptions that the manufacturer made (if possible) that yield the specified error rate. I have to imagine that some kinds of workloads would increase that rate (or approach it) and some kinds of workloads would decrease that rate. It's not just a magical event without explanation.

If it's really 100% likelihood of an error when reading 12TB, then presumably people would encounter checksum errors (on average) every 4 times they scrub a 75%-full 4TB drive. Are you really seeing errors rates like that?

If so, is there a way to configure ZFS to keep more than 2 copies of metadata?

Important Announcement for the TrueNAS Community.

Hard drive topology for a small system: Limiting risk with a limited budget

Cadet

Cadet

Server Wrangler

Cadet

Server Wrangler

Contributor

Inactive Account

Contributor

Inactive Account

Contributor

Patron

Patron

Inactive Account

Cadet

Patron

Patron

Inactive Account

Contributor

Server Wrangler

Contributor

Similar threads