Hard drive topology for a small system: Limiting risk with a limited budget

titan_rw · Oct 20, 2014

pjc said:
If it's really 100% likelihood of an error when reading 12TB, then presumably people would encounter checksum errors (on average) every 4 times they scrub a 75%-full 4TB drive. Are you really seeing errors rates like that?

If so, is there a way to configure ZFS to keep more than 2 copies of metadata?

According to the statistics, yes. However real world results don't seem to be that bad. Silent data corruption still happens though. So do hard drives (silently) developing bad sectors.

Zfs already stores multiple copies of all metadata. I think it's 3 copies of metadata, and 6 copies of uberblocks. I could be wrong though.

pjc · Oct 20, 2014

My recollection was that ZFS stores 2 copies of metadata, 3 if you set copies=2 for file data. But I haven't found the incantation to boost metadata up to 3 without increasing the copies of data.

Are hard drives developing bad sectors upon writing, or does write-once-read-4x also produce errors by the 4th read?

nick0 · Oct 26, 2014

Considering the following, two years after the NAS goes live, is it true that starting with a 2x4TB mirror would actually be safer than a 4x2TB zraid2? / will be on a satadom, backed up nightly.

1. ZFS shows increasing checksum errors for a drive
2. You do validate your replacement 4TB drive as non-defective with a smart short and conveyance test, then run badblocks, then a long smart test
3. A few days later, the checksum errors are getting serious; errors appear in the smart output
4. You can finally attach the 4TB drive as a mirror, making it a 3-way mirror.
5. It resilvers very quickly (someone please provide some hard numbers)
5q. How safe is your data during the resilver?
6. After 40 days of no problems
7. You RMA the drive, get the new one, validate it, and add it as the second mirror

It would seem that drive failure becomes an opportunity for adding redundancy. Chances are you'll get the RMA drive before the upper limit for early hard drive mortality, and can continue with a 3-way mirror. Finally, in a case with room for only four hard drives, this path provides a path for future expansion without replacing existing drives.

Compared this raidz2 scenario:

1. ZFS shows increasing checksum errors for a drive
2. You do validate your replacement 2TB drive as non-defective
3. A few days later, the checksum errors are getting serious; errors appear in the smart output
4. Finally, you physically replace the failing 2TB drive
5. It resilvers slowly in comparison to a mirror (please provide some hard numbers)
5a. While resilvering, a drive completely dies
5b. Now you have zraid1-level parity with 2TB drives under the stress of a resilver
5c. The dying drive slows down the resilver
5d. You can't add parity to a zraid2

In effect, doesn't this mean that when a drive is slowly dying, and then a second drive dies during the resilver, that a 4x2TB zraid2 is more dangerous than a 3-way 4tb mirror resilver in progress?

I've tried to isolate these scenarios from my Linux experience, but my tendency is to believe that a resilver is harder on the drives than a scrub, and that a zraid2 resilver places the drives in a state of greater stress/potential failure than a mirror resilver.

cyberjock · Oct 26, 2014

Saying mirrors resilvers quickly while RAIDZ2 resilvers slowly is inaccurate. Not to mention that the speed at which it resilvers will be unique to your system, so providing numbers from some other system means nothing.

The two comparisons are not the same at all. You had a single failure and 2 failurs in the second one. How about you compare 2 failures to 2 failures. Then tell me your data is safe with mirrors. Also, in the process of replacing a defective disk you remove the bad one, so some "dying drive" shouldn't be causing problems unless you failed to do your job as a ZFS administrator. ;)

Resilvers literally *are* scrubs with regards to that vdev. The only difference is that when you do a resilver it's fully expecting to write data to the new device. That's all.

Oko · Oct 26, 2014

I am sure my advice is going to ruffle many folks' feathers on this forum. ZFS is just not the file system for limited budgets. Full stop. For what is described above I would get a small SSD drive and 2x3TB industrial grade HDD ( I think you can find those for under $200 a pop). Install DragonFly BSD on SSD and mirror HAMMER file system on those two drives. If a master or slave dies make sure you have another drive ready to replace it. You will not need much beyond atom processor and 2 GB of RAM. Make sure you tune HAMMER history not to get nasty surprised in the form of completely filled file system.

pjc · Oct 27, 2014

cyberjock said:
Saying mirrors resilvers quickly while RAIDZ2 resilvers slowly is inaccurate.

But wouldn't a 4x2TB array with 3 good drives still have to read 6TB to reconstruct the missing data, as opposed to a 2x4TB array only having to read 4TB in order to resilver? (I'm pretending they're 100% full for simplicity.)

And in the latter case, couldn't you leave the failing drive in while you add the extra mirror, so that (a) you have possible redundancy for some/most of your blocks and (b) you only need to read 2TB from each of the original drives? (You're still limited by speed of the target drive, of course.)

Resilvers literally *are* scrubs with regards to that vdev. The only difference is that when you do a resilver it's fully expecting to write data to the new device. That's all.

That makes sense. I hadn't thought of it that way...seems like that might be a useful way to get a lower bound on the time required to resilver, no?

cyberjock · Oct 27, 2014

pjc said:
But wouldn't a 4x2TB array with 3 good drives still have to read 6TB to reconstruct the missing data, as opposed to a 2x4TB array only having to read 4TB in order to resilver? (I'm pretending they're 100% full for simplicity.)

Yes, but each disk is reading its part at the same time. It's not like its reading from disk 1, then disk 2, then disk 3. All 3 disks are being hit simultaneously. So whether you have 2 or 4 disks the result is, for the most part, the same. I say for the most part because there are other factors beyond just the number of disks that actually do affect the outcome, and more disks indirectly affect it. The reality of it though is that you don't really need to consider resilvering to be a major factor in your pool layout because you shouldn't be having to do it particularly often. If you've done your job and created a viable pool layout that will cover you during periods of bad disks then it should matter particularly much if it takes 12 hours or 18. My pool scrub takes 36 hours and I don't sweat it at all. I just let it do its thing and life goes on.

pjc said:
And in the latter case, couldn't you leave the failing drive in while you add the extra mirror, so that (a) you have possible redundancy for some/most of your blocks and (b) you only need to read 2TB from each of the original drives? (You're still limited by speed of the target drive, of course.)

Technically, yes. But for FreeNAS no. FreeNAS' WebGUI doesn't let you add a disk mirror to a vdev that is already a mirror. So you'd *have* to offline the old disk before adding the new disk. (yes, you could do the CLI, but as I tell people regularly if you are having to do things from the CLI you've probably screwed up.) But, you don't want bad disks involved with resilvering because ZFS will try to correct the failing disk while it goes. If there's only 1 or 2 bad sectors you might be okay, but for a egregiously bad disk it could make the resilvering process take longer than your expected lifespan. If you are in a position where you are needing to keep a bad disk in a pool during a resilver operation you are failing to do your job as a ZFS admin. You are very likely to not have a working pool afterwards anyway.

Important Announcement for the TrueNAS Community.

Hard drive topology for a small system: Limiting risk with a limited budget

titan_rw

Guru

pjc

Contributor

nick0

Cadet

cyberjock

Inactive Account

Oko

Contributor

pjc

Contributor

cyberjock

Inactive Account

Similar threads