As hard disk capacities increase, at what point do two disk mirrored vdevs stop making sense?

Davvo · Sep 22, 2023

The resource linked above goes into some detail about a few of those issues by grouping quotes from experienced users and a few external articles.

ChrisRJ · Sep 23, 2023

As a practical example, here is my experience with Seagate Exos X16 16 TB drives over the last 3 years. I am running 8 of them in a RAIDZ2 since September 2020. Over these 3 years I have lost 4 of those 8 drives. Never saw anything even remotely like that in over 30 years.

My point is that probability and statistics are non-trivial matters. Worse so, since they often go against your gut feeling. An expectation value is not a guaranteed thing, but some kind of average. To meet it in reality is therefore highly unlikely.

Richard Kellogg · Sep 23, 2023

Etorix said:
Err… actually from your own post #10.
20 TB Exos drives have a specified rate of 1e-15 (datasheet), which then works out to p(12TB) = 0.908. So a 2-way mirror vdev of 20 TB Exos drives which loses one drive while 60% full would have a 91% chance of resilvering without issue. Better… but there's still a less-than-fully-comfortable 9% chance of having to go for the backup to restore a damaged file.

20 TB Ironwolf and WD Red Pro drives are also rated at u=1e-15. WD Red Plus are rated at u=1-14, but this line tops at 14 TB "only". (Still, no-one is going to like the calculation for resilvering out of a 14 TB Red Plus drive with 8-10 TB of actual, valuable, data on it.)

Take home points: Do not entrust large amounts of valuable data to vdevs/pools with only one degree of reduandancy. And ALWAYS HAVE BACKUPS!

Yes, you are correct. I understand now, and I was incorrect.

AlexGG · Sep 23, 2023

Etorix said:
WD Red Plus are rated at u=1-14, but this line tops at 14 TB "only"

If I'm not mistaken in my math, that translates to the following: if you fill the 14TB drive with data, there is only about 1 in 3 chance of reading it back without an error and less than 1 in 10 chance to read it back twice without an error. This is quite contrary to all experience.

Resilvering a mirror is the same as reading the data back from the disk.

Etorix said:
So a 2-way mirror vdev of 20 TB Exos drives which loses one drive while 60% full would have a 91% chance of resilvering without issue.

So, given the parameters, about one in ten attempts to read all data from a 60% full drive should result in something being broken. No, the calculation is way off.

Etorix · Sep 24, 2023

The math is detailed above. Anyone is welcome to expose flaws in the reasoning.

I trust that the math is right. The only parameter which may be questionable is the URE rate. Manufacturers typically quote it as "less than 1 in 1E<number> bits read", which I then interpret as the worst case equal to 1 in 1E<number> (u=1E-<number>). Surely, if manufacturers were confident that the rate were 1E-<number+1> or better they would plainly state so…
Maybe this is a raw rate off the platters, and the drive's firmware then tries really hard to recover valid data before passing it off.
Maybe the "experience" with reading back tens of terabytes of data from drives and validating the result is not what we think it is…

I trust the math, and I take the results at face value. Accordingly, I accept that 2-way mirrors of 10+ TB hard drives are NOT suitable as redundant storage which could reliably recover from a single drive failure without having to resort to a backup. (I do have backups, but I prefer to not have to resort to them for less than a major disaster.)

Anyone is free to take a different stance, dismiss UREs and hope for the best.
Just like anyone if free to ignore other guidances and run a 10+ drive ZFS array on a 450W PSU and 8 GB RAM ("it will be fine, ZFS does not need that much RAM and all drives will not take up 30W each all at the same time").

Davvo · Sep 24, 2023

AlexGG said:
Resilvering a mirror is the same as reading the data back from the disk.

Not really.

jgreco said:
A resilver and a scrub are very similar in that they walk the pool in a virtually identical manner. [...] However, when resilvering, or even when just repairing checksum errors during normal read operations, you are also doing an additional write operation and some other stuff. [...] For a single disk, writing a single sector shouldn't be terribly hard. [...] However, for a single SMR disk, writing a single sector involves rewriting the entire shingle, and we already know that this can get very hard on pools, even outside of a scrub operation, if more than a small number of rewrites are involved. This is what led to the original kerfuffle about SMR disks: people had pools that were failing to resilver, even if they had RAIDZ2 or RAIDZ3 protection. Worse, for even a CMR disk, the sustained write activity increases stress particularly on the target (drive being replaced), increasing temperatures. It is not just a function of reading the existing data sectors and verifying the parity sectors. It is reading the block's sectors, back-calculating the missing data or parity sectors, and then writing that out to the replaced disk. This is more work than just reading all the disks. Reading is relatively trite and some of it is mitigated by drive and host caching. Writing semirandom sectors to rebuild ZFS blocks typically requires a seek for each ZFS block, which may be harder on the drive being written to. More work equals more heat. Finally, resilvers on mirrors are somewhat easier than RAIDZ because you might only be involving two or three disks (meaning only two or three disks are running warm). RAIDZ, on the other hand, involves each disk in the vdev, and because the process is slower due to the nature of RAIDZ, all the component drives run busier, for longer, get warmer, and it just isn't really a great thing for them. [...]

Davvo · Sep 24, 2023

Richard Kellogg said:
I found this presentation from cern which goes a long way into understanding modern drive error mechanisms. https://indico.cern.ch/event/247864...ts/426734/592321/HEPIX_October_2013_ver_6.pdf

I had found it as well while doing research for the resource, and the issue is that's from 10 years ago, which is a long time: HDD drives have greatly increased space, new technology has been developed, things changed.
It's a valid start point, but not likely to represent modern drives.

Whattteva · Sep 24, 2023

Yeah, some HDD's from Seagate even has multi-actuators now and Seagate has a roadmap of HDD's reaching over 100TB.

Seagate's Roadmap: The Path to 120 TB Hard Drives

www.anandtech.com

Important Announcement for the TrueNAS Community.

As hard disk capacities increase, at what point do two disk mirrored vdevs stop making sense?

Davvo

MVP

ChrisRJ

Wizard

Richard Kellogg

Dabbler

AlexGG

Contributor

Etorix

Wizard

Davvo

MVP

Davvo

MVP

Whattteva

Wizard

Seagate's Roadmap: The Path to 120 TB Hard Drives

Similar threads