Storage Expansion - RAIDZ Question

seb101

Contributor
Joined
Jun 29, 2019
Messages
142
I'm adding a bunch of 10TB drives to my system next week, I've got 6 of them.

I'm currently torn between making a striped mirror (3x vDevs of 2 drives) or a 5 disk RAIDZ2 with a hot-spare. They equate to approx the same amount of usable storage.

The striped mirror here is only gaurenteed to tolerate the loss of 1 drive. (In theory it could tolerate the loss of 3 but only if they were the 'right' drives).

So the RAIDZ2 setup is twice as reliable as it could tolerate the loss of 2 drives, plus I'll have a spare?

On the surface the RAIDZ2 is a clear winner. However I've always had a weird affinity to mirrors because (probably just in my head) the data is always in one place, rather than being spread accross many drives, so from a disaster recovery stand-point this seems... safer. Plus I read this article https://jrs-s.net/2015/02/06/zfs-you-should-use-mirror-vdevs-not-raidz/ which suggests RAIDZ is the worst possible choice. It seems to suggest that rebuilding a RAIDZ2 member is far more risky in terms of further drive failure than accepting the slightly reduced drive failure tolerance of the mirrored setup.

So then I went down the rabbit hole of reading up about the rebuild times of large (i.e. 10TB) drives and that seems to be a world of rumor and conjecture without a great deal of empirical evidence, some people assume a rebuild rate of 10MB/sec others 150MB/sec - so it's very hard to be able to draw a conclusion!

Help?!
 

seb101

Contributor
Joined
Jun 29, 2019
Messages
142
Still trying to get my head around this...

If a disk fails in the RAIDZ2 vdev, and starts to automatically rebuild onto the hot spare, if another drive fails in that time, will the rebuild onto the hot-spare carry on as normal? Or would the system have to stop the rebuild and start again?
 
Joined
Jun 15, 2022
Messages
674
The article is quite factual. However, ignored is drives of the same age trend to fail at the same time, so in a RAID-5 system there's around a 10% chance a second drive will fail during the rebuild, and that's quite large: if there was a 10% chance you'd be in a fatal car crash next Monday, would you even get in a car?

This doesn't directly translate to ZFS mirrors because the rebuild depends on the amount of data on the drive, and with a hot spare the risk is less.

The same with a failed Z2 array, only the data is rebuilt. However, during a Z2 failure let's say:
  • One drive failed, now RAID-Z1
  • A second drive fails during the rebuild, given there's a lot of data stored in the array over a period of years which is when the drives tend to fail (when they're old and doing a lot of seeks), now there's no ECC.
  • One of the remaining drives has an uncorrectable error, which is not uncommon for high-capacity drives. That data is lost.
I'd say the above scenario is rare, but how important is your data? Do you have a 3-2-1 backup strategy? Is this an office NAS with 15 users or home? How much data is retrieved and written daily during a rebuild?

Also, is your NAS storing large or small files and how often are they modified? ZFS is Copy- On- Write so naturally is fragmented on some datasets and array speed at 25% array usage is 10% of ext4. So there are considerations.

I run 8-disk SAS RAID-z3 arrays on HGST drives because for me that works well, for you probably not. In fact most members run Z2 or mirror sets, it depends on your needs.
 

seb101

Contributor
Joined
Jun 29, 2019
Messages
142
Thanks thats useful.

This doesn't directly translate to ZFS mirrors because the rebuild depends on the amount of data on the drive, and with a hot spare the risk is less.
Surely the piece about aged drives also applies to mirrors, if all the disks in a mirror are from the same batch then there is a higher chance the second disk in a single mirror vdev will fail while it is resilvering (or in my case degraded, as there would be no hot-spare in the mirror becuase it uses all 6 drives already, so purchasing and waiting for another 10TB drive to arrive). Mitigated by the shorter mirror re-silver vs RAIDZ rebuild obviously.

A second drive fails during the rebuild, given there's a lot of data stored in the array over a period of years which is when the drives tend to fail (when they're old and doing a lot of seeks), now there's no ECC.
I thought you still had some ECC available as ZFS still stores a checksum on the file-pointer, even in RAIDZ? So you'd have to have a multi-bit read error after 2 failed disks for it to be unrecoverable. How common are uncorrectable read-errors realistically? Does ZFS log them? I've had some drives running in my current NAS for about 6 years, would be interesting to see if they've had any read-errors in that time.

But, again, surely the same logic applies to the mirrors, if a single disk in a 2-disk vdev mirror fails, then an uncorrectable error reading from the only remaining disk in that vdev will hose the vdev and hence the pool.

I'd say the above scenario is rare, but how important is your data? Do you have a 3-2-1 backup strategy? Is this an office NAS with 15 users or home? How much data is retrieved and written daily during a rebuild?

Also, is your NAS storing large or small files and how often are they modified? ZFS is Copy- On- Write so naturally is fragmented on some datasets and array speed at 25% array usage is 10% of ext4. So there are considerations.
Well here-in lies the dillema. The data is not mission critical but would be very time consuming to replace, so I'm looking for the most reliable NAS setup without the expense/time of external/offsite backups. Hence trying to decide between the mirror and RAIDZ2+spare for the same overall capacity.
 
Last edited:

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
Plus I read this article https://jrs-s.net/2015/02/06/zfs-you-should-use-mirror-vdevs-not-raidz/ which suggests RAIDZ is the worst possible choice.
Yeah, no. It's a very alarmist, self-centered conclusion that the author draws from fairly accurate observations.

I thought you still had some ECC available as ZFS still stores a checksum on the file-pointer, even in RAIDZ?
No, ZFS can detect errors at that point, but not correct them. Caveat: some data is automatically stored in duplicate or triplicate (metadata and critical metadata, respectively), so if you're lucky you can avoid data loss - best not to count on luck though.

Hence trying to decide between the mirror and RAIDZ2+spare for the same overall capacity.
Well, the spare is pointless at that point. Go with RAIDZ3 rather than RAIDZ2.
As for RAIDZ3 vs mirrors, the reliability is going to be insanely better with RAIDZ3 (all else being equal).

Will performance be worse? Yes, for many workloads. Does it matter? Probably not. Will rebuilds take longer? Yes, but with three disks' redundancy, you could probably tolerate a week-long rebuild (which is not going to happen).
 
Joined
Jun 15, 2022
Messages
674
You know, maybe this would answer your question better as what you really want to do is evaluate your needs, then design a system to meet them.


Somewhere I just read a test of data size vs. block size on ZFS volumes, but cannot find it (which sucks, because it was excellent). This is somewhat the same information:

 

seb101

Contributor
Joined
Jun 29, 2019
Messages
142
Thanks!
You know, maybe this would answer your question better as what you really want to do is evaluate your needs, then design a system to meet them.
That's actually very helpful. My use-case is for larger, never-modified files, which seemingly tends better towards RAIDZ rather than mirrors. My mind is made up!

Well, the spare is pointless at that point. Go with RAIDZ3 rather than RAIDZ2.
My (possibly flawed) logic is that having a spare sitting idle means it's less likely to fail on the same time-line as the other disks in the vdev, whereas a third parity disk will wear out and potentially fail at the same rate as the other disks.
 
Joined
Jun 15, 2022
Messages
674
Thanks!

That's actually very helpful. My use-case is for larger, never-modified files, which seemingly tends better towards RAIDZ rather than mirrors. My mind is made up!
I have a "cold-storage" RAID virtual-device for large data, also a "hot-storage" VDev. When things cool off they migrate from hot to cold.

Some people set up a SSD RAID for "hot-storage," which is great if the data flow outruns spinners.

My (possibly flawed) logic is that having a spare sitting idle means it's less likely to fail on the same time-line as the other disks in the vdev, whereas a third parity disk will wear out and potentially fail at the same rate as the other disks.
Very true, the system uses less power and the drives tend to last longer.

One other thing to look into is mirroring the boot drive, I hear if it fails it can take the RAID arrays down with it and in select cases they're unrecoverable, though cannot speak to that.

 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
My (possibly flawed) logic is that having a spare sitting idle means it's less likely to fail on the same time-line as the other disks in the vdev, whereas a third parity disk will wear out and potentially fail at the same rate as the other disks.
Maybe, slightly? I don't expect a disk that's spinning to last meaningfully longer just because it's idling (aside from edge cases like aggressive head parking timers, which would not be the case on any NAS/Server drive). But I know for sure that it will represent an additional disk's worth of parity ready at all times.
Less power? Yeah, but if that's a concern, keep it as a cold spare instead.
One other thing to look into is mirroring the boot drive, I hear if it fails it can take the RAID arrays down with it and in select cases they're unrecoverable, though cannot speak to that.
Woah woah woah woah, where did you get that idea? Let's take a step back or two, because that's significant misinformation.

I'll be naturally focusing on ZFS, despite the RAID nomenclature, but much of this applies to anything. No sane RAID solution stores its data outside the constituent disks.

So, the boot pool: It holds the OS and the live configuration database. It can optionally hold the system dataset (which must be on reliable storage, as Samba depends on some stuff that while not critical is painful to lose). The disks can also have swap partitions created on them by the installer. It may also hold encryption keys, if you're using encryption and that's really the only possibly critical point of failure - if using encryption, make damned sure you have your keys backed up.
Obviously, bad things happen if the OS' storage disappears, so let's examine what the state of your stuff is upon a clean install:
  • Config: Gone (aside from backups)
  • Keys: Gone (aside from backups)
  • OS: Same as before, assuming you installed the same version.
  • Data pools: Basically untouched, in whatever state they were.
That last one is the key: Any data pools are still there, same way they always were. If you have a config backup, great, upload it and be on your merry way. If not, just reconfigure everything.

So, how reliable do you need your boot device to be? It comes down to uptime, mostly. Any decentish SSD will last essentially forever in this role, so failure rates should be low. Business setting? By all means use two decent SSDs. Home user? If you're okay with some somewhat inconvenient downtime, who cares, save the money now.
 
Joined
Jun 15, 2022
Messages
674
Woah woah woah woah, where did you get that idea? Let's take a step back or two, because that's significant misinformation.
A long-time member here said that if the boot drive dies it could cause a stability issue that could cause the dataset to get corrupted (and thereby lost). So, not wanting to endure a lost dataset, I've got a triple-boot setup.

[continued]
With that said I'll stay at a triple-mirror boot pool instead of quadruple, and feel better about moving it to SSD (my 2.5" well-used spinners are dying off). :grin:
 

danb35

Hall of Famer
Joined
Aug 16, 2011
Messages
15,504
Just to address one point, the "OMG resilvering is so stressful!" argument is bunk. It ignores the fact that scrubs--which are a regular part of pool maintenance--put exactly the same stress on the pool (which is because they do exactly the same thing, with the obvious exception of writing out the data to the replacement disk). The article makes a number of good points, and I'd even agree that the conclusions make sense for small installations--but the absolute conclusion of the title isn't warranted.
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
Just to address one point, the "OMG resilvering is so stressful!" argument is bunk.
Yeah, along the lines I think the message got distorted from "if you have failing disks [plural], you'll probably find out during a resilver after the first one fully craps out and you first figure out that something is amiss" or something similar. I've seen the same argument here on the forums, too.
 

danb35

Hall of Famer
Joined
Aug 16, 2011
Messages
15,504
I've seen the same argument here on the forums, too.
It factors heavily in the "RAID5/RAIDZ1 is dead" argument, which I also consider to be greatly overstated. It's not that the risk doesn't exist, but it's stated as, if not a certainty, at least as more likely than not, which just isn't true.
 
Top