ZFS Parity-only write-only emergency disk

Ericloewe · Nov 2, 2023

Davvo said:
Just give me RAIDZ4.

Fun fact: the specific Reed-Solomon scheme used for RAIDZ2 and Z3 comes from a paper, for which an errata was later issued to the effect of there being a critical flaw that allowed for data loss. Luckily, this flaw only applies from four parity elements and up, so RAIDZ2 and RAIDZ3 are fine. There is a workaround for higher levels, but apparently it's computationally expensive enough to not be particularly viable.

asap2go · Nov 2, 2023

Ericloewe said:
Fun fact: the specific Reed-Solomon scheme used for RAIDZ2 and Z3 comes from a paper, for which an errata was later issued to the effect of there being a critical flaw that allowed for data loss. Luckily, this flaw only applies from four parity elements and up, so RAIDZ2 and RAIDZ3 are fine. There is a workaround for higher levels, but apparently it's computationally expensive enough to not be particularly viable.

Now I am interested in how RaidZ is actually implemented. It's hard to imagine this sudden spike in complexity without knowing the details.
Are there good resources that explain the algorithms (for dummies)?

Davvo · Nov 2, 2023

Ericloewe said:
Fun fact: the specific Reed-Solomon scheme used for RAIDZ2 and Z3 comes from a paper, for which an errata was later issued to the effect of there being a critical flaw that allowed for data loss. Luckily, this flaw only applies from four parity elements and up, so RAIDZ2 and RAIDZ3 are fine. There is a workaround for higher levels, but apparently it's computationally expensive enough to not be particularly viable.

~~You mean the Berlekamp–Massey algorithm?~~

EDIT: found the following.

zfs/module/zfs/vdev_raidz.c at master · openzfs/zfs

OpenZFS on Linux and FreeBSD. Contribute to openzfs/zfs development by creating an account on GitHub.

github.com

Ericloewe · Nov 2, 2023

asap2go said:
Now I am interested in how RaidZ is actually implemented. It's hard to imagine this sudden spike in complexity without knowing the details.
Are there good resources that explain the algorithms (for dummies)?

For dummies, perhaps not. The code for RAID vdevs has the best technical explanation I've seen, look for the first block comment in particular.
I can mostly follow along, but don't ask me to explain all of the details.

Davvo said:
~~You mean the Berlekamp–Massey algorithm?~~

EDIT: found the following.

zfs/module/zfs/vdev_raidz.c at master · openzfs/zfs

OpenZFS on Linux and FreeBSD. Contribute to openzfs/zfs development by creating an account on GitHub.

github.com

I have not checked out the paper yet, but here's a link: https://web.eecs.utk.edu/~jplank/plank/papers/CS-96-332.pdf

asap2go · Nov 2, 2023

"This specification assumes no prior knowledge of algebra or coding theory. The
goal of this paper is for a systems programmer to be able to implement Reed-Solomon coding for reliability in RAID-like
systems without needing to consult any external references."

That actually turned out to be the "for dummies" version :D
Thanks!

SAK · Nov 3, 2023

Interesting discussion.

In theory, in this example of a non-existent zfs layout, since it was mentioned that write speeds would suffer...I wonder if some kind of write-later scheme could be hatched. Background process and catch-up during slower disk access periods, lower-prioritized continuous background task.

I realize all of this talk is stuff that will most probably never happen, but it's fun to wonder possibilities.

asap2go · Nov 3, 2023

SAK said:
Interesting discussion.

In theory, in this example of a non-existent zfs layout, since it was mentioned that write speeds would suffer...I wonder if some kind of write-later scheme could be hatched. Background process and catch-up during slower disk access periods, lower-prioritized continuous background task.

I realize all of this talk is stuff that will most probably never happen, but it's fun to wonder possibilities.

Write later would endanger your data integrity.
So sadly no.
Also the main overhead is from finding the dedup data from the table in the HDDs not writing it to the new SSDs in the dedup vdev.

Arwen · Nov 3, 2023

SAK said:
Interesting discussion.

In theory, in this example of a non-existent zfs layout, since it was mentioned that write speeds would suffer...I wonder if some kind of write-later scheme could be hatched. Background process and catch-up during slower disk access periods, lower-prioritized continuous background task.

I realize all of this talk is stuff that will most probably never happen, but it's fun to wonder possibilities.

Yes and no. Part of the issue is that during a write, some of the data is in memory. In order to update the RAID-3 parity drive, that data plus some read from any other vDevs in that stripe would create the RAID-3 parity. Then, the RAID-3 parity for that stripe can be written.

But, if you wanted to delay the RAID-3 parity updates, you would need to keep a clean / dirty table of the RAID-3 parity stripe's status. Or a transaction log of some sort. This is so you can figure out what RAID-3 parity is out of date, and then read the entire set of vDevs for that RAID-3 stripe.

As @asap2go said, any vDev failure in the mean time, would result in data loss, if that stripe's RAID-3 was not up to date. And in the case of a crash, instead of ZFS being clear it is up to date, a scrub & RAID-3 parity check would be needed.

A design goal of ZFS is to avoid ANY fsck at boot or mount, because as file systems get larger, file system checks can take much longer. I have personally experienced multiple hour's long fsck on RedHat, (with a Linux file system). Sun Microsystems wanted an always consistent file system so that at boot, no long file system check would be needed.

Of course, Sun Microsystems failed one at boot / mount condition. If you had started a Dataset, zVol or Snapshot destroy, and it did not finish before rebooting, (or crash), then it would resume at the beginning after boot. Larger ZFS Dataset destroys took so much resources that some Solaris SysAdmins rebooted, only to make the problem worse, (because it started over).

When OpenZFS forked from Sun ZFS, one of the first new features OpenZFS added was Async Destroy. This slowed down the ZFS dataset destroys, but also made the destroys both less impactful. And on reboot, resume where it left off. Neat feature that was later copied by Sun, (or Oracle).

Important Announcement for the TrueNAS Community.

ZFS Parity-only write-only emergency disk

Ericloewe

Server Wrangler

asap2go

Patron

Davvo

MVP

zfs/module/zfs/vdev_raidz.c at master · openzfs/zfs

Ericloewe

Server Wrangler

zfs/module/zfs/vdev_raidz.c at master · openzfs/zfs

asap2go

Patron

SAK

Dabbler

asap2go

Patron

Arwen

MVP

Similar threads