ZFS Parity-only write-only emergency disk

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
Just give me RAIDZ4.
Fun fact: the specific Reed-Solomon scheme used for RAIDZ2 and Z3 comes from a paper, for which an errata was later issued to the effect of there being a critical flaw that allowed for data loss. Luckily, this flaw only applies from four parity elements and up, so RAIDZ2 and RAIDZ3 are fine. There is a workaround for higher levels, but apparently it's computationally expensive enough to not be particularly viable.
 

asap2go

Patron
Joined
Jun 11, 2023
Messages
228
Fun fact: the specific Reed-Solomon scheme used for RAIDZ2 and Z3 comes from a paper, for which an errata was later issued to the effect of there being a critical flaw that allowed for data loss. Luckily, this flaw only applies from four parity elements and up, so RAIDZ2 and RAIDZ3 are fine. There is a workaround for higher levels, but apparently it's computationally expensive enough to not be particularly viable.
Now I am interested in how RaidZ is actually implemented. It's hard to imagine this sudden spike in complexity without knowing the details.
Are there good resources that explain the algorithms (for dummies)?
 

Davvo

MVP
Joined
Jul 12, 2022
Messages
3,222
Fun fact: the specific Reed-Solomon scheme used for RAIDZ2 and Z3 comes from a paper, for which an errata was later issued to the effect of there being a critical flaw that allowed for data loss. Luckily, this flaw only applies from four parity elements and up, so RAIDZ2 and RAIDZ3 are fine. There is a workaround for higher levels, but apparently it's computationally expensive enough to not be particularly viable.
You mean the Berlekamp–Massey algorithm?

EDIT: found the following.
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
Now I am interested in how RaidZ is actually implemented. It's hard to imagine this sudden spike in complexity without knowing the details.
Are there good resources that explain the algorithms (for dummies)?
For dummies, perhaps not. The code for RAID vdevs has the best technical explanation I've seen, look for the first block comment in particular.
I can mostly follow along, but don't ask me to explain all of the details.
You mean the Berlekamp–Massey algorithm?

EDIT: found the following.
I have not checked out the paper yet, but here's a link: https://web.eecs.utk.edu/~jplank/plank/papers/CS-96-332.pdf
 

asap2go

Patron
Joined
Jun 11, 2023
Messages
228
"This specification assumes no prior knowledge of algebra or coding theory. The
goal of this paper is for a systems programmer to be able to implement Reed-Solomon coding for reliability in RAID-like
systems without needing to consult any external references."

That actually turned out to be the "for dummies" version :D
Thanks!
 

SAK

Dabbler
Joined
Dec 9, 2022
Messages
20
Interesting discussion.

In theory, in this example of a non-existent zfs layout, since it was mentioned that write speeds would suffer...I wonder if some kind of write-later scheme could be hatched. Background process and catch-up during slower disk access periods, lower-prioritized continuous background task.

I realize all of this talk is stuff that will most probably never happen, but it's fun to wonder possibilities.
 

asap2go

Patron
Joined
Jun 11, 2023
Messages
228
Interesting discussion.

In theory, in this example of a non-existent zfs layout, since it was mentioned that write speeds would suffer...I wonder if some kind of write-later scheme could be hatched. Background process and catch-up during slower disk access periods, lower-prioritized continuous background task.

I realize all of this talk is stuff that will most probably never happen, but it's fun to wonder possibilities.
Write later would endanger your data integrity.
So sadly no.
Also the main overhead is from finding the dedup data from the table in the HDDs not writing it to the new SSDs in the dedup vdev.
 

Arwen

MVP
Joined
May 17, 2014
Messages
3,611
Interesting discussion.

In theory, in this example of a non-existent zfs layout, since it was mentioned that write speeds would suffer...I wonder if some kind of write-later scheme could be hatched. Background process and catch-up during slower disk access periods, lower-prioritized continuous background task.

I realize all of this talk is stuff that will most probably never happen, but it's fun to wonder possibilities.
Yes and no. Part of the issue is that during a write, some of the data is in memory. In order to update the RAID-3 parity drive, that data plus some read from any other vDevs in that stripe would create the RAID-3 parity. Then, the RAID-3 parity for that stripe can be written.

But, if you wanted to delay the RAID-3 parity updates, you would need to keep a clean / dirty table of the RAID-3 parity stripe's status. Or a transaction log of some sort. This is so you can figure out what RAID-3 parity is out of date, and then read the entire set of vDevs for that RAID-3 stripe.


As @asap2go said, any vDev failure in the mean time, would result in data loss, if that stripe's RAID-3 was not up to date. And in the case of a crash, instead of ZFS being clear it is up to date, a scrub & RAID-3 parity check would be needed.

A design goal of ZFS is to avoid ANY fsck at boot or mount, because as file systems get larger, file system checks can take much longer. I have personally experienced multiple hour's long fsck on RedHat, (with a Linux file system). Sun Microsystems wanted an always consistent file system so that at boot, no long file system check would be needed.


Of course, Sun Microsystems failed one at boot / mount condition. If you had started a Dataset, zVol or Snapshot destroy, and it did not finish before rebooting, (or crash), then it would resume at the beginning after boot. Larger ZFS Dataset destroys took so much resources that some Solaris SysAdmins rebooted, only to make the problem worse, (because it started over).

When OpenZFS forked from Sun ZFS, one of the first new features OpenZFS added was Async Destroy. This slowed down the ZFS dataset destroys, but also made the destroys both less impactful. And on reboot, resume where it left off. Neat feature that was later copied by Sun, (or Oracle).
 
Top