bitrot data securtity with ZFS but without a RAIDZ

Maxxon · Sep 14, 2014

Hello,

my question is not necessary bound to FreeNAS/ZFS, more to filesystems in general. I ask it here because i hope that here a lot of ppl with filesystem expertise around.

short story:
Is there a way to store redundancy/parity information (like PAR2 generates) on the same disk as the data automatically and have it used automatically if necessary (like the "copies" parameter of ZFS, but more space efficient)?

long story:
I recently decided that i want to protect my data (99% media, 1% documents, etc.) against bitrot and that is how i discovered ZFS/FreeNAS. For a secure bitrot protection (detection and automatic correction) i would need at least a RAIDZ2. To my understanding only Z2 mode has that extra redundancy to restore a rotting bit that might be detetced during a raid rebuild.
Because power usage is a concern and i don't need the extra performance of a raid, i would (without bitrot protection) be fine with a simple mirroring raid. Just adding 3 disks to have bitrot automatically corrected seems overkill.
ATM i am using 2 big HDDs for 2 offline backups in a weekly rotation, To have bitrot protection i am securing the data with PAR2. Theoretically only 1 recovery block of a few MB is needed per TB size HDD (so very space efficient), but because generaring that PAR2 data is a time consuming task and 99% of the data does not chance from one week to another, i use more in practise.

So in the end i am looking for anything that could do the tedious PAR2 work (or anything similiar) on filesystem level automatically.

thanks for reading

cyberjock · Sep 14, 2014

Umm, your post confuses me. Let me clear up a few things and hopefully you'll be able to answer your question for yourself. ;)

1. ZFS uses checksums to validate that data *is* good.
2. In the event that data isn't good, then redundancy is used. This redundancy can be a mirror disk, a copy of the blocks in question (copies=2), or parity from something like RAIDZ1, RAIDZ2, or RAIDZ3.

Bitrot would naturally result in checksums that don't match the data and would result in redundancy being used.

You don't "need RAIDZ2", you just need enough redundancy to correct the error. Now you can go with mirrors (which in terms of disk space used is roughly equivalent to copies=2) but you can also choose to go with RAIDZ(x). Of course the easy answer is to go with RAIDZ1. But not so fast. Now you've added complexity because you have more than 1 disk and such so you're kind of forced to consider RAIDZ2 to ensure you can rebuild in the event of a loss of disk.

Now there's this law from long ago called the law of conservation of data. It states that for x-bytes of parity data you can only repair x-bytes of real data (compression notwithstanding). So if you wanted to have "1 recovery block" you'd need to ensure that all of the other blocks exist (kind of like how PAR2 archives work). Recovery of 1 block with 1TB of data protected is also very time consuming and CPU intensive.

In short, you're asking for something that really doesn't exist because most people are not looking to protect themselves from just bitrot. They're looking for bitrot as well as failing sectors, loss of disks, etc. To be honest, I'm not sure why you're concerned about bitrot but not disk failures. It seems like a disk failure would be particularly nasty as copies=2 doesn't really protect your data from that scenario.

Maxxon · Sep 14, 2014

cyberjock said:
Now there's this law from long ago called the law of conservation of data. It states that for x-bytes of parity data you can only repair x-bytes of real data (compression notwithstanding). So if you wanted to have "1 recovery block" you'd need to ensure that all of the other blocks exist (kind of like how PAR2 archives work). Recovery of 1 block with 1TB of data protected is also very time consuming and CPU intensive.

I was completely aware that it is not possible to get away with "just one recovery block" for a whole drive. On filesystem level, i had something like "one block per sector" in mind. Which could translate e.g. in a few bytes for each 4k sector, which should be still a lot less than just having copies=2.

In short, you're asking for something that really doesn't exist because most people are not looking to protect themselves from just bitrot. They're looking for bitrot as well as failing sectors, loss of disks, etc. To be honest, I'm not sure why you're concerned about bitrot but not disk failures. It seems like a disk failure would be particularly nasty as copies=2 doesn't really protect your data from that scenario.

You misunderstood. I need protection against disk failures as well. I just did the math about how many disks i would need for a bullet proof backup and tried to find a better solution.
Consider that i have 1TB drive for my online storage and i want to protect it against data loss. Against disk failure, i need to add a 2nd drive. In case of failure of the first drive i need to copy the whole data from the 2nd to the replacement drive. The operation will suffer from bitrot in a fair chance. To be protected against that i need to add a 3rd drive to the online storage. Now i am backing this online storage up weekly to a 4th drive. In case i really need the backup (e.g. fire, flood, etc.) this has the same fair chance to see bitrot. A 5th drive is needed for the weekly offline backup. Because a drive from the offline backup could also fail, it is recommended to have 2 offline backups. So add drive 6 and 7. Because i am lazy i use every week the other offline backup, so in worst case i have a 2 week old backup at hand.

Which leaves me with 6 additional drives just to have a decent backup which protects against drive failure AND bitrot. Currently i am using just 1 drive (and not 2) per offline backup. I can do this because i only copy the new media to the backup drives and add a PAR2 recovery block per media file. Documents, etc. are TARed before putting them to offline backup, to get an archive of reasonable size to protect with PAR2. If bitrot occurs, i have the recovery information on the same drive. It is unlikely that bitrot will occur twice on the same file(considering the chance of bitrot to occur and the archive size), so PAR2 is enough protection.
In the end it bugs me to have to add another drive to online, backup1 and backup2 just to have bitrot protection. Not only the cost, but the handling as well (plug 2 drives each week in, etc.). I am doing this with PAR2 without the needs of extra drives, but at the moment lots of manual work is involved. That is why i am asking if there is a filesystem option (not necessarily ZFS/Freenas) for me to do it automatically.

Thanks.

edit: because i mentioned raidz2: i was considering this because it allows with e.g. 6 drives a better ratio of usable/wasted space for parity info than the mirroring solution.

edit2: corrected math, had raidz2 in mind which did not match the mirror example

mjws00 · Sep 14, 2014

Take a look at Snapraid. It does parity and checksums at a file level. Very similar to what you are doing manually. Command line only... but for static datasets might work. I've looked into it for media datasets, to try and get my critical data to a size that is manageable.

I think your math is probably right. The smallest number of disks, I can see working that protects against device failure and bitrot is 6. 3 mirrors, 1 Live, 1 Local Sync, 1 Offsite. You are swapping local sync and offsite pairs. All that for 1 measely drives space. BLECH. But man is it safe ;) You are literally forced to maintain 3 full copies under your scenario. A triple mirror (swap out the 3rd disk) allows for 4 drives total. At the cost of the extra mirror offsite. Or just bring in the offsite mirror weekly to replicate.

Ratios don't get any better until you hit 5 disk z2 datasets. At that point you need to replicate to offsite, or physically bring the offsite data set in at an acceptable interval. Many smaller disks, can keep your cost per TB in the same ballpark. Many of us don't have z2 level protection offsite. Multiple copies, and the cloud are the compromise.

Consider your scenario with 10 disk pools in Z2. ;) The parity ratios are better, but shear volume of disks is still nasty. I have yet to see a great answer for large multidisk offsite backups. Truth is you either pay, sacrifice management time, recovery time, or size. No magic exists.

Good luck. If you find an awesome solution; make sure you share. We're all in the same boat. :)

cyberjock · Sep 14, 2014

Well, if you are trying to protect yourself from failing disks then you shouldn't have even mentioned copies=2. That fixes nothing for you.

Honestly, I'm not sure why you don't just use the offsite for all of your backups. ZFS send/receive is easy if it's offsite but online. Only the changed data blocks are sent. I mean, two *full* copies of your data with snapshots and replication to an offsite is what most people here do. Some add an offline monthly backup of sorts. But yeah, disk space is expensive. Backups are expensive (which is why so many people don't do it). And keeping all this stuff working is itself not necessarily trivial.

solarisguy · Sep 14, 2014

@Maxxon, you are discounting that PAR files could get corrupted. Otherwise, you are perfectly right: one needs RAID-Z2 or three way mirrors at two locations at the minimum.

I was recently made aware that some personal DNA results, of the type one orders on Internet, can be gigabytes in size. There could be a new crop of FreeNAS users :) who would keep their data just like that ↑

Maxxon · Sep 15, 2014

mjws00 said:
Take a look at Snapraid. It does parity and checksums at a file level. Very similar to what you are doing manually. Command line only... but for static datasets might work. I've looked into it for media datasets, to try and get my critical data to a size that is manageable.

As far as i understand it still needs a dedicated disk for the parity info. So its no advantage to the FreeNAS solution.

Good luck. If you find an awesome solution; make sure you share. We're all in the same boat. :)

The best idea i have come up with so far is to have a RAIDZ2 and have a daemon/script running every night that generates PAR2 files. FreeNAS ensures that the RAIDZ2 is not corrupted while calculating the PAR2 blocks. I could then backup the RAIDZ2 (including that PAR2 files) to single disks with ZFS filesystems and have my 2 offline backups.
In the case i need the backup i could just restore from them normally. If bitrot happens while restoring i would be able to see that on failed ZFS checksums. Later i could repair the damaged files manually with PAR2. I think that is doable because i should not see more than 1 or 2 damaged files by the current chance of bitrot.

cyberjock said:
Well, if you are trying to protect yourself from failing disks then you shouldn't have even mentioned copies=2. That fixes nothing for you.

I was considering copies=2 as protection against bitrot on the offline backup. Because it bugs me to have 1 extra drive in each offline backup only to be protected against a single rotting bit. But since copies=2 means the same amount of wasted space as with 2 drives it makes no difference at the end of the day.

Honestly, I'm not sure why you don't just use the offsite for all of your backups.

I have not decided what to do exactly, atm i have just a few backups on external drives. I want to be sure that i considered the failure cases before i spend $$$ on a ZFS server. So that i don't discover that my backups are useless because of an oversight.
I am not considering online offsite backups because in the region of germany i live in we have amazing 384kbit/sec upstream.

solarisguy said:
@Maxxon, you are discounting that PAR files could get corrupted. Otherwise, you are perfectly right: one needs RAID-Z2 or three way mirrors at two locations at the minimum.

I was thinking that e.g. in a PAR2 protected archive of 5GB it is very unlikely to see bitrot in BOTH the archive and the par2 file. In the case it happens on the 5GB archive i would have the par2 file for correction and in the case it happens in the par2 file the par2 checksum simply fails - no harm done here.

cyberjock · Sep 15, 2014

You could generate par2s from a script, but gosh that's alot of work on your drives every night. You will have to do this inside a jail since you can't install software on the base OS.

What you are saying is doable, it just seems like a *lot* of extra work. If you setup ZFS replication and you aren't adding GB of data every night the amount transferred should be very small. Two friends replicate to each other some of their "more important data" and they often send less than 1MB in 24 hours.

solarisguy · Sep 15, 2014

How would you distinguish between a real PAR2 checksum failure and one due to bitrot inside PAR2 file ?

solarisguy · Sep 15, 2014

Maxxon said:
[----] In the case i need the backup i could just restore from them normally. If bitrot happens while restoring i would be able to see that on failed ZFS checksums. Later i could repair the damaged files manually with PAR2. I think that is doable because i should not see more than 1 or 2 damaged files by the current chance of bitrot.

The issue you are overlooking is that you are not guaranteed to have bitrot inside a regular file. It could be flip in a directory or some critical filesystem bit. That is why copies on single disks are not as reliable, as the second system that is fully protected.

And of course a backup, in case you delete your files... And a clone of the backup... ;)

Just do not do like one (large) ISP did... They had backups of their critical data with a different large company (let's say X) for disaster recovery purposes. It just happened that company X decided to locate that particular backup set on the floor below the ISP datacenter. So when the ISP got flooded and declared a disaster...

Luckily for both the ISP and X, there were tapes in existence...

Maxxon · Sep 16, 2014

cyberjock said:
You could generate par2s from a script, but gosh that's alot of work on your drives every night. You will have to do this inside a jail since you can't install software on the base OS.

I don't see lots of work? 99% is media, so a PAR2 for a media file will be generated the 1st night after the new file has been copied onto the NAS, there is no need to (re-)generate all PAR2 files every night. Changes on existing files can be detected by simply comparing a files last changed timestamp with the corresponding par2 creation timestamp.
Only documents would be TARed each night and have a new PAR2 assigned. But that's only little data.

And, as a benefit, i would put the night time to good use ;-)

solarisguy said:
How would you distinguish between a real PAR2 checksum failure and one due to bitrot inside PAR2 file ?

I would only try a recovery with PAR2 if ZFS tells me that there is a checksum failure. I would then try to recover the damaged files with PAR2. If PAR2 does not check the integry of a PAR2 file on it's on and prevents from it's usage, it would be easly possible to store the PAR2's MD5sum as part of the filename.

solarisguy said:
The issue you are overlooking is that you are not guaranteed to have bitrot inside a regular file. It could be flip in a directory or some critical filesystem bit. That is why copies on single disks are not as reliable, as the second system that is fully protected.

As far as i understand ZFS does store copies=2 for filesystem critical data structures anyway. So if bitrot occurs in them, they would be healed automatically.

solarisguy · Sep 16, 2014

An excerpt from man zpool

Device Failure and Recovery

ZFS supports a rich set of mechanisms for handling device failure and data corruption. All metadata and data is checksummed, and ZFS automatically repairs bad data from a good copy when corruption is detected.

In order to take advantage of these features, a pool must make use of some form of redundancy, using either mirrored or raidz groups. While ZFS supports running in a non-redundant configuration, where each root vdev is simply a disk or file, this is strongly discouraged. A single case of bit corruption can render some or all of your data unavailable.

Maxxon · Sep 17, 2014

@solarisguy
I had read the following 2 articles, which clearly state that ZFS stores metadata twice by default. Max 1 filesystem block can be lost due to corrupted metadata. You would have to tune it if you wanted anything less?
https://www.illumos.org/issues/3835
https://blogs.oracle.com/relling/entry/zfs_copies_and_data_protection

Did i understand anything wrong? Or is the ZFS implementation in FreeNAS different?

solarisguy · Sep 17, 2014

Both links you have provided relate to zfs filesystem layer. The information, I have provided, is about zpool, that sits one layer above ( or below..., depending how one looks at it ;) ), needing protection...

solarisguy · Sep 17, 2014

You can also see man zpool at http://illumos.org/man/1m/zpool

The wording is identical to the one given in FreeBSD and quoted by me above.

Maxxon · Sep 19, 2014

Hm, i thought the zpool does not store any data on the drives itself. I had understood it as a configuration (which would probably go onto the flash drive of FreeNAS). I'll investigate this case.

jgreco · Sep 20, 2014

Maxxon said:
Hm, i thought the zpool does not store any data on the drives itself. I had understood it as a configuration (which would probably go onto the flash drive of FreeNAS). I'll investigate this case.

A ZFS pool is basically standalone and does not require external configuration (think: fstab) to import.

Maxxon · Sep 20, 2014

solarisguy, tbh you have confused me totally now:

I have gone through some basic tests with FreeNAS on a VM to learn about the difference between the ZFS filesystem itself and a zpool. But because of my prior readings i am not able to get the things together so that they make any sense. Maybe anyone can shed some light?

I see now a really big misconception in the whole system around ZFS and i cant believe that this is for real: Why store important ZFS metadata twice if damage on the underlying zpool can still wipe the data? If vdevs with redundancy are used to protect against the zpool damage, this should protect the zfs metadata alongside. So no need to store the metadata twice?

Additionally i don't see a clear distinction between zpool and zfs metadata. So maybe the documentation is simply not clear enough in the manpage solarisguy quoted. e.g. in oracle tutorials about recovering damaged zpools no difference is made between recovering whole pools and corrupted files/directories on them.

jgreco · Sep 20, 2014

Maxxon said:
ZFS filesystem itself and a zpool.

Think: Kinda the difference between a car and a motor vehicle.

A ZFS dataset is a "filesystem" built on top of a pool. You can have more than one, or zvol style block storage devices instead.

Why store important ZFS metadata twice if damage on the underlying zpool can still wipe the data? If vdevs with redundancy are used to protect against the zpool damage, this should protect the zfs metadata alongside. So no need to store the metadata twice?

Metadata is a trivial amount of storage. Failure of the underlying storage, especially on a single disk system without storage redundancy, could catastrophically corrupt metadata - a situation ZFS can trivially identify in many cases, but without somewhere to pull correct data from, a correctable catastrophe becomes uncorrectable.

The simple fix is to store it twice. It is virtually free and if/when it saves your pool, you'll be very happy.

solarisguy · Sep 20, 2014

@ Maxxon, information is at your fingertips, read man zpool at http://illumos.org/man/1m/zpool:

The zpool command configures ZFS storage pools. A storage pool is a collection of devices that provides physical storage and data replication for ZFS datasets.

Please read some technical introduction to ZFS concept. May be Oracle still has some white papers available...

Important Announcement for the TrueNAS Community.

bitrot data securtity with ZFS but without a RAIDZ

Cadet

Inactive Account

Cadet

Guru

Inactive Account

Guru

Cadet

Inactive Account

Guru

Guru

Cadet

Guru

Cadet

Guru

Guru

Cadet

Resident Grinch

Cadet

Resident Grinch

Guru

Similar threads