Replicated pool: zeroed files - corruption undetected

f00b4r00 · Aug 30, 2020

Hi,

I've just noticed something very strange and very scary: I have a primary FreeNAS setup that is replicated to an offsite machine (see signature for details).

Both source and target pools are healthy and have passed scrubbing with no error, however I just noticed that on the replicated pool, several files appear to be corrupted, as evidenced by incorrect md5 hashes (the hashes are verified against a "known correct" public third party: they are good on the source pool, and bad on the target).

And yet the source pool still happily replicates onto the target pool as if to nothing, and the corruption seems undetected.

In case it matters I haven't upgraded the pool on the target (after upgrading from 11.2 to 11.3).

How is that even possible? Isn't that exactly what ZFS should protect me against? I'm very confused.

Thanks

f00b4r00 · Aug 31, 2020

A bit more data: out of c.12k files on the replicated pool, it appears that at least a few hundred are not matching their source.

Now, onto the scary part: I took a look at a few of these: they're completely zero-ed out. The size match the source, but they only contain zeroes.

This is mind boggling: this replicated pool is read-only, for extra precaution I made sure the current state matches the last replicated snapshot (which it does), and yet the files are corrupt. I've just rerun a scrub and it passed with no error.

How could this happen? Is there a way to recover, short of destroying the replicated pool and starting all over again? (and how to ensure that this won't happen again?). Did I hit some obscure (and scary) bug?

Thanks

f00b4r00 · Aug 31, 2020

Well, I think I can answer that last question. It seems what I've hit is https://github.com/openzfs/zfs/issues/6224

Jurgen Segaert · Aug 31, 2020

This is scary indeed. I would recommend to log a issue on https://jira.ixsystems.com/ so that the developers can look at this & post the issue number here.

For future reference, please consider adding the exact versions of your FreeNAS systems to the first post as your signature does not show on all browsers and those values may change over time. Maybe also post the output of zpool get all <pool> and zfs get all <pool> as in the GitHub issue you were referencing.

f00b4r00 · Sep 1, 2020

@Jurgen Segaert noted for the versions. They are 11.2-U8 on source and 11.3-U4.1 on target.

That aside I sure hope that ixsystems devs do keep a close eye on openzfs' tracker and are fully aware of all its issues.

But the key takeaway for me here is simply: "Replication cannot be trusted".

It boggles the mind that the most basic expectation one might have for a robust replication system, which is "it must ensure that data committed to target storage matches data that was sent" is completely ignored by the zfs send/recv mechanism.

As evidenced by this bug, zfs recv can happily write complete garbage to disk without ever reporting any wrongdoing (and without triggering any alarm on the target filesystem), simply because it doesn't check written data correctness.

Which to me means that it'll just take another bug to silently corrupt my replicated pool. That is completely unacceptable in any conceivable scenario, and at that point I feel safer reverting to plain old and trusty rsync :-/

freqlabs · Sep 1, 2020

@f00b4r00 Definitely report the bug at https://jira.ixsystems.com/ and provide details about how you have performed the replication as well as a link to the OpenZFS issue. Thanks!

Johnnie Black · Sep 1, 2020

f00b4r00 said:
But the key takeaway for me here is simply: "Replication cannot be trusted".

It is indeed a very worrying bug, and one that's been present for years, I also use send/receive to backup my pool, just checked and I'm using the default record size so not affect by this, but still scary.

f00b4r00 · Sep 4, 2020

@Johnnie Black indeed. The seriousness of this bug and the fact that it's been there for years really puts the credibility of the replication mechanism (and possibly the openzfs code at large) into perspective: does upstream even care?

The "fix" that's just been committed fixes nothing. It's an ugly workaround the trigger, it does nothing to prevent other similar bugs from provoking the same kind of disaster. It does nothing to address what currently is a fatally broken design. As far as I'm concerned, nobody who value their data should use it. I'll run away from it moving forward.

@freqlabs I'm baffled. This is a major bug that's been reported in 2017 and it's not _already_ tracked in your own tracker??

freqlabs · Sep 7, 2020

@f00b4r00 the bug you linked was for ZFS on Linux, which is fairly different from the ZFS implementation in FreeNAS. It's not clear if it's even the same thing you encountered, based on the limited information given in this thread. Creating a ticket would help us get more information to debug the issue.

And to be clear, this isn't really a disaster bug if it is the one described in the issue on github. That bug described an incremental send with incorrect options would populate files with zeros on the receiver. Since it's an incremental send, that means you still have the previous snapshot to roll back to, and you have the data fully in tact on the sender still. Nothing actually gets overwritten on disk in ZFS, your data is still safe. It is inconvenient that it gave the appearance of working when it wasn't instead of refusing to run with the incorrect flags. Roll back to the previous snapshot and do the incremental send again with the correct flags.

f00b4r00 · Sep 8, 2020

freqlabs said:
@f00b4r00 the bug you linked was for ZFS on Linux, which is fairly different from the ZFS implementation in FreeNAS.

The bug I linked is in the common openzfs code, as evidenced by being in the "openzfs/zfs" repository. It affects all implementations - also as evidenced in the bug thread, which mentions e.g. illumos, openzsfsonosx - and the bug includes a cross-platform reproducer. Have you read the bug thread?

freqlabs said:
It's not clear if it's even the same thing you encountered, based on the limited information given in this thread. Creating a ticket would help us get more information to debug the issue.

It is the same bug. Unless you mean to say there is another bug that has the exact same symptoms as this one (completely silent zeroing of replicated files) at which point I'd say it'd be safe to call this replication system a practical joke.

I certainly have zero interest in trying to reproduce the bug (i.e. help debug it) so I don't see what my input would be beyond opening a ticket, which you can do yourself. I don't even remember when I flicked the LARGEBLOCK switch (possibly when I started pulling from the 11.3 target instead of pushing from the 11.2 source), which leads to other dramatic consequences, see below.

Anyway, you have all the available information in the github issue, along with a testcase and a "fix". What else could you possibly need?

freqlabs said:
And to be clear, this isn't really a disaster bug if it is the one described in the issue on github. That bug described an incremental send with incorrect options would populate files with zeros on the receiver. Since it's an incremental send, that means you still have the previous snapshot to roll back to, and you have the data fully in tact on the sender still. Nothing actually gets overwritten on disk in ZFS, your data is still safe. It is inconvenient that it gave the appearance of working when it wasn't instead of refusing to run with the incorrect flags. Roll back to the previous snapshot and do the incremental send again with the correct flags.

Wow. You completely failed to grasp the severity of this bug. Silent data corruption in the replication system, can it get any worse?

Because the corruption is totally silent and the scrub doesn't report anything wrong with the pool and the subsequent incrementals keep happening happily ever after, if you have snapshot expiration enabled, there comes a point where data CANNOT BE RECOVERED. Once the last good snapshot is gone, you're fscked. Guess what: I am, because I noticed too late.

Furthermore, and that's possibly the most important point: it is impossible to quickly identify the last good snapshot. Corruption can be spread between multiple snapshot as random parts of the underlying filesystem are updated. ahrens himself confirms there is NO RECOVERY PATH.

The fact that the data is intact on the sender is completely besides the point. One of the main reasons to use replication is to increase redundancy: if my sender dies, is destroyed in a fire or is stolen, if I thought my data was safe then I'm in for a world of pain! And herein lies the catch: how often do you check data stored on a backup system? Usually when you need it. That's exactly why this bug is a total disaster! Because the time you might eventually notice something went wrong is the time you least want to!

Besides it may not be practical to replicate the complete dataset from scratch (otherwise incremental backups themselves may not be even used in the first place). In my case I have to shlep back 6TB of data between two sites several hundreds km apart which are only capable of doing 20Mbps between themselves...

But maybe it'll take one of your enterprise-grade customers to lose hundreds of TB of data for sh*t to really hit the fan?

anodos · Sep 8, 2020

f00b4r00 said:
The bug I linked is in the common openzfs code, as evidenced by being in the "openzfs/zfs" repository. It affects all implementations - also as evidenced in the bug thread, which mentions e.g. illumos, openzsfsonosx - and the bug includes a cross-platform reproducer. Have you read the bug thread?

It is the same bug. Unless you mean to say there is another bug that has the exact same symptoms as this one (completely silent zeroing of replicated files) at which point I'd say it'd be safe to call this replication system a practical joke.

I certainly have zero interest in trying to reproduce the bug (i.e. help debug it) so I don't see what my input would be beyond opening a ticket, which you can do yourself. I don't even remember when I flicked the LARGEBLOCK switch (possibly when I started pulling from the 11.3 target instead of pushing from the 11.2 source), which leads to other dramatic consequences, see below.

Anyway, you have all the available information in the github issue, along with a testcase and a "fix". What else could you possibly need?

This is a very normal ask for our users. Especially if the issue looks like it might be serious. We have processes internally about how things get done, and the engineering team works on tickets that are filed in our bug ticketing system. They generally don't work based on forum posts. Among other things, we often time-constrained on what features / fixes we can implement in a given release. Belittling one of our ZFS developers, who asked you for specific information, is generally counter-productive.

Anything that you as an end-user can do to help with providing a useful test-case for your specific issue is much appreciated. So please create a bug ticket and attach a debug at https://jira.ixsystems.com

Provide as much detail as you can, and help us to verify that this was the issue you encountered (for instance whether this was through a replication job or a one-off manual send/receive may be useful, etc, etc).

f00b4r00 · Sep 8, 2020

anodos said:
Provide as much detail as you can, and help us to verify that this was the issue you encountered (for instance whether this was through a replication job or a one-off manual send/receive may be useful, etc, etc).

You probably missed the part where I said I don't know exactly at what point I triggered that bug. I speculate that this is when I started using 11.3 to replicate, since 11.3 has the knob and 11.2 doesn't (hint: this means this bug is becoming a lot more likely to be triggered even by users using only the web interface) but it could be something else (I did run send/recv over the command line on a few occasions).

Still, I merely faced its consequences: my replicated pool is completely hosed. Consequently I have zero useful input to provide. In particular I have no debug to provide. I certainly don't want to participate in any experiment that would try to reproduce this problem with my data.

Being a developer myself I know what's a useful bug report. Here, whatever info I can provide (which is very little) will pale in comparison with the info provided IN THE UPSTREAM BUG TRACKER (keyword: upstream), where a forensics explanation of what happens is provided, as well as a cross-platform test case and a fix.

If you still need me (or anyone else) to take any form of action before you can acknowledge the reality of this bug and make sure it's fixed ASAP, then I'm afraid something is very wrong on your end.

kthxbye

freqlabs · Sep 8, 2020

A ticket would be particularly useful if, as previously requested, you could provide "details about how you have performed the replication" in FreeNAS to help us understand why you may have come to encounter this issue in the first place. Even if you can't remember what you may have done in detail, it's likely preserved in the pool histories at the very least, so uploading debug archives from both systems with the ticket would be a simple and likely effective way to contribute towards a solution.

Just to clarify since it can be a little confusing, the openzfs/zfs github repo is not an upstream for any released versions of FreeNAS. Starting in TrueNAS 12 it will be, but it is only recently that FreeBSD support landed there and the organization was renamed from "ZFS on Linux" to "OpenZFS". FreeNAS is based on FreeBSD not Linux, and the ZFS upstream up until a few weeks ago was illumos. Code being platform independent in that repo doesn't necessarily mean it's the same code that we had in FreeBSD for any released FreeNAS version, and even being reproducible on illumos doesn't necessarily mean it affects FreeBSD. In this case the bug likely does affect FreeBSD since you have presumably encountered this yourself, but there have been a lot of assumptions made which really need to be verified to be certain.

Everyone can see you're passionate enough about the issue, so again I'll urge you to open a ticket in Jira, please. This is a more effective way to help than continuing to discuss it on the forum.

Important Announcement for The TrueNAS Community.

Replicated pool: zeroed files - corruption undetected

f00b4r00

Dabbler

f00b4r00

Dabbler

f00b4r00

Dabbler

Jurgen Segaert

Guru

f00b4r00

Dabbler

freqlabs

iXsystems

Johnnie Black

Guru

f00b4r00

Dabbler

freqlabs

iXsystems

f00b4r00

Dabbler

anodos

Sambassador

f00b4r00

Dabbler

freqlabs

iXsystems

Similar threads