Replication Between Pools Causes Corruption

yottabit · Feb 3, 2022

winnielinnie said:
This is something that stands out. Very old zpool from the FreeNAS 9 era, replicating into TrueNAS SCALE with ZFS 2.1, and then back again to a different dataset on the old pool.

I don't believe ZSTD was a supported compression property back then? Did you upgrade the old pool's features prior to these replications?

Yes, I generally upgrade the pool features O(months) after deploying.

I just followed my steps in comment #37 with the single-file dataset and was not able to reproduce the problem.

So it may be something to do with the large amount of data in the intermediate dataset and how it's being read and pipelined to the destination dataset. It's unfortunately I wasn't able to reproduce this with a small test dataset.

The md5sum check is still running on the overnight replication that went old_pool -> old_pool, but so far there is no corruption.

I am starting to run out of free space to keep some of these experiments around. I think next I will do as follows:

On the intermediate dataset, delete everything except the known-to-corrupt file types `.cshrc` and `*.properties`
Snapshot
Replicate that snapshot to the original pool
Check for corruption
If corruption is repro, replicate the snapshot into a new dataset on the same pool
Replicate that snapshot to the original pool
Check for corruption
If corruption is repro, I will pipe the source snapshot into a file and share

AlexGG said:
The corrupted file is in fact LZ4-compressed data as is.

During replication, it does change the compression mode in the block pointer as requested but does not actually recompress the data. This way, when you copy to the LZ4-compressed dataset, there is no problem as no recompression is needed. I wonder if the problem persists if the target dataset uses LZJB compression.

I will start this now.

AlexGG · Feb 3, 2022

winnielinnie said:
or trying to upgrade a pool's features, which spans many ZFS iterations, can unearth these quirks

Probably this. Which means there won't be any repro, because we are looking at something rare, one of a kind. The solution is to copy data without send-receive, verify the copy, and then destroy the source. The copy, never subjected to any upgrades, should not be quirky. And make a note not to upgrade pools in the future.

yottabit · Feb 3, 2022

AlexGG said:
Probably this. Which means there won't be any repro, because we are looking at something rare, one of a kind. The solution is to copy data without send-receive, verify the copy, and then destroy the source. The copy, never subjected to any upgrades, should not be quirky. And make a note not to upgrade pools in the future.

What? I thought it was best practice to upgrade pools. Either way, this is a bug for sure.

I am attempting to pair down the intermediate dataset to only failed files that contain no PII, in an attempt to pipe into a file for further debug. When that's done, I'll probably pursue the copy into a new dataset to unblock this migration. But I do worry about older zvols, for example, for which I can't simply copy without a bunch of acrobatics in a VM.

Removing all of the files from the intermediate dataset that are *not* corrupted on replication:

Code:

# Found another UI bug that wouldn't set this dataset to readonly=off; I had to do it from the shell.
[yottabit@nas1 /mnt/big_scratch/migration/jacob.mcdonald]$ sudo zfs set readonly=off big_scratch/migration/jacob.mcdonald
[yottabit@nas1 /mnt/big_scratch/migration/jacob.mcdonald]$ grep -e ': OK$' /mnt/ssd_scratch/md5sum_check_encrypt_from_intermediate.txt | sed -e 's/: OK$//' | xargs -d '\n' rm -frv

Further reduction of dataset size by removing all the big media:

Code:

[yottabit@nas1 /mnt/big_scratch/migration/jacob.mcdonald]$ find . \( -iname "*.mp4" -o -iname "*.mov" -o -iname "*.avi" -o -iname "*.jpg" -o -iname "*.png" \) -exec rm -frv '{}' \;

AlexGG · Feb 3, 2022

yottabit said:
I am attempting to pair down the intermediate dataset to only failed files that contain no PII, in an attempt to pipe into a file for further debug.

That will be very good if you can do it and it still reproduces the problem. Especially if you can trim it down to something you can make publicly available. However, that does not necessarily preserve the problem. See, the send-recv process is like this

1. There is something on the disk at the source.
2. ZFS takes it and converts it to a different format for send/recv.
3. On the other side, ZFS takes send/recv format and converts it back into the on-disk format, transforming data as needed.
4. Finally, there is something on the disk at the target.

If the problem is at step 3, that's fine. If the problem is at step 2, we may end up with the send-recv stream which is somehow malformed but we have no idea how we arrived at it, and the source is gone. So if possible, you want to keep your source as long as we are at it.

yottabit · Feb 3, 2022

Success, I was able to prune the intermediate dataset from all sensitive PII, replicate to a new dataset, replicate that to the wanted destination pool, and still corrupt data during that final replication.

I have piped both the source dataset (no corruption) and the destination dataset (corruption) into separate files and uploaded to Google Drive to share. Each file is 2.1 GiB.

The md5sums.txt file in the dataset root contains the checksums as generated on the intermediate dataset before replicating/corrupting.

Please see if you can repro from the source dataset.

Edit: fwiw, encryption was a red herring. The destination dataset included here was not encrypted and still exhibits the corruption.

yottabit · Feb 3, 2022

It also seems that replicating from and to the same pool, without the intermediate dataset on the extra pool, did not corrupt the data. This was lz4 source to lz4+enc destination.

I will now try to replicate, on the same pool, to zstd w/o enc, and then from there to zstd w/ enc.

It seems suspect that this corruption is only happening from new pool to old pool. There are no pool errors, and that drive is being hosted on one of the same HBAs used by the old pool. If the data was corrupted at rest by the disk, ZFS would detect this on read, but it doesn't. So it really seems to be that it's a bug in interaction between new pool to old pool.

winnielinnie · Feb 3, 2022

Receiving either of those into a pool results in the "corrupted" files.

Even if using LZ4 pool-wide.

Trying to decompress the .cshrc file through lz4 I get:

Code:

Error 44 : Unrecognized header : file cannot be decoded

The file tool yields, the following:

Code:

.cshrc: data

However, when I tested on a legitimate LZ4-compressed file, I get:

Code:

file.lz4: LZ4 compressed data (v1.4+)

The header for the real LZ4 differs from the "corrupt" file.

AlexGG · Feb 3, 2022

winnielinnie said:
Unrecognized header : file cannot be decoded

Inside ZFS, there is no LZ4 header. The ZFS variant stores a big-endian 32-bit integer number of compressed bytes N, and then the N compressed bytes. You need a decoder able to read a headerless stream.

yottabit said:
I have piped both the source dataset (no corruption) and the destination dataset (corruption) into separate files and uploaded to Google Drive to share. Each file is 2.1 GiB.

Great, thanks. I will be looking into them later today.

yottabit · Feb 3, 2022

winnielinnie said:
Receiving either of those into a pool results in the "corrupted" files.

Ah, so maybe reading the dataset is where the corruption is coming from. I will package up the dataset before that one, and see if it's the same.

AlexGG said:
Great, thanks. I will be looking into them later today.

Thanks!

yottabit · Feb 3, 2022

Since the zfs read may have sourced the corruption, I've shared one more dataset. Just to be clear, here's where these all came from:

vol1/jacob.mcdonald@migration is the original source (checksums created)
big_scratch/migration/jacob.mcdonald@migration is the first replicated snapshot from #1 above (checksums ok)
big_scratch/migration/jacob.mcdonald@small_test is the PII-pruned snapshot of #2 above (checksums ok), `zfs send -w` piped into pre-source.raw_zfs.xz (new file)
big_scratch/migration/jacob.mcdonald_small_test@small_test is the replicated snapshot from #3 above (checksums ok), `zfs send -w` piped into source.raw_zfs.xz
vol1/jacob.mcdonald_small_test@small_test is the replicated snapshot from #4 above (checksums fail), `zfs send -w` piped into destination.raw_zfs.xz

Each file is ~2.1 GiB.

It's possible any read from #2 is already going to result in corruption, but I can't provide the #2 or #1 datasets because they contain sensitive PII. I would be happy to provide any zdb output that might be helpful, though.

winnielinnie · Feb 4, 2022

yottabit said:
Ah, so maybe reading the dataset is where the corruption is coming from. I will package up the dataset before that one, and see if it's the same.

I tried 4 different combination to see if I could get .cshrc to match your supplied hash.

source_zfs.raw --> into LZ4 pool-wide
source_zfs.raw --> into ZSTD pool-wide
detination_zfs.raw --> into LZ4 pool-wide
destination_zfs.raw --> into ZSTD pool-wide

Same exact result each time. The .cshrc file was "corrupt" and yielded the same incorrect hash (compared to your supplied list.) Its contents also looked exactly the same as your previous copy + paste.

However, I'm not well versed in using raw files in place of ZFS-to-ZFS replications.

AlexGG said:
Inside ZFS, there is no LZ4 header. The ZFS variant stores a big-endian 32-bit integer number of compressed bytes N, and then the N compressed bytes. You need a decoder able to read a headerless stream.

Thank you, makes sense now. Hope you have better luck with the raw files. My one relief out of all of this is that it's hopefully VERY RARE, and unlikely to ever occur with modern versions of ZFS and recently-created pools.

yottabit · Feb 4, 2022

At this point I'm unsure how best to proceed. I'm concerned that sourcing anything from this old pool may trigger the problem, even though it seems to be readable fine on my intermediate pool.

The replication of the dataset from and to the original pool was a success. I will try replicating that to the new pool, then back to the original pool, to see if this triggers the problem.

I'm concerned that I cannot trust sending away any datasets from this system now. This presents several problems for me:

I do not have enough spare bays to create a new resilient pool on this server
I have capacity available on a remote Z2 pool, but that is not under my direct control
My plan was to encrypt the PII dataset, and then send *all* datasets to the remote server, refactor my pool, and receive them back
I have 30 containers running, and I really don't want to set them up again
I have offsite object-storage backups of the important data, but not all of the "nice to have" data

My biggest fear is that I send away all these datasets to a remote server where I cannot verify them (do not want to decrypt), pull them back, and find that 0.9% of the files are corrupted. This almost happened, because I verified the intermediate pool datasets (before encrypting) with md5 successfully. It's only by chance that I noticed the problem when sending into the original pool and verifying the md5 again.

I am confused as to why sending the dataset from and to the same pool does not corrupt, but sending through a new intermediate pool does corrupt, even while the data on that pool passes all md5 checks.

I am wondering if I should boot an older distro with OpenZFS 2.0, or earlier, and create a new pool on the intermediate drive with that version. This is not a guarantee that I will not experience the same problem though, as only the new pool version will be older, and I will not be able to access the data on my old pool from that distro as the pool is upgraded to OpenZFS 2.1.

Ideally I would have open bays and new drives on the same server where I can perform the migration and refactor locally, but that's prohibitively expensive (would need a new JBOD chassis, new HBA, and new disks), and is no guarantee I wouldn't run into the same problem in the future as this would push my datasets to a new pool that acts the same as the intermediate pool I have now, from which I cannot push to another pool without corruption.

yottabit · Feb 5, 2022

Last night I did the shuffling in another way:

Send the same dataset snapshot from the original pool to the intermediate pool into an encrypted dataset[^1], without supplying the passphrase, inherit zstd compression on the new dataset
Send the intermediate dataset back to the original pool, without supplying the passphrase, overriding to zstd compression at the destination
Unlock #2 above

Preliminary verification of md5 checksums at the destination, after unlocked, is successful.

This give me a path forward. It seems that the corruption/compression-miss has something to do with sending the data into an unencrypted dataset on the intermediate pool, and then trying to send elsewhere after that stage. Keeping the data encrypted at the intermediate pool seems to work around this.

[^1]: originally I had always sent the source dataset into an unencrypted dataset on the intermediate pool; this is the only difference.

AlexGG · Feb 9, 2022

So, I can't figure it out. It happens, but I don't know how. Somewhere along the line, it fails to recompress the data, while accepting/inheriting the new compression mode. This seems to happen only when encryption is involved and also the change of compression mode. Not much else I can say.

yottabit · Feb 9, 2022

I was able to repro without receiving into an encrypted root, though. I think encryption is a red herring.

Are you saying you do not have the problem when receiving into a non-encrypted root?

AlexGG · Feb 10, 2022

yottabit said:
I was able to repro without receiving into an encrypted root, though. I think encryption is a red herring.

Yes. I think I mixed up my samples and test runs. It is possible to reproduce without encryption.

yottabit · Feb 10, 2022

Great, thanks for confirming. I'll open a ticket with ix hopefully soon. I might have time this weekend. I'll wait for them to triage before pestering the OpenZFS team.

yottabit · Feb 26, 2022

Bug is created in Jira.

Important Announcement for the TrueNAS Community.

Replication Between Pools Causes Corruption

yottabit

Contributor

AlexGG

Contributor

yottabit

Contributor

AlexGG

Contributor

yottabit

Contributor

yottabit

Contributor

winnielinnie

MVP

AlexGG

Contributor

yottabit

Contributor

yottabit

Contributor

winnielinnie

MVP

yottabit

Contributor

yottabit

Contributor

AlexGG

Contributor

yottabit

Contributor

AlexGG

Contributor

yottabit

Contributor

yottabit

Contributor

Similar threads

Important Announcement for the TrueNAS Community.

Replication Between Pools Causes Corruption

Contributor

Contributor

Contributor

Contributor

Contributor

Contributor

MVP

Contributor

Contributor

Contributor

MVP

Contributor

Contributor

Contributor

Contributor

Contributor

Contributor

Contributor

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Replication Between Pools Causes Corruption"

Similar threads