Replication Between Pools Causes Corruption

yottabit

Contributor
Joined
Apr 15, 2012
Messages
192
This is something that stands out. Very old zpool from the FreeNAS 9 era, replicating into TrueNAS SCALE with ZFS 2.1, and then back again to a different dataset on the old pool.


I don't believe ZSTD was a supported compression property back then? Did you upgrade the old pool's features prior to these replications?
Yes, I generally upgrade the pool features O(months) after deploying.

I just followed my steps in comment #37 with the single-file dataset and was not able to reproduce the problem.

So it may be something to do with the large amount of data in the intermediate dataset and how it's being read and pipelined to the destination dataset. It's unfortunately I wasn't able to reproduce this with a small test dataset.

The md5sum check is still running on the overnight replication that went old_pool -> old_pool, but so far there is no corruption.

I am starting to run out of free space to keep some of these experiments around. I think next I will do as follows:
  1. On the intermediate dataset, delete everything except the known-to-corrupt file types `.cshrc` and `*.properties`
  2. Snapshot
  3. Replicate that snapshot to the original pool
  4. Check for corruption
  5. If corruption is repro, replicate the snapshot into a new dataset on the same pool
  6. Replicate that snapshot to the original pool
  7. Check for corruption
  8. If corruption is repro, I will pipe the source snapshot into a file and share

The corrupted file is in fact LZ4-compressed data as is.

During replication, it does change the compression mode in the block pointer as requested but does not actually recompress the data. This way, when you copy to the LZ4-compressed dataset, there is no problem as no recompression is needed. I wonder if the problem persists if the target dataset uses LZJB compression.

I will start this now.
 

AlexGG

Contributor
Joined
Dec 13, 2018
Messages
171
or trying to upgrade a pool's features, which spans many ZFS iterations, can unearth these quirks

Probably this. Which means there won't be any repro, because we are looking at something rare, one of a kind. The solution is to copy data without send-receive, verify the copy, and then destroy the source. The copy, never subjected to any upgrades, should not be quirky. And make a note not to upgrade pools in the future.
 

yottabit

Contributor
Joined
Apr 15, 2012
Messages
192
Probably this. Which means there won't be any repro, because we are looking at something rare, one of a kind. The solution is to copy data without send-receive, verify the copy, and then destroy the source. The copy, never subjected to any upgrades, should not be quirky. And make a note not to upgrade pools in the future.
What? I thought it was best practice to upgrade pools. Either way, this is a bug for sure.

I am attempting to pair down the intermediate dataset to only failed files that contain no PII, in an attempt to pipe into a file for further debug. When that's done, I'll probably pursue the copy into a new dataset to unblock this migration. But I do worry about older zvols, for example, for which I can't simply copy without a bunch of acrobatics in a VM.

Removing all of the files from the intermediate dataset that are *not* corrupted on replication:
Code:
# Found another UI bug that wouldn't set this dataset to readonly=off; I had to do it from the shell.
[yottabit@nas1 /mnt/big_scratch/migration/jacob.mcdonald]$ sudo zfs set readonly=off big_scratch/migration/jacob.mcdonald
[yottabit@nas1 /mnt/big_scratch/migration/jacob.mcdonald]$ grep -e ': OK$' /mnt/ssd_scratch/md5sum_check_encrypt_from_intermediate.txt | sed -e 's/: OK$//' | xargs -d '\n' rm -frv


Further reduction of dataset size by removing all the big media:
Code:
[yottabit@nas1 /mnt/big_scratch/migration/jacob.mcdonald]$ find . \( -iname "*.mp4" -o -iname "*.mov" -o -iname "*.avi" -o -iname "*.jpg" -o -iname "*.png" \) -exec rm -frv '{}' \;
 
Last edited:

AlexGG

Contributor
Joined
Dec 13, 2018
Messages
171
I am attempting to pair down the intermediate dataset to only failed files that contain no PII, in an attempt to pipe into a file for further debug.

That will be very good if you can do it and it still reproduces the problem. Especially if you can trim it down to something you can make publicly available. However, that does not necessarily preserve the problem. See, the send-recv process is like this

1. There is something on the disk at the source.
2. ZFS takes it and converts it to a different format for send/recv.
3. On the other side, ZFS takes send/recv format and converts it back into the on-disk format, transforming data as needed.
4. Finally, there is something on the disk at the target.

If the problem is at step 3, that's fine. If the problem is at step 2, we may end up with the send-recv stream which is somehow malformed but we have no idea how we arrived at it, and the source is gone. So if possible, you want to keep your source as long as we are at it.
 

yottabit

Contributor
Joined
Apr 15, 2012
Messages
192
Success, I was able to prune the intermediate dataset from all sensitive PII, replicate to a new dataset, replicate that to the wanted destination pool, and still corrupt data during that final replication.

I have piped both the source dataset (no corruption) and the destination dataset (corruption) into separate files and uploaded to Google Drive to share. Each file is 2.1 GiB.

The md5sums.txt file in the dataset root contains the checksums as generated on the intermediate dataset before replicating/corrupting.

Please see if you can repro from the source dataset.

Edit: fwiw, encryption was a red herring. The destination dataset included here was not encrypted and still exhibits the corruption.
 
Last edited:

yottabit

Contributor
Joined
Apr 15, 2012
Messages
192
It also seems that replicating from and to the same pool, without the intermediate dataset on the extra pool, did not corrupt the data. This was lz4 source to lz4+enc destination.

I will now try to replicate, on the same pool, to zstd w/o enc, and then from there to zstd w/ enc.

It seems suspect that this corruption is only happening from new pool to old pool. There are no pool errors, and that drive is being hosted on one of the same HBAs used by the old pool. If the data was corrupted at rest by the disk, ZFS would detect this on read, but it doesn't. So it really seems to be that it's a bug in interaction between new pool to old pool.
 
Last edited:
Joined
Oct 22, 2019
Messages
3,641
Receiving either of those into a pool results in the "corrupted" files.

Even if using LZ4 pool-wide.

Trying to decompress the .cshrc file through lz4 I get:
Code:
Error 44 : Unrecognized header : file cannot be decoded


The file tool yields, the following:
Code:
.cshrc: data


However, when I tested on a legitimate LZ4-compressed file, I get:
Code:
file.lz4: LZ4 compressed data (v1.4+)


The header for the real LZ4 differs from the "corrupt" file.
 

AlexGG

Contributor
Joined
Dec 13, 2018
Messages
171
Unrecognized header : file cannot be decoded

Inside ZFS, there is no LZ4 header. The ZFS variant stores a big-endian 32-bit integer number of compressed bytes N, and then the N compressed bytes. You need a decoder able to read a headerless stream.

I have piped both the source dataset (no corruption) and the destination dataset (corruption) into separate files and uploaded to Google Drive to share. Each file is 2.1 GiB.

Great, thanks. I will be looking into them later today.
 

yottabit

Contributor
Joined
Apr 15, 2012
Messages
192
Since the zfs read may have sourced the corruption, I've shared one more dataset. Just to be clear, here's where these all came from:
  1. vol1/jacob.mcdonald@migration is the original source (checksums created)
  2. big_scratch/migration/jacob.mcdonald@migration is the first replicated snapshot from #1 above (checksums ok)
  3. big_scratch/migration/jacob.mcdonald@small_test is the PII-pruned snapshot of #2 above (checksums ok), `zfs send -w` piped into pre-source.raw_zfs.xz (new file)
  4. big_scratch/migration/jacob.mcdonald_small_test@small_test is the replicated snapshot from #3 above (checksums ok), `zfs send -w` piped into source.raw_zfs.xz
  5. vol1/jacob.mcdonald_small_test@small_test is the replicated snapshot from #4 above (checksums fail), `zfs send -w` piped into destination.raw_zfs.xz
Each file is ~2.1 GiB.

It's possible any read from #2 is already going to result in corruption, but I can't provide the #2 or #1 datasets because they contain sensitive PII. I would be happy to provide any zdb output that might be helpful, though.
 
Last edited:
Joined
Oct 22, 2019
Messages
3,641
Ah, so maybe reading the dataset is where the corruption is coming from. I will package up the dataset before that one, and see if it's the same.
I tried 4 different combination to see if I could get .cshrc to match your supplied hash.

source_zfs.raw --> into LZ4 pool-wide
source_zfs.raw --> into ZSTD pool-wide
detination_zfs.raw --> into LZ4 pool-wide
destination_zfs.raw --> into ZSTD pool-wide

Same exact result each time. The .cshrc file was "corrupt" and yielded the same incorrect hash (compared to your supplied list.) Its contents also looked exactly the same as your previous copy + paste.

However, I'm not well versed in using raw files in place of ZFS-to-ZFS replications. :confused:

Inside ZFS, there is no LZ4 header. The ZFS variant stores a big-endian 32-bit integer number of compressed bytes N, and then the N compressed bytes. You need a decoder able to read a headerless stream.
Thank you, makes sense now. Hope you have better luck with the raw files. My one relief out of all of this is that it's hopefully VERY RARE, and unlikely to ever occur with modern versions of ZFS and recently-created pools.
 
Last edited:

yottabit

Contributor
Joined
Apr 15, 2012
Messages
192
At this point I'm unsure how best to proceed. I'm concerned that sourcing anything from this old pool may trigger the problem, even though it seems to be readable fine on my intermediate pool.

The replication of the dataset from and to the original pool was a success. I will try replicating that to the new pool, then back to the original pool, to see if this triggers the problem.

I'm concerned that I cannot trust sending away any datasets from this system now. This presents several problems for me:
  • I do not have enough spare bays to create a new resilient pool on this server
  • I have capacity available on a remote Z2 pool, but that is not under my direct control
  • My plan was to encrypt the PII dataset, and then send *all* datasets to the remote server, refactor my pool, and receive them back
  • I have 30 containers running, and I really don't want to set them up again
  • I have offsite object-storage backups of the important data, but not all of the "nice to have" data
My biggest fear is that I send away all these datasets to a remote server where I cannot verify them (do not want to decrypt), pull them back, and find that 0.9% of the files are corrupted. This almost happened, because I verified the intermediate pool datasets (before encrypting) with md5 successfully. It's only by chance that I noticed the problem when sending into the original pool and verifying the md5 again.

I am confused as to why sending the dataset from and to the same pool does not corrupt, but sending through a new intermediate pool does corrupt, even while the data on that pool passes all md5 checks.

I am wondering if I should boot an older distro with OpenZFS 2.0, or earlier, and create a new pool on the intermediate drive with that version. This is not a guarantee that I will not experience the same problem though, as only the new pool version will be older, and I will not be able to access the data on my old pool from that distro as the pool is upgraded to OpenZFS 2.1.

Ideally I would have open bays and new drives on the same server where I can perform the migration and refactor locally, but that's prohibitively expensive (would need a new JBOD chassis, new HBA, and new disks), and is no guarantee I wouldn't run into the same problem in the future as this would push my datasets to a new pool that acts the same as the intermediate pool I have now, from which I cannot push to another pool without corruption.

o_O
 

yottabit

Contributor
Joined
Apr 15, 2012
Messages
192
Last night I did the shuffling in another way:
  1. Send the same dataset snapshot from the original pool to the intermediate pool into an encrypted dataset[^1], without supplying the passphrase, inherit zstd compression on the new dataset
  2. Send the intermediate dataset back to the original pool, without supplying the passphrase, overriding to zstd compression at the destination
  3. Unlock #2 above
Preliminary verification of md5 checksums at the destination, after unlocked, is successful.

This give me a path forward. It seems that the corruption/compression-miss has something to do with sending the data into an unencrypted dataset on the intermediate pool, and then trying to send elsewhere after that stage. Keeping the data encrypted at the intermediate pool seems to work around this.

[^1]: originally I had always sent the source dataset into an unencrypted dataset on the intermediate pool; this is the only difference.
 

AlexGG

Contributor
Joined
Dec 13, 2018
Messages
171
So, I can't figure it out. It happens, but I don't know how. Somewhere along the line, it fails to recompress the data, while accepting/inheriting the new compression mode. This seems to happen only when encryption is involved and also the change of compression mode. Not much else I can say.
 

yottabit

Contributor
Joined
Apr 15, 2012
Messages
192
I was able to repro without receiving into an encrypted root, though. I think encryption is a red herring.

Are you saying you do not have the problem when receiving into a non-encrypted root?
 

AlexGG

Contributor
Joined
Dec 13, 2018
Messages
171
I was able to repro without receiving into an encrypted root, though. I think encryption is a red herring.

Yes. I think I mixed up my samples and test runs. It is possible to reproduce without encryption.
 

yottabit

Contributor
Joined
Apr 15, 2012
Messages
192
Great, thanks for confirming. I'll open a ticket with ix hopefully soon. I might have time this weekend. I'll wait for them to triage before pestering the OpenZFS team.
 
Top