SOLVED Right approach for zfs send with snapshot corruption

mpdelbuono · Aug 4, 2017

Through my own negligence and, presumably, a bit of bad luck, during a zfs send | zfs recv operation I managed to create data corruption in a snapshot. The good news is `zfs scrub` only finds a problem with a single file, and really I don't care about that file. I also don't need to keep my historical snapshots. So I deleted the snapshot and re-ran the scrub, but all that ended up doing is telling me the next snapshot is corrupt.

This is what I'm currently looking at:

Code:

  pool: data0-backup
 state: ONLINE
status: One or more devices has experienced an error resulting in data
	corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
	entire pool from backup.
   see: http://illumos.org/msg/ZFS-8000-8A
  scan: scrub repaired 0 in 1h45m with 1 errors on Fri Aug  4 05:12:46 2017
config:

	NAME											STATE	 READ WRITE CKSUM
	data0-backup									ONLINE	   0	 0	 1
	  mirror-0									  ONLINE	   0	 0	 2
		gptid/89af00a4-78bb-11e7-9400-5404a617fd4f  ONLINE	   0	 0	 2
		gptid/8a5b5dd9-78bb-11e7-9400-5404a617fd4f  ONLINE	   0	 0	 2

errors: Permanent errors have been detected in the following files:

		data0-backup@auto-20170724.0300-2w:/path/to/file

I got here when previously the error was detected in data0-backup@auto-20170723.2100-2w:/path/to/file. I destroyed the snapshot with `zfs destroy data0-backup@auto-20170723.2100-2w` and then ran `zfs scrub data0-backup` again. This is a snapshot that runs every 6 hours. I'm assuming (perhaps incorrectly) that if I were to keep up this process, I will just end up destroying all of the snapshots after 20170723.2100.

My goal is to replicate data0-backup over to another pool (data0). I had been intending to do that with:

 zfs snapshot -r data0-backup@xfr

zfs send -R data0-backup@xfr | zfs recv -vF data0

Is the correct approach here to just destroy all of the snapshots after 20170723.2100-2w (and then presumably the final file as well)? That seems like a lot of commands to run; something that is potentially error prone since I consider `zfs destroy` to be a dangerous command. Or is there a better way to do this correction and/or transfer, given that I am OK with losing data associated with this file in particular and all of my snapshots?

PhilipS · Aug 4, 2017

I'm probably not understanding something, but since you don't care about /path/to/file, couldn't you delete that file from your dataset BEFORE creating the snapshot you want to send?

Stux · Aug 4, 2017

Think you will have to delete snaspshots until you get back to the one that is the base of the corruption.

You have multiple checksum errors on your mirror. How'd you manage that?

mpdelbuono · Aug 4, 2017

PhilipS said:
I'm probably not understanding something, but since you don't care about /path/to/file, couldn't you delete that file from your dataset BEFORE creating the snapshot you want to send?

The snapshots had already been created. I could delete the file, but the complaint remains about the bad snapshot.

Thanks, Stux. I destroyed all the snapshots in the chain and the pool is back to being healthy (minus the one file, of course). As for the checksum errors - I'm not entirely sure where they popped up, but this pool was created as a replication from another pool, and while the original pool gets regularly scrubbed so I'm confident it doesn't have errors, it looks like there may have been an error during the replication itself. I made the stupid mistake of not scrubbing this duplicate pool before destroying the original, so I don't really have a way of retrying the replication. Lesson learned :)

Anyway, thanks for the confirmation. Deleting the various snapshots put me back in a good state and now I'm back to confirming everything is where I expect it to be.

Stux · Aug 4, 2017

Right. So the checksum error was mirrored to each disk.

Didn't realise that could happen as part of replication.

How did you replicate? Direct in chassis? With ECC? Over ssh?

mpdelbuono · Aug 4, 2017

This was direct in chassis via the console with a traditional keyboard and monitor (i.e., not remote), just piping zfs send to zfs recv. The simplest explanation is that the system does not have ECC RAM, and as such may have gotten corrupted in memory. I've since done a memory test and come back with nothing, though, so it would have had to be a transient error. But it's the simplest explanation I can think of.

(FWIW: Part of the reason I'm doing this move is so that I can upgrade the box to one with an ECC-capable board, so hopefully that risk will be gone soon.)

Stux · Aug 4, 2017

Right. It seems like poster child non-ECC to me.

Good point tho. People should scrub the destination of the replication before destroying the original.

Important Announcement for the TrueNAS Community.

SOLVED Right approach for zfs send with snapshot corruption

mpdelbuono

Cadet

PhilipS

Contributor

Stux

MVP

mpdelbuono

Cadet

Stux

MVP

mpdelbuono

Cadet

Stux

MVP

Similar threads