SOLVED Right approach for zfs send with snapshot corruption

Status
Not open for further replies.

mpdelbuono

Cadet
Joined
Aug 3, 2017
Messages
3
Through my own negligence and, presumably, a bit of bad luck, during a zfs send | zfs recv operation I managed to create data corruption in a snapshot. The good news is `zfs scrub` only finds a problem with a single file, and really I don't care about that file. I also don't need to keep my historical snapshots. So I deleted the snapshot and re-ran the scrub, but all that ended up doing is telling me the next snapshot is corrupt.

This is what I'm currently looking at:

Code:
  pool: data0-backup
 state: ONLINE
status: One or more devices has experienced an error resulting in data
	corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
	entire pool from backup.
   see: http://illumos.org/msg/ZFS-8000-8A
  scan: scrub repaired 0 in 1h45m with 1 errors on Fri Aug  4 05:12:46 2017
config:

	NAME											STATE	 READ WRITE CKSUM
	data0-backup									ONLINE	   0	 0	 1
	  mirror-0									  ONLINE	   0	 0	 2
		gptid/89af00a4-78bb-11e7-9400-5404a617fd4f  ONLINE	   0	 0	 2
		gptid/8a5b5dd9-78bb-11e7-9400-5404a617fd4f  ONLINE	   0	 0	 2

errors: Permanent errors have been detected in the following files:

		data0-backup@auto-20170724.0300-2w:/path/to/file


I got here when previously the error was detected in data0-backup@auto-20170723.2100-2w:/path/to/file. I destroyed the snapshot with `zfs destroy data0-backup@auto-20170723.2100-2w` and then ran `zfs scrub data0-backup` again. This is a snapshot that runs every 6 hours. I'm assuming (perhaps incorrectly) that if I were to keep up this process, I will just end up destroying all of the snapshots after 20170723.2100.

My goal is to replicate data0-backup over to another pool (data0). I had been intending to do that with:
zfs snapshot -r data0-backup@xfr
zfs send -R data0-backup@xfr | zfs recv -vF data0


Is the correct approach here to just destroy all of the snapshots after 20170723.2100-2w (and then presumably the final file as well)? That seems like a lot of commands to run; something that is potentially error prone since I consider `zfs destroy` to be a dangerous command. Or is there a better way to do this correction and/or transfer, given that I am OK with losing data associated with this file in particular and all of my snapshots?
 

PhilipS

Contributor
Joined
May 10, 2016
Messages
179
I'm probably not understanding something, but since you don't care about /path/to/file, couldn't you delete that file from your dataset BEFORE creating the snapshot you want to send?
 

Stux

MVP
Joined
Jun 2, 2016
Messages
4,419
Think you will have to delete snaspshots until you get back to the one that is the base of the corruption.

You have multiple checksum errors on your mirror. How'd you manage that?
 

mpdelbuono

Cadet
Joined
Aug 3, 2017
Messages
3
I'm probably not understanding something, but since you don't care about /path/to/file, couldn't you delete that file from your dataset BEFORE creating the snapshot you want to send?
The snapshots had already been created. I could delete the file, but the complaint remains about the bad snapshot.

Thanks, Stux. I destroyed all the snapshots in the chain and the pool is back to being healthy (minus the one file, of course). As for the checksum errors - I'm not entirely sure where they popped up, but this pool was created as a replication from another pool, and while the original pool gets regularly scrubbed so I'm confident it doesn't have errors, it looks like there may have been an error during the replication itself. I made the stupid mistake of not scrubbing this duplicate pool before destroying the original, so I don't really have a way of retrying the replication. Lesson learned :)

Anyway, thanks for the confirmation. Deleting the various snapshots put me back in a good state and now I'm back to confirming everything is where I expect it to be.
 

Stux

MVP
Joined
Jun 2, 2016
Messages
4,419
Right. So the checksum error was mirrored to each disk.

Didn't realise that could happen as part of replication.

How did you replicate? Direct in chassis? With ECC? Over ssh?
 

mpdelbuono

Cadet
Joined
Aug 3, 2017
Messages
3
This was direct in chassis via the console with a traditional keyboard and monitor (i.e., not remote), just piping zfs send to zfs recv. The simplest explanation is that the system does not have ECC RAM, and as such may have gotten corrupted in memory. I've since done a memory test and come back with nothing, though, so it would have had to be a transient error. But it's the simplest explanation I can think of.

(FWIW: Part of the reason I'm doing this move is so that I can upgrade the box to one with an ECC-capable board, so hopefully that risk will be gone soon.)
 

Stux

MVP
Joined
Jun 2, 2016
Messages
4,419
Right. It seems like poster child non-ECC to me.

Good point tho. People should scrub the destination of the replication before destroying the original.
 
Status
Not open for further replies.
Top