ZFS Send Receive Interrupted

Jerzy Sobski · Sep 25, 2018

Background: I have 2 pools Vol0 (Raidz1) and Vol2 (Raidz2). My primary pool is Vol0 which will eventually be replaced with Vol2. My first step I did a snapshot of Vol0 and then did a send/ receive of that snapshot to Vol2. Once this was completed this past weekend I did a new snapshot of Vol0 and did a send receive of the difference between first snapshot and latest snapshot to bring pool up to date. During this process I had power loss due to faulty (Loose Internal) cable cause Vol0to become unavailable result in a interruption in the send and receive. After checking all connections and checking drives I finally got Vol0 back online and discovered that Vol2 is getting warning message that there may be some corrupt data. Im assuming this is due to a incomplete snapshot (of the difference) being send to Vol2.

Below is that last commands I ran before the interruption occurred:
zfs snapshot -r Vol0@migrate_20180924-0040
zfs send -Rv -i Vol0@migrate_20180916-1430 Vol0@migrate_20180924-0040 | zfs receive -F Vol2

My question is can I redo that send receive line to correct the corrupted data?

Ericloewe · Sep 25, 2018

Jerzy Sobski said:
Im assuming this is due to a incomplete snapshot (of the difference) being send to Vol2.

Nope. Replication is atomic, it either works completely or fails completely and there is nothing in between.

So, to examine this further you'll need to provide us with the output of zpool status -v Vol2. Don't forget the [CODE][/CODE] tags.

Jerzy Sobski · Sep 25, 2018

Ericloewe said:
Nope. Replication is atomic, it either works completely or fails completely and there is nothing in between.

So, to examine this further you'll need to provide us with the output of zpool status -v Vol2. Don't forget the [CODE][/CODE] tags.

Code:

root@freenas:~ # zpool status -v Vol2

  pool: Vol2
 state: ONLINE
status: One or more devices has experienced an error resulting in data
	corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
	entire pool from backup.
   see: http://illumos.org/msg/ZFS-8000-8A
  scan: resilvered 148G in 0 days 04:52:58 with 0 errors on Mon Sep 24 20:47:57 2018
config:

	NAME											STATE	 READ WRITE CKSUM
	Vol2											ONLINE	   0	 0	 0
	  raidz2-0									  ONLINE	   0	 0	 0
		gptid/957833a7-b55b-11e8-9f49-001f295ccd89  ONLINE	   0	 0	 0
		gptid/99f55722-b55b-11e8-9f49-001f295ccd89  ONLINE	   0	 0	 0
		gptid/9e70f554-b55b-11e8-9f49-001f295ccd89  ONLINE	   0	 0	 0
		gptid/a2dd4f28-b55b-11e8-9f49-001f295ccd89  ONLINE	   0	 0	 0
		gptid/a748a0ec-b55b-11e8-9f49-001f295ccd89  ONLINE	   0	 0	 0
		gptid/aba4c6e0-b55b-11e8-9f49-001f295ccd89  ONLINE	   0	 0	 0
		gptid/b0173de6-b55b-11e8-9f49-001f295ccd89  ONLINE	   0	 0	 0
		gptid/b48c368e-b55b-11e8-9f49-001f295ccd89  ONLINE	   0	 0	 0
	  raidz2-1									  ONLINE	   0	 0	 0
		gptid/b94b6c12-b55b-11e8-9f49-001f295ccd89  ONLINE	   0	 0	 0
		gptid/fb6b937a-b96d-11e8-9f6f-001f295ccd89  ONLINE	   0	 0	 0
		gptid/c23ffc1d-b55b-11e8-9f49-001f295ccd89  ONLINE	   0	 0	 0
		gptid/c6a25fdc-b55b-11e8-9f49-001f295ccd89  ONLINE	   0	 0	 0
		da27p2									  ONLINE	   0	 0	 0
		da26p2									  ONLINE	   0	 0	 0
		da25p2									  ONLINE	   0	 0	 0
		gptid/d6fc5125-b55b-11e8-9f49-001f295ccd89  ONLINE	   0	 0	 0

errors: Permanent errors have been detected in the following files:

		Vol2/Media@migrate_20180910-1950:<0x0>

The list of files was several pages long and was note included in the code.

Ericloewe · Sep 25, 2018

Can you upload it as a text file? It's important to know where the corruption is. The first entry you included is in metadata for snapshot migrate_20180910-1950 of dataset Vol2/Media, so that snapshot has to go.

Did you run a scrub after the incident?

Jerzy Sobski · Sep 25, 2018

Ericloewe said:
Can you upload it as a text file?

Uploaded.

It's important to know where the corruption is. The first entry you included is in metadata for snapshot migrate_20180910-1950 of dataset Vol2/Media, so that snapshot has to go.

Did you run a scrub after the incident?

No I have not done so. Nooby with ZFS and did not want to do something to make matters worse without guidance on right course of action.

Below is sequence of Snapshot sends that were done over the course of last couple weeks. Not sure if it will help.
This is being done to eventually move over to Vol2 (which is Raidz2).

Code:

zfs snapshot -r Vol0@migrate_20180910-1950

zfs list -t snapshot

zfs send -Rv Vol0@migrate_20180910-1950 | zfs receive -Fd Vol2

----------------------------------------------------------------------------
#2 COPY

zfs snapshot -r Vol0@migrate_20180913-1830

zfs send -Rv -i Vol0@migrate_20180910-1950 Vol0@migrate_20180913-1830 | zfs receive -F Vol2

----------------------------------------------------------------------------
#3 COPY

zfs snapshot -r Vol0@migrate_20180916-1430

zfs send -Rv -i Vol0@migrate_20180913-1830 Vol0@migrate_20180916-1430 | zfs receive -F Vol2

------------------------------------------------------------------------------
#4 COPY	Done 00:40 am 9/24/2018

zfs snapshot -r Vol0@migrate_20180924-0040

zfs send -Rv -i Vol0@migrate_20180916-1430 Vol0@migrate_20180924-0040 | zfs receive -F Vol2

------------------------------------------------------------------------------

Ericloewe · Sep 25, 2018

Seems to be confined to a single snapshot, so just delete Vol2/Media@migrate_20180910-1950. Best to run a scrub when you're done.

Jerzy Sobski · Sep 25, 2018

Ericloewe said:
Seems to be confined to a single snapshot, so just delete Vol2/Media@migrate_20180910-1950. Best to run a scrub when you're done.

The Snapshot with @migrate_20180910-1950 was part of the original first snapshot that was sent to Vol2 when new Pool was created. If I were to delete this wouldn't I lose everything from that Data Set?

Since this Dataset has the bulk of the data, Would I be better of to destroy the Vol2 Pool and start over to recreate a copy of VOl0?

Ericloewe · Sep 25, 2018

Deleting any snapshot will not affect the "current" version. The errors are specifically in the snapshot (at least the errors detected so far), so they do not affect live data.

Jerzy Sobski · Sep 25, 2018

Ericloewe said:
Deleting any snapshot will not affect the "current" version. The errors are specifically in the snapshot (at least the errors detected so far), so they do not affect live data.

Thanks.

Jerzy Sobski · Sep 25, 2018

Ericloewe said:
Deleting any snapshot will not affect the "current" version. The errors are specifically in the snapshot (at least the errors detected so far), so they do not affect live data.

Thanks. That cleared the error issue.

Since the problem with faulty cable started during last send/Receive of snapshots, Would it cause problems to rerun that last Send/Receive to make sure all changes between snapshots made it over to Vol2?

Ericloewe · Sep 25, 2018

Before you waste time with further operations, definitely run a scrub and let it finish. If it's clear, you'll want to run your replication again, since none of it made it in.

Jerzy Sobski · Sep 26, 2018

Ericloewe said:
Before you waste time with further operations, definitely run a scrub and let it finish. If it's clear, you'll want to run your replication again, since none of it made it in.

Thanks for your help. Scrub came back with no issues and was able to rerun the Replication to get it up to date. Everything seems to be working fine at this point.

Important Announcement for the TrueNAS Community.

ZFS Send Receive Interrupted

Jerzy Sobski

Explorer

Ericloewe

Server Wrangler

Jerzy Sobski

Explorer

Ericloewe

Server Wrangler

Jerzy Sobski

Explorer

Ericloewe

Server Wrangler

Jerzy Sobski

Explorer

Ericloewe

Server Wrangler

Jerzy Sobski

Explorer

Jerzy Sobski

Explorer

Ericloewe

Server Wrangler

Jerzy Sobski

Explorer

Similar threads