zpool Encrypted Pool auto Mount / Import Issues (due to missing SLOG)

Status
Not open for further replies.

svtkobra7

Patron
Joined
Jan 12, 2017
Messages
202
I have an encrypted pool, attached a SLOG, which I removed physically (no longer in my hands), and zpool remove doesn't work. FreeNAS won't auto import the pool due to the missing device. See detail below.

I tried a zpool replace but get an error message about the replacement device being too small (removed SLOG was 800GB - I didn't use a chunk because I was in a hurry, figured it didn't matter and it was just a test), replacement SLOG available is 280 GB, 12 x 3.5 bays are full, but I suppose I could attach a 3.5" drive via SATA, and pass through via RDM (temporarily), to replace with a large enough device. That is if I'm interpreting the too small message correctly.

Signature with hardware is correct (except for an unused Optane 900p), which I want to pass through to FreeNAS (for use as a SLOG).

Please help, I dont want to have to blow up my pool, copy 30 TB, blah blah (end game = replicate to server B daily, but still working on getting server A provisioned / stable).

Attempt to unlock pool via gui ... results in this:
import volume.jpg
Code:
node-A-FreeNAS# zpool status
  pool: freenas-boot
 state: ONLINE
  scan: scrub repaired 0 in 0 days 00:00:01 with 0 errors on Tue Sep 18 03:45:01																																							  2018
config:

		NAME		STATE	 READ WRITE CKSUM
		freenas-boot  ONLINE	   0	 0	 0
		  da0p2	 ONLINE	   0	 0	 0

errors: No known data errors

OK, so no pool imported ... Let's try ...
Code:
node-A-FreeNAS# zpool import -m Tank1

Code:
node-A-FreeNAS# zpool status
  pool: Tank1
 state: DEGRADED
status: One or more devices could not be opened.  Sufficient replicas exist for
		the pool to continue functioning in a degraded state.
action: Attach the missing device and online it using 'zpool online'.
   see: http://illumos.org/msg/ZFS-8000-2Q
  scan: resilvered 96K in 0 days 00:00:01 with 0 errors on Tue Sep 18 15:12:36 2018
config:

		NAME												STATE	 READ WRITE CKSUM
		Tank1											   DEGRADED	 0	 0	 0
		  raidz2-0										  ONLINE	   0	 0	 0
			gptid/b2a4631f-9431-11e8-974d-000c296707ef.eli  ONLINE	   0	 0	 0
			gptid/b32f3d84-9431-11e8-974d-000c296707ef.eli  ONLINE	   0	 0	 0
			gptid/b3ac7c81-9431-11e8-974d-000c296707ef.eli  ONLINE	   0	 0	 0
			gptid/b42f2d96-9431-11e8-974d-000c296707ef.eli  ONLINE	   0	 0	 0
			gptid/b4afbc60-9431-11e8-974d-000c296707ef.eli  ONLINE	   0	 0	 0
			gptid/b52dfe57-9431-11e8-974d-000c296707ef.eli  ONLINE	   0	 0	 0
		  raidz2-1										  ONLINE	   0	 0	 0
			gptid/b63c8107-9431-11e8-974d-000c296707ef.eli  ONLINE	   0	 0	 0
			gptid/b6cf4e29-9431-11e8-974d-000c296707ef.eli  ONLINE	   0	 0	 0
			gptid/b760d773-9431-11e8-974d-000c296707ef.eli  ONLINE	   0	 0	 0
			gptid/b7f232c5-9431-11e8-974d-000c296707ef.eli  ONLINE	   0	 0	 0
			gptid/b8782967-9431-11e8-974d-000c296707ef.eli  ONLINE	   0	 0	 0
			gptid/caef1f42-a56d-11e8-a14d-000c29cd5f5a.eli  ONLINE	   0	 0	 0
		logs
		  2240279982886824239							   UNAVAIL	  0	 0	 0  was /dev/gptid/92a829ca-bab2-11e8-8ab9-000c29d08afe.eli

errors: No known data errors

  pool: freenas-boot
 state: ONLINE
  scan: scrub repaired 0 in 0 days 00:00:01 with 0 errors on Tue Sep 18 03:45:01 2018
config:

		NAME		STATE	 READ WRITE CKSUM
		freenas-boot  ONLINE	   0	 0	 0
		  da0p2	 ONLINE	   0	 0	 0

errors: No known data errors

OK, so that worked, but I can't see my pool in the gui, nor does it show up under /mnt only /Tank1 ...
Attempts to zpool remove Tank1 2240279982886824239 fail.
I've already run a scrub.
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
Attempts to zpool remove Tank1 2240279982886824239 fail.

Try a zpool clear Tank1 now that it's imported with the missing log device, then remove it again.

What's the error you get?
 

svtkobra7

Patron
Joined
Jan 12, 2017
Messages
202
Tried that, no error ... but device doesn't "remove" ...

MAN I LOVE THAT HoneyBadger! :)
 

svtkobra7

Patron
Joined
Jan 12, 2017
Messages
202
Code:
node-A-FreeNAS# zpool status Tank1
  pool: Tank1
 state: DEGRADED
status: One or more devices could not be opened.  Sufficient replicas exist for
		the pool to continue functioning in a degraded state.
action: Attach the missing device and online it using 'zpool online'.
   see: http://illumos.org/msg/ZFS-8000-2Q
  scan: resilvered 96K in 0 days 00:00:01 with 0 errors on Tue Sep 18 15:12:36 2																																							 018
config:

		NAME												STATE	 READ WRITE																																							  CKSUM
		Tank1											   DEGRADED	 0	 0																																								  0
		  raidz2-0										  ONLINE	   0	 0																																								  0
			gptid/b2a4631f-9431-11e8-974d-000c296707ef.eli  ONLINE	   0	 0																																								  0
			gptid/b32f3d84-9431-11e8-974d-000c296707ef.eli  ONLINE	   0	 0																																								  0
			gptid/b3ac7c81-9431-11e8-974d-000c296707ef.eli  ONLINE	   0	 0																																								  0
			gptid/b42f2d96-9431-11e8-974d-000c296707ef.eli  ONLINE	   0	 0																																								  0
			gptid/b4afbc60-9431-11e8-974d-000c296707ef.eli  ONLINE	   0	 0																																								  0
			gptid/b52dfe57-9431-11e8-974d-000c296707ef.eli  ONLINE	   0	 0																																								  0
		  raidz2-1										  ONLINE	   0	 0																																								  0
			gptid/b63c8107-9431-11e8-974d-000c296707ef.eli  ONLINE	   0	 0																																								  0
			gptid/b6cf4e29-9431-11e8-974d-000c296707ef.eli  ONLINE	   0	 0																																								  0
			gptid/b760d773-9431-11e8-974d-000c296707ef.eli  ONLINE	   0	 0																																								  0
			gptid/b7f232c5-9431-11e8-974d-000c296707ef.eli  ONLINE	   0	 0																																								  0
			gptid/b8782967-9431-11e8-974d-000c296707ef.eli  ONLINE	   0	 0																																								  0
			gptid/caef1f42-a56d-11e8-a14d-000c29cd5f5a.eli  ONLINE	   0	 0																																								  0
		logs
		  2240279982886824239							   UNAVAIL	  0	 0																																								  0  was /dev/gptid/92a829ca-bab2-11e8-8ab9-000c29d08afe.eli

errors: No known data errors
node-A-FreeNAS# zpool clear Tank1
node-A-FreeNAS# zpool remove Tank1 2240279982886824239
node-A-FreeNAS# zpool status
  pool: Tank1
 state: DEGRADED
status: One or more devices could not be opened.  Sufficient replicas exist for
		the pool to continue functioning in a degraded state.
action: Attach the missing device and online it using 'zpool online'.
   see: http://illumos.org/msg/ZFS-8000-2Q
  scan: resilvered 96K in 0 days 00:00:01 with 0 errors on Tue Sep 18 15:12:36 2018
config:

		NAME												STATE	 READ WRITE CKSUM
		Tank1											   DEGRADED	 0	 0	 0
		  raidz2-0										  ONLINE	   0	 0	 0
			gptid/b2a4631f-9431-11e8-974d-000c296707ef.eli  ONLINE	   0	 0	 0
			gptid/b32f3d84-9431-11e8-974d-000c296707ef.eli  ONLINE	   0	 0	 0
			gptid/b3ac7c81-9431-11e8-974d-000c296707ef.eli  ONLINE	   0	 0	 0
			gptid/b42f2d96-9431-11e8-974d-000c296707ef.eli  ONLINE	   0	 0	 0
			gptid/b4afbc60-9431-11e8-974d-000c296707ef.eli  ONLINE	   0	 0	 0
			gptid/b52dfe57-9431-11e8-974d-000c296707ef.eli  ONLINE	   0	 0	 0
		  raidz2-1										  ONLINE	   0	 0	 0
			gptid/b63c8107-9431-11e8-974d-000c296707ef.eli  ONLINE	   0	 0	 0
			gptid/b6cf4e29-9431-11e8-974d-000c296707ef.eli  ONLINE	   0	 0	 0
			gptid/b760d773-9431-11e8-974d-000c296707ef.eli  ONLINE	   0	 0	 0
			gptid/b7f232c5-9431-11e8-974d-000c296707ef.eli  ONLINE	   0	 0	 0
			gptid/b8782967-9431-11e8-974d-000c296707ef.eli  ONLINE	   0	 0	 0
			gptid/caef1f42-a56d-11e8-a14d-000c29cd5f5a.eli  ONLINE	   0	 0	 0
		logs
		  2240279982886824239							   UNAVAIL	  0	 0	 0  was /dev/gptid/92a829ca-bab2-11e8-8ab9-000c29d08afe.eli

errors: No known data errors

  pool: freenas-boot
 state: ONLINE
  scan: scrub repaired 0 in 0 days 00:00:01 with 0 errors on Tue Sep 18 03:45:01 2018
config:

		NAME		STATE	 READ WRITE CKSUM
		freenas-boot  ONLINE	   0	 0	 0
		  da0p2	 ONLINE	   0	 0	 0

errors: No known data errors

 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
Tried that, no error ... but device doesn't "remove" ...

zdb -U /data/zfs/zpool.cache and code tag it up, I bet your log vdev is pending a removal but zfs can't execute on it.

MAN I LOVE THAT HoneyBadger! :)

HoneyBadger don't give a (bleep) if you love it or not. ;)
 

svtkobra7

Patron
Joined
Jan 12, 2017
Messages
202
The HB comment made my day - seriously :) :)

Code:
node-A-FreeNAS# zdb -U /data/zfs/zpool.cache
Tank1:
	version: 5000
	name: 'Tank1'
	state: 0
	txg: 849076
	pool_guid: 1165592307591645323
	hostid: 1055334124
	hostname: 'node-A-FreeNAS.SULLYREALM.LAN'
	com.delphix:has_per_vdev_zaps
	vdev_children: 3
	vdev_tree:
		type: 'root'
		id: 0
		guid: 1165592307591645323
		children[0]:
			type: 'raidz'
			id: 0
			guid: 2579288238076018945
			nparity: 2
			metaslab_array: 53
			metaslab_shift: 38
			ashift: 12
			asize: 36007020331008
			is_log: 0
			create_txg: 4
			com.delphix:vdev_zap_top: 36
			children[0]:
				type: 'disk'
				id: 0
				guid: 10146101327001238949
				path: '/dev/gptid/b2a4631f-9431-11e8-974d-000c296707ef.eli'
				phys_path: 'id1,enc@n5003048017b7a5bd/type@0/slot@2/elmdesc@Slot_02'
				whole_disk: 1
				DTL: 397
				create_txg: 4
				com.delphix:vdev_zap_leaf: 37
			children[1]:
				type: 'disk'
				id: 1
				guid: 16516664522723935945
				path: '/dev/gptid/b32f3d84-9431-11e8-974d-000c296707ef.eli'
				phys_path: 'id1,enc@n5003048017b7a5bd/type@0/slot@9/elmdesc@Slot_09'
				whole_disk: 1
				DTL: 153
				create_txg: 4
				com.delphix:vdev_zap_leaf: 38
			children[2]:
				type: 'disk'
				id: 2
				guid: 7825365496761078844
				path: '/dev/gptid/b3ac7c81-9431-11e8-974d-000c296707ef.eli'
				phys_path: 'id1,enc@n5003048017b7a5bd/type@0/slot@1/elmdesc@Slot_01'
				whole_disk: 1
				DTL: 152
				create_txg: 4
				com.delphix:vdev_zap_leaf: 39
			children[3]:
				type: 'disk'
				id: 3
				guid: 8868288218045992461
				path: '/dev/gptid/b42f2d96-9431-11e8-974d-000c296707ef.eli'
				phys_path: 'id1,enc@n5003048017b7a5bd/type@0/slot@7/elmdesc@Slot_07'
				whole_disk: 1
				DTL: 145
				create_txg: 4
				com.delphix:vdev_zap_leaf: 40
			children[4]:
				type: 'disk'
				id: 4
				guid: 6499826818015338779
				path: '/dev/gptid/b4afbc60-9431-11e8-974d-000c296707ef.eli'
				phys_path: 'id1,enc@n5003048017b7a5bd/type@0/slot@b/elmdesc@Slot_11'
				whole_disk: 1
				DTL: 143
				create_txg: 4
				com.delphix:vdev_zap_leaf: 41
			children[5]:
				type: 'disk'
				id: 5
				guid: 4814616330319370776
				path: '/dev/gptid/b52dfe57-9431-11e8-974d-000c296707ef.eli'
				phys_path: 'id1,enc@n5003048017b7a5bd/type@0/slot@3/elmdesc@Slot_03'
				whole_disk: 1
				DTL: 139
				create_txg: 4
				com.delphix:vdev_zap_leaf: 42
		children[1]:
			type: 'raidz'
			id: 1
			guid: 5750916281922356097
			nparity: 2
			metaslab_array: 50
			metaslab_shift: 38
			ashift: 12
			asize: 36007020331008
			is_log: 0
			create_txg: 4
			com.delphix:vdev_zap_top: 43
			children[0]:
				type: 'disk'
				id: 0
				guid: 17301457186701692218
				path: '/dev/gptid/b63c8107-9431-11e8-974d-000c296707ef.eli'
				phys_path: 'id1,enc@n5003048017b7a5bd/type@0/slot@6/elmdesc@Slot_06'
				whole_disk: 1
				DTL: 138
				create_txg: 4
				com.delphix:vdev_zap_leaf: 44
			children[1]:
				type: 'disk'
				id: 1
				guid: 4877257517821901329
				path: '/dev/gptid/b6cf4e29-9431-11e8-974d-000c296707ef.eli'
				phys_path: 'id1,enc@n5003048017b7a5bd/type@0/slot@8/elmdesc@Slot_08'
				whole_disk: 1
				DTL: 136
				create_txg: 4
				com.delphix:vdev_zap_leaf: 45
			children[2]:
				type: 'disk'
				id: 2
				guid: 8746803776442122363
				path: '/dev/gptid/b760d773-9431-11e8-974d-000c296707ef.eli'
				phys_path: 'id1,enc@n5003048017b7a5bd/type@0/slot@4/elmdesc@Slot_04'
				whole_disk: 1
				DTL: 135
				create_txg: 4
				com.delphix:vdev_zap_leaf: 46
			children[3]:
				type: 'disk'
				id: 3
				guid: 11955749817537875686
				path: '/dev/gptid/b7f232c5-9431-11e8-974d-000c296707ef.eli'
				phys_path: 'id1,enc@n5003048017b7a5bd/type@0/slot@5/elmdesc@Slot_05'
				whole_disk: 1
				DTL: 133
				create_txg: 4
				com.delphix:vdev_zap_leaf: 47
			children[4]:
				type: 'disk'
				id: 4
				guid: 9019293143275552892
				path: '/dev/gptid/b8782967-9431-11e8-974d-000c296707ef.eli'
				phys_path: 'id1,enc@n5003048017b7a5bd/type@0/slot@a/elmdesc@Slot_10'
				whole_disk: 1
				DTL: 131
				create_txg: 4
				com.delphix:vdev_zap_leaf: 48
			children[5]:
				type: 'disk'
				id: 5
				guid: 10583640473928649361
				path: '/dev/gptid/caef1f42-a56d-11e8-a14d-000c29cd5f5a.eli'
				phys_path: 'id1,enc@n5003048017b7a5bd/type@0/slot@c/elmdesc@Slot_12'
				whole_disk: 1
				DTL: 57
				create_txg: 4
				com.delphix:vdev_zap_leaf: 56
		children[2]:
			type: 'disk'
			id: 2
			guid: 2240279982886824239
			path: '/dev/gptid/92a829ca-bab2-11e8-8ab9-000c29d08afe.eli'
			whole_disk: 1
			not_present: 1
			metaslab_array: 60
			metaslab_shift: 27
			ashift: 12
			asize: 800160743424
			is_log: 1
			removing: 1
			create_txg: 838186
			com.delphix:vdev_zap_top: 58
			offline: 1
	features_for_read:
		com.delphix:hole_birth
		com.delphix:embedded_data

 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
Now we're cooking.

Code:
	   children[2]:
		   type: 'disk'
		   id: 2
		   guid: 2240279982886824239
		   path: '/dev/gptid/92a829ca-bab2-11e8-8ab9-000c29d08afe.eli'
		   whole_disk: 1
		   not_present: 1
		   metaslab_array: 60
		   metaslab_shift: 27
		   ashift: 12
		   asize: 800160743424
		   is_log: 1
		   removing: 1 ***This line right here is the ticket***
		   create_txg: 838186
		   com.delphix:vdev_zap_top: 58
		   offline: 1


Your log drive is in a removing state which is good. Before you physically unplugged it, did you remove it from the pool from GUI/cmd? zpool history Tank1 might help here as well. (This could take a long time to run depending on your pool age.)

Since we're also talking about an encrypted pool, do you have a backup of your key and passphrase? Just in case things go really sideways here.
 
Last edited:

svtkobra7

Patron
Joined
Jan 12, 2017
Messages
202
Now we're cooking.
Your log drive is in a removing state which is good. Before you physically unplugged it, did you remove it from the pool from GUI/cmd? zpool history Tank1 might help here as well. (This could take a long time to run depending on your pool age.)
  • I'm nearly certain I removed it from the pool prior to shutdown, and subsequent physical removal.
  • If you are asking for the extract from zpool history Tank1 ref code tags below, which doesn't show anything exciting in my opinion. You can see I replaced one SLOG with another (15:48:26), then attempted to remove it (16:16:42). I believe I attempted to offline it because it didn't remove via the GUI.
Since we're also talking about an encrypted pool, do you have a backup of your key and passphrase? Just in case things go really sideways here.
  • I'm super diligent about keeping my key, recovery key, and passphrase in my head backed up. Heck, I even record each drives disk id, gptid, etc. and back up each drives geli metadata, but ...
  • I've noticed that if I remove a SLOG from my pool, upon reboot, the pool auto-imports without prompt to unlock.
  • Right, wrong, or indifferent, I've always "solved" that by an encryption rekey, followed by the aforementioned backup schema.
  • But between encountering the issue and now, I've attempted to rekey several times in an attempt to remedy a "removed" SLOG which the gui still "reports", so if you are asking if I have the geli key on hand, yes as I have to run attach -k geli.key da2p1 and then enter the passphrase for each disk, which reverts my pool from an unavailable to degraded state, but I don't have all of the items enumerated previously.
BTW - many thanks! :)

Code:
2018-09-17.15:43:35 zpool import -f -R /mnt 1165592307591645323
2018-09-17.15:43:37 zfs inherit -r mountpoint Tank1
2018-09-17.15:43:37 zpool set cachefile=/data/zfs/zpool.cache Tank1
2018-09-17.15:43:37 zfs set aclmode=passthrough Tank1
2018-09-17.15:43:42 zfs set aclinherit=passthrough Tank1
2018-09-17.15:48:26 zpool replace Tank1 gptid/e2f76680-ba9f-11e8-a40b-000c29d08afe.eli /dev/gptid/92a829ca-bab2-11e8-8ab9-000c29d08afe.eli
2018-09-17.16:16:42 zpool remove Tank1 gptid/92a829ca-bab2-11e8-8ab9-000c29d08afe.eli
2018-09-17.16:17:49 zpool offline Tank1 gptid/92a829ca-bab2-11e8-8ab9-000c29d08afe.eli
2018-09-17.16:17:54 zpool remove Tank1 2240279982886824239
 

svtkobra7

Patron
Joined
Jan 12, 2017
Messages
202
Maybe it is relevant or not, but
  • FreeNAS seems to take it upon itself to initiate a scrub upon reboot, unlocking disks.
  • Also, it seems to go nuts with "vdev state changed" while geli attaching disk by disk (ref image below)
  • Finally, when I execute zdb Tank1 I see an issue (ref code tag below), I think this is part of the problem.
footer.jpg

Code:
node-A-FreeNAS# zdb Tank1
Assertion failed: vd->vdev_ms_shift == 0 (0x1b == 0x0), file /freenas-11-releng/freenas/_BE/os/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/metaslab.c, line 1866.
Abort trap (core dumped)
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
The vdev_ms_shift seems to be pointing towards a metaslab issue, and when zdb is giving those kinds of errors, that's usually my cue to back up data and destroy the pool, which was what got recommended in this old bug report that was closed as "cannot reproduce"

https://redmine.ixsystems.com/issues/5337

The only other hail-mary attempt would be

to try the replace with a truncated file of the same size (800160743424 bytes) and then if that succeeds, remove it right away, hopefully destroying the entire log top-level vdev successfully. Of course if the bug is at the pool level that could just result in you now having two log devices that are stuck pending removal.
 

svtkobra7

Patron
Joined
Jan 12, 2017
Messages
202
Thanks so much for the prompt follow up!

The vdev_ms_shift seems to be pointing towards a metaslab issue, and when zdb is giving those kinds of errors, that's usually my cue to back up data and destroy the pool, which was what got recommended in this old bug report that was closed as "cannot reproduce"

https://redmine.ixsystems.com/issues/5337

So that may make sense, I changed the recordsize at the pool level to 1024k (my data = mostly huge files) and ~24 hours prior set up datasets with different concatenations of recordsize, sync, and turned compression off to facilitate testing. I noticed that zfs set / zfs get returned the proper values, but the gui continued to show an old value for recordsize (inherit) instead of what I changed via CLI ... seemed odd to me. Perhaps something was wonky prior to SLOG removal?



Changing Tank1's recordsize at the pool level would change the ashift, right?

This is now twice I've blown up a pool, and twice it has been caused by the "non-destructive" addition of a SLOG. On the bright side, it could always be worse!

I played OL, so I never got to throw a hail mary, and I don't think today is my day either ... Additionally, even backup via 1 Gbps sounds a lot better than 100 Mbps down from external should I throw an interception and blow the game.

Let me ask you this: Why is it not really recommended? The Optane => P3700 replace succeeded; the fail = P3700 removal. I know pass through of SATA controllers falls into the "i don't care about my data" bucket, but in absence of another disk that size, the only think I can think of is to pass through a SATA controller, plug in a 3.5" 6 TB drive via SATA, and power it via 4-pin molex => SATA power (12 bay backplane = full). Use that as the target for the replace.

Today's survey (select one):
[ ] It would take 2 days 13 hours 5 minutes 2 seconds to transfer 25 Terabytes at 119.21 MB/sec
[ ] It would take 15 hours 16 minutes 15 seconds to transfer 25 Terabytes at 476.84 MB/sec
https://downloadtimecalculator.com/Data-Transfer-Calculator.html

My point => Now I just need to figure out how to get the two servers to talk to each other point-to-point, since I don't have a 10 GbE switch yet (option 2, not limited by 1 GbE, sounds much better). Servers have 10 GbE NICs, my switch does not.
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
So that may make sense, I changed the recordsize at the pool level to 1024k (my data = mostly huge files) and ~24 hours prior set up datasets with different concatenations of recordsize, sync, and turned compression off to facilitate testing. I noticed that zfs set / zfs get returned the proper values, but the gui continued to show an old value for recordsize (inherit) instead of what I changed via CLI ... seemed odd to me. Perhaps something was wonky prior to SLOG removal?

Setting things in the CLI tends to make them not show at the GUI, possibly until an export/import of the pool.

Changing Tank1's recordsize at the pool level would change the ashift, right?

No. ashift is immutable once set, what I'm looking at here is the meta_shift. And I think I've got something from this, I'll throw together another post in a second on my suspicions.

I played OL, so I never got to throw a hail mary, and I don't think today is my day either ... Additionally, even backup via 1 Gbps sounds a lot better than 100 Mbps down from external should I throw an interception and blow the game.

Agreed. The only reason I've got it there is because you seem like you have a recent backup and are willing to take the chance on this, because one way or another there's a backup and then restore involved.

Let me ask you this: Why is it not really recommended? The Optane => P3700 replace succeeded; the fail = P3700 removal. I know pass through of SATA controllers falls into the "i don't care about my data" bucket, but in absence of another disk that size, the only think I can think of is to pass through a SATA controller, plug in a 3.5" 6 TB drive via SATA, and power it via 4-pin molex => SATA power (12 bay backplane = full). Use that as the target for the replace.

My suggestion on "don't do it" is because with a zpool in a wonky state like this, messing with it could make things worse and cause you to restore from backup. But it really looks like you might be heading down that path regardless. But if you want to give it a shot, you can use truncate -s 800160743424 /path/to/a/sparse/file and then try zpool replace using that file as a target - then if it succeeds, immediately remove the SLOG. Of course if you can't remove it, then you've got a wonky pool trying to SLOG to a sparse file. Hence the "hail mary" nature of it.

Also, "Molex to SATA, lose all your data" is a rhyme that I've heard a lot.

Today's survey (select one):
[ ] It would take 2 days 13 hours 5 minutes 2 seconds to transfer 25 Terabytes at 119.21 MB/sec
[ ] It would take 15 hours 16 minutes 15 seconds to transfer 25 Terabytes at 476.84 MB/sec
https://downloadtimecalculator.com/Data-Transfer-Calculator.html

Neither is particularly appealing.

My point => Now I just need to figure out how to get the two servers to talk to each other point-to-point, since I don't have a 10 GbE switch yet (option 2, not limited by 1 GbE, sounds much better). Servers have 10 GbE NICs, my switch does not.

If they're SFP+ just buy a cheap direct-attach (Twinax) copper cable from Monoprice.

Next post coming shortly.
 

svtkobra7

Patron
Joined
Jan 12, 2017
Messages
202
Comments for potential inclusion, in the pending reply mentioned; however, none are excruciatingly relevant.

Agreed. The only reason I've got it there is because you seem like you have a recent backup and are willing to take the chance on this, because one way or another there's a backup and then restore involved.
  • Known variables: 100% backed up offsite / restore speed = 100 Mbps / I'm ultimately going to blow that pool away.
  • Unknown variables: Worthwhile to attempt to put pool in a ready state prior to copy to 2nd server?
Also, "Molex to SATA, lose all your data" is a rhyme that I've heard a lot.Neither is particularly appealing.
  • Never heard that one, catchy though! Not sure I get it but the big 4 pins power the backplane / what is the difference between that and big 4 => female sata adapter? No reply needed, really.
  • Mine seems to be "hook up a slog, your data disappears into the fog."
Neither is particularly appealing.
  • True; however, this is why I went from 1 server to 2. Of the 3-2-1 backup strategy I was missing the "2."
  • I would have preferred to backup the data at my pleasure; not necessity, and after I was done testing, but stuff happens.
If they're SFP+ just buy a cheap direct-attach (Twinax) copper cable from Monoprice.
  • No, they are 8P8C (RJ45) 10GB-BASE-T.
  • Cat 6 should be good enough to get better than 1 Gbps for such a short distance (inches).
  • I've just never configured point to point before.
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
Okay, so here's what I think the root cause of this issue was. Please correct me if I screw up any assumptions here.

The sequence of events was:

1. You got an Optane 900p drive, it didn't want to VT-d nicely, so you created a 20GB vDisk and used that for the very first SLOG this pool got
2. You later on got an 800GB P3700 and used zpool replace to "hot swap" the SLOG devices, rather than removing the log and creating a new one
3. You've now tried to remove that P3700 and are hitting this issue

So here's what I believe broke your log vdev, and why.

When ZFS first adds a vdev to a pool, it slices it up into what are called metaslabs - simply, a smaller, more manageable chunk of the disk. Think similar to the idea of a "RAID stripe size" - but ZFS aims for less than 200 slabs per vdev. That number is an old holdover from the early days of ZFS, and probably needs some serious revisiting in modern times. Metaslabs are sized in powers of two - specifically, the size is 2^ metaslab_shift bytes.

So when you added that 20GB vDisk off the Optane 900p, ZFS said "what's the proper size to get less than 200 slabs here?" and calculated until it came up with metaslab_shift: 27 - your metaslabs are 2^27 = 134217728 bytes = 128MB each.

For a 20GB disk, that gives you a nice round 160 metaslabs, and everything's great.

Until the live-swap to the 800GB P3700.

If you'd destroyed the log vdev entirely, ZFS would have recalculated a proper metaslab_shift to keep your slab count under 200 - metaslab_shift: 32 to be exact, which would have given you 4GB slabs.

But the vdev wasn't destroyed, and metaslab_shift, like ashift, is immutable once set. So when that 800GB drive was swapped in, it got carved up into 128MB chunks.

6400 of them.

I imagine this is making ZFS throw an absolute fit trying to manage the metadata here.

I'm almost tempted to suggest abuse of the gzero device here to create a null-log-device (since it's already in an offline/removing state, it shouldn't have any valid data, and you've already been wiling to import -m discarding it) and doing a replace -f with that, then crucially destroying that log vdev.
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
Not sure I get it but the big 4 pins power the backplane / what is the difference between that and big 4 => female sata adapter? No reply needed, really.

It's the quality of connector. Using a good (crimped wires and not injection molded!) is fine, as is plugging them into PCBs like a backplane.

https://www.youtube.com/watch?v=TataDaUNEFc

Mine seems to be "hook up a slog, your data disappears into the fog."

I chuckled. At least you can laugh at this.

I've just never configured point to point before.

Setting up an IP on each end in a separate subnet from one that you're using actively, and specifying the other system's "private 10Gb IP" as the recv target should force the traffic out that port.
 

svtkobra7

Patron
Joined
Jan 12, 2017
Messages
202
Okay, so here's what I think the root cause of this issue was. Please correct me if I screw up any assumptions here.

The sequence of events was:

1. You got an Optane 900p drive, it didn't want to VT-d nicely, so you created a 20GB vDisk and used that for the very first SLOG this pool got
2. You later on got an 800GB P3700 and used zpool replace to "hot swap" the SLOG devices, rather than removing the log and creating a new one
3. You've now tried to remove that P3700 and are hitting this issue

  • All correct. In the interest of factual correctness and since it is staring me (literally) in my face (your SLOG signature link which I posted to) ...
  • I seem to have been accumulating Optane drives over time / had both Optane & P3700 connected to a CSE-836 now deprecated, but as part of pre-work to move to 2 x CSE-826, I blew away the pools that those drives were attached to and created my current, limping 2 x 6 disk z2 pool, which was in the CSE-836, now 1 of the CSE-826 and NVMe was on the shelf post creation of current pool.
So here's what I believe broke your log vdev, and why.

When ZFS first adds a vdev to a pool, it slices it up into what are called metaslabs - simply, a smaller, more manageable chunk of the disk. Think similar to the idea of a "RAID stripe size" - but ZFS aims for less than 200 slabs per vdev. That number is an old holdover from the early days of ZFS, and probably needs some serious revisiting in modern times. Metaslabs are sized in powers of two - specifically, the size is 2^ metaslab_shift bytes.

So when you added that 20GB vDisk off the Optane 900p, ZFS said "what's the proper size to get less than 200 slabs here?" and calculated until it came up with metaslab_shift: 27 - your metaslabs are 2^27 = 134217728 bytes = 128MB each.

For a 20GB disk, that gives you a nice round 160 metaslabs, and everything's great.

Until the live-swap to the 800GB P3700.

If you'd destroyed the log vdev entirely, ZFS would have recalculated a proper metaslab_shift to keep your slab count under 200 - metaslab_shift: 32 to be exact, which would have given you 4GB slabs.

But the vdev wasn't destroyed, and metaslab_shift, like ashift, is immutable once set. So when that 800GB drive was swapped in, it got carved up into 128MB chunks.

6400 of them.

I imagine this is making ZFS throw an absolute fit trying to manage the metadata here.

  • So executive summary: I'm not as smart as the average bear, being lazy and not properly slicing the slog into a manageable size = root cause and I screwed myself, and in conclusion I don't deserve NVMe SLOGs and perhaps a return to COTS appliances, with a specially ordered iteration disallowing user SSH access is in order?
  • But that is a perfectly understandable explain, thanks for taking the time, and breaking it down for me.
I'm almost tempted to suggest abuse of the gzero device here to create a null-log-device (since it's already in an offline/removing state, it shouldn't have any valid data, and you've already been wiling to import -m discarding it) and doing a replace -f with that, then crucially destroying that log vdev.
  • You lost me on the gzero, but it sounds cool, so I think we do it!
It's the quality of connector. Using a good (crimped wires and not injection molded!) is fine, as is plugging them into PCBs like a backplane.
  • Thank goodness you stopped me. I was about to fire up the injection press to create 6400 connectors. Henceforth, I'll only be making crimped!
I chuckled. At least you can laugh at this.
  • I'm not an overly creative person but that was the best I got (and I'm glad you laughed).
  • Unfortunately, that seems to be my reality (slog fog). I've never achieved a stable SLOG config, worse yet, that quest has cost me pools, twice.
  • Its not funny for sure, but at the end of the day, the cost for my mistake, will be at worst, 25 TB download from cloud, at best a 15 hour rclone marathon. It could always be worse, one of my connectors could have spontaneously combusted, taking out not just the server closet, but also my condo.
  • Shamefully, I've violated a number of FreeNAS don'ts in my day (May '17 to date), but none ever bit me (you know like a cobra). Which was it here? (rhetorical)
NB: My sarcasm isn't intended to discount in any way the quality / meaningfulness of your help and amazingly thorough reply. I just have a really lame sense of humor. Let me tell you just how bad it is, a memory that I recall thanks to your avatar. Circa 2013 I was enamored with the Honey Badger video and I thought it would be funny to configure my then manager's computer to play the video at Windows startup. Well apparently he always suspended his laptop, but upon the first cold boot, he happened to be running late to a meeting, which of course happened to be with the CEO et al, and they got their very first taste of Honey Badger. They didn't watch the duration in order to be able to appreciate its full glory and thus likely took pause and wondered if their Director of FP&A had a substance abuse problem. (that's actually a true story)
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
So executive summary: I'm not as smart as the average bear, being lazy and not properly slicing the slog into a manageable size = root cause and I screwed myself, and in conclusion I don't deserve NVMe SLOGs and perhaps a return to COTS appliances, with a specially ordered iteration disallowing user SSH access is in order?

brutal.jpg


zpool replaceing the SLOG was what got you, rather than two separate steps of zpool remove and zpool attach. A giant SLOG is lazy and will go to waste, but provided it's done that way shouldn't cause this issue. Better way would be to slice up a small partition or use hardware overprovisioning (secure-erase the drive and make it present only 8GB or so to the OS) to get better wear-leveling.

You lost me on the gzero, but it sounds cool, so I think we do it!

Let's do this.

Here's your YOLO-SLOG as requested.

Code:
geom zero load
zpool replace Tank1 2240279982886824239 /dev/gzero
(wait for resilver/online)
zpool remove Tank1 /dev/gzero


If this doesn't work, you can try the truncate -s 800160743424 /path/to/a/sparse/file and then zpool replace Tank1 2240279982886824239 /path/to/a/sparse/file and remove it afterwards.

It could always be worse, one of my connectors could have spontaneously combusted, taking out not just the server closet, but also my condo.

Search "Molex to SATA fire" and you can see some evidence as to why the rhyme exists. ;)

Shamefully, I've violated a number of FreeNAS don'ts in my day (May '17 to date), but none ever bit me (you know like a cobra). Which was it here? (rhetorical)

HoneyBadger smacks the (redacted) outta those cobras. In this case, the snake that bit you was a bit of old ZFS code that's not really well-documented, and a bit of an edge case in that you zpool replaced an SLOG device rather than removing and re-attaching, causing the huge number of metaslabs.

NB: My sarcasm isn't intended to discount in any way the quality / meaningfulness of your help and amazingly thorough reply. I just have a really lame sense of humor.

I'm genuinely happy that you're taking this in stride and are still able to laugh in the face of a potentially long restore process. Rest assured that I'm just as amused by your "lame" sense of humor as you are!
 

svtkobra7

Patron
Joined
Jan 12, 2017
Messages
202
I'm genuinely happy that you're taking this in stride and are still able to laugh in the face of a potentially long restore process. Rest assured that I'm just as amused by your "lame" sense of humor as you are!
  • All sarcasm aside, I'm more happy that the FreeNAS community has such extremely knowledgeable individuals like yourself, who both understand ZFS so intimately, and further take the time to assist those of us needing a hand. My sincere thanks. :)
  • As to the restore process, it is what it is and well in progress (after all, this is the precise reason I added another chassis, so I could afford to "tinker" without risk of data loss).
To report back ...
  • Using the non-supported, at your own risk [1] geom zero load and [2] truncate -s 800160743424 /path/to/a/sparse/file approaches resulted in the addition of two new SLOG devices (I was able to replace the P3700 SLOG); however, I was unable to remove either SLOG created using these approaches. Note: [1] was attempted prior to [2].
  • It was worth a shot! ;)
  • The pool seems to be just as "alive" as it was prior to the hail-mary, but it may be clinging to life because it is only be read from, not written to (to copy data to the second server).
Lessons Learned ...
  • Slice SLOG appropriately (don't be lazy regardless of time constraints); use zpool remove instead of zpool replace (unless SLOG needs to be replaced).
  • Testing, even if "non-destructive" is best done on a non-live pool.
Looking Forward (listed in order of "priority") ...
  1. My feelings won't be hurt if these questions go unanswered, but figured I would toss them out there ...
  2. I'm quite used to rclone; however, I've never used it to copy from one local server to another before. Configuring the SSH remote was giving me trouble and wanting to get on with the 25 TB+ copy, I started using this command scp -rp /Tank1/Data1/ root@FreeNAS-02:/mnt/Tank1/Data1/, where source = /Tank1/Data1/ & destination = /mnt/Tank1/Data1/, although I'm quite sure rclone / rsync is superior. Can anyone offer the equivalent rsync command, please? (I ask as while I can look it up, I'm more comfortable taking guidance from someone else and prefer to be risk adverse until I have two local copies of my data)
  3. Out of curiosity ... I believe I mentioned before, that prior to the issue, and working with an encrypted pool, using the GUI, removing the SLOG hasn't always worked, and for some reason the only way to remove the SLOG from view in the GUI is a reboot and then encryption rekey. I don't take it this is normal behavior, is it?
  4. Out of curiosity ... Unless a passphrase is set, post removal of a SLOG from an encrypted pool, upon reboot, the pool is auto-imported without need to enter a passphrase / key. I take it this is normal behavior?
  5. Bench-marking ... In regards to a pool that isn't destined to become shared storage (i.e. not NFS dataset / iSCSI zvol), is it possible to achieve write speeds (sync=Always) on par with those for sync=Disabled? Referencing my preliminary benchmarks below, you can see that addition of a SLOG provides a huge increase over sync=Always, but never catches up to sync=Disabled speeds.
  6. Bench-marking ... Again, in regards to a pool that isn't destined to become shared storage (i.e. not NFS dataset / iSCSI zvol), why would addition of a SLOG result in much increased read speeds as well (I wouldn't think it would), with the only explanation I can think of being testing methodology error? Note my approach = (1) compression off, (2) removed RAM from the VM, ensured bs * count > exceeds attached RAM ***, (3) dd if=/dev/zero of=/mnt/Tank1/ ... + dd of=/dev/null if=/mnt/Tank1/ ... for testfileX, (4) repeat with testfile Y.
  7. *** Isn't there a Tunable that accomplishes the same / other command that should be used prior to using dd? I searched relentlessly, have seen it before, but couldn't find it again.
Thanks Again!

Tank1_bench.jpg
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
My feelings won't be hurt if these questions go unanswered, but figured I would toss them out there ...

Can anyone offer the equivalent rsync command, please?

rsync -aP /mnt/Tank1/Data1/ root@FreeNAS-02:/mnt/Tank1/Data1/ should be what you're after. a for Archive (recursive copy, preserve all attributes, etc) and P for Progress (because it's always nice to see how things are going)

I don't take it [rebooting and rekeying when removing SLOG] is normal behavior, is it?
SLOG is an encrypted device. I believe this is normal behavior.

I take it [auto import without a passphrase] is normal behavior?

Key stored locally, no passphrase = auto import. Normal as per documentation section 8.1.8.1

Is it possible to achieve write speeds (sync=Always) on par with those for sync=Disabled?
Close, but never on par. There's always going to be a slight performance disparity because you're doing the writes to a block device. Faster SLOG devices will get you closer and closer to that point but you're basically competing with a NOP instruction.

why would addition of a SLOG result in much increased read speeds as well (I wouldn't think it would), with the only explanation I can think of being testing methodology error?

If you're using sync=always you still have a ZIL (ZFS Intent Log) but it's just stored in-pool, so you're now asking your vdevs to effectively write everything twice (once to ZIL, then again to the vdevs) and unless they're really fast this extra load will impact their ability to deliver read IOPS/bandwidth.

Isn't there a Tunable that accomplishes the same / other command that should be used prior to using dd
?
I think you're thinking of vfs.zfs.arc_max - the quickest way to ensure you're exceeding your ARC would be to limit it.
sysctl vfs.zfs.arc_max to read it (check and record your current setting first)
sysctl vfs.zfs.arc_max=valueinbytes to write it (but you'll want to monitor the actual size of it via kstat.zfs.misc.arcstats.size before you start blasting your benchmarks)
 
Last edited:

svtkobra7

Patron
Joined
Jan 12, 2017
Messages
202
Can anyone offer the equivalent rsync command, please?
rsync -aP /mnt/Tank1/Data1/ root@FreeNAS-02:/mnt/Tank1/Data1/ should be what you're after. a for Archive (recursive copy, preserve all attributes, etc) and P for Progress (because it's always nice to see how things are going)
  • Thanks ... sorry to ask the favor, but I don't want 2 days to turn into 200, i.e. DL from cloud.
I don't take it [rebooting and rekeying when removing SLOG] is normal behavior, is it?
SLOG is an encrypted device. I believe this is normal behavior.
  • Maybe with a fresh pool I won't encounter the issue, I guess what I was trying to convey here is that when you click remove in the gui, you would expect to see the device removed (or at least I would, in a non-borked pool). I'll attempt to replicate later / log ticket if necessary.
I take it [auto import without a passphrase] is normal behavior?
Key stored locally, no passphrase = auto import. Normal as per documentation section 8.1.8.1
  • Arghh ... search failed me. Thanks.
Is it possible to achieve write speeds (sync=Always) on par with those for sync=Disabled?
Close, but never on par. There's always going to be a slight performance disparity because you're doing the writes to a block device. Faster SLOG devices will get you closer and closer to that point but you're basically competing with a NOP instruction.
  • I agree with your assertion, that the delta can never be chased down to 0, but I would expect the impact from such a fast SLOG to have made the variance much smaller than it was.
  • pool recordsize = 1024k ... average [sync=Always w/ Optane write speed] / average [sync=disabled write speed] = 41%
  • pool recordsize = 128k ... average [sync=Always w/ Optane write speed] / average [sync=disabled write speed] = 69%
  • Let's discard the 1024k result, still I would think that 30% still on the table to be a bit high. Of note, a prior, less comprehensive go at bench-marking showed that ratio at 61% and adding a another Optane (striped) got it to 72%.
  • [but that is without any additional tweaking - perhaps more research will be rewarding]
why would addition of a SLOG result in much increased read speeds as well (I wouldn't think it would), with the only explanation I can think of being testing methodology error?
If you're using sync=always you still have a ZIL (ZFS Intent Log) but it's just stored in-pool, so you're now asking your vdevs to effectively write everything twice (once to ZIL, then again to the vdevs) and unless they're really fast this extra load will impact their ability to deliver read IOPS/bandwidth.
  • I would 100% buy that answer, and in fact it is the same I would offer (less articulately); however,
  • That second write, occurs every 5 seconds, when the ZIL is flushed and blocks written to pool, so unless you run dd write, and then
  • BAM within 5 seconds, prior to the final on disk ZIL flush,
  • hit the pool with dd read, I would not think that your disks working harder during the write cycle (on disk ZIL vs. SLOG) would have any impact during the beginning of the read cycle, and subsequent speeds.
  • Thus my suggestion that it is likely methodology error. But I've seen this every time I've tested ... I find it curious. And there was no other disk activity other than the bench marking.
Isn't there a Tunable that accomplishes the same / other command that should be used prior to using dd
I think you're thinking of vfs.zfs.arc_max - the quickest way to ensure you're exceeding your ARC would be to limit it.
sysctl -a vfs.zfs.arc_max to read it (check and record your current setting first)
sysctl vfs.zfs.arc_max=valueinbytes to write it (but you'll want to monitor the actual size of it via kstat.zfs.misc.arcstats.size before you start blasting your benchmarks)
  • YES ... that's the ticket - not sure why I couldn't find it!
  • [not to correct, but in case someone references later], I believe the last command (while implied) is technically, sysctl -a kstat.zfs.misc.arcstats.size. (again, not to correct you).
THANKS AGAIN! :):):)
 
Status
Not open for further replies.
Top