Has it all gone sideways? No Gui? Encrypted Pools resilvering. I dare not reboot

Status
Not open for further replies.

afinman

Cadet
Joined
Mar 10, 2018
Messages
6
Help!

I have a FreeNAS 11 install with 5x 6 drive RAIDZ-2 volumes

This has been working fine for the past 100ish days.

I have 2x 6x 6Tb Arrays and 3x 6x 3Tb Arrays... and was wanting to upgrade the 3Tb arrays to 6Tb drives.
All are encrypted with geli through the FreeNAS interface. I have the geli key copied...

I removed one 3Tb drive from each array and replaced with a 6Tb drive. All was fine. I used the GUI to replace the drives and start the resilvering process.

I added a 960Gb SSD and tried to make a RAIDZ mirror of the two 960Gb SSDs in the server and the GUI stopped responding.

From cli, zpood status shows as follows:
Code:
  pool: Five
 state: ONLINE
status: One or more devices is currently being resilvered.  The pool will
		continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Thu Mar  8 14:44:00 2018
		5.34T scanned out of 13.0T at 33.1M/s, 67h27m to go
		909G resilvered, 41.04% done
config:

		NAME												STATE	 READ WRITE CKSUM
		Five												ONLINE	   0	 0	 0
		  raidz2-0										  ONLINE	   0	 0	 0
			gptid/6c2d53d2-c4b1-11e7-8089-ac1f6b09a344.eli  ONLINE	   0	 0	 0
			gptid/74a47955-c4b1-11e7-8089-ac1f6b09a344.eli  ONLINE	   0	 0	 0
			gptid/9493701f-c896-11e7-9711-ac1f6b09a344.eli  ONLINE	   0	 0	 0
			gptid/10233830-22df-11e8-9057-ac1f6b09a344.eli  ONLINE	   0	 0	 0  (resilvering)
			gptid/92991055-c4b1-11e7-8089-ac1f6b09a344.eli  ONLINE	   0	 0	 0
			gptid/a7e5d6e6-c4b1-11e7-8089-ac1f6b09a344.eli  ONLINE	   0	 0	 0

errors: No known data errors

  pool: Four
 state: ONLINE
status: One or more devices is currently being resilvered.  The pool will
		continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Thu Mar  8 13:58:31 2018
		6.04T scanned out of 14.5T at 36.9M/s, 66h49m to go
		1.01T resilvered, 41.66% done
config:

		NAME												STATE	 READ WRITE CKSUM
		Four												ONLINE	   0	 0	 0
		  raidz2-0										  ONLINE	   0	 0	 0
			gptid/eb7b35c5-c4af-11e7-8089-ac1f6b09a344.eli  ONLINE	   0	 0	 0
			gptid/07c08df6-c4b0-11e7-8089-ac1f6b09a344.eli  ONLINE	   0	 0	 0
			gptid/0b5912c5-c4b0-11e7-8089-ac1f6b09a344.eli  ONLINE	   0	 0	 0
			gptid/0cb21854-c4b0-11e7-8089-ac1f6b09a344.eli  ONLINE	   0	 0	 0
			gptid/b9adaa3b-22d8-11e8-9057-ac1f6b09a344.eli  ONLINE	   0	 0	 0  (resilvering)
			gptid/30840d26-c4b0-11e7-8089-ac1f6b09a344.eli  ONLINE	   0	 0	 0

errors: No known data errors

  pool: One
 state: ONLINE
  scan: scrub repaired 0 in 155h51m with 0 errors on Sat Feb  3 11:59:38 2018
config:

		NAME												STATE	 READ WRITE CKSUM
		One												 ONLINE	   0	 0	 0
		  raidz2-0										  ONLINE	   0	 0	 0
			gptid/87350245-b0cc-11e7-9b15-ac1f6b09a344.eli  ONLINE	   0	 0	 0
			gptid/8b78c87f-b0cc-11e7-9b15-ac1f6b09a344.eli  ONLINE	   0	 0	 0
			gptid/8fc2a163-b0cc-11e7-9b15-ac1f6b09a344.eli  ONLINE	   0	 0	 0
			gptid/90c95865-b0cc-11e7-9b15-ac1f6b09a344.eli  ONLINE	   0	 0	 0
			gptid/94fa9675-b0cc-11e7-9b15-ac1f6b09a344.eli  ONLINE	   0	 0	 0
			gptid/95f44bbc-b0cc-11e7-9b15-ac1f6b09a344.eli  ONLINE	   0	 0	 0

errors: No known data errors

  pool: Three
 state: ONLINE
status: One or more devices is currently being resilvered.  The pool will
		continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Thu Mar  8 14:29:51 2018
		6.26T scanned out of 11.8T at 38.6M/s, 41h37m to go
		1.04T resilvered, 53.14% done
config:

		NAME												STATE	 READ WRITE CKSUM
		Three											   ONLINE	   0	 0	 0
		  raidz2-0										  ONLINE	   0	 0	 0
			gptid/cafa8c71-bf28-11e7-8a56-ac1f6b09a344.eli  ONLINE	   0	 0	 0
			gptid/cfe225a3-bf28-11e7-8a56-ac1f6b09a344.eli  ONLINE	   0	 0	 0
			gptid/d2e9c046-bf28-11e7-8a56-ac1f6b09a344.eli  ONLINE	   0	 0	 0
			gptid/d509e639-bf28-11e7-8a56-ac1f6b09a344.eli  ONLINE	   0	 0	 0
			gptid/d72d1959-bf28-11e7-8a56-ac1f6b09a344.eli  ONLINE	   0	 0	 0
			gptid/1316f79c-22dd-11e8-9057-ac1f6b09a344.eli  ONLINE	   0	 0	 0  (resilvering)

errors: No known data errors

  pool: Two
 state: ONLINE
  scan: scrub repaired 0 in 55h15m with 0 errors on Tue Feb  6 07:15:39 2018
config:

		NAME												STATE	 READ WRITE CKSUM
		Two												 ONLINE	   0	 0	 0
		  raidz2-0										  ONLINE	   0	 0	 0
			gptid/b96de995-b0cc-11e7-9b15-ac1f6b09a344.eli  ONLINE	   0	 0	 0
			gptid/ba7dd585-b0cc-11e7-9b15-ac1f6b09a344.eli  ONLINE	   0	 0	 0
			gptid/bb925f53-b0cc-11e7-9b15-ac1f6b09a344.eli  ONLINE	   0	 0	 0
			gptid/bc8f71eb-b0cc-11e7-9b15-ac1f6b09a344.eli  ONLINE	   0	 0	 0
			gptid/bd8d5e46-b0cc-11e7-9b15-ac1f6b09a344.eli  ONLINE	   0	 0	 0
			gptid/be957b5d-b0cc-11e7-9b15-ac1f6b09a344.eli  ONLINE	   0	 0	 0

errors: No known data errors

  pool: freenas-boot
 state: ONLINE
  scan: scrub repaired 0 in 0h0m with 0 errors on Tue Mar  6 03:45:54 2018
config:

		NAME		STATE	 READ WRITE CKSUM
		freenas-boot  ONLINE	   0	 0	 0
		  da31p2	ONLINE	   0	 0	 0

errors: No known data errors


Which looks ok to me.

I have just read this though: http://doc.freenas.org/11/storage.html#replacing-an-encrypted-drive which advises that I need to do some special steps otherwise death will result on reboot. I can't do these steps though as the GUI is toast...

I have tried restarting nginx with service nginx restart and also django as per some guidance, but the situation doesn't improve.

The console seems to be unhappy about network configuration even though this hasn't changed.

Please advise as I need to get this fixed asap please.

Am hoping there are commands to do the rekey or somebody can advise how to get the gui started again. I can't even get any error message - I don't know where to look...

HELP!
 

afinman

Cadet
Joined
Mar 10, 2018
Messages
6
I also have a lot on these in the daily kernel log email

Code:
> swap_pager: I/O error - pagein failed; blkno 14155981,size 4096, error 6 
> uiomove_object: vm_obj 0xfffff8042de8bb58 idx 2 valid 0 pager error 4 
> swap_pager: I/O error - pagein failed; blkno 14156343,size 4096, error 6 
> swap_pager: I/O error - pagein failed; blkno 14155981,size 4096, error 6 
> uiomove_object: vm_obj 0xfffff8042de8bb58 idx 2 valid 0 pager error 4 
> swap_pager: I/O error - pagein failed; blkno 14156343,size 4096, error 6 
> swap_pager: I/O error - pagein failed; blkno 14155981,size 4096, error 6 
> uiomove_object: vm_obj 0xfffff8042de8bb58 idx 2 valid 0 pager error 4 
> swap_pager: I/O error - pagein failed; blkno 14156343,size 4096, error 6 
> swap_pager: I/O error - pagein failed; blkno 14155981,size 4096, error 6 
> uiomove_object: vm_obj 0xfffff8042de8bb58 idx 2 valid 0 pager error 4 
> swap_pager: I/O error - pagein failed; blkno 14156343,size 4096, error 6 
> swap_pager: I/O error - pagein failed; blkno 14155981,size 4096, error 6 
> uiomove_object: vm_obj 0xfffff8042de8bb58 idx 2 valid 0 pager error 4 
> swap_pager: I/O error - pagein failed; blkno 14156343,size 4096, error 6 
> swap_pager: I/O error - pagein failed; blkno 14155981,size 4096, error 6 
> uiomove_object: vm_obj 0xfffff8042de8bb58 idx 2 valid 0 pager error 4 
> swap_pager: I/O error - pagein failed; blkno 14156343,size 4096, error 6 
> swap_pager: I/O error - pagein failed; blkno 14155981,size 4096, error 6 
> uiomove_object: vm_obj 0xfffff8042de8bb58 idx 2 valid 0 pager error 4 
> swap_pager: I/O error - pagein failed; blkno 14156343,size 4096, error 6 
> swap_pager: I/O error - pagein failed; blkno 14155981,size 4096, error 6


And this in another email:
Code:
Traceback (most recent call last): 
File "/usr/local/bin/midclt", line 10, in <module> 
sys.exit(main()) 
File "/usr/local/lib/python3.6/site-packages/middlewared/client/client.py", line 325, in main 
with Client(uri=args.uri) as c: 
File "/usr/local/lib/python3.6/site-packages/middlewared/client/client.py", line 114, in __init__ 
self._ws.connect() 
File "/usr/local/lib/python3.6/site-packages/middlewared/client/client.py", line 51, in connect 
rv = super(WSClient, self).connect() 
File "/usr/local/lib/python3.6/site-packages/ws4py/client/__init__.py", line 216, in connect 
bytes = self.sock.recv(128) 
socket.timeout: timed out
 

garm

Wizard
Joined
Aug 19, 2017
Messages
1,556
You lost the GUI because the swap was somehow messed up during your resilver. Atleast I think that is was happened. As long as you get in using shell then just let it run it’s course.
 

afinman

Cadet
Joined
Mar 10, 2018
Messages
6
Thanks, I'll see what happens, but I'm not hopefully that when the resilvering finishes that the GUI will come back.
I would need to figure out how to re-key with GELI command line and for freenas not to get upset by that....
Can't find any guidance.
Tips welcome.
 

afinman

Cadet
Joined
Mar 10, 2018
Messages
6
Hi
There is no "shell" though... I can ssh in, but that is just normal stuff. Have they deprecated the FreeNAS CLI?
If I can't start the GUI, how will I re-key the rebuilt volumes?
I think I understand what has happened now...
Each drive has a swap partition on it which I didn't appreciate...
When I removed the drives, I've basically taken some of the swap out of circulation without telling the system and so this has upset it...
swapctl -l now looks like this:
Code:
Device:	   1024-blocks	 Used:
/dev/da7p1.eli   2097152	  1992
/dev/da8p1.eli   2097152	  1812
/dev/da9p1.eli   2097152	  1808
/dev/da10p1.eli   2097152	  1836
/dev/da11p1.eli   2097152	  2040
/dev/da12p1.eli   2097152	  1876
/dev/da13p1.eli   2097152	  1788
/dev/da14p1.eli   2097152	  1692
/dev/da15p1.eli   2097152	  1424
/dev/da16p1.eli   2097152	  1660
/dev/da17p1.eli   2097152	  1600
/dev/da18p1.eli   2097152	  1780
/dev/da24p1.eli   2097152	  1532
/dev/da23p1.eli   2097152	  1740
/dev/da22p1.eli   2097152	  1540
/dev/da21p1.eli   2097152	  1740
/dev/da20p1.eli   2097152	  1932
/dev/gptid/1316f79c-22dd-11e8-9057-ac1f6b09a344.eli   2097152	  1872
/dev/da27p1.eli   2097152	  2064
/dev/da28p1.eli   2097152	  1804
/dev/da29p1.eli   2097152	  1512
/dev/da30p1.eli   2097152	  1868
/dev/gptid/b9adaa3b-22d8-11e8-9057-ac1f6b09a344.eli   2097152	  1872
/dev/da26p1.eli   2097152	  1780
/dev/da0p1.eli   2097152	  1880
/dev/da1p1.eli   2097152	  1896
/dev/da2p1.eli   2097152	  1904
/dev/enc@n5003048017c7df3d/type@0/slot@4/elmdesc@Slot03/da3p2   2097152	  1756
/dev/da4p1.eli   2097152	  2052
/dev/da5p1.eli   2097152	  1752
/dev/da25p1.eli   2097152		 0
/dev/da19p1.eli   2097152		 0
/dev/da3p1.eli   2097152		 0

I'm hoping (worst case) I can just add the three drives I removed into spare bays and the system will see them again and pair them up as appropriate?! then the swap won't fail reading as it will be able to read from the drives.
This is a long shot, but...

The only other thing I'm thinking is that this all worked fine until I inserted a 960Gb SSD into one of the slots and tried to create a new mirror out of it, so I don't know if that is somehow to blame too...
Currently waiting out the resilvering...
help needed to get the rekey going, otherwise, do I just remove the drives I've resilvered and it will be ok on reboot as only the new drives aren't encrypted properly???
 

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,994
I'm hoping (worst case) I can just add the three drives I removed into spare bays and the system will see them again and pair them up as appropriate?! then the swap won't fail reading as it will be able to read from the drives.
This is a long shot, but...
Yup, odds are that you are going to loose all your data if you just monkey around with the file system. WAIT for someone who has experience with this kind of issue to respond. I wish I could help you but I'd just be guessing and there is no need to put your data at risk.

Question: Is your data backed up? If yes AND no one can help you out of this problem, I'd recommend you destroy the pool and rebuild it from scratch. Do you need an encrypted pool? If no then don't use one, it only adds further complications when things go wrong.
 

danb35

Hall of Famer
Joined
Aug 16, 2011
Messages
15,504

afinman

Cadet
Joined
Mar 10, 2018
Messages
6
Update:
The arrays rebuilt, but while I was waiting, I started prodding around again in terminal. I found the uwsgi? log and found some errors there and also from then nginx error log.
This pointed to the fact that /usr/local/etc/rc.d./django script was toast - input/output error
so I got one from a freshly minted FreeNAS install and that sort of worked, but made a new error which I then found from uwsgi log which basically said that /etc/shells was also toast (by looking on line 1012 of one of the freenasUI python files anyway) - again, I recreated then re-ran and the UI came back.
which was nice
Then I tried a rekey and got a nasty error which led me to manually running a sync command which syncs the ui with the encryption status I found which related to freenas 8... anyway, my rekeys appear to work and I have now downloaded all keys for all volumes - something which isn't exactly clear from the interface!!!

Worst case I have lost 3/5 pools on reboot, but I have backed up some of the data already and will then look at backing up the rest before I restart and presumably the thing doesn't boot...!!!

anyway. wow!
not sure why on earth the system blindly makes swap space on every drive?! Seems dangerous to me!
or why key files have holes in
the console moans that the network isn't configured so I presume something else important has also been shot in the face!
interestingly, I can't get the console messages in any of the logs, so god know what that means!
 

garm

Wizard
Joined
Aug 19, 2017
Messages
1,556
Well, there are reasons for the recommendation to not use encryption unless it’s mandated.

Until we get native ZFS encryption, backups are going to be really important. It’s just to easy to lose a pool.
 
Status
Not open for further replies.
Top