Step by step how I messed up my zpool - can you help me fix it?

hungarianhc · Oct 4, 2018

TL;DR: My volume is alive. I can access the files via SSH. It is, however, in a degraded state, and I'm currently running "naked," two drives down in my Raid-Z2 pool. I need to fix this...

1) A week ago, I noticed that 5 drive Raid-Z2 pool had a disk throwing all sorts of errors. I ordered a new 4TB drive to replace it.
2) I shut down the system, replaced the drive, and rebooted. I did NOT offline the drive first. Yup. Mistake.
3) Now this is the part that is odd. It actually showed the new drive as part of the pool, but it said it was unavailable. I know this may sound implausible, and maybe I'm misrepresenting information, but we're past this point, and I can't replicate this now, due to following steps.
4) I thought this was odd. I shut the system down. Plugged the old, bad drive in. Rebooted.
5) Now when I went to the replace drive option in the UI, it showed the old dead drive as the one available as a replacement drive. Weird. My logic at this point was that something on the pool was messed up, I'd re-add the "bad" drive, and then re-go through the proper steps to replace the drive.
6) I used the "replace" option to bring the bad drive back into the pool. Had to use the "force" option as it obviously saw some traces of a ZFS pool already on the failing drive.
7) Okay so now it's reslivering on the failing drive. This will likely take a LONG time, if it ever even finishes successfully.
8) Meanwhile, I do a "sanity check" (which, I know, will sound more like insane behavior, and I agree in hindsight) and remove the drive that is reslivering. My logic is that if I reboot the machine, and it shows that the reslivering drive is now unavailable, I can reattach the new drive, initiate "replace" on it, and then I'll be good to go.
9) Here's another bonehead move. I pulled out the WRONG drive. SHIT. Yup. Reboot FreeNAS. The drive that I pulled out is now showing as unavailable in the pool.

Okay so here's where I'm currently at... I've got a dying drive that is in the process of re-slivering. It's at 0.38% right now. Not sure when it will finish or if it ever will. Meanwhile, a healthy drive is showing up as unavailable in the pool, even though it is plugged in. My new drive is not even plugged into the box. I know there were bonehead moves made all along the way here so go easy on me. I'd love to see how I can move forward from here.

Here is the current zpool status:

Code:

  pool: Storage
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
	continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Wed Oct  3 22:37:08 2018
	2.57T scanned at 202M/s, 42.4G issued at 3.26M/s, 10.8T total
	8.34G resilvered, 0.38% done, no estimated completion time
config:

	NAME												STATE	 READ WRITE CKSUM
	Storage											 DEGRADED	 0	 0	 0
	  raidz2-0										  DEGRADED	 0	 0	 0
		gptid/5729e8e1-b247-11e3-82da-d050990a6791.eli  ONLINE	   0	 0	 0
		gptid/5796718e-b247-11e3-82da-d050990a6791.eli  ONLINE	   0	 0	 0
		gptid/4e6e50a9-c797-11e8-91c8-e0d55ecf5abd.eli  ONLINE	   0	 0	 0
		11091397725427720989							UNAVAIL	  0	 0	 0  was /dev/gptid/58e962e4-b247-11e3-82da-d050990a6791.eli
		gptid/594a667f-b247-11e3-82da-d050990a6791.eli  ONLINE	   0	 0	 0

errors: No known data errors

So.... What do I do?

My gut is to stop the resliver, get rid of the dying drive, somehow reassociate the unavailable drive with the pool, and then add my replacement disk to replace the dying drive. Does that make sense? Thoughts? THANKS!!!!

Edit: I took a look in the UI, and the drive that is "unavailable" shows as a replacement drive. See below.

Screenshot 2018-10-04 at 12.09.42 PM.png

I didn't click "Replace Disk" but it seems to me that if I do that, it will probably give me an error and tell me that there is already a ZFS volume on it, and then it will give me the option to force it. If I force it, it will then re-join the pool, I assume. It will then start reslivering... Assuming nothing dies between now and when it finishes reslivering, then I'll be back to my state of a non-degraded pool with a dying drive, ada0p2 (currently reslivering). Then I can go and replace it with the extra drive. Thoughts on this plan?

diskdiddler · Oct 4, 2018

Wow :(

This is not a good situation at all. honestly, I'm not certain what I would do - except wait for someone very clued up to get in to this thread and provide experienced advice. One more wrong 'guess' and it all might be gone.

Don't be hard on yourself, I probably would've done almost exactly the same.

hungarianhc · Oct 4, 2018

diskdiddler said:
Wow :(
This is not a good situation at all. honestly, I'm not certain what I would do - except wait for someone very clued up to get in to this thread and provide experienced advice. One more wrong 'guess' and it all might be gone.
Don't be hard on yourself, I probably would've done almost exactly the same.

I'm now thinking that maybe I offline the bad drive. It's going VERY slowly. 0.6% after 10 hours or so. Then I replace the drive that's listed as unavailable (I'll have to click 'force' as it has zvol data on it now). My thinking is that reslivering will be much faster then... Maybe it finishes in 24 hours. Then I'm at least 24 hours away from being back at single disk redundancy. At that point, I try replacing the failing drive. If it's in the process of reslivering, will clicking the button to offline the drive work? In any case, I'll wait to hear another opinion or two before I do anything... but it seems like getting the healthy drive that's currently plugged in back and part of the pool would be the top priority.

diskdiddler · Oct 4, 2018

That is a very slow resliver, I must admit I'd ditch ittoo.

How many disks do you have redundant?

hungarianhc · Oct 4, 2018

diskdiddler said:
That is a very slow resliver, I must admit I'd ditch ittoo.

How many disks do you have redundant?

RAID-Z2 5 disk array. Right now the dying drive is reslivering, and the healthy drive is not attached. So I'm running 3 out of 5 right now. eek.

Stux · Oct 4, 2018

And your drives are encrypted too.

FWIW, if you are careful with your next steps I think you’ll pull through, but you want to make sure you are associating drives with serials before pulling anything.

How Many SATA ports have you got?

hungarianhc · Oct 4, 2018

Stux said:
And your drives are encrypted too.

FWIW, if you are careful with your next steps I think you’ll pull through, but you want to make sure you are associating drives with serials before pulling anything.

How Many SATA ports have you got?

I have plenty of SATA ports available. My board has support for 16 via mini-SAS adapter. I think I have two open ports right now, but that could be six if I ordered another cable.

I agree that I think I'll be okay if I'm super careful with next steps. I'm just curious about a few very specific things...

- The system is re-slivering right now at this very moment. However, if you look at my zpool status output, it doesn't say WHICH drive is re-slivering. Now... I'm 99.999% sure it's the failing drive, and I know the serial number of that drive, and I have it noted as failing in the description field in FreeNAS... but still... I'd love to get some visual confirmation so I can be 100% sure that I offline the right drive.
- Will it let me offline the drive while it's re-slivering?
- Is re-adding the healthy drive back into the pool while reslivering possible? Can I do that and then offline the dying drive? If so, I think that gets me exactly where I need to be.

So yeah... if anyone can help me out with those questions, I'm just not exactly in a place where I can just "try" things.

My last resort is to hook up some other type of network attached storage, rsync the data to it, blow away the zpool, and then rsync it back, but I think I can avoid that. Would love some help from the experts with the above questions.

hungarianhc · Oct 4, 2018

Another update. I bit the bullet.

- I offlined the bad drive. Luckily, I picked the right one

. I was sweating when I clicked the button.

- Okay... at this point, I'm running with zero redundancy. I made sure I could still browse the file system, create and save a file, etc. All good.

- Now I went to go replace the other drive, and this time I selected ada4, the healthy drive. I had to "force" the replacement, as it could detect a zvol already on ada4. Okay... was a bit nervous. Clicked. Worked!

- Now... it's reslivering... It's already WAY faster than the dying drive (duh). After ten minutes, it's at 1% complete. It took the other drive twelve hours to get to 1.5% haha.

So basically... my risk now is that one of the drives dies overnight. It's completely freaking plausible, as they're all about the same age, and now they're going to be working pretty hard. That being said, I have faith! I believe I should wake up in the morning to a single drive redundancy system... At that point, I believe I should do the following (please confirm):

Re-key the pool. I'm not sure why I need to do this, but it is in the recommended steps. I'll use the same key.
Save my config. Save my encryption key.
Shut the system down.
Attach the new drive.
Use the UI to replace the currently offlined drive with the new one.
Wait another 24 hours.
Breath

I think...

hungarianhc · Oct 5, 2018

Okay... Here's an update for anyone who is following along. The resliver finished this morning. Woot! Sweet single drive redundancy has never felt so good!

Code:

pool: Storage
state: DEGRADED
status: One or more devices has been taken offline by the administrator.
	Sufficient replicas exist for the pool to continue functioning in a
	degraded state.
action: Online the device using 'zpool online' or replace the device with
	'zpool replace'.
  scan: resilvered 2.16T in 0 days 09:53:25 with 0 errors on Fri Oct  5 07:24:55 2018
config:

	NAME												STATE	 READ WRITE CKSUM
	Storage											 DEGRADED	 0	 0	 0
	  raidz2-0										  DEGRADED	 0	 0	 0
		gptid/5729e8e1-b247-11e3-82da-d050990a6791.eli  ONLINE	   0	 0	 0
		gptid/5796718e-b247-11e3-82da-d050990a6791.eli  ONLINE	   0	 0	 0
		10124516482167301404							OFFLINE	  0	 0	 0  was /dev/gptid/4e6e50a9-c797-11e8-91c8-e0d55ecf5abd.eli
		gptid/57e546ad-c857-11e8-9a22-e0d55ecf5abd.eli  ONLINE	   0	 0	 0
		gptid/594a667f-b247-11e3-82da-d050990a6791.eli  ONLINE	   0	 0	 0

errors: No known data errors

Should I re-key the encrypted pool? I don't see any harm in doing that so I probably will. I'm going to use the same passphrase.

Other than that, tonight I'll save the config, export the key, shut down the system, and add the replacement drive back in. The dying drive is already offlined so it's obvious which one needs to be replaced. Any other advice?

Stux · Oct 5, 2018

Verify serial numbers. FreeNAS tells you the serials of utilizes drives (not sure if this applies with encrypted) but you can also use smartctl to read serials. Verify the serial on the mounted device you want to replace matches the physical device you are replacing.

hungarianhc · Oct 6, 2018

Hey everyone,

I thought I'd give an update as to where I'm at. The drive reslivered completely and quickly. Single drive redundancy - good to go. Then I tested my replacement drive. Turns out it was DOA. Wouldn't even power on. Tried in multiple machines. Ordered a new drive from Amazon yesterday. It came this morning. Popped it in. Reslivering right now... Going quickly. Already at 0.5% done. I should be back at a healthy RAIDZ-2 pool tonight.

The instructions say I need to re-key the pool. I'll do that, but TBH I don't fully understand the technical reason for that. I tripped over my own two feet a fair amount in this process, but I do think that there are some things that the FreeNAS UI could do to make things a bit more clear. I'll file some feature requests and write more later.

pro lamer · Oct 6, 2018

hungarianhc said:
The instructions say I need to re-key the pool. I'll do that, but TBH I don't fully understand the technical reason for that. I tripped over my own two feet a fair amount in this process, but I do think that there are some things that the FreeNAS UI could do to make things a bit more clear. I'll file some feature requests and write more later.

You can find some explanations here however I am not sure if they are "technical". Anyway I found them "clarifying".

Sent from my mobile phone

hungarianhc · Oct 6, 2018

Well, I'm alive! Dual drive redundancy. Good to go! ZFS rules!

diskdiddler · Oct 7, 2018

Great news to hear!

Important Announcement for the TrueNAS Community.

Step by step how I messed up my zpool - can you help me fix it?

hungarianhc

Patron

diskdiddler

Wizard

hungarianhc

Patron

diskdiddler

Wizard

hungarianhc

Patron

Stux

MVP

hungarianhc

Patron

hungarianhc

Patron

hungarianhc

Patron

Stux

MVP

hungarianhc

Patron

pro lamer

Guru

hungarianhc

Patron

diskdiddler

Wizard

Similar threads