encrypted zpool gone after wrong disk replacement

Status
Not open for further replies.

aLCHEMY

Cadet
Joined
Jan 24, 2016
Messages
3
Hi,

I am running FreeNAS-9.3-STABLE-201512121950 on an AsRock motherboard with 16 GB ECC RAM and Intel(R) Atom(TM) CPU C2550 @ 2.40GHz. I had an encrypted RAID-10 zpool consisting of 4 SSDs (Samsung 256 GB). However, the disks were giving problems (one of them got the pool degraded, others gave CAM errors) so I decided to replace them one by one with traditional HDDs (WD 2TB). I did this by first matching the geli GUID with the serial number of the disk (using
Code:
glabel -status
), taking one of the disks offline, shutting down the server, checking which serial number matched the geli GUID, and when I found the correct disk, replace it by a fresh HDD. After bringing up the server, I unlocked the zpool, then replaced the disk on the CLI (using
Code:
zpool replace <zpool name> <device name>
).
After this, the zpool was resilvered automatically and was working fine after that. Over a period of a few days it did this four times, which involved lots of reboots/shutting down the server, without any problem. I did notice that
Code:
zpool status
no longer specified geli devices, but /dev/ada0 to /dev/ada3. But I did not realize I had made a big mistake by not following the correct procedure for disk replacement of encrypted zpools; I didn't know that there was a special procedure for this.
However, all was fine until I decided to apply the latest updates. After reboot, the GUI told me underneath the tab 'Storage' that my pool said "0 (error)" under 'used' and "Error getting available space" under 'Available'. Status is LOCKED. I tried to detach the volume, import and export it in the CLI, but it does not appear in the GUI anymore. When I try to import it in the GUI, it cannot find encrypted disks so it blocks on step 2.
Code:
select * from storage_encrypteddisk;
in sqlite3 returns nothing.

Since the zpool survived lots of shutdowns before the upgrade, I tried rebooting the previous freenas version, but that didn't resolve the situation.

My data is not accessible now. Is there any way I can fix this issue (that I have caused myself, I know...)?

I have included the debug.log (my nas is called "nas3" and the zpool is called "ssds"). The HDDs are recognized in freenas; see CAMCONTROL output.txt.

Thanks a lot,
Edwin.
 

Attachments

  • debug-nas3-20160124102718..tgz
    134.7 KB · Views: 186
  • CAMCONTROL output.txt
    443 bytes · Views: 259

Mr_N

Patron
Joined
Aug 31, 2013
Messages
289
Given the doc says "If the following additional steps are not performed before the next reboot, you may lose access to the pool permanently."

I'd say your probs not getting it back...
 

DrKK

FreeNAS Generalissimo
Joined
Oct 15, 2013
Messages
3,630
I unlocked the zpool, then replaced the disk on the CLI
Also, why would you use the CLI to manipulate your appliance for functionalities that are built into the GUI? That, too, is an excellent way to wreck your FreeNAS.
 

DrKK

FreeNAS Generalissimo
Joined
Oct 15, 2013
Messages
3,630

aLCHEMY

Cadet
Joined
Jan 24, 2016
Messages
3
Anyhow, I rolled back to "FreeNAS-9.3-STABLE-201510290351" and SURPRISE: it recognized my zpool without any issues! I can unlock it and access my data. So the question is: why do I get this error with the two latest Freenas upgrades? The problem occurred with v201601181840 and rolling back to v201512121950 did not solve the problem.
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
Rolling back restored the geli key and the config filee. More than likely your choosing to do a disk replacement "not in accordance with procedure" is why things are going awry. I'd recommend you backup the zpool, destroy the zpool and recreate it, then do an upgrade (which will likely succeed).

I'm running with an encrypted zpool and I was able to successfully upgrade from the Dec 12th to the Jan 18th build without problems.

This is almost entirely self-inflicted (which you seem to recognize) but as for the cause I don't know. I could review the code but the short and skinny is "do what the manual says because the middleware expects you to do things in accordance with the manual". Deviating from the manual leads to hate and discontent (as you've seen firsthand).

Not to sound like a total jerk, but its really not worth the effort to figure out what went wrong and fix it since you are seriously deviating from the expected disk replacement procedure FreeNAS expects you to use... and us experts know that ends badly because the code expects to control things and for things to be 'a certain way'.

Consider yourself lucky you had a BE to roll back to. You'd likely have lost access to your data permanently if it hadn't been for the BE. ;)
 

DrKK

FreeNAS Generalissimo
Joined
Oct 15, 2013
Messages
3,630
Rolling back restored the geli key and the config filee. More than likely your choosing to do a disk replacement "not in accordance with procedure" is why things are going awry. I'd recommend you backup the zpool, destroy the zpool and recreate it, then do an upgrade (which will likely succeed).

I'm running with an encrypted zpool and I was able to successfully upgrade from the Dec 12th to the Jan 18th build without problems.

This is almost entirely self-inflicted (which you seem to recognize) but as for the cause I don't know. I could review the code but the short and skinny is "do what the manual says because the middleware expects you to do things in accordance with the manual". Deviating from the manual leads to hate and discontent (as you've seen firsthand).

Not to sound like a total jerk, but its really not worth the effort to figure out what went wrong and fix it since you are seriously deviating from the expected disk replacement procedure FreeNAS expects you to use... and us experts know that ends badly because the code expects to control things and for things to be 'a certain way'.

Consider yourself lucky you had a BE to roll back to. You'd likely have lost access to your data permanently if it hadn't been for the BE. ;)
I don't necessary entirely agree, Cyberjock.

I think it would be intellectually enlightening, and potentially relevant in other contexts, to understand the behavior this user experienced, notwithstanding the fact that he seriously deviated from the standard procedure. Under what conditions can a guy with a hosed, encrypted, pool, rollback his BE and get access again? Here, everyone is saying "your pool is hosed", "you're screwed", "you're a moron for deviating from the manual"; as it happened, we could have saved the user considerable consternation had we said: "You're a moron for doing this shit in the CLI, but, you should be able to get back your geli key and pool by reverting to a previous boot environment". I think it's on us to have known that, and suggested that, to the user, even if he was ignoring the procedures.
 
Status
Not open for further replies.
Top