So I'm having this problem on one of my encrypted zpools and I'm trying to determine how to deal with this issue. I'll give step-by-step instructions to replace a zpool in a VM so its easily replicable....(I used Virtualbox but I doubt it matters since I'm seeing this on a production system). Some of these steps are unnecessary, but I performed them for troubleshooting purposes. I don't have much experience with geli(none before FreeNAS introduced it, but I've been experimenting with it for a few weeks) and I'm trying to determine how to recover from what appears to be an unrecoverable situation for FreeNAS encrypted zpools.
Here's how to reproduce the issue in a VM:
1. Install FreeNAS 8.3.1-p2(I installed x64 version) to a virtual hard drive(I made my disk 2.5GB).
2. After installation remove the CD from the virtual disk list and create 3 more virtual drives. I used 10GB drives. I put my drives on a SATA controller but again this shouldn't matter.
3. Created an encrypted RAIDZ1 array with the 3 drives created in step 2. I also enabled the footer in the GUI so I can see if any output errors occur.
4. In FreeNAS, I setup my keys in this order: Created pass phrase, then Downloaded Key and Recovery Key. I created a 10GB temp file just to prove the files would be there. Not really necessary, but I go overboard when troubleshooting just to rule out other things.
5. Rebooted FreeNAS and used the key+passphrase worked to restore the zpool. I get "An error occurred!" message in the GUI header but the zpool mounts. Both a zpool status and zpool scrub show nothing wrong so I'll ignore the error since it seems to be in error itself. The footer shows 3 drives mounted with AES-XTS 128 encryption.
6. Rebooted FreeNAS again and used the recovery key(no password required) to restore the zpool. Again I get "An error occurred!" message in the GUI header but the zpool mounts. Both a zpool status and zpool scrub show nothing wrong so I'll ignore the error since it seems to be in error itself. The footer shows 3 drives mounted with AES-XTS 128 encryption.
7. Now to simulate a "failed drive". All of the steps I will be performing will be based off of the manual, section 6.3.11 - Replacing a Failed Drive or ZIL Device. After all, us senior guys are constantly telling newbies to use this section for replacing a drive because it works. At this point ada0 is my boot drive and ada1 through ada3 are my RAIDZ1 member disks.
8. Next I took a look at the /dev/adaXX to the gptids using gpart list. I did this in case the info comes in handy later. I took the last disk's partition in the zpool (ada3p2) and set it to "OFFLINE" status. The disk goes offline. GELI makes some comments about the drive being detached and a zpool status shows that the gptid starting with 76f87c66 is now OFFLINE and my zpool status is now in DEGRADED status. The ada3 device goes offline and a Replace button appears just as expected.
9. Now I do a VM shutdown and remove the first disk from the VM and create a new VM disk(in that order). It is important that you make sure that the disk you remove and the new disk you create are on the same SATA port in VirtualBox. If I had removed ada1 and the booted up and tried to mount my zpool I'll get the error in the GUI that "Error: Volume could not be imported: 1 devices failed to decrypt." because FreeNAS doesn't seem to be smart enough to recognize a drive that moves device IDs from ada3 to ada2. Additionally, if you do fail to make sure they're on the same SATA port in VirtualBox and you attempt to mount them something goes horribly wrong and you'll have an even harder time trying to mount the zpool because for some reason only 1 disk will mount even after you fix the devices (which a RAIDZ1 can't recover from). Personally, I've been unable to determine how to survive a situation where the disks change IDs, but that's not the problem I'm trying to identify. For now, just stick with the last disk and make sure you use the same SATA port. I took a snapshot here just to have one. (I'm really trying to nail down this issue, its been bothering me for a week)
10. After the new disk is created(and obviously is at least the same size as the old virtual disks) boot up FreeNAS. You should be able to mount the zpool with the key+pass and the recovery key but in a degraded state. This seems to be pretty normal and expected.
11. To continue with the disk replacement per 6.3.11 of the manual I mount the encrypted zpool in its degraded state using the key+pass method (I get the GUI "An error occurred!" message but I'm ignoring it because I don't know if its due to the same issue I saw in step 5 and 6 or because a disk is missing)
12. Next I click the "REPLACE" button. It asks for a passphrase and I enter the same passphrase I used for the rest of the zpool. After a few seconds I get "Disk replacement has been initiated," and resilvering completes seconds later.
13. In accordance with the manual I then click the "Detach" button in the GUI to removed the old drive from the zpool permanently. Now a zpool status and zpool scrub give typical "no problems" output, so all is well. At this point I've completed all of the steps section 6.3.11 of the manual and everything should be okay. Time to verify all is actually well. After all, you'd hate to find out later that things aren't well.
14. I reboot the FreeNAS machine and attempt to mount the zpool with key+passphrase. It works as expected. Again I get the "An error occurred!" but everything seems fine. My zpool is healthy.
15. I reboot the FreeNAS machine again and attempt to mount the zpool with the recovery key. Uh-oh. It's not working quite right. My zpool is "DEGRADED". Looking at the footer I have the errors:
So not only is it trying to attach the old gptid(76f87c66) but it can't mount the new gptid(610edb60). Now things aren't looking so well. So big picture I appear to have full redundancy with the key+passphrase but not with the recovery key. I need to fix this obviously. If a second disk were to fail my recovery key would be useless since it wouldn't be capable of remounting my zpool.
16. Rebooted the machine again, mounted the zpool with the key+passphrase method and I have new error now. I get the following in the footer:
So now FreeNAS is trying to remount my old failed drive every time. Not exactly the end of the world(definitely something that needs to be fixed though). The zpool does mount and is healthy. Since last time I mounted the zpool one disk failed to mount I chose to do a zpool scrub and wait for it to finish.
17. So time to fix that recovery key. I click the "Add Recovery Key" button. I get the warning that it will invalidate any previous recovery key("so what" for this situation) and click "Continue". I get a GUI error in the header that says
In the footer I get:
Hmm. Not the most ideal situation. It looks like FreeNAS may have tried copying the key from somewhere to somewhere. Maybe it fixed my replaced disk and my recovery key will work again... But I wasn't given the option to download a recovery key(remember you get that warning that all previous recovery keys will be invalid). This may not be good news at all.
18. Rebooted FreeNAS and attempted to mount the zpool using the recovery key I do have. Well, my recovery key doesn't work at all.
I get this error in the header:
And in the footer I get these entries:
So to recap, after a disk replacement and an attempt to recreate the recovery key I'm left with a system that can't use the recovery key and the key+pass works with the 3 installed member drives of the zpool but has an error because its trying to mount the disk that was replaced.
So does anyone have a clue what I did wrong, or if I even did something wrong? Should the manual be updated to reflect how to deal with encryption?
Is this a bug?
Is there a way to recovery from this using the CLI and/or editing the config file?
Any other ideas how to achieve full redundancy with both the key+passphrase and recovery key method without recreating the zpool and recovering from backups?
Note: I documented this issue in support ticket 2178 and have had no response, which is why I'm asking the forum. If you take the time to read 2178(I don't think there's any reason to since this post discusses step-by-step what I did on that server, but in a VM machine here, the only thing that seems to be much different is that I always used the recovery key in 2178(laziness). But on that server the problem found me because after the resilvering completes and I reboot and use the recovery key again I just end up with a DEGRADED zpool. I rebooted and resilvered twice before stopping and started questioning what was wrong. And naturally(thanks to murphy's law) I've replaced 1 failed disk last weekend and another disk in the zpool is racking up SMART errors like nobody's business, so I'm about to have a recovery key with no redundancy. I'd destroy and recreate the zpool from backup but now I'm questioning how "trustworthy" the encryption is with regards to disk replacement/recovery. This is a pretty big deal for people that haven't realized that their recovery key isn't 100% after a failed disk and I don't see any easy way to recover from this with the knowledge I have(and alot of people using FreeNAS have even less than myself...)
Here's how to reproduce the issue in a VM:
1. Install FreeNAS 8.3.1-p2(I installed x64 version) to a virtual hard drive(I made my disk 2.5GB).
2. After installation remove the CD from the virtual disk list and create 3 more virtual drives. I used 10GB drives. I put my drives on a SATA controller but again this shouldn't matter.
3. Created an encrypted RAIDZ1 array with the 3 drives created in step 2. I also enabled the footer in the GUI so I can see if any output errors occur.
4. In FreeNAS, I setup my keys in this order: Created pass phrase, then Downloaded Key and Recovery Key. I created a 10GB temp file just to prove the files would be there. Not really necessary, but I go overboard when troubleshooting just to rule out other things.
5. Rebooted FreeNAS and used the key+passphrase worked to restore the zpool. I get "An error occurred!" message in the GUI header but the zpool mounts. Both a zpool status and zpool scrub show nothing wrong so I'll ignore the error since it seems to be in error itself. The footer shows 3 drives mounted with AES-XTS 128 encryption.
6. Rebooted FreeNAS again and used the recovery key(no password required) to restore the zpool. Again I get "An error occurred!" message in the GUI header but the zpool mounts. Both a zpool status and zpool scrub show nothing wrong so I'll ignore the error since it seems to be in error itself. The footer shows 3 drives mounted with AES-XTS 128 encryption.
7. Now to simulate a "failed drive". All of the steps I will be performing will be based off of the manual, section 6.3.11 - Replacing a Failed Drive or ZIL Device. After all, us senior guys are constantly telling newbies to use this section for replacing a drive because it works. At this point ada0 is my boot drive and ada1 through ada3 are my RAIDZ1 member disks.
8. Next I took a look at the /dev/adaXX to the gptids using gpart list. I did this in case the info comes in handy later. I took the last disk's partition in the zpool (ada3p2) and set it to "OFFLINE" status. The disk goes offline. GELI makes some comments about the drive being detached and a zpool status shows that the gptid starting with 76f87c66 is now OFFLINE and my zpool status is now in DEGRADED status. The ada3 device goes offline and a Replace button appears just as expected.
9. Now I do a VM shutdown and remove the first disk from the VM and create a new VM disk(in that order). It is important that you make sure that the disk you remove and the new disk you create are on the same SATA port in VirtualBox. If I had removed ada1 and the booted up and tried to mount my zpool I'll get the error in the GUI that "Error: Volume could not be imported: 1 devices failed to decrypt." because FreeNAS doesn't seem to be smart enough to recognize a drive that moves device IDs from ada3 to ada2. Additionally, if you do fail to make sure they're on the same SATA port in VirtualBox and you attempt to mount them something goes horribly wrong and you'll have an even harder time trying to mount the zpool because for some reason only 1 disk will mount even after you fix the devices (which a RAIDZ1 can't recover from). Personally, I've been unable to determine how to survive a situation where the disks change IDs, but that's not the problem I'm trying to identify. For now, just stick with the last disk and make sure you use the same SATA port. I took a snapshot here just to have one. (I'm really trying to nail down this issue, its been bothering me for a week)
10. After the new disk is created(and obviously is at least the same size as the old virtual disks) boot up FreeNAS. You should be able to mount the zpool with the key+pass and the recovery key but in a degraded state. This seems to be pretty normal and expected.
11. To continue with the disk replacement per 6.3.11 of the manual I mount the encrypted zpool in its degraded state using the key+pass method (I get the GUI "An error occurred!" message but I'm ignoring it because I don't know if its due to the same issue I saw in step 5 and 6 or because a disk is missing)
12. Next I click the "REPLACE" button. It asks for a passphrase and I enter the same passphrase I used for the rest of the zpool. After a few seconds I get "Disk replacement has been initiated," and resilvering completes seconds later.
13. In accordance with the manual I then click the "Detach" button in the GUI to removed the old drive from the zpool permanently. Now a zpool status and zpool scrub give typical "no problems" output, so all is well. At this point I've completed all of the steps section 6.3.11 of the manual and everything should be okay. Time to verify all is actually well. After all, you'd hate to find out later that things aren't well.
14. I reboot the FreeNAS machine and attempt to mount the zpool with key+passphrase. It works as expected. Again I get the "An error occurred!" but everything seems fine. My zpool is healthy.
15. I reboot the FreeNAS machine again and attempt to mount the zpool with the recovery key. Uh-oh. It's not working quite right. My zpool is "DEGRADED". Looking at the footer I have the errors:
Code:
[middleware.notifier:1200] Failed to geli attach gptid/76f87c66-c0cc-11e2-8195-080027d1d3cd: geli: Cannot open gptid/76f87c66-c0cc-11e2-8195-080027d1d3cd: No such file or directory. [middleware.notifier:1200] Failed to geli attach gptid/610edb60-c0d0-11e2-b27e-080027d1d3cd: geli: Wrong key for gptid/610edb60-c0d0-11e2-b27e-080027d1d3cd.
So not only is it trying to attach the old gptid(76f87c66) but it can't mount the new gptid(610edb60). Now things aren't looking so well. So big picture I appear to have full redundancy with the key+passphrase but not with the recovery key. I need to fix this obviously. If a second disk were to fail my recovery key would be useless since it wouldn't be capable of remounting my zpool.
16. Rebooted the machine again, mounted the zpool with the key+passphrase method and I have new error now. I get the following in the footer:
Code:
freenas kernel: GEOM_ELI: Device gptid/7679a9bb-c0cc-11e2-8195-080027d1d3cd.eli created. freenas kernel: GEOM_ELI: Encryption: AES-XTS 128 freenas kernel: GEOM_ELI: Crypto: software freenas kernel: GEOM_ELI: Device gptid/76b64f5c-c0cc-11e2-8195-080027d1d3cd.eli created. freenas kernel: GEOM_ELI: Encryption: AES-XTS 128 freenas kernel: GEOM_ELI: Crypto: software freenas manage.py: [middleware.notifier:1200] Failed to geli attach gptid/76f87c66-c0cc-11e2-8195-080027d1d3cd: geli: Cannot open gptid/76f87c66-c0cc-11e2-8195-080027d1d3cd: No such file or directory. freenas kernel: GEOM_ELI: Device gptid/610edb60-c0d0-11e2-b27e-080027d1d3cd.eli created. freenas kernel: GEOM_ELI: Encryption: AES-XTS 128 freenas kernel: GEOM_ELI: Crypto: software
So now FreeNAS is trying to remount my old failed drive every time. Not exactly the end of the world(definitely something that needs to be fixed though). The zpool does mount and is healthy. Since last time I mounted the zpool one disk failed to mount I chose to do a zpool scrub and wait for it to finish.
17. So time to fix that recovery key. I click the "Add Recovery Key" button. I get the warning that it will invalidate any previous recovery key("so what" for this situation) and click "Continue". I get a GUI error in the header that says
Code:
Error: Unable to set recovery key: geli: Cannot open gptid/76f87c66-c0cc-11e2-8195-080027d1d3cd: No such file or directory."
In the footer I get:
Code:
freenas notifier: 1+0 records in freenas notifier: 1+0 records out freenas notifier: 64 bytes transferred in 0.000048 secs (1335500 bytes/sec) freenas manage.py: [middleware.exceptions:38] [MiddlewareError: Unable to set recovery key: geli: Cannot open gptid/76f87c66-c0cc-11e2-8195-080027d1d3cd: No such file or directory. ]
Hmm. Not the most ideal situation. It looks like FreeNAS may have tried copying the key from somewhere to somewhere. Maybe it fixed my replaced disk and my recovery key will work again... But I wasn't given the option to download a recovery key(remember you get that warning that all previous recovery keys will be invalid). This may not be good news at all.
18. Rebooted FreeNAS and attempted to mount the zpool using the recovery key I do have. Well, my recovery key doesn't work at all.
I get this error in the header:
Code:
Error: Volume could not be imported: 4 devices failed to decrypt.
And in the footer I get these entries:
Code:
freenas manage.py: [middleware.notifier:1200] Failed to geli attach gptid/7679a9bb-c0cc-11e2-8195-080027d1d3cd: geli: Wrong key for gptid/7679a9bb-c0cc-11e2-8195-080027d1d3cd. freenas manage.py: [middleware.notifier:1200] Failed to geli attach gptid/76b64f5c-c0cc-11e2-8195-080027d1d3cd: geli: Wrong key for gptid/76b64f5c-c0cc-11e2-8195-080027d1d3cd. freenas manage.py: [middleware.notifier:1200] Failed to geli attach gptid/76f87c66-c0cc-11e2-8195-080027d1d3cd: geli: Cannot open gptid/76f87c66-c0cc-11e2-8195-080027d1d3cd: No such file or directory. freenas manage.py: [middleware.notifier:1200] Failed to geli attach gptid/610edb60-c0d0-11e2-b27e-080027d1d3cd: geli: Wrong key for gptid/610edb60-c0d0-11e2-b27e-080027d1d3cd. freenas manage.py: [middleware.exceptions:38] [MiddlewareError: Volume could not be imported: 4 devices failed to decrypt]
So to recap, after a disk replacement and an attempt to recreate the recovery key I'm left with a system that can't use the recovery key and the key+pass works with the 3 installed member drives of the zpool but has an error because its trying to mount the disk that was replaced.
So does anyone have a clue what I did wrong, or if I even did something wrong? Should the manual be updated to reflect how to deal with encryption?
Is this a bug?
Is there a way to recovery from this using the CLI and/or editing the config file?
Any other ideas how to achieve full redundancy with both the key+passphrase and recovery key method without recreating the zpool and recovering from backups?
Note: I documented this issue in support ticket 2178 and have had no response, which is why I'm asking the forum. If you take the time to read 2178(I don't think there's any reason to since this post discusses step-by-step what I did on that server, but in a VM machine here, the only thing that seems to be much different is that I always used the recovery key in 2178(laziness). But on that server the problem found me because after the resilvering completes and I reboot and use the recovery key again I just end up with a DEGRADED zpool. I rebooted and resilvered twice before stopping and started questioning what was wrong. And naturally(thanks to murphy's law) I've replaced 1 failed disk last weekend and another disk in the zpool is racking up SMART errors like nobody's business, so I'm about to have a recovery key with no redundancy. I'd destroy and recreate the zpool from backup but now I'm questioning how "trustworthy" the encryption is with regards to disk replacement/recovery. This is a pretty big deal for people that haven't realized that their recovery key isn't 100% after a failed disk and I don't see any easy way to recover from this with the knowledge I have(and alot of people using FreeNAS have even less than myself...)