This is more of an informational post / bug / reference / resolution (for me at least) post.
Recently upgraded to TrueNAS 12 (started from FreeNAS 9.x to 10, rolled back to 9, upgraded to 11.x).
Have 2 GELI pools (with passphrase) running RAID-Z-2 (Supermicro server HW, lots of ECC RAM, SAS/SATA not-RAID controller, etc...).
Due to drive ages, recently started getting hardware faults on 3TB disks, resilvered - even w/ 2 failing drives, no issues.
Had drive errors on 4TB pool (2 equal RAIDZ2 VDEVS), added hotspare... there was an issue somewhere between adding hotspare and having a healthy pool, web UI faulted somewhere, but thru some CLI it eventually "seemed" happy.
Downloaded GELI key and recovery key when prompted - did not notice "recovery" key was 0 bytes! :(
Removed faulted 4TB disk, everything continued fine.
At some point upgraded from upgraded from 12.0-U2.1 to 12.0-U3.
On boot, Web UI attempted to unlock pool, and failed/aborted/detached them.
Trying from SSH saw that less than half of the disks would attach - this is when I noticed the 0 byte "recovery" key.
<INSERT PANIC>
Found multiple posts that basically said "you backed it up, so don't worry.. right?"
Found this (old) post https://www.truenas.com/community/threads/accidentally-marked-zfs-drives-as-unconfigured-good.69912/
Some "key" pieces of data are about using "dd" to read the "last block" of partition to check for geli header.
Eventually (10 days or so now) I discovered more details from playing with this command, so now will attempt to document in hopes that it may help someone else - or maybe a dev can figure out if there is a bug?
This was true for all of the disks that would not geli attach!
Eventually expanded out the dd and found a GELI header!
Eventually found that the GELI header for these partitions is located 4096b from end instead of 512b from end!?
On first disk, I wrote out the 512 bytes (starting 4096 from end) to a file and then re-wrote it to last 512 bytes of partition (when reading with dd use "skip=" when writing use "seek=").
Once I did that, doing the dd/od now showed the correct "GEOM::ELI" string!
Using geli dump even showed geli metadata!
However, when I attempted to attach it, I got:
Comparing it so a working geli device, I saw that the working one had: "provsize: 3998639460352" instead of "3998639456256".
Doing some math showed: 3998639460352 - 3998639456256 = 4096
So to fix it, I had to do a geli resize - which would read the GELI header from "old" offset and write it to the "new" last sector:
SUCCESS!!!!!
So, just had to repeat for the remaining drives that all had a "blank" final sector - did not need to use dd, just geli resize using the discovered "old" size.
So, in summary, not sure how or why but somehow the geli size or header changed or moved during the resilver/hotspare expansion/upgrade....
No clue which step "caused" the issue, but hopefully this may help someone else recover.
Also things I discovered:
1) Each disk has its own "master encryption key" that is stored in this 512byte GELI header whihc is *only* stored in last sector of partition.
2) This drive master key is encrypted with the "key" and "recovery key" before actually being written.
3) The key and recovery key you download are used to decrypt each disk's master key to "create"(mount) the geli/eli device
4) This "master key" controls that actual encryption and never changes - it is only re-encrypted when the key/recovery key are rekeyed
5) You can backup this "encrypted master key" with geli backup command - must be down for each disk.
6) Losing this GELI header/sector renders the rest of the disk useless.
7) You can backup/dump this info after successfully attaching the geli/eli devices - not done by FreeNAS/TrueNAS or as part of its backup, though there have previously been a few requests to include it...
Hopefully these breadcrumbs help someone else at some point... I don't know too much more about the geli/eli stuff beyond this and otehr threads, so sorry probably can't help you with your specific situation... I am also moving to ZFS encryption (which doesn't seem to require these silly key/recovery key in the same way geli/eli did/does)
Recently upgraded to TrueNAS 12 (started from FreeNAS 9.x to 10, rolled back to 9, upgraded to 11.x).
Have 2 GELI pools (with passphrase) running RAID-Z-2 (Supermicro server HW, lots of ECC RAM, SAS/SATA not-RAID controller, etc...).
Due to drive ages, recently started getting hardware faults on 3TB disks, resilvered - even w/ 2 failing drives, no issues.
Had drive errors on 4TB pool (2 equal RAIDZ2 VDEVS), added hotspare... there was an issue somewhere between adding hotspare and having a healthy pool, web UI faulted somewhere, but thru some CLI it eventually "seemed" happy.
Downloaded GELI key and recovery key when prompted - did not notice "recovery" key was 0 bytes! :(
Removed faulted 4TB disk, everything continued fine.
At some point upgraded from upgraded from 12.0-U2.1 to 12.0-U3.
On boot, Web UI attempted to unlock pool, and failed/aborted/detached them.
Trying from SSH saw that less than half of the disks would attach - this is when I noticed the 0 byte "recovery" key.
<INSERT PANIC>
Found multiple posts that basically said "you backed it up, so don't worry.. right?"
Found this (old) post https://www.truenas.com/community/threads/accidentally-marked-zfs-drives-as-unconfigured-good.69912/
Some "key" pieces of data are about using "dd" to read the "last block" of partition to check for geli header.
Eventually (10 days or so now) I discovered more details from playing with this command, so now will attempt to document in hopes that it may help someone else - or maybe a dev can figure out if there is a bug?
Code:
freenas# geli dump /dev/da3 Cannot read metadata from /dev/da3: Invalid argument. geli: Not fully done. freenas# gpart list /dev/da3 | grep Mediasize Mediasize: 2147483648 (2.0G) Mediasize: 3998639460352 (3.6T) Mediasize: 4000787030016 (3.6T) # The following is copied / modified from above post/comment by @Ibes # The number of bytes per block BLOCK_SIZE=512 # The number of blocks on the media BLOCK_COUNT=$(( 3998639460352 / ${BLOCK_SIZE} )) # The number of blocks to skip (count - 1) let SKIP=$(( ${BLOCK_COUNT} - 1)) # dump the entire last block (256B) then pipe it into grep to see if "GEOM::ELI" is present dd if=/dev/da3p2 bs=${BLOCK_SIZE} skip=${SKIP} | od -a 0000000 nul nul nul nul nul nul nul nul nul nul nul nul nul nul nul nul * 0001000
This was true for all of the disks that would not geli attach!
Eventually expanded out the dd and found a GELI header!
Eventually found that the GELI header for these partitions is located 4096b from end instead of 512b from end!?
On first disk, I wrote out the 512 bytes (starting 4096 from end) to a file and then re-wrote it to last 512 bytes of partition (when reading with dd use "skip=" when writing use "seek=").
Once I did that, doing the dd/od now showed the correct "GEOM::ELI" string!
Code:
0000000 G E O M : : E L I nul nul nul nul nul nul nul
Using geli dump even showed geli metadata!
However, when I attempted to attach it, I got:
Code:
freenas# geli dump /dev/da3p2 Metadata on /dev/da3p2: magic: GEOM::ELI version: 7 flags: 0x0 ealgo: AES-XTS keylen: 256 provsize: 3998639456256 sectorsize: 4096 keys: 0x03 iterations: 123456 Salt: saltysaltysaltysalty Master Key: 0keyd0key0keyd0key MD5 hash: 5a5a5a5a5a5a5a5a5a5 freenas# geli attach /dev/da3p2 geli: Provider size mismatch. geli: There was an error with at least one provider.
Comparing it so a working geli device, I saw that the working one had: "provsize: 3998639460352" instead of "3998639456256".
Doing some math showed: 3998639460352 - 3998639456256 = 4096
So to fix it, I had to do a geli resize - which would read the GELI header from "old" offset and write it to the "new" last sector:
Code:
freenas# geli resize -v -s 3998639456256 /dev/da3p2 Done. freenas# geli dump /dev/da3p2 Metadata on /dev/da3p2: magic: GEOM::ELI version: 7 flags: 0x0 ealgo: AES-XTS keylen: 256 provsize: 3998639460352 sectorsize: 4096 keys: 0x03 iterations: 123456 Salt: saltysaltysaltysalty Master Key: 0keyd0key0keyd0key MD5 hash: 5a5a5a5a5a5a5a5a5a5 freenas# geli attach -C -k /data/geli/0123456-0123-0123-0123-01234567890.key /dev/da3p2 Enter passphrase: freenas#
SUCCESS!!!!!
So, just had to repeat for the remaining drives that all had a "blank" final sector - did not need to use dd, just geli resize using the discovered "old" size.
So, in summary, not sure how or why but somehow the geli size or header changed or moved during the resilver/hotspare expansion/upgrade....
No clue which step "caused" the issue, but hopefully this may help someone else recover.
Also things I discovered:
1) Each disk has its own "master encryption key" that is stored in this 512byte GELI header whihc is *only* stored in last sector of partition.
2) This drive master key is encrypted with the "key" and "recovery key" before actually being written.
3) The key and recovery key you download are used to decrypt each disk's master key to "create"(mount) the geli/eli device
4) This "master key" controls that actual encryption and never changes - it is only re-encrypted when the key/recovery key are rekeyed
5) You can backup this "encrypted master key" with geli backup command - must be down for each disk.
6) Losing this GELI header/sector renders the rest of the disk useless.
7) You can backup/dump this info after successfully attaching the geli/eli devices - not done by FreeNAS/TrueNAS or as part of its backup, though there have previously been a few requests to include it...
Hopefully these breadcrumbs help someone else at some point... I don't know too much more about the geli/eli stuff beyond this and otehr threads, so sorry probably can't help you with your specific situation... I am also moving to ZFS encryption (which doesn't seem to require these silly key/recovery key in the same way geli/eli did/does)