Kernel panic when replicating an encrypted pool

alphazo

Cadet
Joined
Aug 15, 2022
Messages
8
I'm running TrueNAS-13.0-U1.1 on a fairly recent installation (i7-4770S, 16GB RAM, 6x18Tb in RaidZ2)
I have a pool setup on that machine that is an encrypted replication of a remote encrypted pool (native ZFS encryption, aes-256-gcm, dedup off, recordsize 1M) found on a USB drive attached to my Linux machine.
Everything has been running smoothly and I was able to replicate using the zfs send raw mode over the network without any issue.
Recently I noticed that TrueNAS would reboot when I launch the replication job on my Linux machine:

Code:
zfs send -w -I MyDrive/PHOTOS@2022-07-07:21:02:46 MyDrive/PHOTOS@2022-08-15:01:22:19 | ssh truenas zfs receive truenaspool/MyDrive-PHOTOS


So I connected the USB drive directly to the TrueNAS machine and imported the pool but I get the same instant reboot by issuing:
Code:
zfs send -w -I MyDrive/PHOTOS@2022-07-07:21:02:46 MyDrive/PHOTOS@2022-08-15:01:22:19 | zfs receive truenaspool/MyDrive-PHOTOS


I was able to take a snapshot of the screen just before it reboots (less than a second). There is no error message on the terminal used to launch the command.
I performed a scrub on my TrueNAS pool as well as the foreign one and they passed without any error. I went back to the RELEASE version via the boot option but I get the same error either locally or remotely (no USB involved on the TrueNAS side).

Code:
cpuid = 4
time = 1660596042
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe01150f0ce0
vpanic() at vpanic+0x17f/frame 0xfffffe01150f0d30
spl_panic() at spl_panic+0x3a/frame 0xfffffe01150f0d90
receive_object() at receive_object+0x6d0/frame 0xfffffe01150f0e80
receive_writer_thread() receive_writer_thread+0x199/frame 0xfffffe01150f0ef0
fork_exit() at fork_exit+0x7e/frame 0xfffffe01150f0f30
fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe01150f0f30
--- trap 0xc, rip = 0x800804e7a, rsp=0x7fffffff7b48, rbp = 0x7fffffff7bb0 ---
KDB: enter: panic
[ thread pid 2213 tid 12255 ]
Stopped at        kdb_enter+0x37: movq     $0,0x141d7fe(%rip)
db:0:kdb.enter.default> write cn_mute 1
cn_mute
db:0:kdb.enter.default>   textdump dump
Textdump complete
db:0:kdb:enter.default>   reset
cpu_reset: Restarting BSP
cpu_reset_proxy: Stopped CPU 4
 

alphazo

Cadet
Joined
Aug 15, 2022
Messages
8
I did more testing by restarting a fresh new full replication and then by doing incremental replications of each snapshot. It went well except with the last snapshot that is causing the kernel panic. I destroyed that last snapshot and created a new one but when sending it over, I get the same kernel panic. There is something wrong with the current live filesystem. Could it be linked to ACL or other permissions?
 
Joined
Oct 22, 2019
Messages
3,641
I've bumped into this before when replicating an encrypted dataset. It was back on TrueNAS Core 12. There was an upstream bug with ZFS in which it would invoke a kernel panic if an encrypted dataset contained a symlink that referenced a very long path length However, this bug was fixed in the 12.0-U1.1 and U2 updates.

One thing you can do is compare the differences between your second-to-most recent snapshot against your most recent snapshot, and output it to a text file to inspect, hopefully to find a culprit or clue.

The dataset needs to be unlocked when you do this. (Change the snapshot names accordingly. I'm only using an example. I understand that your most recent snapshot will have have a different name.)

For example,
zfs diff MyDrive/PHOTOS@2022-07-07:21:02:46 MyDrive/PHOTOS@2022-08-15:01:22:19 > differences.txt

It might be better to output the text file to a dataset which you are sharing via SMB so that you can view it from a client computer at your own leisure.
 
Last edited:

alphazo

Cadet
Joined
Aug 15, 2022
Messages
8
Thanks @winnielinnie for the suggestion. I noticed the bug you mentioned but saw that it was fixed a long time ago.
I did a search for symlinks and I don't have any in that dataset.
I also used the zfs diff command and ended up with a file with 117K lines :frown: with lots of files removed (-) and new files (+), a number of moved files (R), and very few modified files (M), I'll try to identify new files with a very long path to see if they are causing any issue.
 
Joined
Oct 22, 2019
Messages
3,641
You might be able to narrow it down by filtering out only the "new" and "moved" files, and then from there narrow it down even further based on the path-length of the files.

But this is shooting in the dark. We don't even know if its an errant file (or path length) causing the kernel panic during replication of this particular incremental stream.

UPDATE: It really shouldn't matter, but did you change any dataset properties after the "second-most-recent" snapshot was created?
 
Last edited:

alphazo

Cadet
Joined
Aug 15, 2022
Messages
8
What about ACL? The pool and dataset found in the source USB drive have been created on a Linux machine but lately I imported it in my TrueNAS environment to write some files to it from another local pool. Has this messed around with POSIX vs NFSv4 implementations?
I tried to add a new dataset via TrueNAS to the "MyDrive" pool and it says this pool has ACL setup so I could use the UI to apply some changes.

Code:
# zfs get all MyDrive
NAME       PROPERTY              VALUE                  SOURCE
MyDrive  type                  filesystem             -
MyDrive  creation              Mon Jun 13  8:24 2022  -
MyDrive  used                  5.15T                  -
MyDrive  available             5.63T                  -
MyDrive  referenced            96K                    -
MyDrive  compressratio         1.04x                  -
MyDrive  mounted               yes                    -
MyDrive  quota                 none                   default
MyDrive  reservation           none                   default
MyDrive  recordsize            128K                   default
MyDrive  mountpoint            /mnt/MyDrive           default
MyDrive  sharenfs              off                    default
MyDrive  checksum              on                     default
MyDrive  compression           off                    default
MyDrive  atime                 on                     default
MyDrive  devices               on                     default
MyDrive  exec                  on                     default
MyDrive  setuid                on                     default
MyDrive  readonly              off                    default
MyDrive  jailed                off                    default
MyDrive  snapdir               hidden                 default
MyDrive  aclmode               passthrough            local
MyDrive  aclinherit            passthrough            local
MyDrive  createtxg             1                      -
MyDrive  canmount              on                     default
MyDrive  xattr                 on                     default
MyDrive  copies                1                      local
MyDrive  version               5                      -
MyDrive  utf8only              off                    -
MyDrive  normalization         none                   -
MyDrive  casesensitivity       sensitive              -
MyDrive  vscan                 off                    default
MyDrive  nbmand                off                    default
MyDrive  sharesmb              off                    default
MyDrive  refquota              none                   default
MyDrive  refreservation        none                   default
MyDrive  guid                  4027177749782563419    -
MyDrive  primarycache          all                    default
MyDrive  secondarycache        all                    default
MyDrive  usedbysnapshots       0B                     -
MyDrive  usedbydataset         96K                    -
MyDrive  usedbychildren        5.15T                  -
MyDrive  usedbyrefreservation  0B                     -
MyDrive  logbias               latency                default
MyDrive  objsetid              54                     -
MyDrive  dedup                 off                    default
MyDrive  mlslabel              none                   default
MyDrive  sync                  standard               default
MyDrive  dnodesize             legacy                 default
MyDrive  refcompressratio      1.00x                  -
MyDrive  written               96K                    -
MyDrive  logicalused           5.40T                  -
MyDrive  logicalreferenced     42K                    -
MyDrive  volmode               default                default
MyDrive  filesystem_limit      none                   default
MyDrive  snapshot_limit        none                   default
MyDrive  filesystem_count      none                   default
MyDrive  snapshot_count        none                   default
MyDrive  snapdev               hidden                 default
MyDrive  acltype               nfsv4                  default
MyDrive  context               none                   default
MyDrive  fscontext             none                   default
MyDrive  defcontext            none                   default
MyDrive  rootcontext           none                   default
MyDrive  relatime              off                    default
MyDrive  redundant_metadata    all                    default
MyDrive  overlay               on                     default
MyDrive  encryption            off                    default
MyDrive  keylocation           none                   default
MyDrive  keyformat             none                   default
MyDrive  pbkdf2iters           0                      default
MyDrive  special_small_blocks  0                      local


Code:
NAME              PROPERTY              VALUE                  SOURCE
# zfs get all MyDrive/PHOTOS
MyDrive/PHOTOS  type                  filesystem             -
MyDrive/PHOTOS  creation              Mon Jun 13  8:25 2022  -
MyDrive/PHOTOS  used                  4.44T                  -
MyDrive/PHOTOS  available             5.63T                  -
MyDrive/PHOTOS  referenced            4.01T                  -
MyDrive/PHOTOS  compressratio         1.03x                  -
MyDrive/PHOTOS  mounted               no                     -
MyDrive/PHOTOS  quota                 none                   default
MyDrive/PHOTOS  reservation           none                   default
MyDrive/PHOTOS  recordsize            1M                     local
MyDrive/PHOTOS  mountpoint            /mnt/MyDrive/PHOTOS    default
MyDrive/PHOTOS  sharenfs              off                    default
MyDrive/PHOTOS  checksum              on                     default
MyDrive/PHOTOS  compression           lz4                    local
MyDrive/PHOTOS  atime                 off                    local
MyDrive/PHOTOS  devices               on                     default
MyDrive/PHOTOS  exec                  on                     default
MyDrive/PHOTOS  setuid                on                     default
MyDrive/PHOTOS  readonly              off                    default
MyDrive/PHOTOS  jailed                off                    default
MyDrive/PHOTOS  snapdir               hidden                 default
MyDrive/PHOTOS  aclmode               passthrough            inherited from MyDrive
MyDrive/PHOTOS  aclinherit            passthrough            inherited from MyDrive
MyDrive/PHOTOS  createtxg             21                     -
MyDrive/PHOTOS  canmount              on                     default
MyDrive/PHOTOS  xattr                 on                     default
MyDrive/PHOTOS  copies                1                      inherited from MyDrive
MyDrive/PHOTOS  version               5                      -
MyDrive/PHOTOS  utf8only              off                    -
MyDrive/PHOTOS  normalization         none                   -
MyDrive/PHOTOS  casesensitivity       sensitive              -
MyDrive/PHOTOS  vscan                 off                    default
MyDrive/PHOTOS  nbmand                off                    default
MyDrive/PHOTOS  sharesmb              off                    default
MyDrive/PHOTOS  refquota              none                   default
MyDrive/PHOTOS  refreservation        none                   default
MyDrive/PHOTOS  guid                  11632244640048655189   -
MyDrive/PHOTOS  primarycache          all                    default
MyDrive/PHOTOS  secondarycache        all                    default
MyDrive/PHOTOS  usedbysnapshots       443G                   -
MyDrive/PHOTOS  usedbydataset         4.01T                  -
MyDrive/PHOTOS  usedbychildren        0B                     -
MyDrive/PHOTOS  usedbyrefreservation  0B                     -
MyDrive/PHOTOS  logbias               latency                default
MyDrive/PHOTOS  objsetid              515                    -
MyDrive/PHOTOS  dedup                 off                    default
MyDrive/PHOTOS  mlslabel              none                   default
MyDrive/PHOTOS  sync                  standard               default
MyDrive/PHOTOS  dnodesize             legacy                 default
MyDrive/PHOTOS  refcompressratio      1.02x                  -
MyDrive/PHOTOS  written               496K                   -
MyDrive/PHOTOS  logicalused           4.58T                  -
MyDrive/PHOTOS  logicalreferenced     4.12T                  -
MyDrive/PHOTOS  volmode               default                default
MyDrive/PHOTOS  filesystem_limit      none                   default
MyDrive/PHOTOS  snapshot_limit        none                   default
MyDrive/PHOTOS  filesystem_count      none                   default
MyDrive/PHOTOS  snapshot_count        none                   default
MyDrive/PHOTOS  snapdev               hidden                 default
MyDrive/PHOTOS  acltype               nfsv4                  default
MyDrive/PHOTOS  context               none                   default
MyDrive/PHOTOS  fscontext             none                   default
MyDrive/PHOTOS  defcontext            none                   default
MyDrive/PHOTOS  rootcontext           none                   default
MyDrive/PHOTOS  relatime              off                    default
MyDrive/PHOTOS  redundant_metadata    all                    default
MyDrive/PHOTOS  overlay               on                     default
MyDrive/PHOTOS  encryption            aes-256-gcm            -
MyDrive/PHOTOS  keylocation           prompt                 local
MyDrive/PHOTOS  keyformat             passphrase             -
MyDrive/PHOTOS  pbkdf2iters           350000                 -
MyDrive/PHOTOS  encryptionroot        MyDrive/PHOTOS       -
MyDrive/PHOTOS  keystatus             unavailable            -
MyDrive/PHOTOS  special_small_blocks  0                      inherited from MyDrive
 

alphazo

Cadet
Joined
Aug 15, 2022
Messages
8
I went through more testing.
  • Unlocked and mounted the source dataset found on a USB drive and then rsynced all the files found in the problematic snapshot to a temp dataset on TrueNAS.
  • Created a new encrypted dataset on the same pool (external drive).
  • Added a couple of dummy files and created an initial snapshot.
  • Sent full replication of this new dataset to TrueNAS (-w)
  • Rsynced everything back from TrueNAS' temp dataset to that external drive.
  • Snapshoted and then sent over an incremental replication (-w -I) to TrueNAS
No issue, no reboot so it seems like the problem doesn't come from a long path or filename with illegal symbols.
I could just use the newly created encrypted dataset that holds the latest snapshot and destroy the problematic dataset but I will lose the history. Furthemore, I would prefer to understand the issue before getting it back in a couple of snapshots.

Any idea on how I could investigate this issue to see why this particular snapshot is creating a kernel panic?

PS: I haven't changed any dataset properties between snapshots.
 
Last edited:
Joined
Oct 22, 2019
Messages
3,641
At this point I'm at a total loss.

It might have something to do with the pool being created on a Linux system, but that still doesn't make sense, as ZFS is meant to be rugged and reliable. A kernel panic (rather than just an error) seems ridiculous. :frown:

This might be one for upstream ZFS to fix (or report if it's a new bug indeed.)
 

alphazo

Cadet
Joined
Aug 15, 2022
Messages
8
I agree. It's getting even weirder. If I replicate the snapshot as a brand new encrypted dataset (instead of incremental) then I don't get the kernel panic.
 

ITR6

Dabbler
Joined
Jan 8, 2017
Messages
12
Were you able to resolve this issue? I am experiencing the same on my replication server.
 

StorageCurious

Explorer
Joined
Sep 28, 2022
Messages
60
I can replicate that if I replicate into an encrypted pool that is locked (either manually locked, or the server rebooted due to a power outage, say) the server receiving the replication reboots at every replication attempt.
As soon as the encrypted dataset is unlocked, the reboot stops, but other issues kill the replication task which cannot succeed from that point on.

A bug is opened, my understanding is that this is a OpenZFS issue fixed in OpenZFS 2.1.7, which is either due to be merged in 13-U4 or U5. (I am no Jira expert, so I find it hard to pinpoint that info precisely.
 
Top