instant reboot when accessing zfs snapshots over samba

fschaer · Dec 28, 2023

Hi,

1st post ever here, hoping I'll find someone who can enlighten me...
1st my hardware : an HP microserver Gen 8, with 12GB of ecc memory and a E31260L xeon. I added some RAM and changed the cpu.

Running Truenas core TrueNAS-13.0-U3.1, with a pool of 2 active 4tb disks (I resilvered a disk about 2 weeks ago and did not have time to remove it yet). I've seen recently I had new ZFS upgrade flags but did not yet take time to try an upgrade (I don't really know how to say here which "flags I'm running").

I've just met a weird issue on my small home NAS, which is the following : I've tried accessing the .zfs/snapshots directory of a dataset over samba... and immediattely lost contact with the NAS. it looks like I got an instant reboot issue (got mails, and uptime show this). I'm accessing from an ubuntu laptop over twiregard for now, so can't connect anything to the vga output to attempt to see the console...

The logs in /var/log do not mention anything except there are new logs at boot, and I've tried accessing a second time the same dataset over samba... and got a second reboot... ???

I then connected as root through ssh, and found out I can actually list the contents of the directory that's causing reboots when accessed from dolphin - I can see 36 snapshots and cat a file for instance -

I'm wondering if there could be something I could enable to get more logs or if there would be something I could try to prevent samba from crahing the NAS ?

Thanks && regards

winnielinnie · Dec 28, 2023

Obviously, a client shouldn't be able to crash the entire server simply by accessing network shares.

With that said, how are you connecting to the SMB share with Dolphin?

Just "click and go" via Dolphin's built-in KIO to navigate/access the share?

Or actually using the kernel cifs module, via mount, fstab, systemd, autofs, and/or Smb4K?

fschaer · Dec 28, 2023

Hi,

yup, just using kio and smb://ip/ address in dolphin...

regards

Ericloewe · Dec 28, 2023

The experience with that tends to be pretty crappy, but that's just a client sidenote (a client-side note?).

Assuming that this is a kernel panic, here's a suggestion: Open up the Serial over LAN functionality on your server, set the scrollback buffer to something big on your terminal, and cause it to crash again. The panic output should be informative and the serial console approach allows you to log it without hacks like "filming a monitor and hoping it stays on there long enough".

winnielinnie · Dec 28, 2023

fschaer said:
yup, just using kio and smb://ip/ address in dolphin...

Can you try a test using the (proper) method, via the cifs kernel module method?

This can be done with a simple:

Code:

mkdir /mnt/smbtest

mount -t cifs -o rw,username=tnuser,uid=1000,gid=1000,file_mode=0666,dir_mode=0777,soft,nounix,serverino ip.add.re.ss:/sharename /mnt/smbtest

Then try to navigate (with Dolphin) to /mnt/smbtest, as well as the hidden ".zfs/snapshot" directory. Make sure you're not currently using the share via KIO.

fschaer · Dec 30, 2023

Hi *,

back home, I just tested about everything including the cifs mount : no more crash... * on that share / dataset *.
I tried almost everything to reproduce on the share, including the cifs mount and connecting using my phone + wiregard, trying nfs : no more crash on the dataset.

However : when I tried another dataset... I got a crash but wasnt quick enough to screenshot using spectacle (ipmi over lan was not displaying a thing) - I therefore made sure I had ipmi over lan working.... and got no more crash on that other dataset ....

Since I only have 3 datasets with periodic snapshots, I made sure I could capture the crash for the last one... and here it is (this is with ipmitool) :

Fatal trap 12: page fault while in kernel mode
cpuid = 1; apic id = 01
fault virtual address = 0x0
fault code = supervisor read instruction, page not present
instruction pointer = 0x20:0x0
stack pointer = 0x28:0xfffffe01040ae588
frame pointer = 0x28:0xfffffe01040ae5a0
code segment = base 0x0, limit 0xfffff, type 0x1b
= DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags = interrupt enabled, resume, IOPL = 0
current process = 2535 (smbd)
trap number = 12
panic: page fault
cpuid = 1
time = 1703937345
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe01040ae340
vpanic() at vpanic+0x17f/frame 0xfffffe01040ae390
panic() at panic+0x43/frame 0xfffffe01040ae3f0
trap_fatal() at trap_fatal+0x385/frame 0xfffffe01040ae450
trap_pfault() at trap_pfault+0x4f/frame 0xfffffe01040ae4b0
calltrap() at calltrap+0x8/frame 0xfffffe01040ae4b0
--- trap 0xc, rip = 0, rsp = 0xfffffe01040ae588, rbp = 0xfffffe01040ae5a0 ---
??() at 0/frame 0xfffffe01040ae5a0
vgonel() at vgonel+0x186/frame 0xfffffe01040ae610
vgone() at vgone+0x31/frame 0xfffffe01040ae630
vfs_hash_insert() at vfs_hash_insert+0x26d/frame 0xfffffe01040ae680
sfs_vgetx() at sfs_vgetx+0x149/frame 0xfffffe01040ae7f0
zfsctl_snapdir_lookup() at zfsctl_snapdir_lookup+0x1e2/frame 0xfffffe01040aea70
lookup() at lookup+0x45c/frame 0xfffffe01040aeb10
namei() at namei+0x259/frame 0xfffffe01040aebc0
kern_statat() at kern_statat+0xf3/frame 0xfffffe01040aed00
sys_fstatat() at sys_fstatat+0x2f/frame 0xfffffe01040aee00
amd64_syscall() at amd64_syscall+0x10c/frame 0xfffffe01040aef30
fast_syscall_common() at fast_syscall_common+0xf8/frame 0xfffffe01040aef30
--- syscall (552, FreeBSD ELF64, sys_fstatat), rip = 0x804cc993a, rsp = 0x7fffffffd828, rbp = 0x7fffffffd930 ---
KDB: enter: panic
[ thread pid 2535 tid 101324 ]
Stopped at kdb_enter+0x37: movq $0,0x141d87e(%rip)
db:0:kdb.enter.default> write cn_mute 1
cn_mute 0 = 0x1
&. [terminated ipmitool]

probably unrelated this line was displayed at truenas boot, I haven't tried yet to find more about it :

ZFS: inconsistent nvlist contents

seems to me the kernel stack trace is just related to snap directories though...

Ericloewe · Dec 30, 2023

zfsctl_snapdir_lookup()

Very curious place to panic. Out of curiosity, what's the setting for snapdir, on or off?

ZFS: inconsistent nvlist contents

More on point, what does zpool status have to say? This sure sounds like you have a corrupt pool - there are still a bunch of places where a corrupted pool may end up causing a panic and it would be interesting to narrow this one down a bit.

winnielinnie · Dec 30, 2023

Wonder if this is related, and if so, it apparently is now fixed in upstream FreeBSD:

252700 – page fault in zfsctl_snapdir_lookup

bugs.freebsd.org

(Bug closed in December 1st, a few weeks ago.)

Ericloewe · Dec 30, 2023

Sure sounds familiar. I would suggest filing a bug ticket with iX, referencing the FreeBSD bug ticket, and with some luck you can get a point release out of it.

fschaer · Dec 30, 2023

Hi and thanks for your replies,

Not sure if that answers the snapdir question, but this is the setting for the pool/dataset :

freenas# zfs get snapdir pool-4TB/documents
NAME PROPERTY VALUE SOURCE
pool-4TB/documents snapdir hidden default

(I haven't yet renamed the host from freenas to truenas ;) )

zpool isn't showing any error for the data pool :

freenas# zpool status pool-4TB
pool: pool-4TB
state: ONLINE
status: Some supported and requested features are not enabled on the pool.
The pool can still be used, but some features are unavailable.
action: Enable all features using 'zpool upgrade'. Once this is done,
the pool may no longer be accessible by software that does not support
the features. See zpool-features(7) for details.
scan: resilvered 1.75T in 05:23:10 with 0 errors on Sat Dec 16 19:17:46 2023
config:

NAME STATE READ WRITE CKSUM
pool-4TB ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
gptid/43484653-9c12-11ee-a56d-9418823770f8 ONLINE 0 0 0
gptid/65ff90a5-c1d8-11ea-83e9-9418823770f8 ONLINE 0 0 0

errors: No known data errors

I have openned a bug in jira after I found a closed one about almost the same issue where the bug was closed because not reproducible - I will report your finding about the freebsd bug there... thank you already for your support - truenas rocks :]

The bug I openned : https://ixsystems.atlassian.net/browse/NAS-125969

Regards :]

Ericloewe · Dec 30, 2023

fschaer said:
(I haven't yet renamed the host from freenas to truenas ;) )

Files are still hosted at freenas.org, too, so we're in good company.

Important Announcement for the TrueNAS Community.

instant reboot when accessing zfs snapshots over samba

fschaer

Cadet

winnielinnie

MVP

fschaer

Cadet

Ericloewe

Server Wrangler

winnielinnie

MVP

fschaer

Cadet

Ericloewe

Server Wrangler

winnielinnie

MVP

252700 – page fault in zfsctl_snapdir_lookup

Ericloewe

Server Wrangler

fschaer

Cadet

Ericloewe

Server Wrangler

Similar threads

Important Announcement for the TrueNAS Community.

instant reboot when accessing zfs snapshots over samba

Cadet

MVP

Cadet

Server Wrangler

MVP

Cadet

Server Wrangler

MVP

Server Wrangler

Cadet

Server Wrangler

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "instant reboot when accessing zfs snapshots over samba"

Similar threads