Surprise hard hang of smbd

rlpowell

Dabbler
Joined
Oct 6, 2022
Messages
25
So to be clear, I now know I was holding it wrong; I'm mostly reporting this because (1) the UI allowed me to do something it shouldn't, (2) there's clearly some sort of underlying ZFS or kernel bug and (3) so future searchers can easily find this.

To be clear following https://www.truenas.com/docs/scale/gettingstarted/install/setupsharing/ once I found it worked fine, so as I said I was definitely holding it wrong.

Short version: if you make an SMB share on the *pool*, rather than on an SMB dataset in the pool, any attempt to actually write to it (at least from my Windows 10 machine) will hard-hang the smbd sub-process, and you'll keep building up more and more such sub-processes and also any attempt to interact with the file you just tried to create will hang whatever process touches it, including things like "ls -l".

Long version:

So got my NAS, threw some drives in, no previous ZFS experience but *lots* of Linux experience, stuck SCALE on it, made a pair of 1 disk mirrors, made a pool out of that named "main". Oh hey something created a dataset name ix-applications what's a dataset dunno hey let's SSH in and check things out! Oh hey there's /mnt/main with a bunch of space neat!

OK, let's make a samba share! Oh it needs a mount point, OK cool, mkdir /mnt/main/share, make the share, great!

And as soon as I try to create a file on the Windows side Windows Explorer hangs. -_-;

Somewhat later the following showed up in my journal on the truenas side:

Code:
Oct 09 09:41:43 truenas kernel: INFO: task smbd:66856 blocked for more than 120 seconds.
Oct 09 09:41:43 truenas kernel:       Tainted: P           OE     5.10.142+truenas #1
Oct 09 09:41:43 truenas kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Oct 09 09:41:43 truenas kernel: task:smbd            state:D stack:    0 pid:66856 ppid:  3915 flags:0x00004005
Oct 09 09:41:43 truenas kernel: Call Trace:
Oct 09 09:41:43 truenas kernel:  __schedule+0x282/0x880
Oct 09 09:41:43 truenas kernel:  ? rwsem_spin_on_owner+0x74/0xd0
Oct 09 09:41:43 truenas kernel:  schedule+0x46/0xb0
Oct 09 09:41:43 truenas kernel:  rwsem_down_read_slowpath+0x18e/0x500
Oct 09 09:41:43 truenas kernel:  ? mutex_lock+0xe/0x30
Oct 09 09:41:43 truenas kernel:  zpl_xattr_get+0x97/0x1d0 [zfs]
Oct 09 09:41:43 truenas kernel:  zpl_get_acl_impl+0x44/0x100 [zfs]
Oct 09 09:41:43 truenas kernel:  get_acl.part.0+0x56/0x120
Oct 09 09:41:43 truenas kernel:  generic_permission+0x153/0x1d0
Oct 09 09:41:43 truenas kernel:  zfs_zaccess_common+0x2bb/0x310 [zfs]
Oct 09 09:41:43 truenas kernel:  zfs_zaccess+0xe0/0x440 [zfs]
Oct 09 09:41:43 truenas kernel:  zfs_create+0x2ef/0x990 [zfs]
Oct 09 09:41:43 truenas kernel:  zpl_xattr_set_dir.constprop.0+0x1e2/0x290 [zfs]
Oct 09 09:41:43 truenas kernel:  zpl_xattr_set+0x145/0x380 [zfs]
Oct 09 09:41:43 truenas kernel:  zpl_set_acl_impl+0x160/0x220 [zfs]
Oct 09 09:41:43 truenas kernel:  zpl_xattr_acl_set_access+0x96/0x110 [zfs]
Oct 09 09:41:43 truenas kernel:  __vfs_setxattr+0x66/0x80
Oct 09 09:41:43 truenas kernel:  __vfs_setxattr_noperm+0x70/0x200
Oct 09 09:41:43 truenas kernel:  vfs_setxattr+0x6b/0x120
Oct 09 09:41:43 truenas kernel:  setxattr+0x144/0x270
Oct 09 09:41:43 truenas kernel:  ? __schedule+0x28a/0x880
Oct 09 09:41:43 truenas kernel:  ? cgroup_leave_frozen+0x46/0xd0
Oct 09 09:41:43 truenas kernel:  ? ptrace_stop+0x1e2/0x2b0
Oct 09 09:41:43 truenas kernel:  ? ptrace_do_notify+0x82/0xb0
Oct 09 09:41:43 truenas kernel:  __x64_sys_fsetxattr+0xa1/0xc0
Oct 09 09:41:43 truenas kernel:  do_syscall_64+0x33/0x80
Oct 09 09:41:43 truenas kernel:  entry_SYSCALL_64_after_hwframe+0x61/0xc6
Oct 09 09:41:43 truenas kernel: RIP: 0033:0x7fbfd53b7b1a
Oct 09 09:41:43 truenas kernel: RSP: 002b:00007ffea4df12d8 EFLAGS: 00000246 ORIG_RAX: 00000000000000be
Oct 09 09:41:43 truenas kernel: RAX: ffffffffffffffda RBX: 0000000000000006 RCX: 00007fbfd53b7b1a
Oct 09 09:41:43 truenas kernel: RDX: 00005611b52b73a0 RSI: 00007fbfd458b0b3 RDI: 0000000000000033
Oct 09 09:41:43 truenas kernel: RBP: 00005611b52b73a0 R08: 0000000000000000 R09: 00007fbfd547abe0
Oct 09 09:41:43 truenas kernel: R10: 0000000000000034 R11: 0000000000000246 R12: 0000000000000033
Oct 09 09:41:43 truenas kernel: R13: 00007ffea4df1328 R14: 00007ffea4df1330 R15: 0000000000000010


To cut considerable investigation short, it turns out that when Windows connected smbd forked a sub-proc (normal) which did a bunch of stuff ending in:

Code:
[pid 66856] getegid()                   = 0
[pid 66856] setresuid(0, 0, -1)         = 0
[pid 66856] geteuid()                   = 0
[pid 66856] fsetxattr(51, "system.posix_acl_access", "\2\0\0\0\1\0\7\0\377\377\377\377\2\0\7\0\376\377\0\0\4\0\7\0\377\377\377\377\10\0\7", 52, 0


51 there is the new file I just tried to create. After that, PID 66856 fully hangs; can't be killed, any process that touches it (i.e. strace) can't be killed, etc, etc.

smbd, presumably as Windows comes back and re-asks it what the hell's going on, continues to fork new sub-procs which do a bunch of stuff ending with:

Code:
[pid 68004] stat(".", {st_mode=S_IFDIR|0755, st_size=3, ...}) = 0
[pid 68004] getcwd("/mnt/main/share", 4096) = 16
[pid 68004] openat(AT_FDCWD, ".", O_RDONLY|O_NOFOLLOW|O_PATH|O_DIRECTORY) = 50
[pid 68004] fstat(50, {st_mode=S_IFDIR|0755, st_size=3, ...}) = 0
[pid 68004] getxattr("/proc/self/fd/49", "user.DOSATTRIB"


49 there is the new file, this time.

It will keep doing this forever. ls -l (but not stat, oddly) also hangs if it touches the new file.

In case anybody wants more detail, the complete strace -ff of the main smbd is at https://gist.github.com/rlpowell/251fd95dfbc116d80cd6bfc41d0da87f

The only fix is a reboot, but as soon as you try to make a file after reboot it'll happen again, 100% of the time.

So, yeah, for any future searchers: don't do that. :D
 

rlpowell

Dabbler
Joined
Oct 6, 2022
Messages
25
An obvious UI fix here is that when you make an SMB share it should complain if the place you're pointing to is not an SMB dataset, OSLT.

But, like, also, silently hard-hanging with no error output of any kind is bad and whatever is causing that should probably stop? :D

To be clear, "zpool status" was like "it's all good yo!"; if there is some other ZFS thing I should have checked I don't know what it is.
 
Top