smbd memory leak FN 11.1Ux and System crash after overflow

michael.samer · Apr 16, 2018

Hello
I'm using a FN installation (11.1U4) on a ESXi VM with following hardware:
12 Cores (2*6vCPU)
32GB RAM
1*e1000 nic 192.168.0.2/24
1*vmx3 nic 192.168.80.2/24
1*10GB Boot
1*33GB swap
2*500TB SAN volumes
network sharing done with AD (2012R2) and CIFS only

Since we upgraded from (11.1 to 11.1U3) we have a severe memory leak at hand.
After a few hours running one of the smbd processes start consuming all available RAM, and then start to swap. Max. was 260GB after all services died and the system become unresponsive on ssh:

last pid: 35878; load averages: 1.28, 1.43, 1.34 up 6+15:25:23 07:00:03
56 processes: 1 running, 48 sleeping, 1 zombie, 6 waiting
CPU: 0.0% user, 0.0% nice, 12.9% system, 0.0% interrupt, 87.1% idle
Mem: 25G Active, 72K Inact, 6639M Wired, 142M Free
ARC: 3842M Total, 133M MFU, 2872M MRU, 3172K Anon, 66M Header, 768M Other
2566M Compressed, 20G Uncompressed, 8.03:1 Ratio
Swap: 34G Total, 34G Used, K Free, 100% Inuse, 4K In, 4K Out
PID USERNAME THR PRI NICE SIZE RES STATE C TIME WCPU COMMAND
34284 root 16 35 0 199M 109M vmwait 5 59:52 0.14% python3.6
35878 root 1 20 0 8212K 3540K CPU2 2 0:00 0.11% top
12195 root 18 31 0 55604K 13612K uwait 7 16:58 0.04% consul
11568 root 1 20 0 16436K 2296K select 5 3:10 0.03% vmtoolsd
516 root 1 20 0 13776K 5072K select 3 0:03 0.03% sshd
9372 root 1 20 0 9176K 800K select 5 0:35 0.01% devd
12137 root 1 20 0 147M 8004K kqread 0 0:24 0.00% uwsgi
10496 root 1 20 0 10432K 10544K select 2 0:17 0.00% ntpd
13314 root 1 20 0 9004K 2436K select 2 0:08 0.00% zfsd
21414 root 1 46 0 260G 0K pfault 3 510:33 0.00% <smbd>
12269 root 12 20 0 101M 11448K nanslp 6 13:23 0.00% collectd
12126 root 1 20 0 102M 20044K select 6 5:44 0.00% python3.6
49525 root 15 20 0 232M 90656K umtxn 7 4:49 0.00% uwsgi
21280 root 1 20 0 93544K 48820K select 2 4:46 0.00% winbindd
21204 root 1 20 0 128M 65472K select 1 2:24 0.00% smbd
13560 root 18 20 0 33936K 4584K uwait 2 1:07 0.00% consul
9571 root 3 20 0 31940K 0K WAIT 1 1:00 0.00% <syslog-n
12188 root 20 20 0 46680K 7228K uwait 4 0:40 0.00% consul-al

As I have only 2-3 Users (with big datas) running, I'm unable to see the cause apart from a memory leak. The 11.0 version (smb=3.6.8) I'm using at home, seems not to have such an effect where I handle about the same amount of datas.
I already tried the "fixes" from https://forums.freenas.org/index.php?threads/samba-using-up-most-of-the-ram.61039/
with no avail.
The killing by the kernel of the smbd happens about every day (best case) in worst case the whole system just blurbs error messages (ssh/console) like:
swap_pager_getswapspace(3): failed
and no key was accepted, nor did a CAD help in any way.
Three weeks ago and before the system was not showing this high loads. When the ram consumation starts, the CPU climbs fast high (only 1GBit load, so I'd not graps why):

root@DEVNETNAS:~ # swapinfo
Device 1K-blocks Used Avail Capacity
/dev/mirror/swap0.eli 2097152 2097104 48 100%
/dev/gptid/f52b45ff-3d4c-11e8-8 33554432 29001512 4552920 86%
Total 35651584 31098616 4552968 87%
PID USERNAME THR PRI NICE SIZE RES STATE C TIME WCPU COMMAND
97258 root 1 100 0 155G 17498M CPU2 2 16.8H 177.47% smbd
57713 root 1 20 0 8212K 3532K CPU7 7 0:00 2.76% top
12195 root 18 30 0 55604K 16200K uwait 1 14:42 1.19% consul
516 root 1 20 0 13776K 5264K select 1 0:02 0.30% sshd
32729 root 1 20 0 128M 92328K select 1 2:13 0.23% smbd
11568 root 1 20 0 16436K 2444K select 4 2:39 0.08% vmtoolsd
34284 root 16 20 0 199M 112M kqread 0 50:39 0.00% python3.6
12269 root 12 20 0 99M 10460K nanslp 1 11:23 0.00% collectd
32791 root 1 20 0 92976K 53688K select 2 7:21 0.00% winbindd
12126 root 1 22 0 102M 21020K select 5 4:58 0.00% python3.6
49525 root 15 28 0 231M 102M umtxn 0 2:42 0.00% uwsgi
13560 root 18 20 0 33936K 5608K uwait 2 1:06 0.00% consul
9571 root 1 20 0 31932K 3228K kqread 0 0:51 0.00% syslog-ng
12188 root 19 20 0 46552K 6904K uwait 2 0:34 0.00% consul-al
9372 root 1 20 0 9176K 880K select 4 0:31 0.00% devd
12137 root 1 20 0 147M 23904K kqread 1 0:22 0.00% uwsgi
11724 root 1 52 0 13004K 4756K select 3 0:17 0.00% sshd
10496 root 1 20 0 10432K 10544K select 6 0:15 0.00% ntpd
32720 root 1 20 0 49908K 12504K select 4 0:15 0.00% winbindd
13314 root 1 20 0 9004K 2780K select 2 0:07 0.00% zfsd
13558 root 18 32 0 33424K 5164K uwait 1 0:07 0.00% consul

If I'm unable to solve it the next days I'm forced to step over to a different zfs based system which I'm very unhappy to migrate to.

Cheers
Michael

danb35 · Apr 16, 2018

michael.samer said:
If I'm unable to solve it the next days I'm forced to step over to a different zfs based system which I'm very unhappy to migrate to.

Was it working properly under 11.1? If so, I'd think reverting to that would be a good thing to try. I'd also recommend a bug report, filed through the GUI, with a debug file attached.

michael.samer · Apr 16, 2018

Hi danb35
the above mentioned thread already filed a redmine ticket. I'm not sure if the 11.1U1 did work flawlessly (maybe just in my case), but other reported the leak in the 11.1 (but not 11.0) as well. Downgrading was my first step, but unluckily the database could not import back and leaves smb+ad dead with a lot of error messages in the GUI, so I'd to update again.
As I read the leak was found (friday), patched in the upstream and will be available somehow, sometime. So help will come, I hope fast as the acceptance dropped considerably since the crashes started nearly 3 weeks ago.
I already asked how the patch/update will be available.

Cheers
Michael

Chris Moore · Apr 16, 2018

Did you upgrade to U4 yet? I'm having no problem with this version.

Sent from my SAMSUNG-SGH-I537 using Tapatalk

michael.samer · Apr 16, 2018

Chris Moore said:
Did you upgrade to U4 yet? I'm having no problem with this version.

Hello Chris
yes we upgraded a week after going to U3 as the problem showed itself then. Since two weeks we life with U4 but the problems got more extreme since then. The memory leak (vfswrapper) seems to be located in smb V4.7.0 as before there was no such a strange behavior.
According to the redmine ticket it's fixed and will be published with U5, but about every day a crash doesn't sound to promising surviving 14 days to go.
Cheers
Michael

toadman · Apr 16, 2018

I'm seeing a leak in 11.1-U4 as well. Reported here: https://redmine.ixsystems.com/issues/31920

I dropped a pointer to this thread in that report.

EDIT: 31920 has been limited to a net-snmp memory leak. Separate issue entirely.

toadman · Apr 16, 2018

michael.samer said:
According to the redmine ticket it's fixed and will be published with U5, but about every day a crash doesn't sound to promising surviving 14 days to go.
Cheers
Michael

What is that ticket #? I'd like to go read up on it. Thank you.

EDIT: Looking at the list for -U5 it is perhaps this ticket? https://redmine.ixsystems.com/issues/28585

michael.samer · Apr 17, 2018

Anodos: The workaround is to disable Unix Extensions under "services"->"SMB", then set the auxiliary parameter "widelinks = yes" under "services"->"SMB".
11.1U5 will contain the fix.

My reply: Hi anodos
Darn! We already have these settings in use (and rebootet) since 10 days or so, but to no avail: the mem leak is still there (as well the crashes). Should I open a new redmine ticket for this case?
Cheers
Michael

michael.samer · Apr 17, 2018

toadman said:
What is that ticket #? I'd like to go read up on it. Thank you.

EDIT: Looking at the list for -U5 it is perhaps this ticket? https://redmine.ixsystems.com/issues/28585

Yep that's it. Unluckily it doesn't help for my case, apart from being a setting, not a patch or hotfix.

michael.samer · Apr 17, 2018

[global]
interfaces = 127.0.0.1 192.168.1.2 192.168.80.2
bind interfaces only = yes
encrypt passwords = yes
dns proxy = no
strict locking = no
oplocks = yes
deadtime = 15
max log size = 51200
max open files = 942599
logging = file
load printers = no
printing = bsd
printcap name = /dev/null
disable spoolss = yes
getwd cache = yes
guest account = nobody
map to guest = Bad User
obey pam restrictions = yes
ntlm auth = no
directory name cache size = 0
kernel change notify = no
panic action = /usr/local/libexec/samba/samba-backtrace
nsupdate command = /usr/local/bin/samba-nsupdate -g
server string = AEV Filer
ea support = yes
store dos attributes = yes
lm announce = yes
unix extensions = no
acl allow execute always = true
dos filemode = yes
multicast dns register = no
domain logons = no
idmap config *: backend = tdb
idmap config *: range = 90000001-100000000
server role = member server
workgroup = DEVNET
realm = DEVNET.AEV
security = ADS
client use spnego = yes
local master = no
domain master = no
preferred master = no
ads dns update = yes
winbind cache time = 7200
winbind offline logon = yes
winbind enum users = yes
winbind enum groups = yes
winbind nested groups = yes
winbind use default domain = yes
winbind refresh tickets = yes
idmap config DEVNET: backend = rid
idmap config DEVNET: range = 20000-90000000
allow trusted domains = no
client ldap sasl wrapping = plain
template shell = /bin/sh
template homedir = /home/%D/%U
netbios name = DEVNETNAS
netbios aliases = FILER
create mask = 0666
directory mask = 0777
client ntlmv2 auth = yes
dos charset = CP1250
unix charset = UTF-8
log level = 2
wide links = yes

[IT$]
path = "/mnt/SAN2LV0_500TB"
comment = Replikationsshare SAN2LV0
printable = no
veto files = /.snapshot/.windows/.mac/.zfs/
writeable = yes
browseable = no
access based share enum = no
shadow:snapdir = .zfs/snapshot
shadow:sort = desc
shadow:localtime = yes
shadow:format = auto-%Y%m%d.%H%M-12m
shadow:snapdirseverywhere = yes
vfs objects = shadow_copy2 zfs_space zfsacl streams_xattr
hide dot files = no
hosts allow = 192.168.1.0/24
guest ok = no
nfs4:mode = special
nfs4:acedup = merge
nfs4:chown = true
zfsacl:acesort = dontcare

[LV0$]
path = "/mnt/SANLV0_500TB"
comment = Admin SAN1 LUN0 Share
printable = no
veto files = /.snapshot/.windows/.mac/.zfs/
writeable = yes
browseable = yes
access based share enum = no
recycle:repository = .recycle/%U
recycle:keeptree = yes
recycle:versions = yes
recycle:touch = yes
recycle:directory_mode = 0777
recycle:subdir_mode = 0700
vfs objects = zfs_space zfsacl streams_xattr recycle
hide dot files = no
hosts allow = 192.168.1.0/24
guest ok = no
nfs4:mode = special
nfs4:acedup = merge
nfs4:chown = true
zfsacl:acesort = dontcare

[checkpointdata]
path = "/mnt/SANLV0_500TB/DataSet3_5TiB/aev21"
comment = CheckPointData
printable = no
veto files = /.snapshot/.windows/.mac/.zfs/
writeable = yes
browseable = yes
access based share enum = yes
recycle:repository = .recycle/%U
recycle:keeptree = yes
recycle:versions = yes
recycle:touch = yes
recycle:directory_mode = 0777
recycle:subdir_mode = 0700
shadow:snapdir = .zfs/snapshot
shadow:sort = desc
shadow:localtime = yes
shadow:format = auto-%Y%m%d.%H%M-6m
shadow:snapdirseverywhere = yes
vfs objects = shadow_copy2 zfs_space zfsacl streams_xattr recycle
hide dot files = no
guest ok = no
nfs4:mode = special
nfs4:acedup = merge
nfs4:chown = true
zfsacl:acesort = dontcare

[checkpointdaten]
path = "/mnt/SANLV0_500TB/DataSet3_5TiB/aev21"
comment = CheckPointDaten
printable = no
veto files = /.snapshot/.windows/.mac/.zfs/
writeable = yes
browseable = yes
access based share enum = yes
recycle:repository = .recycle/%U
recycle:keeptree = yes
recycle:versions = yes
recycle:touch = yes
recycle:directory_mode = 0777
recycle:subdir_mode = 0700
shadow:snapdir = .zfs/snapshot
shadow:sort = desc
shadow:localtime = yes
shadow:format = auto-%Y%m%d.%H%M-6m
shadow:snapdirseverywhere = yes
vfs objects = shadow_copy2 zfs_space zfsacl streams_xattr recycle
hide dot files = no
guest ok = no
nfs4:mode = special
nfs4:acedup = merge
nfs4:chown = true
zfsacl:acesort = dontcare

[datasets]
path = "/mnt/SANLV0_500TB/DataSet2_50TiB/aev21"
comment = Datasets
printable = no
veto files = /.snapshot/.windows/.mac/.zfs/
writeable = yes
browseable = yes
access based share enum = yes
recycle:repository = .recycle/%U
recycle:keeptree = yes
recycle:versions = yes
recycle:touch = yes
recycle:directory_mode = 0777
recycle:subdir_mode = 0700
shadow:snapdir = .zfs/snapshot
shadow:sort = desc
shadow:localtime = yes
shadow:format = auto-%Y%m%d.%H%M-6m
shadow:snapdirseverywhere = yes
vfs objects = shadow_copy2 zfs_space zfsacl streams_xattr recycle
hide dot files = no
guest ok = no
nfs4:mode = special
nfs4:acedup = merge
nfs4:chown = true
zfsacl:acesort = dontcare

[datensets]
path = "/mnt/SANLV0_500TB/DataSet2_50TiB/aev21"
comment = Datensets
printable = no
veto files = /.snapshot/.windows/.mac/.zfs/
writeable = yes
browseable = yes
access based share enum = yes
recycle:repository = .recycle/%U
recycle:keeptree = yes
recycle:versions = yes
recycle:touch = yes
recycle:directory_mode = 0777
recycle:subdir_mode = 0700
shadow:snapdir = .zfs/snapshot
shadow:sort = desc
shadow:localtime = yes
shadow:format = auto-%Y%m%d.%H%M-6m
shadow:snapdirseverywhere = yes
vfs objects = shadow_copy2 zfs_space zfsacl streams_xattr recycle
hide dot files = no
guest ok = no
nfs4:mode = special
nfs4:acedup = merge
nfs4:chown = true
zfsacl:acesort = dontcare

[exchange]
path = "/mnt/SANLV0_500TB/DataSet5_Exchange"
comment = Exchange
printable = no
veto files = /.snapshot/.windows/.mac/.zfs/
writeable = yes
browseable = yes
access based share enum = no
recycle:repository = .recycle/%U
recycle:keeptree = yes
recycle:versions = yes
recycle:touch = yes
recycle:directory_mode = 0777
recycle:subdir_mode = 0700
vfs objects = zfs_space zfsacl streams_xattr recycle
hide dot files = no
guest ok = no
nfs4:mode = special
nfs4:acedup = merge
nfs4:chown = true
zfsacl:acesort = dontcare

[rawdata]
path = "/mnt/SANLV0_500TB/DataSet1_200TiB/aev21"
comment = Rawdata
printable = no
veto files = /.snapshot/.windows/.mac/.zfs/
writeable = yes
browseable = yes
access based share enum = yes
recycle:repository = .recycle/%U
recycle:keeptree = yes
recycle:versions = yes
recycle:touch = yes
recycle:directory_mode = 0777
recycle:subdir_mode = 0700
shadow:snapdir = .zfs/snapshot
shadow:sort = desc
shadow:localtime = yes
shadow:format = auto-%Y%m%d.%H%M-6m
shadow:snapdirseverywhere = yes
vfs objects = shadow_copy2 zfs_space zfsacl streams_xattr recycle
hide dot files = no
guest ok = no
nfs4:mode = special
nfs4:acedup = merge
nfs4:chown = true
zfsacl:acesort = dontcare

[rohdaten]
path = "/mnt/SANLV0_500TB/DataSet1_200TiB/aev21"
comment = Rohdaten
printable = no
veto files = /.snapshot/.windows/.mac/.zfs/
writeable = yes
browseable = yes
access based share enum = yes
recycle:repository = .recycle/%U
recycle:keeptree = yes
recycle:versions = yes
recycle:touch = yes
recycle:directory_mode = 0777
recycle:subdir_mode = 0700
shadow:snapdir = .zfs/snapshot
shadow:sort = desc
shadow:localtime = yes
shadow:format = auto-%Y%m%d.%H%M-6m
shadow:snapdirseverywhere = yes
vfs objects = shadow_copy2 zfs_space zfsacl streams_xattr recycle
hide dot files = no
guest ok = no
nfs4:mode = special
nfs4:acedup = merge
nfs4:chown = true
zfsacl:acesort = dontcare

[smartcar]
path = "/mnt/SANLV0_500TB/DataSet4_80TiB"
comment = smartcar Poppinga
printable = no
veto files = /.snapshot/.windows/.mac/.zfs/
writeable = yes
browseable = yes
access based share enum = yes
recycle:repository = .recycle/%U
recycle:keeptree = yes
recycle:versions = yes
recycle:touch = yes
recycle:directory_mode = 0777
recycle:subdir_mode = 0700
shadow:snapdir = .zfs/snapshot
shadow:sort = desc
shadow:localtime = yes
shadow:format = auto-%Y%m%d.%H%M-6m
shadow:snapdirseverywhere = yes
vfs objects = shadow_copy2 zfs_space zfsacl streams_xattr recycle
hide dot files = no
guest ok = no
nfs4:mode = special
nfs4:acedup = merge
nfs4:chown = true
zfsacl:acesort = dontcare

michael.samer · Apr 30, 2018

As the restart every 24h of SMBD doesn't fix the problem and while the "fixes/settings" are already applied I tried 11.0U1/U2 in comparison to 11.1U4 (currently only one user is using the share) applying the same uploads (same VM+different, but identical SAN Storage):
a) FNAS11.1U4:
last pid: 15733; load averages: 0.25, 0.45, 0.52 up 2+17:53:15 09:08:55
49 processes: 1 running, 48 sleeping
CPU: 1.4% user, 0.0% nice, 1.9% system, 0.8% interrupt, 95.8% idle
Mem: 2268M Active, 14G Inact, 455M Laundry, 14G Wired, 648M Free
ARC: 10G Total, 1420M MFU, 8188M MRU, 84M Anon, 80M Header, 767M Other
9196M Compressed, 27G Uncompressed, 3.03:1 Ratio
Swap: 32G Total, 198M Used, 32G Free, 8K In
PID USERNAME THR PRI NICE SIZE RES STATE C TIME WCPU COMMAND
19535 root 1 36 0 54885M 16772M select 5 155:24 23.78% smbd
19555 root 1 21 0 196M 165M select 3 37:37 2.35% smbd
4451 root 12 20 0 169M 29236K nanslp 3 5:26 0.81% collectd
240 root 22 21 0 196M 109M kqread 3 24:49 0.49% python3.6
19427 root 1 20 0 128M 101M select 1 2:31 0.35% smbd
4372 root 19 30 0 54548K 20192K uwait 6 11:59 0.12% consul

b) FNAS11.0U2:
last pid: 29909; load averages: 1.29, 1.46, 1.29 up 5+12:23:12 09:10:07
48 processes: 2 running, 46 sleeping
CPU: 6.8% user, 0.0% nice, 6.7% system, 0.5% interrupt, 86.0% idle
Mem: 5856K Active, 370M Inact, 30G Wired, 718M Free
ARC: 27G Total, 6319M MFU, 20G MRU, 18M Anon, 166M Header, 774M Other
Swap: 31G Total, 282M Used, 31G Free
PID USERNAME THR PRI NICE SIZE RES STATE C TIME WCPU COMMAND
2421 root 1 98 0 361M 32980K CPU3 3 31.4H 85.82% smbd
4745 root 15 20 0 512M 105M umtxn 11 15:16 16.71% uwsgi
3539 root 1 20 0 279M 4196K select 11 6:04 0.25% smbd
5431 www 1 20 0 55644K 2412K kqread 7 0:02 0.24% nginx

After 3 days running the 11.0 still performs as planned, while the 11.1 crashed it's smbd (and django) two times already and the test died with it. If the test is running flawlessly up to friday I'll switch back to 11.0 as it's samba 3.6.4 works as intended, compared to the 3.7.0 of FN11.1. I wonder why they haven't switched back to the 3.6.x instead of microfixing a leak?!
I'm happy so far to be able to stay with FreeNAS. A export of the 11.1 Settings and import into 11.0 failed as the AD+SMB settings seems to be quite different and not chewable for the GUI, so a manuel import was done but seems to fix the problem :)

michael.samer · May 14, 2018

So today we switched our storages back to the NEW VM with the older 11.0-U2 as 14 days testing under heavy workload worked flawlessly. I've seen as well that the 11.1-U5 is still in waiting, so happy that I checked+switched to the older version instead of waiting.
So my case is closed. NExt time we will have to check the newer versions instead of blind hoping everything works as before.
Cheers
Michael

Important Announcement for the TrueNAS Community.

smbd memory leak FN 11.1Ux and System crash after overflow

michael.samer

Dabbler

danb35

Hall of Famer

michael.samer

Dabbler

Chris Moore

Hall of Famer

michael.samer

Dabbler

toadman

Guru

toadman

Guru

michael.samer

Dabbler

Attachments

michael.samer

Dabbler

michael.samer

Dabbler

michael.samer

Dabbler

michael.samer

Dabbler

Similar threads