FreeNAS Mini on 11.2 freezes / locks up

Meyers

Patron
Joined
Nov 16, 2016
Messages
211
I take that back: a few things have changed on this system but I only started having problems when I upgraded to 11.2. I added a cache for example but I removed it and still had the pool freeze after that.
 

hoba

Dabbler
Joined
Mar 25, 2019
Messages
14
I think I'm seeing very similiar issues with my FreeNAS 11.2 here too. I'm using it as SMB-Backup-Target. Backups are done daily using veeam to backup several vmware-servers. The system was a fresh install with 11.2 (not sure anymore, which build exactly) which worked out fine for some time and I'm pretty sure I upgraded it twice using the webgui since then. Now every night it locks up after the backup job runs for some time. When this happens it's still pingabe. When trying to access it using ssh you can still enter username and password but then it gets stuck (no command prompt). When going to the local console it accepts the 9-key to got to shell but even there you don't get a prompt anymore. Webgui and Share is dead as well. Only thing to recover from this is a hard reset. I also tried using the autotune, which didn't help at all, so I disabled it and removed all the tunables. Limiting the ARC to 50G using vfs.zfs.arc_max didn't help either. I have been seeing ARC using almost all of the RAM, so I thought this is something worth trying.

Here are some details on the system and it's configuration:

Mainboard: Supermicro X11DPI-NT (Dual 10 gbit/s Intel NICs onboard but only one of the used atm)
CPU: Dual Intal Xeon Bronze 3104
RAM: 64 GB (4x16) ECC DDR4-2666
HBA: LSI SAS 9300-8i
Bootdevice: GEOM Mirror 2x Samsung SSD SM863A 240GB
Storage: 16x HGST 10TB SAS
FreeNAS-Version: FreeNAS-11.2-U2.1 (Build Date: Feb 27, 2019 20:59)

The Storage is configured as one pool RAID-Z3 using all 16 10TB drives (about 90% used atm).
Besides SMB using a local User nothing else is configured or used.

I appreciate any hints and can get you more details, logs or do tests if you tell me what is needed.

Regards
Holger
 
Last edited:

Meyers

Patron
Joined
Nov 16, 2016
Messages
211
Maybe chime in on the ticket I opened. I'd keep a terminal open to the system and run the commands Alexander mentioned when this happens again.
 

hoba

Dabbler
Joined
Mar 25, 2019
Messages
14
Hi Meyers,

can do so but as mentioned before, when it gets stuck I can't do anything anymore as there is no way to run any kind of command anymore. I don't think that my issues are related to worn out hardware, as the components are still pretty new and worked for a while. I'l try to leave some ssh sessions open and stresstest it to get some kind of information though.

Regards
Holger
 

Meyers

Patron
Joined
Nov 16, 2016
Messages
211
there is no way to run any kind of command any more

So it sounds like maybe your boot pool is freezing up? My boot pool was fine as I recall. It was my storage pool that was freezing up. Could be the same underlying issue though.
 
Joined
Jan 4, 2014
Messages
1,644
The Storage is configured as one pool RAID-Z3 using all 16 10TB drives (about 90% used atm).
Are all your disks in one vdev? That and your pool being at 90% are complicating factors in your situation. Extracts from the ZFS Primer:
  • Using more than 12 disks per vdev is not recommended. The recommended number of disks per vdev is between 3 and 9. With more disks, use multiple vdevs.
  • At 90% capacity, ZFS switches from performance- to space-based optimization, which has massive performance implications. For maximum write performance and to prevent problems with drive replacement, add more capacity before a pool reaches 80%.
 

hoba

Dabbler
Joined
Mar 25, 2019
Messages
14
Performance of the system is still ok for what we are using it. We want to use as much space as possible. It's not recommended but should be supported.

Atm I'm running a bunch of backups waiting for the system to die again while having multiple ssh sessions running showing different informations. I can see that most of the disks (besides the boot mirror) are showing 50% to 90% busy constantly with few drops below 50%. The ARC filled up to 50G quite fast but stays there now. Free memory is around 7G.
 

hoba

Dabbler
Joined
Mar 25, 2019
Messages
14
I have been pushing about 5 TB to the NAS from multiple sources running some backups but I was not able to get it down. Even after removing the tunable for limiting the ARC I was not able to bring it down. Without ARC limit I was not seeing less than 1.5G free memory.

I'll leave the ssh-sessions running over night, hoping that I can see some frozen pictures from the moment when things go downhill.

One thing, that I noticed though, is that it's not using any swap at all. It always says Swap: 10G Total, 10G Free. Maybe it's getting out of memory at some point and locks up, as it is not using the swap at all? Is it normal behaviour that it doesn't use swap?
 

Meyers

Patron
Joined
Nov 16, 2016
Messages
211
ARC is evict-able cache so I wouldn't be too worried about that. Make sure you change these settings back to what they were otherwise you'll have an impossible mess on your hands. I would just stick to using the autotune settings (which I do on my systems).

Wide vdevs aren't recommended like Seymour pointed out, so I'd definitely rethink your setup. Check out this post. If you do 2x8 RAID-Z2 vdevs you'll only lose about 18TB usable.

None of my production systems use swap. This appears to be normal (and in general you almost never want your systems using swap).
 

Meyers

Patron
Joined
Nov 16, 2016
Messages
211
So when this happens, can you run ANY commands, even a simple ls? I can run commands on mine. It's only when I try to access my data pool that the command freezes until reboot. I'm curious which pool is freezing up for you. Can you post the output of zpool status?
 

hoba

Dabbler
Joined
Mar 25, 2019
Messages
14
It died again tonight, this time in a slightly different way. Looks like it rebooted at around 7:45am. This is the last output from the ssh-console showing a top (sorted by size):

last pid: 77855; load averages: 0.11, 0.10, 0.46 up 0+20:05:10 07:45:17
86 processes: 1 running, 85 sleeping
CPU: 0.0% user, 0.0% nice, 0.0% system, 0.1% interrupt, 99.9% idle
Mem: 9736K Active, 359M Inact, 735M Laundry, 59G Wired, 2304M Free
ARC: 55G Total, 2193M MFU, 52G MRU, 569M Anon, 232M Header, 100M Other
53G Compressed, 57G Uncompressed, 1.08:1 Ratio
Swap: 10G Total, 10G Free


PID USERNAME THR PRI NICE SIZE RES STATE C TIME WCPU COMMAND
236 root 23 20 0 271M 219M kqread 8 10:06 0.09% python3.6
4233 root 1 20 0 186M 157M select 3 182:07 0.00% smbd
27272 root 25 20 0 184M 158M buf_ha 8 83:27 0.00% smbd
3524 root 15 24 0 183M 133M umtxn 5 0:34 0.00% uwsgi-3.6
54963 root 1 20 0 170M 151M select 6 0:00 0.00% smbd
3002 root 11 20 0 166M 125M nanslp 11 13:28 0.00% collectd
20726 root 1 20 0 165M 148M select 0 0:01 0.00% smbd
77763 root 1 22 0 163M 147M zio->i 3 0:00 0.00% smbd
77852 root 1 20 0 161M 145M select 10 0:00 0.00% smbd
77832 root 1 20 0 161M 145M select 6 0:00 0.00% smbd
77811 root 1 20 0 161M 145M select 0 0:00 0.00% smbd
77833 root 1 20 0 161M 145M select 9 0:00 0.00% smbd
77855 root 1 20 0 161M 145M select 9 0:00 0.00% smbd
2698 root 1 20 0 161M 145M select 10 0:04 0.00% smbd
2785 root 1 20 0 118M 103M select 0 0:00 0.00% smbd
2775 root 1 20 0 118M 102M select 10 0:42 0.00% smbd
2876 root 1 20 0 115M 100M kqread 0 0:07 0.01% uwsgi-3.6
2709 root 1 20 0 78372K 62868K vmpfw 2 0:00 0.00% winbindd
2703 root 1 20 0 77064K 61316K zio->i 10 0:02 0.00% winbindd
3750 root 1 20 0 65232K 59232K wait 7 0:02 0.00% python3.6
344 root 2 22 0 64740K 54404K usem 8 0:01 0.00% python3.6
2570 root 1 20 0 60008K 27672K uwait 7 0:43 0.12% dtrace
2568 root 1 20 0 60008K 27672K uwait 2 0:37 0.12% dtrace
2569 root 1 20 0 60008K 27672K uwait 7 0:36 0.06% dtrace
343 root 2 20 0 54528K 47764K piperd 11 0:01 0.00% python3.6
2439 root 5 20 0 46268K 31496K buf_ha 11 13:26 0.03% python3.6
2732 root 8 20 0 44480K 17556K select 0 2:16 0.01% rrdcached
2855 root 1 20 0 38852K 21528K select 11 0:00 0.00% winbindd
2856 root 1 20 0 37128K 21392K select 0 0:00 0.00% winbindd
2695 root 1 20 0 29412K 16448K select 1 0:07 0.03% nmbd
2008 root 2 20 0 20516K 9204K buf_ha 8 0:02 0.00% syslog-ng
2435 root 1 20 0 18864K 11468K select 11 0:11 0.00% snmpd
244 root 1 52 0 15532K 11404K piperd 5 0:00 0.00% python3.6

...


And this is the last HDD-Statistics from another ssh-shell:

dT: 1.064s w: 1.000s
L(q) ops/s r/s kBps ms/r w/s kBps ms/w %busy Name
0 0 0 0 0.0 0 0 0.0 0.0| ada0
0 0 0 0 0.0 0 0 0.0 0.0| ada1
0 0 0 0 0.0 0 0 0.0 0.0| da0
0 0 0 0 0.0 0 0 0.0 0.0| da1
0 0 0 0 0.0 0 0 0.0 0.0| da2
0 0 0 0 0.0 0 0 0.0 0.0| da3
0 0 0 0 0.0 0 0 0.0 0.0| da4
0 0 0 0 0.0 0 0 0.0 0.0| da5
0 0 0 0 0.0 0 0 0.0 0.0| da6
0 0 0 0 0.0 0 0 0.0 0.0| da7
0 0 0 0 0.0 0 0 0.0 0.0| da8
0 0 0 0 0.0 0 0 0.0 0.0| da9
0 0 0 0 0.0 0 0 0.0 0.0| da10
0 0 0 0 0.0 0 0 0.0 0.0| da11
0 0 0 0 0.0 0 0 0.0 0.0| da12
0 0 0 0 0.0 0 0 0.0 0.0| da13
0 0 0 0 0.0 0 0 0.0 0.0| da14
0 0 0 0 0.0 0 0 0.0 0.0| da15


Previously when it died it never rebooted. It just got completely stuck. You were able to login to ssh (enter username and password) but then nothing happened anymore, no prompt, no MOTD, nothing. Same on the local console. You were able to enter the number 9 for shell on the console menu but the prompt never appeared.

Poolstatus:

pool: freenas-boot
state: ONLINE
scan: scrub repaired 0 in 0 days 00:00:06 with 0 errors on Thu Mar 21 03:45:06 2019
config:

NAME STATE READ WRITE CKSUM
freenas-boot ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
ada0p2 ONLINE 0 0 0
ada1p2 ONLINE 0 0 0

errors: No known data errors

pool: pool0
state: ONLINE
scan: scrub repaired 0 in 0 days 22:26:45 with 0 errors on Sun Feb 24 22:27:07 2019
config:

NAME STATE READ WRITE CKSUM
pool0 ONLINE 0 0 0
raidz3-0 ONLINE 0 0 0
gptid/5cff0baf-f942-11e8-918e-ac1f6b4da51e ONLINE 0 0 0
gptid/5dd5b5f6-f942-11e8-918e-ac1f6b4da51e ONLINE 0 0 0
gptid/5ebad554-f942-11e8-918e-ac1f6b4da51e ONLINE 0 0 0
gptid/5fab3abe-f942-11e8-918e-ac1f6b4da51e ONLINE 0 0 0
gptid/608d7c8b-f942-11e8-918e-ac1f6b4da51e ONLINE 0 0 0
gptid/6173b0eb-f942-11e8-918e-ac1f6b4da51e ONLINE 0 0 0
gptid/625e999c-f942-11e8-918e-ac1f6b4da51e ONLINE 0 0 0
gptid/63597239-f942-11e8-918e-ac1f6b4da51e ONLINE 0 0 0
gptid/644c1e19-f942-11e8-918e-ac1f6b4da51e ONLINE 0 0 0
gptid/65462af9-f942-11e8-918e-ac1f6b4da51e ONLINE 0 0 0
gptid/66533eab-f942-11e8-918e-ac1f6b4da51e ONLINE 0 0 0
gptid/6760ec47-f942-11e8-918e-ac1f6b4da51e ONLINE 0 0 0
gptid/688715ca-f942-11e8-918e-ac1f6b4da51e ONLINE 0 0 0
gptid/69ab3979-f942-11e8-918e-ac1f6b4da51e ONLINE 0 0 0
gptid/6ae807c4-f942-11e8-918e-ac1f6b4da51e ONLINE 0 0 0
gptid/6f1736ea-fdd6-11e8-9145-ac1f6b4da51e ONLINE 0 0 0

errors: No known data errors



Not sure where to go from here... :-/
 
Last edited:

balpay

Cadet
Joined
Jul 26, 2018
Messages
3
I'v been dealing with the same behavior almost for a month now but with no exact solution.
 

hoba

Dabbler
Joined
Mar 25, 2019
Messages
14
Just found out, why the system got reset this time (Output from the Supermicro IPMI Eventlog):

2019/03/26 06:06:08 Watchdog 2 Timer interrupt - Interrupt type: None, Timer use at expiration: SMS/OS - Assertion
2019/03/26 06:06:09 Watchdog 2 Hard Reset - Interrupt type: None, Timer use at expiration: SMS/OS - Assertion


Also ipmitool shows, that the watchdog is enabled:

root@nas06[~]# ipmitool mc watchdog get
Watchdog Timer Use: SMS/OS (0x44)
Watchdog Timer Is: Started/Running
Watchdog Timer Actions: Hard Reset (0x01)
Pre-timeout interval: 0 seconds
Timer Expiration Flags: 0x10
Initial Countdown: 137 sec
Present Countdown: 127 sec



Which leads me to this thread:
https://www.ixsystems.com/community/threads/ipmi-watchdog-2-hard-resets.42332/

However, maybe the reset is just the result from the system becoming unresponsive. As I said, the previous times the IPMI didn't reset the system but the system git somehow stuck without rebooting.
 

Meyers

Patron
Joined
Nov 16, 2016
Messages
211
I'v been dealing with the same behavior almost for a month now but with no exact solution.

Can you add any information to this ticket? You'll want to post your hardware details, what version of FreeNAS you're running, etc. If this started when you upgraded to 11.2 like it did for me, you could revert to 11.1 for now.

got reset this time

I disabled watchdog on my production systems because I don't want them rebooting automatically. I want to investigate the console before it reboots plus I have full remote access so I don't need auto reboot.

You might want to disable watchdog for now just so you can see what's going on when the system locks up. System -> Tunables -> Add:

Variable: watchdogd_enable
Value: NO
Type: rc

Then run service watchdogd stop on the system.

Keep your consoles open and hopefully you find something new next time this happens.
 

hoba

Dabbler
Joined
Mar 25, 2019
Messages
14
Thanks for the hint regarding disabling the watchdog. Done so. SSH-Sessions restarted and top on local console running as well. Will report back about any new findings tomorrow.
 

Redcoat

MVP
Joined
Feb 18, 2014
Messages
2,925

hoba

Dabbler
Joined
Mar 25, 2019
Messages
14
It died again last night, however this time I can still run commands on the shell. I updated the redmine ticket with some information.
 

balpay

Cadet
Joined
Jul 26, 2018
Messages
3
You would be best to start a new thread describing your system hardware, the FreeNAS version and detailing your problem(s) in order that people here have something to work on to provide you with advice.
Look here: https://www.ixsystems.com/community/threads/forum-guidelines.45124/ ,please, to understand the forum's expectations.
It mihgt have one or two reasons for this. Boot drive and/or zfs pool fails to handle the samba traffic load, especially when used with a backup tool like veeam. But still i'm not sue which part is heavly responsible for this affect, boot drive or zpool. I replaced the boot drive samsung evo to samsung pro 860, replaced a couple of storage disks (suspicious about the smart tests) with newer ones and revert back os to freenas 11.2-u1. Stress tests generally fail to repsesent the real world expreceince for veeam (with diskspd) yet so i'll wait to see if this is going to work or not.
 

hoba

Dabbler
Joined
Mar 25, 2019
Messages
14
From my experience it looks like the differential veeambackups run through just fine (writing to the zfs pool) but later veeam starts to create the full backup from the previous differential buckups which seems to make it lock up (concurrent reads and writes to the zfs). I have left the system in this stuck state so if somebody has a clue what commands to run and where to look for problems exactly I would be happy to deliver that information.
 

balpay

Cadet
Joined
Jul 26, 2018
Messages
3
From my experience it looks like the differential veeambackups run through just fine (writing to the zfs pool) but later veeam starts to create the full backup from the previous differential buckups which seems to make it lock up (concurrent reads and writes to the zfs). I have left the system in this stuck state so if somebody has a clue what commands to run and where to look for problems exactly I would be happy to deliver that information.
Yes it's problematic when writes/reads occur at the same time (while merging incremantals with full backups). But i have two freenas box, the first one seems to be ok when concerent jobs are sequezzed to a low number like 2-3 with problem free boot and zpool drives. I'll wait for the other one to see if new drives will do the job or not.
 
Top