SOLVED Random FN crashes on new hardware

-MG-

Explorer
Joined
Mar 13, 2017
Messages
65
I'm struggling to identify what is causing my system to randomly freeze on me. It has happened on both 11.1U7 and also on 11.2. The crash has occurred a few times during the following:

1) Twice after a replication task from another FN server and while I don't have exact time or data transferred. It seemed to freeze maybe 2 hours into the process. I have since stopped this. Rsync appears to be working fine.
2) While installing plex plugin from the webGUI also happened randomly when trying to install from jail shell (Note: it didn't happen during the actual install, once when downloading the pkg's and another when 11.2 was downloading the repository through the webGUI).
3) random other times

Before having to physically shut off the server. The console does not reveal to me any message or problems.

My build is a beefy setup that I lucked into when our office was trying this server instead of HP gear we have and didn't want it (winner winner me):

Dell R640 - (2) Intel Xeon Gold 6150 2.7G 18C/36T - (8) 32GB DDR4 SDRAM PC4-21300 ECC
-Dell PERC H200E 6Gb/s SAS PCIe HBA 2 Port Ext SFF-8088 Crossflashed to LSI firmware 20 in IT mode
-Network Cards: Intel 10G X550T (rev 01) & Intel I350 Gigabit
-Boot - Kingston 60GB SV300S37A/60G (using internal usb port and usb cable to SSD)
Lenovo 70F10000UX SA120 Direct Attach Storage
-(6) HGST Ultrastar HE6 HUS726060ALA640 6GB in Raidz2
-(2) Crucial MX500 500GB CT500MX500SSD1(Z) Mirrored for jails/VMs

I'm using an 8088 cable to the direct attached storage unit.

Is there anywhere I can look for why I am freezing? I'm also going to run a memtest to make sure I don't have any problems there. Where do I start with tackling this issue?

I should add, the system ran all night while rsync did it's thing without any issues.

Only thing that the log shows in general that I can't figure out with 11.2:

Code:
Jan 27 19:12:59 freenas uwsgi: [sentry.errors:674] Sentry responded with an API error: RateLimited(None)
Jan 27 19:12:59 freenas uwsgi: [sentry.errors.uncaught:702] ['timeout: timed out', '  File "django/core/handlers/exception.py", line 42, in inner', '  File "django/core/handlers/base.py", line 244, in _legacy_get_response', '  File "freenasUI/freeadmin/middleware.py", line 296, in process_request', '  File "freenasUI/middleware/auth.py", line 8, in authenticate', '  File "freenasUI/middleware/client.py", line 20, in __enter__', '  File "middlewared/client/client.py", line 320, in __init__', '  File "middlewared/client/client.py", line 313, in __init__', '  File "middlewared/client/client.py", line 170, in connect', '  File "ws4py/client/__init__.py", line 215, in connect']
 
Last edited:

SweetAndLow

Sweet'NASty
Joined
Nov 6, 2013
Messages
6,421
-Boot - Kingston 60GB SV300S37A/60G (using internal usb port and usb cable to SSD)
This? Please explain this more. Why are you using a usb cable for your disk. You need to be using a normal sata interface.
 

-MG-

Explorer
Joined
Mar 13, 2017
Messages
65
The server was intended to virtualize initially and didn’t come with any storage except two sdcards. I thought about this too. I’ll switch it out for a USB thumb drive and reload the config.
 

SweetAndLow

Sweet'NASty
Joined
Nov 6, 2013
Messages
6,421
The server was intended to virtualize initially and didn’t come with any storage except two sdcards. I thought about this too. I’ll switch it out for a USB thumb drive and reload the config.
Worth a shot.
 

-MG-

Explorer
Joined
Mar 13, 2017
Messages
65
Worth a shot.

Out of curiosity, the Dell R640 server I have has an IDSDM module with dual 32GB SD cards in it. By default they mirror each other. I have read that everyone says to absolutely avoid SDcards. Aren't these particular ones intended to run an operating system out of in virtualized environments? is FreeNAS different in this aspect?
 

SweetAndLow

Sweet'NASty
Joined
Nov 6, 2013
Messages
6,421
Might work, try something that is known to work until you get a working system. With newer versions of freenas it actually runs off the disk and it's not a ram disk like previous releases. Using a SSD is the recommended solution. I still use mirrors USB drives but that's just until they die. Been 5 years so they might never die.
 

ThreeDee

Guru
Joined
Jun 13, 2013
Messages
700
for what it's worth.. I tried using a 40GB SSD connected with a USB to SSD cable and initially it seemed to work but then had an odd "lock up" where Plex and Ubiquiti Controller, Tautulli where all still accessible but I could not log into FreeNAS via web or directly on my FreeNAS box .. I went back to a thumb drive and no issues. I am not certain if I just have a bad SSD drive or if it was how I was trying to use it as the SSD drive is a bit aged
 

SweetAndLow

Sweet'NASty
Joined
Nov 6, 2013
Messages
6,421
for what it's worth.. I tried using a 40GB SSD connected with a USB to SSD cable and initially it seemed to work but then had an odd "lock up" where Plex and Ubiquiti Controller, Tautulli where all still accessible but I could not log into FreeNAS via web or directly on my FreeNAS box .. I went back to a thumb drive and no issues. I am not certain if I just have a bad SSD drive or if it was how I was trying to use it as the SSD drive is a bit aged
Again I would definitely not use a USB cable to a SSD. Why is this even a idea people are having?
 

ThreeDee

Guru
Joined
Jun 13, 2013
Messages
700
It's because SSD's are "recommended" and a lot of us are out of SATA ports .. :confused:
 

-MG-

Explorer
Joined
Mar 13, 2017
Messages
65
Finally got around to working on this. I grabbed some PNY turbo usb drives to try as I saw people commenting on the Cruzer Fits tend to burn themselves up ( just had one go on my on my current build).

What I am not understanding is the snail pace install took ~45 minutes. Boot up took probably 30 minutes for first time. This is way different than before, that SSD took 5 min to install and less than that to boot.

I’m trying install again on a different stick to see if it’s a bad stick, but this is strange. I know once it’s up and running that doesn’t matter, but when I’m getting setup and restart for whatever reason I can’t be waiting that long. I’m trying install on a different drive plus usb slot to make sure there isn’t something wrong with the internal one

Edit - super slow install here too...
 

-MG-

Explorer
Joined
Mar 13, 2017
Messages
65
Ok, got some new information that hopefully helps us identify what's going on. After talking with some of the IT guys at work, they encourged me to try the SD card module that came with the server that includes (2) 32gb dell cards. Their logic is that these are not the same SD cards that you get at best buy and they are intended to be used to run an operating system like VMware, etc. I was happy with the install speed and boot up speed and went ahead down this path.

Everything seemed to be running better after loading the config up. I was running some rsync tasks, installed plex, and was using my channels dvr plugin all at the same time, where before it would crash.

Thinking I was in a good spot, the system eventually did freeze on me, but I did not have syslog or top running in putty. I rebooted, got things running and was particulary interested in how FreeNAS slowly take my memory into ARC. As I watched it get low ~4000mb, it quickly reallocated back to 13G and held there for awhile. Then it locked up again with this sitting on top:

Code:
last pid: 18814;  load averages:  7.08,  4.09,  3.69    up 0+01:06:48  20:47:08
69 processes:  2 running, 67 sleeping
CPU:  1.4% user,  0.0% nice, 11.1% system,  0.3% interrupt, 87.2% idle
Mem: 16G Active, 770M Inact, 15G Laundry, 218G Wired, 492M Free
ARC: 203G Total, 8324M MFU, 192G MRU, 6712M Anon, 393M Header, 42M Other
     196G Compressed, 199G Uncompressed, 1.02:1 Ratio
Swap: 8192M Total, 8192M Free

  PID USERNAME    THR PRI NICE   SIZE    RES STATE   C   TIME    WCPU COMMAND
 3709 root         12  34    0   121M 35008K nanslp 67   2:53 469.28% collectd
17765 root         22  20    0 31605M   903M kqread 58   2:42 323.30% bhyve
  238 root         25  20    0   286M   213M uwait  56   6:54  99.44% python3.6
 7239 root          1  85    0  9800K  4484K CPU23  23   6:20  44.85% rsync
 7313 root          1  36    0  9800K  4656K dp->dp 61   4:30  20.35% rsync
 8604 root          1  27    0 24136K 17020K dp->dp 55   4:37  12.67% rsync
 7233 root          1  26    0 22088K 17200K zio->i 67   6:27  11.03% rsync
 9030 root          1  28    0 12040K  7352K CPU28  28   0:20   6.09% top
 5785 root         64  20    0   313M   100M uwait  62   6:19   5.00% channels-
 3463 nobody        1  20    0  6932K  3120K select 58   0:01   0.02% mdnsd
 1910 nobody        1  20    0  6932K  3216K select 32   0:02   0.02% mdnsd
 3117 root          1  20    0 12484K 12592K select 55   0:02   0.01% ntpd
 9025 root          1  20    0 12920K  8096K select 55   0:00   0.01% sshd
 3515 root          1  20    0   119M   102M kqread 17   0:04   0.00% uwsgi-3.6


Could something be going on with the memory getting too low all of a sudden crashing me?

I got around also to getting syslog going and it doesn't appear to give us any insight into the moments leading up:

Code:
01-29-2019    20:47:32    Local7.Info    192.168.20.99    Jan 29 20:47:31 freenas nginx: 192.168.20.107 - - [29/Jan/2019:20:47:31 -0600] "GET /api/v1.0/storage/replication/ HTTP/1.1" 200 12 "http://192.168.20.99/ui/vm" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36"
01-29-2019    20:47:22    Local7.Info    192.168.20.99    Jan 29 20:47:21 freenas nginx: 192.168.20.107 - - [29/Jan/2019:20:47:21 -0600] "GET /api/v1.0/storage/replication/ HTTP/1.1" 200 12 "http://192.168.20.99/ui/vm" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36"
01-29-2019    20:47:12    Local7.Info    192.168.20.99    Jan 29 20:47:11 freenas nginx: 192.168.20.107 - - [29/Jan/2019:20:47:11 -0600] "GET /api/v1.0/storage/replication/ HTTP/1.1" 200 12 "http://192.168.20.99/ui/vm" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36"
01-29-2019    20:47:06    Local7.Info    192.168.20.99    Jan 29 20:47:05 freenas nginx: 192.168.20.107 - - [29/Jan/2019:20:47:05 -0600] "GET /api/v1.0/storage/replication/ HTTP/1.1" 200 12 "http://192.168.20.99/ui/vm" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36"
01-29-2019    20:46:51    Local7.Info    192.168.20.99    Jan 29 20:46:51 freenas nginx: 192.168.20.107 - - [29/Jan/2019:20:46:51 -0600] "GET /api/v1.0/storage/replication/ HTTP/1.1" 200 12 "http://192.168.20.99/ui/vm" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36"
01-29-2019    20:46:42    Local7.Info    192.168.20.99    Jan 29 20:46:41 freenas nginx: 192.168.20.107 - - [29/Jan/2019:20:46:41 -0600] "GET /api/v1.0/storage/replication/ HTTP/1.1" 200 12 "http://192.168.20.99/ui/vm" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36"
01-29-2019    20:46:31    Local7.Info    192.168.20.99    Jan 29 20:46:30 freenas nginx: 192.168.20.107 - - [29/Jan/2019:20:46:30 -0600] "GET /api/v1.0/storage/replication/ HTTP/1.1" 200 12 "http://192.168.20.99/ui/vm" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36"
01-29-2019    20:46:21    Local7.Info    192.168.20.99    Jan 29 20:46:20 freenas nginx: 192.168.20.107 - - [29/Jan/2019:20:46:20 -0600] "GET /api/v1.0/storage/replication/ HTTP/1.1" 200 12 "http://192.168.20.99/ui/vm" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36"
01-29-2019    20:46:11    Local7.Info    192.168.20.99    Jan 29 20:46:10 freenas nginx: 192.168.20.107 - - [29/Jan/2019:20:46:10 -0600] "GET /api/v1.0/storage/replication/ HTTP/1.1" 200 12 "http://192.168.20.99/ui/vm" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36"
01-29-2019    20:46:01    Local7.Info    192.168.20.99    Jan 29 20:46:00 freenas nginx: 192.168.20.107 - - [29/Jan/2019:20:46:00 -0600] "GET /api/v1.0/storage/replication/ HTTP/1.1" 200 12 "http://192.168.20.99/ui/vm" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36"
01-29-2019    20:45:51    Local7.Info    192.168.20.99    Jan 29 20:45:50 freenas nginx: 192.168.20.107 - - [29/Jan/2019:20:45:50 -0600] "GET /api/v1.0/storage/replication/ HTTP/1.1" 200 12 "http://192.168.20.99/ui/vm" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36"
01-29-2019    20:45:41    Local7.Info    192.168.20.99    Jan 29 20:45:40 freenas nginx: 192.168.20.107 - - [29/Jan/2019:20:45:40 -0600] "GET /api/v1.0/storage/replication/ HTTP/1.1" 200 12 "http://192.168.20.99/ui/vm" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36"
01-29-2019    20:45:31    Local7.Info    192.168.20.99    Jan 29 20:45:30 freenas nginx: 192.168.20.107 - - [29/Jan/2019:20:45:30 -0600] "GET /api/v1.0/storage/replication/ HTTP/1.1" 200 12 "http://192.168.20.99/ui/vm" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36"
01-29-2019    20:45:21    Local7.Info    192.168.20.99    Jan 29 20:45:20 freenas nginx: 192.168.20.107 - - [29/Jan/2019:20:45:20 -0600] "GET /api/v1.0/storage/replication/ HTTP/1.1" 200 12 "http://192.168.20.99/ui/vm" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36"
01-29-2019    20:45:11    Local7.Info    192.168.20.99    Jan 29 20:45:10 freenas nginx: 192.168.20.107 - - [29/Jan/2019:20:45:10 -0600] "GET /api/v1.0/storage/replication/ HTTP/1.1" 200 12 "http://192.168.20.99/ui/vm" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36"
01-29-2019    20:45:01    Local7.Info    192.168.20.99    Jan 29 20:45:00 freenas nginx: 192.168.20.107 - - [29/Jan/2019:20:45:00 -0600] "GET /api/v1.0/storage/replication/ HTTP/1.1" 200 12 "http://192.168.20.99/ui/vm" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36"
01-29-2019    20:45:00    Cron.Info    192.168.20.99    Jan 29 20:45:00 freenas /usr/sbin/cron[18067]: (root) CMD (/usr/libexec/atrun > /dev/null 2>&1)
01-29-2019    20:44:54    Kernel.Notice    192.168.20.99    Jan 29 20:44:54 freenas kernel: bridge1: link state changed to UP
01-29-2019    20:44:54    Kernel.Notice    192.168.20.99    Jan 29 20:44:54 freenas kernel: tap0: link state changed to UP
01-29-2019    20:44:54    Kernel.Notice    192.168.20.99    Jan 29 20:44:54 freenas kernel: bridge1: link state changed to DOWN
01-29-2019    20:44:54    Kernel.Info    192.168.20.99    Jan 29 20:44:54 freenas kernel: tap0: promiscuous mode enabled
01-29-2019    20:44:54    User.Notice    192.168.20.99    Jan 29 20:44:54 freenas tap0: Ethernet address: 00:bd:3e:f1:31:00


Thoughts?
 

SweetAndLow

Sweet'NASty
Joined
Nov 6, 2013
Messages
6,421
Can you try not running bhyve and see if it creates? Especially since it seems to create pretty easily now.
 

-MG-

Explorer
Joined
Mar 13, 2017
Messages
65
Can you try not running bhyve and see if it creates? Especially since it seems to create pretty easily now.
Let me look up how to do that.

Also, out of curiosity and other posts. I enabled autotune. After restarting I am now getting a new message:

I/0 error occured while writing; fd='23', error ='Message too long (40)'

another one popped up with fd=24 as well
 

SweetAndLow

Sweet'NASty
Joined
Nov 6, 2013
Messages
6,421
Let me look up how to do that.

Also, out of curiosity and other posts. I enabled autotune. After restarting I am now getting a new message:

I/0 error occured while writing; fd='23', error ='Message too long (40)'

another one popped up with fd=24 as well
Don't enable auto tune it's not helpful for most people. You have crazy errors, this is not normal at all. You probably have hardware failures
 

-MG-

Explorer
Joined
Mar 13, 2017
Messages
65
Don't enable auto tune it's not helpful for most people. You have crazy errors, this is not normal at all. You probably have hardware failures

Looks like that error is around my syslog server. I need to modify a setting, so that error isn't serious.. I am not sure I understand your previous comment around Bhyvee and what to do there. Appreciate any insight.
 

SweetAndLow

Sweet'NASty
Joined
Nov 6, 2013
Messages
6,421
Turn it off! Removing variables until this thing doesn't crash anymore. I think I have tried enough, I'm done helping. Good luck.
 

-MG-

Explorer
Joined
Mar 13, 2017
Messages
65
Turn it off! Removing variables until this thing doesn't crash any more. I think I have tried enough, I'm done helping. Good luck.

Ouch... appreciate you spending the time thus far. I had to re-read what you meant about turning off the VM. Didn't understand at first. I'll keep plugging away at it.
 

Apollo

Wizard
Joined
Jun 13, 2013
Messages
1,458
I suspect the issue is with you LSI card. Do yo get errros related to mps0?
What is you swap usage?
You can run Netdata to see how thngs are doing.
Why 200GB of ARC? Do you have a L2ARC drive? If so you may want to remove it for now.
 

-MG-

Explorer
Joined
Mar 13, 2017
Messages
65
I suspect the issue is with you LSI card. Do yo get errros related to mps0?
What is you swap usage?
You can run Netdata to see how thngs are doing.
Why 200GB of ARC? Do you have a L2ARC drive? If so you may want to remove it for now.

Oddly, I can't get Netdata service to start running... It just says failed to start.

I do not have an L2ARC. The system has 256GB of RAM. I cannot find any mps0 errors through the logs.

I'm not 100% convinced there isn't something going revolving how it slowly reallocates memory to ARC and then something ramps up and the system can't move the memory fast enough and that is what causing a lock up. It's just interesting that both times I saw the lock up I was watching how the system would take me down to maybe 4GB left then reallocate, etc. After reading a bit, I understand why this happens, but just have this suspicion that something isn't going the right way. After turning on autotune and letting it set the settings. I also changed the max ARC to around 200GB as that would be plenty (heck even 100GB for my uses).

I left the system running overnight while streaming my TV through a plugin and everything is still running, and I don't see the system has gobbled up all the memory. In fact, it's left it at 230G free and 12G is in ARC. To clarify one thing, before I was literally able to watch TOP and see the free memory being taken slowly (maybe 2-3 gig a minute being moved to ARC). My understanding in a 'runway' memory situation it would take weeks to get it to eat it all up. Maybe I am wrong on that thought.

Granted this is with the Bhyve VM not running. My next step is to enable it to see as @SweetAndLow pointed out, do I have something going on with that. Previously, I don't think the system would go for longer than an hour or so without falling on its face.
 
Last edited:

-MG-

Explorer
Joined
Mar 13, 2017
Messages
65
One more update:

I've restarted the VM and have the system recording a TV channel and playing that out on a TV (same conditions would crash the system after an hour). It's still running without issue. After doing more reading through the forms, I can see where what I wrote above isn't 100% correct. But there is some tunable that autotune put in that seems to be keeping everything in line and I continue to focus on the memory usage. It doesn't look like the system is using the 8GB of swap as far as I can tell but strangling the free memory before.
 
Top