Hoping someone can help me figure out what is going on. My FreeNAS box has been great, when it works. Lately it has been getting into a fit, where whenever I access the WebUI dashboard I have 3 processes nail the system to the wall to the point that unless I'm already logged in via SSH, I can't even login. The console is slow and unresponsive, slower than SSH, and even if I issue a shutdown, it never will after is has already unmounted all the pooled disks. If I am lucky to already be SSH'd in I can run top and see this:
FreeNAS is latest stable version as of this writing (Updated 2 nights ago). Storage/Boot Drives, Memory and RAM is barely touched, CPU is cranking on something from collectd, rrdcached and python. I see nothing in the logs, and eventually it gets to the point the metric offloads to Graphite stop, Web UI won't load (in fact I will see socket timeout messages on the console from Django), NFS won't mount, and existing mounts are very slow or time out. I'd like to figure out what is causing it and get it to stop. Or worst yet, find whatever hardware maybe went bad. I'll admit, I don't really know how to check if my backplanes or drives are okay, outside of extended SMART checks are returning okay.
Here is the hardware FreeNAS is running on:
Processor: 2x AMD Opteron AMD 6212 Octo (8) Core 2.6Ghz (Total 16 Cores)
Memory: 64GB DDR3 (8 x 8GB - DDR3 - REG PC3-10600R (1333MHZ)
Server Chassis/ Case: CSE-847E16-R1400UBMotherboard: H8DGU-F
Just last night it happened again while I was asleep. I woke up to find it in the problematic state. I was poking around trying to get more information while it was happening (not very successful, I'm normally a Linux admin guy and FreeBSD is proving to really test what I can do with limited tools I'm used to, but that's a learning problem). Suddenly I noticed the problem stopped. I double checked my graphite/graphana server (which I stood up to specifically figure this out), and all the metrics that weren't getting saved off, were now there. Here is what metrics I have over the 8 hour period that collectd, python3.7, and rrdcached were working hard in some mine and I don't even know why.
Unfortunately, when I opened the UI after this happened, it happened yet again, and the UI has yet to return. I suspect that it will take another 8 hours to return unless I force a reboot. UI seems fine if I get in and out quick enough. Otherwise CMD line is my only way to do anything. Any help would be greatly appreciated.
Code:
last pid: 7979; load averages: 22.66, 23.51, 22.78 up 0+06:52:39 22:03:57 59 processes: 2 running, 57 sleeping CPU: 19.3% user, 0.0% nice, 64.9% system, 5.9% interrupt, 9.9% idle Mem: 222M Active, 1456M Inact, 1694M Wired, 59G Free ARC: 547M Total, 126M MFU, 390M MRU, 9709K Anon, 5774K Header, 16M Other 132M Compressed, 816M Uncompressed, 6.16:1 Ratio Swap: 24G Total, 24G Free PID USERNAME THR PRI NICE SIZE RES STATE C TIME WCPU COMMAND 2929 root 11 25 0 390M 340M nanslp 9 273:01 246.04% collectd 79 root 37 46 0 316M 238M umtxn 4 189:26 228.95% python3.7 1427 root 8 21 0 31968K 11632K select 12 79:18 93.42% rrdcached 7836 root 1 81 0 7892K 3920K CPU5 5 30:22 25.00% top 3328 root 1 38 0 12924K 7912K select 14 7:50 15.79% sshd 1359 root 1 30 0 12484K 12580K select 14 11:22 11.84% ntpd 1504 root 1 26 0 127M 106M kqread 5 6:12 9.21% uwsgi-3.7 1407 root 1 28 0 38072K 22540K select 6 3:30 3.95% winbindd 1391 root 1 22 0 31052K 17860K select 2 3:03 2.63% nmbd 1442 root 1 20 0 120M 103M select 5 0:05 1.32% smbd 1035 root 1 22 0 9164K 5556K select 10 0:04 1.32% devd
FreeNAS is latest stable version as of this writing (Updated 2 nights ago). Storage/Boot Drives, Memory and RAM is barely touched, CPU is cranking on something from collectd, rrdcached and python. I see nothing in the logs, and eventually it gets to the point the metric offloads to Graphite stop, Web UI won't load (in fact I will see socket timeout messages on the console from Django), NFS won't mount, and existing mounts are very slow or time out. I'd like to figure out what is causing it and get it to stop. Or worst yet, find whatever hardware maybe went bad. I'll admit, I don't really know how to check if my backplanes or drives are okay, outside of extended SMART checks are returning okay.
Here is the hardware FreeNAS is running on:
Processor: 2x AMD Opteron AMD 6212 Octo (8) Core 2.6Ghz (Total 16 Cores)
Memory: 64GB DDR3 (8 x 8GB - DDR3 - REG PC3-10600R (1333MHZ)
Server Chassis/ Case: CSE-847E16-R1400UBMotherboard: H8DGU-F
Just last night it happened again while I was asleep. I woke up to find it in the problematic state. I was poking around trying to get more information while it was happening (not very successful, I'm normally a Linux admin guy and FreeBSD is proving to really test what I can do with limited tools I'm used to, but that's a learning problem). Suddenly I noticed the problem stopped. I double checked my graphite/graphana server (which I stood up to specifically figure this out), and all the metrics that weren't getting saved off, were now there. Here is what metrics I have over the 8 hour period that collectd, python3.7, and rrdcached were working hard in some mine and I don't even know why.
Unfortunately, when I opened the UI after this happened, it happened yet again, and the UI has yet to return. I suspect that it will take another 8 hours to return unless I force a reboot. UI seems fine if I get in and out quick enough. Otherwise CMD line is my only way to do anything. Any help would be greatly appreciated.