Opening Dashboard causes Collectd, Python3.7, and rrdcached processes to spike CPU and system load through the roof

Samuel Tai

Never underestimate your own stupidity
Moderator
Joined
Apr 24, 2020
Messages
5,399
No, collectd in /usr/local/lib/collectd has a dynamic library to output to Graphite, write_graphite.so. The problem seems to be when querying SMART on your disks for temps, and hitting the socket timeout. This is kind of drastic, but what happens if you disable the SMART service?
 

Samuel Tai

Never underestimate your own stupidity
Moderator
Joined
Apr 24, 2020
Messages
5,399
Also, it could be the SMART daemon is corrupted, which may be the cause of this issue. What output do you get from an MD5 hash of various parts of the SMART subsystem, using openssl dgst -md5 <file>?

File
MD5 hash
/usr/local/sbin/smartctlcd03d1e1be44cd98e2925dac6899b6aa
/usr/local/sbin/smartd2a132f6027618a41bfaa5bb7d8dbea03
/usr/local/libexec/smart_alert.py6afb53d7fd086ad7f01d6ed2ec3da485
/usr/local/etc/smartd_warning.sh8f0dc803ccaeeda791f08a1250f18da0
 

Samuel Tai

Never underestimate your own stupidity
Moderator
Joined
Apr 24, 2020
Messages
5,399
Finally, I think what's going on is there's too many SMART temp messages on both 190 and 194 for /usr/local/lib/collectd_pyplugins/disktemp.py to cope with. In addition to not running SMART, you could change the frequency of smartd checking, and the temperature difference to check. On my system, I have these at the default 30 seconds and 0 degrees.

1589238094742.png
 

Dewey

Dabbler
Joined
Dec 19, 2016
Messages
14
All MD5 hashes matched on my system to what you provided. I disabled the SMART service, these settings were at the system default as I never changed them. I went ahead and configured it to match your settings just in case it gets re enabled. For comparison, the original settings were: Check: 30, Power Mode: Never, Difference: 0, Informational: 0, Critical: 0.
So I changed the settings, and I let it sit all day, never acted up, was even able to do some decent size NFS traffic. All the while I was getting drive temp reports back at my Graphite server. The moment I open the WebUI Dashboard, immediate CPU spike again. It feels more and more like there is something in the code for the Web UI Dashboard of my install that kills collectd while it tried to summarize the system metrics for the front page. It is always either the reporting tab or the Front Page Dashboard of the UI where the system metrics and hardware is summarized that tends to kick off collectd, rrdcached, and python. Best I can figure out, is that the reason it doesn't have a problem initially is because there isn't much data from a new boot, better yet if the system was shutdown for a bit. But once enough time goes by, there is enough metrics that the code isn't properly handling between collectd and the UI backend code. This is pure speculation on my part, and where it unravels is I don't know why seemingly only I am having this problem. I have seen no indications elsewhere that match my symptoms of the UI backend running out of control, or atleast to the point collectd brings the system to its knees. And aside from socket timeouts (which I believe are happening because collectd maxes out the system load), all messages from collectd, and others have been reported or asked by others, buy responses are almost always, "yeah, I have it to, it doesn't do anything, don't worry about it."
This is so confusing.
 

Samuel Tai

Never underestimate your own stupidity
Moderator
Joined
Apr 24, 2020
Messages
5,399
By any chance, do you allow your drives to spin down?
 

Samuel Tai

Never underestimate your own stupidity
Moderator
Joined
Apr 24, 2020
Messages
5,399
Also, it's worth flushing your browser cache before you try logging in next to the web UI.
 

Dewey

Dabbler
Joined
Dec 19, 2016
Messages
14
I do not allow my drives to spin down.
Browser cache flushed, and even tried other browsers. Same result.
 

Samuel Tai

Never underestimate your own stupidity
Moderator
Joined
Apr 24, 2020
Messages
5,399
OK, I think we've ruled out all the software side. The only thing left is hardware. You may want to look at your system, and blow out all the dust. Try reseating all the connectors and RAM sticks.
 
Top