Hi there, as promised I went over the code. Don't have any *bsd system on-hand to run it so if I make an invalid assumption below please correct me.
First off, I didn't see any egregious mistakes and the logic seems fine. Besides a few minor code style issues , what I would have done differently is:
1. move config stuff to a separate cfg file
1a. perhaps add a function/different script that on first-run fills that cfg file based on user input/some system test
2.
Code:
elsif( $fan_speed > 10000 || $fan_speed < 0 )
{
dprint( 0, "$fan_name Fan speed: $fan_speed RPM, is nonsensical\n");
$fan_speed = -1;
}
Don't assume 10k is nonsensical. Get its upper non-critical from ipmi at the start of the script. You can also set a hardcoded value if you want, but a value closer to 30k (do faster fans even exist?) would make more sense I think.
3. move as many sanity checks as possible to the top (such as illegal fan-zone/duty cycle). You can still have them in functions as needed ofc.
4. Move raw command bytes to a platform specific constant (e.g. x10_raw_bytes) and then check for the platform we're running on at start-up.
5. try a bmc warm reset and if it doesn't work/isn't supported do a cold reset
6. add an option to crosscheck systemctl returned values with ipmi reported ones (a sort of sanity check)
7. add the code to github :)
On the whole these are all nitpicks though. Thanks for sharing this script with us and keep up the good work!
Thanks.
Regarding the config, although that might make sense, it would also complicate things for simple users, the benefit of course is that the main script could stay un-edited. You'd still need to be able to configure where the config file is kept ;)
And it would involve me doing a lot of work for no gain... since its working in my production environment now ;)
Where I would like to improve the script is with regards to logging critical temperature events, BMC resets etc... and emailing those events. This doesn't happen currently.
Regarding 10K, the BMC returns a value, something like 27K iirc, sporadically, sometime after a reset. The important thing is to filter out that value. If your fans can infact reach 10K, then yes, that value should be changed. And perhaps a value of 25K would be better. Do fans exist that reach that? The default upper-non-critical is 28K I believe, and that is above this spurious value. IIRC. I'd have to retest the BMC reset logic to retrigger the spurious reading which prompted the code. It doesn't happen every time :)
One of the problems is I don't actually remember the exact value that was a problem... was it 27800 RPM?
It is a good idea to move the sanity checks, and in fact one of my ideas for improvements was to derive and calculate a few of the settings which can be derived... for example, you can derive the fan_zone from the fan_header...
Re the raw bytes... as far as I know, the raw bytes are the same for X10/X11 etc... where they are applicable. What changes between revisions is the modes that are available. The change would be to add support for different platforms (ie not supermicro), and that may be handled better by changing which function is called, rather than which set of raw bytes are sent.
ie, store the function in a variable. Of course... I have no idea how to detect the platform :)
Re the BMC reset code. This script relies on previous testing/experience from others, where the cold reset was used. Warm might work... I wouldn't know... the case where its needed is very hard to trigger... I've only seen it occur... I think twice in all my testings. As such, I can't easily test if the warm reset would solve it, and then I would also need to add another layer to the reset mode to "try harder" with a cold reset. At the moment, the reset code is for when a rare bad situation happens... and the cold reset does fix it.
The latest version of this script is using systemctl values for CPU temps, not the ipmitool values. I don't think there is any value in cross-checking against the ipmi when the values are coming from the kernel. I've never seen the kernel values being invalid, and its read orders of magnitude faster than the original IPMI method.
And re github... yes... maybe one day :)
Thanks again for taking a look