SOLVED The usefulness of ECC (if we can't assess it's working)?

diversity

Contributor
Joined
Dec 4, 2018
Messages
128
Thank you Mastakilla.
I must admit I am out of my league ;( also on your stability issue. I did not even notice first time around the amount of disks you have and hence the need for a HBA ;(
Although very very much appreciated I am having the hardest time debugging your bash script. I am trying though.
Btw, my script is supposed to deal with not sending emails more than once by removing all MCA related lines from /var/log/messages. It is not implemented yet but the code comments do mention that. This will mean that we get one email containing the MCA related non errors the first time the script runs.
I will have a better look at bash in the mean time
 

diversity

Contributor
Joined
Dec 4, 2018
Messages
128
sweet, i got it to work. the cron expression was incorrect. * * * * * checks every minute.
Also it turns out that the cron deamon emails output by default. so any script / code can be quite small.
 

diversity

Contributor
Joined
Dec 4, 2018
Messages
128
Perhaps a variable exists for which email address to email to, so that the script doesn't need to be edited to work?
in crontab -e

MAILTO=user@somehost.tld
* * * * * /path/to/your/command

when using tasks in the gui you could also set a run as user to influence where the email is sent
 

diversity

Contributor
Joined
Dec 4, 2018
Messages
128
also after a reboot the crontab -l lists not cronjobs. so i guess the 'changes via cli do not persist' warning was correct
So i'll focus on the tasks interface then in the gui for now
 

diversity

Contributor
Joined
Dec 4, 2018
Messages
128
this seems to do what I want.


Code:
#!/bin/bash

#if no processed errors file exists yet
if [ ! -f ./MCAmessages.processed ]
then
    touch MCAmessages.processed
fi

#see if there are MCA errors in the log and put in a tmp file
cat /var/log/messages | grep MCA: | grep -v Features: > MCAmessages.tmp

#store unprocessed errors in new file
comm -23 MCAmessages.tmp MCAmessages.processed >> MCAmessages.new

if [ -s ./MCAmessages.new ] #there are new errors
then
    cat MCAmessages.new >> MCAmessages.processed
    echo $(cat MCAmessages.new)
fi

rm MCAmessages.tmp
rm MCAmessages.new


can anyone shoot a hole in this please? or good to go?
 
Last edited:

diversity

Contributor
Joined
Dec 4, 2018
Messages
128
wow, it's running ;)

All I do to test if an email is sent or not is to remove the MCAmessages.processed file. The /var/log/messages still contain ECC errors that I triggered before I started this all.

One thing, probably among many others, that could be better is to have echo not strip new line characters. This is now what I am getting.
Code:
Dec 8 00:00:00 truenas MCA: Bank 15, Status 0xdc2040000000011b Dec 8 00:00:00 truenas MCA: Global Cap 0x0000000000000117, Status 0x0000000000000000 Dec 8 00:00:00 truenas MCA: Vendor "AuthenticAMD", ID 0x800f82, APIC ID 0 Dec 8 00:00:00 truenas MCA: CPU 0 COR OVER GCACHE LG RD error Dec 8 00:00:00 truenas MCA: Address 0x40000000095f680 Dec 8 00:00:00 truenas MCA: Misc 0xd01b0fff01000000 Dec 8 00:06:48 truenas MCA: Bank 15, Status 0xdc2040000000011b Dec 8 00:06:48 truenas MCA: Global Cap 0x0000000000000117, Status 0x0000000000000000 De
.......


Does one have a suggestion on how to preserve formatting?

I am sooo happy that I can stop searching for an alternative to truenas ;)
 

Mastakilla

Patron
Joined
Jul 18, 2019
Messages
203
I still prefer to avoid using files and do everything in memory instead. I'm close to solving the logrotation flaw btw... Will post an update soon...
 

Chris Moore

Hall of Famer
Joined
May 2, 2015
Messages
10,080

Mastakilla

Patron
Joined
Jul 18, 2019
Messages
203
This here should do the trick for hourly monitoring... Still not thoroughly tested though, but I expect it to work...

email_prev_hour_MCA_errors.bash
As root:
mkdir /root/bin
vi /root/bin/email_prev_hour_MCA_errors.bash
chown root:wheel /root/bin/email_prev_hour_MCA_errors.bash
chmod 700 /root/bin/email_prev_hour_MCA_errors.bash
Code:
#!/bin/bash
declare MCA_ERRORS

MCA_ERRORS="$(bzcat /var/log/messages.0.bz2 | egrep "$(date -v -1H "+%b %e %H").*$(hostname -s) MCA.*")"
[[ ! -z "${MCA_ERRORS}" ]] && MCA_ERRORS+=$'\n'
MCA_ERRORS+="$(cat /var/log/messages | egrep "$(date -v -1H "+%b %e %H").*$(hostname -s) MCA.*")"

[[ ! -z "${MCA_ERRORS}" ]] && mail -s "TrueNAS $(hostname): Alerts" youremail@youremaildomain.com <<< "MCA Errors were found in /var/log/messages:"$'\n\n'"${MCA_ERRORS}"

Cron job
1607707076437.png


A little explanation
Code:
1 #!/bin/bash
2 declare MCA_ERRORS

3 MCA_ERRORS="$(bzcat /var/log/messages.0.bz2 | egrep "$(date -v -1H "+%b %e %H").*$(hostname -s) MCA.*")"
4 [[ ! -z "${MCA_ERRORS}" ]] && MCA_ERRORS+=$'\n'
5 MCA_ERRORS+="$(cat /var/log/messages | egrep "$(date -v -1H "+%b %e %H").*$(hostname -s) MCA.*")"

6 [[ ! -z "${MCA_ERRORS}" ]] && mail -s "TrueNAS $(hostname): Alerts" youremail@youremaildomain.com <<< "MCA Errors were found in /var/log/messages:"$'\n\n'"${MCA_ERRORS}"

  1. specifies that this is a bash script
  2. declares a variable named ${MCA_ERRORS} in which I store all MCA messages from the previous hour
  3. stores all MCA messages from the previous hour from /var/log/messages.0.bz2 in ${MCA_ERRORS} (this is the previous /var/log/messages file. If the script runs shortly after a log rotation, this file needs to be checked as well!)
  4. appends a newline to ${MCA_ERRORS}, if it already contains something
  5. appends all MCA messages from the previous hour from /var/log/messages in ${MCA_ERRORS}
  6. emails ${MCA_ERRORS} if it contains something
How I select the messages from the previous hour is by filtering out lines with egrep that start with a specific date.
Code:
date -v -1H "+%b %e %H"

"date -v -1H" -> subtracts 1 hour of the current datetime. This correctly works even if the 1 hour ago was a different day, month or year.
for example:
Code:
data# date
Fri Dec 11 18:05:52 CET 2020
data# date -v -19H
Thu Dec 10 23:06:07 CET 2020
data#

"+%b %e %H" -> formats the date in the same format as /var/log/messages and only until the hour (ignoring minutes, seconds)
for example:
Code:
data# date "+%b %e %H"
Dec 11 18
data#

If you want it more frequently than hourly, you can for example do the following for checking every 10 minutes
Replace
$(date -v -1H "+%b %e %H")
by
$(date -v -10M "+%b %e %H:%M" | sed 's/.$//')

And then create cronjob with
5-59/10 * * * *
instead of
30 * * * *

Resource
After testing this a bit I'll probably make this into a resource.
 
Last edited:

diversity

Contributor
Joined
Dec 4, 2018
Messages
128
coool! If I get this to run I am going with your solution.

Thx bro
 

Mastakilla

Patron
Joined
Jul 18, 2019
Messages
203
I've just tested it, seems to work :smile: Made it into a resource:
 
Last edited by a moderator:

diversity

Contributor
Joined
Dec 4, 2018
Messages
128
What errors are you refering to?
 

Wolfeman0101

Patron
Joined
Jun 14, 2012
Messages
428
What errors are you refering to?
Code:
Feb  5 21:55:35 Packers MCA: Bank 17, Status 0x9c2040000000011b
Feb  5 21:55:35 Packers MCA: Global Cap 0x000000000000011c, Status 0x0000000000000000
Feb  5 21:55:35 Packers MCA: Vendor "AuthenticAMD", ID 0x870f10, APIC ID 0
Feb  5 21:55:35 Packers MCA: CPU 0 COR GCACHE LG RD error
Feb  5 21:55:35 Packers MCA: Address 0x4000007aaa1be80
Feb  5 21:55:35 Packers MCA: Misc 0xd01b0fff01000000

I just installed more RAM and I'm wondering if I got a bad stick.
 

Mastakilla

Patron
Joined
Jul 18, 2019
Messages
203
I read:
COR = means you got an error, but it got corrected
CPU / GCACHE = it was detected somewhere around the CPU Cache (I think)
LG = dunno
RD= read

Something is certainly wrong with your "platform" if you get these more than once. Platform means CPU, Motherboard, RAM. Most likely it got introduced by installing the new RAM (or extreme coincidence). Can be a bad stick, but can also be bad compatibility or that you've configured the RAM in a way it can't handle (timings / speed).

Run Memtest86 from a bootable USB stick and see if you get errors there too...
 

Wolfeman0101

Patron
Joined
Jun 14, 2012
Messages
428
I read:
COR = means you got an error, but it got corrected
CPU / GCACHE = it was detected somewhere around the CPU Cache (I think)
LG = dunno
RD= read

Something is certainly wrong with your "platform" if you get these more than once. Platform means CPU, Motherboard, RAM. Most likely it got introduced by installing the new RAM (or extreme coincidence). Can be a bad stick, but can also be bad compatibility or that you've configured the RAM in a way it can't handle (timings / speed).

Run Memtest86 from a bootable USB stick and see if you get errors there too...
I added the exact same RAM I had in before so that's why I think it's maybe bad. Also the periodic script isn't working. When I try to run it manually I get this:
Code:
root@Packers:/mnt/vol1/bkwolfe # sh /root/bin/periodic_mca_log_monitoring.bash
/root/bin/periodic_mca_log_monitoring.bash: declare: not found
/root/bin/periodic_mca_log_monitoring.bash: declare: not found
/root/bin/periodic_mca_log_monitoring.bash: declare: not found
/root/bin/periodic_mca_log_monitoring.bash: [[: not found
/root/bin/periodic_mca_log_monitoring.bash: [[: not found
/root/bin/periodic_mca_log_monitoring.bash: [[: not found
/root/bin/periodic_mca_log_monitoring.bash: 19: Syntax error: redirection unexpected (expecting word)
root@Packers:/mnt/vol1/bkwolfe #
 

Mastakilla

Patron
Joined
Jul 18, 2019
Messages
203
for questions / issues with the resource, please post in the resource discussion thread...

Are you sure you didn't just forget below line on top of your script?
#!/bin/bash
 

Patrick M. Hausen

Hall of Famer
Joined
Nov 25, 2013
Messages
7,776

Mastakilla

Patron
Joined
Jul 18, 2019
Messages
203
Ah good point... I've indeed configured my ssh user and root to have bash as shell, which explains why I didn't run in to this requirement... I'll update my resource to be compatible with other shell configurations.

edit:
resource = updated
 
Last edited:

diversity

Contributor
Joined
Dec 4, 2018
Messages
128
still shaking the tree.
I am hoping that the devs will stop the political/commercial angle and just cater to the rest of us.
@Mastakilla has the best example of why this is important. This pro guru bought a server grade board only to get pinched in the buttocks by the little technical details.

and let's not forget that even server grade hardware can go bad. Either DOA or over time.

@Mastakilla and anyone with an opinion. Given the advances in 3d print technology and desktop cnc milling and what not.
Would there be interest in an open source clamp of some sorts that one can easily place around a memory module and trigger errors.

I think when going at it it might be that kind of stability might also lead to be able to trigger multi bit errors and then we're golden.

I am interested to learn if the community at large would be interested in something like that
 
Top