SOLVED The usefulness of ECC (if we can't assess it's working)?

SweetAndLow

Sweet'NASty
Joined
Nov 6, 2013
Messages
6,421
Over the last year I have had 3 incidents of memory errors in my system. I would love to be able to get more information out of it if possible.

Code:
messages:Mar 30 23:02:05 tubby MCA: Bank 12, Status 0x8c00004e000800c3
messages-Mar 30 23:02:05 tubby MCA: Global Cap 0x0000000007000c16, Status 0x0000000000000000
messages-Mar 30 23:02:05 tubby MCA: Vendor "GenuineIntel", ID 0x306f2, APIC ID 0
messages-Mar 30 23:02:05 tubby MCA: CPU 0 COR (1) MS channel 3 memory error
messages-Mar 30 23:02:05 tubby MCA: Address 0x1f17b5bbc0
messages-Mar 30 23:02:05 tubby MCA: Misc 0x1229402000201c8c
--
messages.1:Jan 22 00:53:48 tubby MCA: Bank 12, Status 0x8c00004e000800c3
messages.1-Jan 22 00:53:48 tubby MCA: Global Cap 0x0000000007000c16, Status 0x0000000000000000
messages.1-Jan 22 00:53:48 tubby MCA: Vendor "GenuineIntel", ID 0x306f2, APIC ID 0
messages.1-Jan 22 00:53:48 tubby MCA: CPU 0 COR (1) MS channel 3 memory error
messages.1-Jan 22 00:53:48 tubby MCA: Address 0x1f17a9bbc0
messages.1-Jan 22 00:53:48 tubby MCA: Misc 0x1229402000201c8c
--
messages.6:Aug 23 19:32:07 tubby MCA: Bank 12, Status 0x8c00004e000800c3
messages.6-Aug 23 19:32:07 tubby MCA: Global Cap 0x0000000007000c16, Status 0x0000000000000000
messages.6-Aug 23 19:32:07 tubby MCA: Vendor "GenuineIntel", ID 0x306f2, APIC ID 0
messages.6-Aug 23 19:32:07 tubby MCA: CPU 0 COR (1) MS channel 3 memory error
messages.6-Aug 23 19:32:07 tubby MCA: Address 0x200951bbc0
messages.6-Aug 23 19:32:07 tubby MCA: Misc 0x1229402000201c8c
 

Yorick

Wizard
Joined
Nov 4, 2018
Messages
1,912
Did the middleware send you an email alert when these happened?

More information should be in the event log of your IPMI.
 

diversity

Contributor
Joined
Dec 4, 2018
Messages
128
Let's distinguish between the part that is wearing - RAM - and the alert system for that - ECC/BIOS/IPMI. With an automated alert system, I do not need to do manual testing of the wearing part, as long as I know that the automated alert system exists and functions.

In your brake analogy, the wearing part is the brake system, and the alert system is completely manual. At least on my car it is.

Testing the automated alert system is a fun little side project. I've got some feelers out for a stick of defective ECC. I'll post if I get a bite. Edit: Crucial can't part with defective sticks. Maybe someone on the STH forums feels generous.

If you are really really keen, you can always solder wires to an ECC stick and inject errors, see link further up in this thread.

Edit: More error injection ideas. "Two syringes", wow that sounds scary, see video at https://www.vusec.net/projects/eccploit/ . And more generally rowhammer, which will work if (big IF) the specific modules in that specific board are susceptible to it.

Edit2: If you thought sticking needles into your DIMM socket was scary, have a look at this: http://bluesmoke.sourceforge.net/heat_gun.html

Edit3: Masking a pin. Hmm. http://bluesmoke.sourceforge.net/testing.html . Though maybe Kapton tape instead.

Edit4: Heat lamps! Oh my goodness. https://www.cs.princeton.edu/~appel/papers/memerr.pdf

So, out of all of those, I think the most reasonable things to try, in order, are:
- Boot from a Linux stick and try rowhammer. Slim chance that the memory is actually susceptible to it and, costs nothing to try
- Mask a pin with Kapton tape. Introduces error, will allow one to verify the alerting system works.
- ..... yeah no I'm not comfortable with any of the others :). Mayyyybe the gooseneck clip-on lamp with a 50W bulb. But, yeah, not sure I am keen enough to go down that road.

So, @diversity , I can't wait to hear your test results :)

Ok I am going in head first. Buying a 8GB cruicial ecc unbuffered stick and going to try a few things beginning with the heat gun.

Regarding the 2 syring method, is there a possibility to damage the mobo? When looking at it a few times that looks quite easy to do actually.

I'll report back once I have results.

The other suggestions seemed a little tricky as I don't have a steady hand.
 

diversity

Contributor
Joined
Dec 4, 2018
Messages
128
Or if the syringe method also does not destroy the memory I can try it using my current memory
 

Yorick

Wizard
Joined
Nov 4, 2018
Messages
1,912
Syringe doesn’t destroy memory. When you look at their pictures and video, they use syringes because they are stiff wires, set them to the right distance to fit into the holes of the dimm socket, power the system down, place them into the holes of the dimm socket to make contact with the metal contacts that touch the ram, insert the ram, power the system up. When they want to introduce an error, they use pliers to short the two syringes / wires. Done carefully, nothing will go up in blue smoke.

I pondered that method as well, maybe with a wine cork to keep the syringe at the right distance. I’m chasing down some defective memory though, and I’ll see whether my memory is susceptible to rowhammer before going down the syringe path.

When it comes to heating the idea of a lamp seems less scary than the idea of that heat gun. A digital thermometer like you see in the lamp setup seems to be a must. They saw errors in the 80 to 100 degrees C range - right up to the boiling point of water.

I’d personally try kapton tape first, then syringes, and probably never heat, that’s scary.

Well actually I am going to see whether I can rowhammer first :)
 
Last edited:

diversity

Contributor
Joined
Dec 4, 2018
Messages
128
Ok, Syringe it is then. Luckily we are not (yet) in total lock down and the the pharmacy is still open.

As I have no idea of what I am doing I am trying to emulate the video as closely as possible. However my memtest86 pro 8.4 (rc2 build 1001) does not have a test #17 'Test mem Raw'

Can you please help me select the best test I do have available?

Please see the attached screen captures.

On the sys_info screen capture you'll see ECC memory injection being disabled. Enabling it will not yet seem to do what we all hope it should do. Passmark is still analyzing my debug logs. Hopefully build 1002 will do the trick or it can't ever be done if it turns out the mobo is not playing along and if asrock rack will not update their bios. We'll see.

Anyways I am too impatient in the meantime so hence here I go try going off road ;)

Also should I use a single cpu thread or all of them?
 

Attachments

  • sys_info.jpeg
    sys_info.jpeg
    119.7 KB · Views: 685
  • test_options.jpeg
    test_options.jpeg
    120.9 KB · Views: 616

diversity

Contributor
Joined
Dec 4, 2018
Messages
128
Ohh yeah forgot to ask. In the BIOS, do I disable or enable the 'platform first error handling' in the advanced>AMD CBS>CPU common options?
 

Yorick

Wizard
Joined
Nov 4, 2018
Messages
1,912
No idea on your memtest and BIOS questions.

To run rowhammer on FreeNAS, here are the instructions. Keep in mind rowhammer relies on susceptible memory / BIOS; chances are high these systems are not susceptible.


Code:
Create rowhammer base jail and start it
ssh to FreeNAS
iocage console rowhammer
pkg install gcc git bash
git clone https://github.com/google/rowhammer-test.git
cd rowhammer-test
bash make.sh
./rowhammer-test
 

Mastakilla

Patron
Joined
Jul 18, 2019
Messages
203
Hi all,

Just read through the topic and a few small remarks from my experiences:

1) Asrock did NOT properly test ECC for their Ryzen platforms (like the X470D4U / X470D4U2-2T). I did do this testing and discovered ECC reporting is NOT working on these platforms and they've admitted this. They said they'll remove things like ECC Error Event logs in the IPMI, because it fools people into thinking that it does work. In meanwhile I'm still trying to get them to properly fix / implement this in their BIOS and it seems like they want to, but they're not getting proper support from AMD Taiwan. I now brought them into contact with AMD US, hopefully that progresses a bit better.

2) Isn't it easier and as reliable to just overclock / undervolt the RAM for testing ECC? I've had pretty good results with this... Not that I was able to see an ECC error report (as told above, it doesn't work), but I do feel confident that this is a proper way to test it...

For more details on my struggles on this topic, see my posts in
 

diversity

Contributor
Joined
Dec 4, 2018
Messages
128
Thx Mastakilla for your recap so far.

What i'll do then is wait until my test x570 mobo's arrive and then try memtest86 8.4 first and if that fails then go play doctor with my memory on x570 boards ;)

btw. I tried already using sowing needles but those are far too thick to fit into the tiny holes of the memory banks.

So yes I do need to go to the pharmacy if memtest fails to inject on the comming test set.

x570 boards that are on the way are:

ASUS Prime X570-P (if it arrives first I'll start with this one first, anyway this one will have my top priority)
ASRock X570 Pro 4
Gigabyte X570 Aorus Elite
ASRock X570M Pro4
ASRock X570 Extreme4
ASROCK X570 PHANTOM
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
1) Asrock did NOT properly test ECC for their Ryzen platforms (like the X470D4U / X470D4U2-2T). I did do this testing and discovered ECC reporting is NOT working on these platforms and they've admitted this. They said they'll remove things like ECC Error Event logs in the IPMI, because it fools people into thinking that it does work.

@diversity and I have chatted a bit about this in private messaging. This is essentially the reason I am suggesting to stick with major manufacturers who make servers as their bread and butter, and also to stick with the much more thoroughly deployed and tested Intel Xeon stuff.

I used to torture some of my padawans with a criticism that "this is a detail-oriented business".

When you install FreeBSD, do you install the ISO to HDD? Of course. Do you configure resolv.conf? Almost always. Do you set a root password? Probably. Do you set up NTP? Usually. Do you install an SSL CA pack? Maybe. Do you set up smartd? Possibly. Do you set up powerd, or nscd? Don't lie to me (heh). See, there are all sorts of corners of server software configuration that you COULD touch and maybe SHOULD touch, but don't.

In that same manner, there's lots of stuff that newbie server manufacturers COULD do, but the number of people clamoring for arcane features that require significant expertise to arrange correctly through several highly specialized mainboard subsystems does not really make a compelling case for a low-yield server manufacturer to make such an engineering investment. Heck, I don't even trust Supermicro to get this stuff right on alternative platforms like Ryzen. I'm pretty sure they do have the expertise in-house, which gives them a huge advantage, but call me skeptical until someone demonstrates that it works.
 

SweetAndLow

Sweet'NASty
Joined
Nov 6, 2013
Messages
6,421
Did the middleware send you an email alert when these happened?

More information should be in the event log of your IPMI.
Huh? No it doesn't do that. I have a supermicro board and there was nothing more in the event log.
 

diversity

Contributor
Joined
Dec 4, 2018
Messages
128
I agree with jgreco. For me this topic has been been a great ride and for me has been concluded as follows:

If one needs assurance then do the research on what does work and stick with that.

This topic has drifted from it's original intend to AMD Ryzen/ECC ramblings and thus I will redirect anyone interested in that particular topic to:
I for one am singing out of this thread.

Once again all thanks for contributing. Much appreciated.

respect!
 

Patrick M. Hausen

Hall of Famer
Joined
Nov 25, 2013
Messages
7,776
can you please share your setup details? We can then add it to the, not (yet) existant, known good configurations.

If we have but a few then notifications might become a posibility
Supermicro SYS-1028R-WTNRT with mainboard X10DRW-NT. FreeBSD 11.3
 

Yorick

Wizard
Joined
Nov 4, 2018
Messages
1,912
Huh? No it doesn't do that. I have a supermicro board and there was nothing more in the event log.

That’s unfortunate. I was expecting that the IPMI event log would show these errors, and can alert on it when set to send alerts.

At least now I know the middleware doesn’t. I’m not clear on whether mcelog on FreeBSD supports triggers, that would be one way to go about this.
 

Yorick

Wizard
Joined
Nov 4, 2018
Messages
1,912
I did some testing on TrueNAS Core 12 with an X11SSH-F. I am receiving alerts from both IPMI and TrueNAS.


Code:
Alert received from FreeNAS IPMI
IP : 192.168.2.8
Hostname:
SEL_TIME: 2020/04/16 11:19:47
SENSOR_NUMBER: 53
SENSOR_TYPE: Memory
SENSOR_NAME: OEM
EVENT_DESCRIPTION: Correctable ECC @DIMMS6
EVENT_DIRECTION: Assertion
EVENT SEVERITY:"information"

TrueNAS @ freenas.wuffden.local

New alerts:
* Memory #0x53 Asserted Correctable ECC (@DIMMO6(CPU4)).

Current alerts:
* Memory #0x53 Asserted Correctable ECC (@DIMMO6(CPU4)).
 

Dan Tudora

Patron
Joined
Jul 6, 2017
Messages
276
@diversity and I have chatted a bit about this in private messaging. This is essentially the reason I am suggesting to stick with major manufacturers who make servers as their bread and butter, and also to stick with the much more thoroughly deployed and tested Intel Xeon stuff.

I used to torture some of my padawans with a criticism that "this is a detail-oriented business".

When you install FreeBSD, do you install the ISO to HDD? Of course. Do you configure resolv.conf? Almost always. Do you set a root password? Probably. Do you set up NTP? Usually. Do you install an SSL CA pack? Maybe. Do you set up smartd? Possibly. Do you set up powerd, or nscd? Don't lie to me (heh). See, there are all sorts of corners of server software configuration that you COULD touch and maybe SHOULD touch, but don't.

In that same manner, there's lots of stuff that newbie server manufacturers COULD do, but the number of people clamoring for arcane features that require significant expertise to arrange correctly through several highly specialized mainboard subsystems does not really make a compelling case for a low-yield server manufacturer to make such an engineering investment. Heck, I don't even trust Supermicro to get this stuff right on alternative platforms like Ryzen. I'm pretty sure they do have the expertise in-house, which gives them a huge advantage, but call me skeptical until someone demonstrates that it works.

I agree with that
in IT industry must to have "some trust in hardware"
I have a DELL, another DELL, a HP Microserver Gen8, another one and another, one wincor nixdorf as router (pfsense)
and I am happy with that setup
I am not worry about (for now) hardware failure
happy covid-19
succes
 

diversity

Contributor
Joined
Dec 4, 2018
Messages
128
I am happy to report that it is actually really easy to generate ecc mem errors. Thanks all for motivating me to go in head first.

The method I used was to use the inner wires of some electrical cable rather than syringes and stick it in the memory bank with 8GB ECC UDIMM in it.

First try success. I saw errors being corrected using proxmox 6.1 with a 5.4 kernal and also memtest86 pro 8.4 rc 2 build 1001.

I don't have a mobo yet with IPMI that logs and alerts on ecc errors so could not test that yet but that would be my solution for FreeNAS then.
Have not tested yet on TrueNAS 12 to see if OS reporting has made it in yet but as soon as I can I'll test that as well. Might take a while though.

Memory seems still intact.

Now the icing on the cake would be if ecc error injection would be available for recent setups and part of monthly (or what ever period is reasonable) health check.
 

Attachments

  • 20200419_134352.jpg
    20200419_134352.jpg
    302.7 KB · Views: 646
  • 20200419_163225.jpg
    20200419_163225.jpg
    351.2 KB · Views: 730
Last edited:
Top