SOLVED The usefulness of ECC (if we can't assess it's working)?

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,681
Given the advances in 3d print technology and desktop cnc milling and what not.
Would there be interest in an open source clamp of some sorts that one can easily place around a memory module and trigger errors.

Not clear to me what the value of this would be over just soldering some leads to a cheap low capacity test DIMM and generating errors.
 

diversity

Contributor
Joined
Dec 4, 2018
Messages
128
one thing I can think of is that one does not need soldering skills when using my suggestion, only a 3d print service locally.
Also one does not need to sacrifice a test DIMM. Rather have the actual DIMM tested.

Does this make sense or still not hitting any marks? If no one sees it then at least I gave it a try
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,681
What exactly do you picture your 3D print service generating?

The fundamental problem is that determining ECC support is like trying to ascertain the octane of the gasoline in your gas tank without getting out of the driver's seat. There's actually no reliable way to do it across the PC landscape, so the referenced Jira ticket is just basically asking for something that can't be done reliably.

As nice as it would be to

Please make life better and stop pushing certain "known good configs". it is not helping anyone

from the Jira ticket, even the dedicated memtest86 products, whose whole raison d'etre is to test and report memory, do not reliably and correctly report this for every platform, and these people have made a BUSINESS out of trying to keep track of which chipsets do which things, and how the data is reported. They STILL can't get it right.

The reason we push "known good configs" here on the forum is because that's actually the only viable solution to the problem.

One reason iXsystems sells prebuilt systems is so that they can know that the components used do in fact support ECC.

So, having provided some context, and now looping back around to your reply,

one thing I can think of is that one does not need soldering skills when using my suggestion, only a 3d print service locally.
Also one does not need to sacrifice a test DIMM. Rather have the actual DIMM tested.

Does this make sense or still not hitting any marks? If no one sees it then at least I gave it a try

What do you envision your 3D print service doing?

If we are testing for ECC functionality, then presumably we don't already know that the chipset supports ECC correctly, or reports it usefully. This means that you need to deliberately and reliably be able to inject an error into the data. That involves corruption of the signals at an electrical level. I have no idea what you believe 3D printing could do to help you here.

Those of us who are "in the biz" will often have a few DIMM modules around that are known to be flaky. This, or deliberately injecting errors into a good DIMM, would seem to be the only options for end-to-end testing for ECC.

Otherwise, you would need to try to collect platform-dependent information and try to infer what you can from data that may exist, may not exist, and will definitely vary in quality from platform to platform. There is no way to do this correctly without hardware support, and the only way to validate would be to deliberately inject errors.
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,681
<moderation note, January 2022>

At this point, diversity accused people of getting emotional, having ulterior motives, etc. while also moving goalposts, and due to some of the previous rounds of this topic, was asked to stop. Some posts in this thread have been removed now that active participation in the thread has ended, in an attempt to improve readability for future readers. I've tried to avoid damaging meaningful comments and useful content, because the underlying topic is indeed valid, even if we are unlikely to be able to solve it generally. Technical discussions are fine. Accusing people of vague ulterior motives is not. @Ericloewe's following message summarizes the remainder of what I would have said here:
 
Last edited:

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,176
<Moderator hat>
So write it yourself. Or find someone with the right skills who's interested in doing so. That's the beauty of open-source. Spamming the forum about it is not going to accomplish anything.

Your suggestions are vague and not credible (perhaps they are just poorly explained, but you were asked to explain better and didn't.

Shorting DDR4 pins 4 and 5 sounds like an awesome way of destroying your memory controller, since shorting a differential input to the voltage rail is firmly outside of anything CPUs are designed for.

Since we're in "dangerous to people's property" territory here, we're done here.
</Moderator hat>
 

nasbdh9

Dabbler
Joined
Oct 23, 2020
Messages
17
If I remember correctly, truenas supports the error reported from bmc and then send email
20211204170607.jpg
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,681
If I remember correctly, truenas supports the error reported from bmc and then send email
View attachment 51185

Yes, that is correct, and iXsystems obviously supports this.

The original poster's axe to grind is that there is no standard way that such reporting is done. Non-server motherboards generally do not have a BMC, and even some servers do not usefully log or usefully alert to ECC errors in this manner. Intel introduced the Machine Check Architecture (MCA) back in Pentium days as an attempt to standardize this sort of thing, but experience suggests that MCA is not a reliable reporter of ECC issues, and more generally, there is no way to reliably detect a correct ECC implementation in software.

The OP would like us to stop pushing "known good configs" as stated above. Well, there are MANY reasons, beyond just ECC, that we advocate certain known good configs on the forums, which include the use of server mainboards, Intel ethernet chipsets, etc. The fact of the matter is, sadly, manufacturers are not going to go and redesign the hardware that was sold five or ten years ago, that is being bought on eBay and repurposed for use as a NAS by hobbyists and SOHO users, to make ECC reporting work in some uniform and reliable manner. So it is virtually mandatory that we identify what is known to work correctly and encourage users to buy that.

The PC architecture is full of all sorts of horrible design crap and I appreciate the original poster's frustration, but at the end of the day, I'm more interested in viable solutions to solvable problems.
 

Mastakilla

Patron
Joined
Jul 18, 2019
Messages
202
Oopsie, people seem to be talking besides each other here lately it seems...

Let me try to clarify things here a bit...

The "physical" bit-flip method that Diversity is talking about here, is indeed about shorting specific pins. One pin is used for "data", the other pin is a ground pin. The simplified idea is that when a 1 is send over the data pin, then there is a current on the data pin. By "shorting" this pin with the ground pin, the 1 becomes a 0 and you get a forced "bit flip" on the memory itself.

We've gotten this trick from a research paper about those rowhammer attacks and worked with the guys who wrote the paper to get to the bottom of this. The paper was about DDR3, but we also found the correct pins for DDR4. So yes, this trick does work for all platforms that have DDR3 or DDR4.

And yes, this method can be used to "quickly" and "reliably" "validate" "unvalidated platforms" for correct handling of single-bit errors. So I do think this is a very good method to easily increase the number of "known good platforms".

However, we didn't find a way to trigger multiple-bit errors yet using physical pin-shorting. For testing multiple-bit errors, you still need to resort to the "slower" and "less reliable" underclocking methods. Also that I was able to do for my specific platform, but it is more time consuming...

For more details on this, you can read my thread about my journey to figure all of this out
Warning: it's a long read and the solution only starts to form near the end ;)

But back to this topic:
I think what Diversity is asking for here, is to create a specific 3d printed shape that makes it "easier" and "safer" to short the correct pins, without risking to accidently short the wrong pins. As the pins are very small, it currently requires a steady hand and quite large cohones ;)
 

Patrick M. Hausen

Hall of Famer
Joined
Nov 25, 2013
Messages
7,740
I am at a complete loss about what you folks are trying to achieve/prove.

1. Buy server from Supermicro, Fujitsu, Lenovo, HPE, ... that has "ECC support" on the data sheet.
2. Fill the DIMM slots with "ECC memory".
3. Put the server in production and sleep well.
4. Whenever there's a memory error the system will log something like kernel: MCA: CPU 0 COR (2) OVER MS channel 3 memory error
5. When that does not happen, there's no memory error.

I have somewhere between 50 an 100 servers (would need to check our inventory just now) and they are all humming along and some time some server flags a memory error like cited above. That proves it's working, right?

Do I assess the quality of the fuel I pour into my car at the station? Or the voltage of the current the provider delivers to my household? Why would you doubt a piece of hardware from an established supplier regarding a documented and guaranteed feature? And guessing you could do better than e.g. Supermicro with some hand held short cut circuitry? They tested ECC and guarantee it's working. And they definitely can do that better than I can.

Still really puzzled by this thread.
Patrick
 

Mastakilla

Patron
Joined
Jul 18, 2019
Messages
202
I am at a complete loss about what you folks are trying to achieve/prove.

1. Buy server from Supermicro, Fujitsu, Lenovo, HPE, ... that has "ECC support" on the data sheet.
2. Fill the DIMM slots with "ECC memory".
3. Put the server in production and sleep well.
4. Whenever there's a memory error the system will log something like kernel: MCA: CPU 0 COR (2) OVER MS channel 3 memory error
5. When that does not happen, there's no memory error.

I have somewhere between 50 an 100 servers (would need to check our inventory just now) and they are all humming along and some time some server flags a memory error like cited above. That proves it's working, right?

Do I assess the quality of the fuel I pour into my car at the station? Or the voltage of the current the provider delivers to my household? Why would you doubt a piece of hardware from an established supplier regarding a documented and guaranteed feature? And guessing you could do better than e.g. Supermicro with some hand held short cut circuitry? They tested ECC and guarantee it's working. And they definitely can do that better than I can.

Still really puzzled by this thread.
Patrick
Here are my personal reasons for going down this rabbit hole:

I bought an Asrock Rack motherboard, which has a well known IPMI (Aspeed AST2500) and has ECC support on the data sheet. After testing it however, I figured out that the IPMI is not capable of detecting ECC errors (PFEH). ECC errors can be detected by the OS however. This makes it very clear having ECC support on the data sheet is not always sufficient.
So then you're probably thinking "Ok, but I actually meant that you need to stick to the limited list of well-working-platforms". But also for this I personally had my reasons to deviate from it...
I wanted to have silent, power efficient, NAS that is capable of real time 4k HDR HEVC transcoding. When I bought my NAS a couple years ago, Intel simply had no Xeons that could meet these requirements (either they were power hungry and impossible to silence or they couldn't handle the transcoding) and the list of well known working platforms even only contained older Intel Xeon platforms.

Other good reasons I can think of are
  • Some people want to avoid second hand hardware, but don't want to pay the premium of a pre-build TrueNAS server
  • Some people already have hardware that might not be on the list of well-working-platforms, but they want to try to put it to good use anyway.
  • Some people live in countries where there is no easy access to second hand servers from well-working-list (the world is bigger then US, EU, Asia)
Now I do agree that, most of the times, the reason people think they have to deviate from the well-working-list, is not sufficient. I understand that you try to push people to use well-working-platforms as much as possible. If I knew how much time I'd spend in validating ECC, I'm not sure that I would have done it...

But that is not the point here. What Diversity (I think) is trying to accomplish, is (now that we figured out methods to reliably validate ECC functionality for unknown platforms) that we can "enlarge" the list of well-working-platforms.
I do see this as welcome enhancement of this community. The larger this wel-working-list, the less people will feel the need to deviate from it...
Now if someone is financially invested in iXSystems, then I understand that this person do not welcome this idea. As you want to force people as much as possible to buy your pre-build servers.
But if you are in it for the community, then I do think this knowledge is very useful...
 
Last edited:

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,176

Patrick M. Hausen

Hall of Famer
Joined
Nov 25, 2013
Messages
7,740
I am not invested in iXsystems and I don't want to push anyone anywhere. It just really never occured to me not to trust specs from reputable vendors. Whenever they do have a bad series of product X people with a large electronics lab will find out and we will get to know.

I do have a data center with more than 50 servers, all of them with ECC memory. And the fact that one or two of them occasionally report errors was proof enough for me that it's working as intended.

Thanks for your thoughts. Again I am not criticising the endeavour per se, I am questioning the cost/effort to benefit ratio. I make a living from running servers. I don't have time to poke wires into DIMM sockets. You might call that tunnel vision :wink:
 

Mastakilla

Patron
Joined
Jul 18, 2019
Messages
202
Platform-First Error Handling... A very Orwellian way of saying "the system firmware won't report ECC errors".
I indeed depend on the OS for reporting my ECC errors. Luckily I did find a way to properly do this in TrueNAS:
 

Mastakilla

Patron
Joined
Jul 18, 2019
Messages
202
I am not invested in iXsystems and I don't want to push anyone anywhere. It just really never occured to me not to trust specs from reputable vendors. Whenever they do have a bad series of product X people with a large electronics lab will find out and we will get to know.

I do have a data center with more than 50 servers, all of them with ECC memory. And the fact that one or two of them occasionally report errors was proof enough for me that it's working as intended.

Thanks for your thoughts. Again I am not criticising the endesvour per se, I am questioning the cost/effort to benefit ratio. I make a living from running servers. I don't have time to poke wires into DIMM sockets. You might call that tunnel vision :wink:
I am also aware that this non-working PFEH probably has more to do with me choosing a "server motherboard" for an AMD consumer platform. Probably ECC will work fine for most Xeon / EPYC servers. But the actual testing of this remains very useful in my opinion. For example for workstation platforms, like Threadripper, which are a bit in between consumer and server space.
 
Last edited:

danb35

Hall of Famer
Joined
Aug 16, 2011
Messages
15,466
Now if you are financially invested in iXSystems
The only people on this forum who have a financial interest in iX are those with an iXSystems badge--and you'll notice that exactly zero such people have participated in this thread. Not only is this a red herring, it's a particularly silly and insulting red herring.
 

Mastakilla

Patron
Joined
Jul 18, 2019
Messages
202
The only people on this forum who have a financial interest in iX are those with an iXSystems badge--and you'll notice that exactly zero such people have participated in this thread. Not only is this a red herring, it's a particularly silly and insulting red herring.
I didn't mean to implicate that anyone of you is financially invested in iX. I was simply saying that "IF you are, then I understand you could oppose to this"...
Sorry if I insulted anyone...

edit: as everyone seems a bit "on edge" in this topic, I tried to reword my posts to be a bit a more neutral...
 
Last edited:

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,681
I didn't mean to implicate that anyone of you is financially invested in iX. I was simply saying that "IF you are, then I understand you could oppose to this"...
Sorry if I insulted anyone...

I am pretty sure that no one at iXsystems gives a crap either. iXsystems is a small fish in a huge IT aquarium, and even "financial investment" in iXsystems isn't going to matter. They're a systems integrator for Supermicro hardware, and they take advantage of available hardware and they wrote TrueNAS to encourage sales and enter a new market. They work with what's available to create products for enterprise and businesses, which is one of the few IT markets with deep-ish pockets able to afford a product like TrueNAS.

Their motive for providing FreeNAS was to take advantage of the userbase as beta testers, not to sell you their pricey enterprise-targeted servers. When FreeNAS got good enough that people were howling for iX to release a consumer NAS, they came out with the Mini, which is a niche product for those who had money but perhaps not the skill or time to build their own. iX is NOT getting rich off the Mini's and indeed may be regretting the choice, given the Avoton debacle, etc. Making wild accusations of financial motives is basically tinfoil hat grade paranoia. iXsystems has always made it clear that you can run this on whatever you want, and they aren't really going to support it, except insofar as such support also improves the TrueNAS enterprise product -- supporting FreeNAS is what the forum is all about. Here on the forum, we're tasked with cleaning up the resulting mess of that hands-off minimal-support policy, and things were so bad here in the early days, I began a hard push towards true server grade hardware, and others such as Cyberjock enthusiastically took that up for all the obvious reasons.

The accusations of ulterior motives here is ridiculous. No one in this forum is going to make a million dollars off of crippling ECC. We're simply dealing with the state of affairs as they exist, and advocating best courses of action. Of course it all sucks. PC's have sucked since invented. But if you can't focus on adding value to this discussion and instead insist on spreading this sort of baseless paranoia, this thread and any others like it can and will be closed by the moderation team. I've spent some significant time in this thread trying to provide a clearer perspective on all these issues.

There is no cabal printing money based on the sale of faux-ECC-support. Nobody here is opposed to ECC support. You just need to understand that good support for ECC requires a lot of work at the CPU and mainboard level, and vendors other than HP, Dell, Supermicro, and IBM may not have put in that work.
 

Mastakilla

Patron
Joined
Jul 18, 2019
Messages
202
Thanks for the good clarification jgreco... Much appreciated. I agree with all, except for the implication that I'm trying to accuse someone of something. I tried very much to make clear that was not my intention. And I also hope you recognize that I do try to add value...
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,681
Thanks for the good clarification jgreco... Much appreciated. I agree with all, except for the implication that I'm trying to accuse someone of something. I tried very much to make clear that was not my intention. And I also hope you recognize that I do try to add value...

There's been several implications of impropriety by multiple posters in this thread. Even saying "I'm not accusing someone of X" often comes across as an accusation of X. There is no reason to think that there is some financial motivation by iX or people in this thread for or against ECC. We all basically have to live with the CPU support that Intel and AMD have "graced" us with (sarcastic airquotes most definitely intended, hopefully you can grin at that), and, for better or for worse, all the mainboard manufacturers have relatively transparent and relatively obvious reasons for their level of proficiency or deficiency supporting ECC.

Speaking as someone who does this professionally, and who is a high volume poster, and who was the driving force behind significant pushes towards server gear on these forums, I am happy to report that my garage is full of supercars, purchased with my winnings from advocating for ECC. Oh, wait, no it isn't. The "financial" benefit I've gotten from this gig is a box of leftover iX conference swag, some free mugs and shirts from the iX online store, and maybe a few other trinkets...? So I may resent these various implications, because I'm sure that the thousands of hours
I've put in here may be slanted, but I try hard to make my reasoning for things clear, and spend a lot of time explaining stuff. Often repeatedly.

And I also hope you recognize that I do try to add value...

To the extent that a discussion is productive and not just repetitive, that's fantastic. I am absolutely the worst at thread hijacking in some random direction, but as long as people find it useful and interesting, it is not wasted. The ECC testing thing feels like it has been hashed out to near death, though, and basically there's no reliable software test (a bit I was hammering at @diversity awhile ago) and that leaves hardware tests. What's needed is for discussions to not be handwavey "what if" "but but" stuff. I've already put forth that the normal solutions would be to make use of an existing known bad DIMM (something some of us like to keep around when we find one), or to solder some leads onto a low density cheap DIMM and inject some errors manually. I guess if you can figure out some way to "3D print" a clip that can reliably make contact to do a similar thing, that'd be fine, but I'm a bit skeptical. Either way, that's still an electrical injection of errors. My way is guaranteed to work, and to work for every generation of mainboard/RAM.
 
Top