SOLVED The usefulness of ECC (if we can't assess it's working)?

diversity · Apr 1, 2020

Given that ECC functionality depends on several components working well together (e.g. cpu, mobo, mem) there are many things that can go wrong resulting in a user detectable lack of ECC support.

However, it's also far too easy to get a false positive for ECC functionality. This might well mean that a large percentage of all the ECC 'enabled' systems in the industry are actually not, either DOA or over time.

What I am worried about is the lack of reliable way of testing ECC functionality in the industry. And by functionality I mean ECC error correcting, reporting and most particual injection.

Without ECC injection, testing for ECC correction functionality is only possible for a small subset of power users that are able to deliberately cause mem errors as to be able to see if ECC reporting and correcting is functional. However this is not something that can be automated as part of a scheduled health check.

I consider ECC reporting (and a way to test if that is still working) a requirement as to be able to preemptively replace memory that is about to go bad.

I am asking for opinion of the community, and most notably senior technicians @ixsystems, regarding this stance because I am quite a bit stuck now not daring to proceed with a mission critical project.

diversity · Apr 2, 2020

This thread is a more expanded version of

ECC reporting important for FreeNAS and ZFS?

LS, My Setup: FreeNAS 11.3-Release CPU: AMD Ryzen 9 3950x Mobo: ASrock x570 Creator (Latest bios 2.10, ECC enabled in settings) Memory: 4 x Kingston KSM26ED8/16ME (ECC)" (64 GB total) According to AMD the AM4 socket does support ECC but it does not support ECC error reporting function. Is...

www.ixsystems.com

I am worried why this subject is getting no attention

If my stance is indeed defendable, will things get better once I buy a pre configured setup from ixsystems?

jgreco · Apr 2, 2020

There aren't "senior technicians @ixsystems" who participate here, particularly not any who might have such insights.

Speaking as someone who has built a lot of servers professionally over many years, my take on it is that you simply need to buy server-grade components from a company that specializes in it, that short list defined as Supermicro, Dell, or HP, who have made this an enterprise hardware feature and who have a proven track record of being able to do it. My concerns are more along the lines of proper notification, because even if you detect the event, getting notified requires support to be set up and configured in the IPMI.

With the advent of virtualization, we don't build as many servers here as we used to anymore, but it still seems to be a valid observation that the vast majority of failures happen early on in a server's life. Once memory is validated and passes the thousand hour mark, failures are not impossible, but are definitely much more unusual.

diversity · Apr 2, 2020

Thank you jgreco for your response.

Please allow me to politely refuse your premise 'buy server grade and it will 100% work' if that is what you meant.

I am talking about the ability to assess whether it actually does work or not.

Even server grade hardware can have defects if I am not mistaken.

jgreco · Apr 2, 2020

Okay, then, let me be a little more blunt. What you are asking for doesn't exist in any comprehensive, guarantee-able manner outside of the engineering labs at companies that integrate these things. I already gave you the shortlist of the vendors I thought were probably doing this in a non-halfarse manner.

As someone who has built servers in the ~thousands tier, I can tell you that I've seen sticks of bad memory with low error thresholds and we keep some of them around for those times we need to do testing. That's about the closest us mere mortals can get to validating ECC systems. If I'm skeptical about a platform, I will dig them up and see what happens. But there's no real way to do ongoing testing. And there's little point, either. The usual question is "was the platform designed correctly" and you can check that with a stick of bad RAM.

diversity · Apr 3, 2020

Thank you jgreco, I hope you trust I mean no disrespect either to you personally, the community, ixsystems or FreeNAS.

I am just hoping things will get better in the future for us mere mortals thus I am shaking the tree.

I tried searching for a Memory Error Injection Module. Something along the lines of a ramstick with additional functionality. No luck yet. Can you please have a stab at this as well and ask the team to also have a look?

If it does not exist yet then how would one feel about a kickstarter project? I think FreeNAS and the users can benefit from the peace of mind such a module would bring.
I will admit though I have absulutly no idea how to set that up and what it entails. But we could learn.

jgreco · Apr 3, 2020

Stuff to do this general kind of testing exists, but it is highly specialized, which is a fancy way of saying basically unobtainium -- when you want to buy something for which the product run is maybe only a few hundred pieces, the company manufacturing such a specialty device covers their costs to design and fabricate it by jacking the price up. I expect that with the reduction of memory suppliers these days a lot of this stuff is done in-house and so you might not even find this outside the engineering test labs.

Part of the problem here is that I think you have a goal of simply being able to test ECC functionality, but to be a useful engineering tool, a product to do this would need to be able to simulate a variety of errors and issues. To make a long story short, I wouldn't expect to find what you're looking for.

Yorick · Apr 3, 2020

jgreco said:
Speaking as someone who has built a lot of servers professionally over many years, my take on it is that you simply need to buy server-grade components from a company that specializes in it, that short list defined as Supermicro, Dell, or HP, who have made this an enterprise hardware feature and who have a proven track record of being able to do it.

Please do add Cisco to that list. The server arm of Cisco is very, very serious about their gear, and they've got a track record of making quality servers and quality management software to go with it.

Yorick · Apr 3, 2020

@diversity Can you explain your specific use case and desired workflow a little? While what you are asking for doesn't exist, there are a few things you can do.

1) Set up your OS so it reports ECC errors to you. Set it up so it halts on dual error, and informs on single error. "Halt on dual error" is very application-specific. Is it better to keep running, with an unrecoverable memory error, or better to shut down immediately lest some data get corrupted? That really depends. You will want some form of notification for errors. How you set that up is OS-specific.

2) You can force ECC errors on consumer-grade hardware by tightening timings. This is discussed in great detail at https://hardwarecanucks.com/forum/threads/ecc-memory-amds-ryzen-a-deep-dive.75030/. This will allow you to verify that the above steps work on your OS of choice: You get notifications on recoverable and unrecoverable errors, and the OS does / does not halt, as configured by you, on unrecoverable errors. I'd use an ASrock Ryzen setup for this testing, it's affordable and ASrock has a reputation for taking their ECC code in BIOS seriously.

2b) Edit - there is one other way to do this, on the target hardware for 3). You'll need to destroy a DIMM to make it happen. You solder wires onto your DIMM and force-inject errors, in a bid to test your OS response to those errors on the hardware you'll use in production. When you are satisfied your reporting and OS response is solid, replace the butchered DIMM with a healthy one. See https://serverfault.com/questions/762186/how-to-force-ecc-error for that.

3) Now that you know that you can trust the reporting, go to server grade hardware, and sleep soundly in the knowledge that a) you probably will never see a significant amount of errors, as you are now on "proper" hardware and b) if a stick should go bad, you know your OS will alert you, because you tested the OS configuration on consumer grade hardware

Redcoat · Apr 3, 2020

Yorick said:
ASrock has a reputation for taking their ECC code in BIOS seriously

Back in 2017 I was chasing a performance issue on my FreeNAS Mini with ASrock C2750 mobo. I was not seing any response to ECC error injection and took it up with Passmark, sending them a debug dump. Their response:

"Got the log. Your hardware supports ECC injection. So no problem there.
But the DRAM Control Operation (DCO) register is showing that the ECC injection has been disabled in your BIOS firmware. You may want to check your BIOS setup to see if there is an option to enable ECC injection. Otherwise, you would need to flash a custom BIOS to prevent the ECC injection feature from being disabled.
In the next release of MemTest86 we'll decode the bits in the DCO registers to provide a clearer indication in the log that the injection feature is disabled in BIOS. That message will be,
**Warning** DRPLOCK is set to 1. DRPLOCK must be cleared by the BIOS to enable error injection".

I offer this not to dispute any previously made position, just to provide another possible data point in @diversity 's quest. Bios firmware choice by manufacturer can negate the test even in a ECC-capable server board.

Yorick · Apr 3, 2020

Thanks, that's interesting! Memtest86 would not help with testing the OS response and notification functions and, it's interesting. There's a thread over yonder that talks about the injection lock on Intel: https://www.passmark.com/forum/memtest86/5984-how-do-you-verify-ecc-error-injection-working

Re ASRock, my comment was specific to Ryzen. Not all Ryzen motherboard vendors support dual-bit error detection, which makes it rather hard to test the OS response to it. That's not meant as hardware for production use, of course. Production use would be "proper" server hardware.

diversity · Apr 3, 2020

Thx All for contributing. I hope more will join to get to a consensus.

I tried shaking up the hen house because I am looking to validate my stance on ECC.
It is more of a philosophical point of view.

I for one am on board with ECC and will keep on trying to actually have it functioning.
It frustrates me that I am still not able to actually know if it is and in essence should that not mean that most of us, even the companies that paid top dollar for their setups, also don't know?
This is the core of what I would like ones opinion about. When you think of it it looks really logical.

@Yorick So there is no real intend or workflow perse. It is more about opening up a discussion on the philosophy behind ECC and potentially also warn people that feel save at tme moment that they might not be. I mean, there is no humanly friendly way to tell so it seems.
How ever what I meant with mission critical project is to have 3 x (3 x 3TB mirrored) spread and synced across the globe. With data that is irreplaceable to me.

Yorick · Apr 8, 2020

There is absolutely a humanly friendly way to tell. Set up your system so it alerts you on ECC error; rest well in the knowledge that functionality has been tested, just like SMART error alerting and every other type of alert. To hammer that point home, you don't cause your hard drive to fail to verify that SMART alerting and checksum alerting works; you trust that that's been tested and you'll get an email alert if something goes wrong.

A good question would be: Does the FreeNAS alert daemon alert on ECC errors, recoverable and not? If not, that'd be a good feature to add.

Stage5-F100 · Apr 8, 2020

Yorick said:
Set up your OS so it reports ECC errors to you

Your writeup was fantasatic and I learned a lot. However, searching this exact subject is giving me a lot of confusing results.
Is there any concrete way to set up my OS's on ECC machines (FreeNAS, macOS, Manjaro, and Windows 10 Professional) to report/halt on errors?

I'd imagine it'd entirely different per OS, and assume that FreeNAS does this by default... but wanted to ask.

Edit: Just saw your followup:

Does the FreeNAS alert daemon alert on ECC errors, recoverable and not?

And I have the same question, per above! :)

Yorick · Apr 8, 2020

FreeNAS does not yet alert, though Event Log through IPMI will show it of course. FreeNAS will alert on ECC error in a future version, that's tracked here: https://jira.ixsystems.com/browse/NAS-105287

That's great, I am sure it'll give peace of mind to people to know they'll get an alert.
Edit: For ixSystems hardware only, which makes sense. So, there is a small project - a set of scripts to run mcelog and send alerts on ECC error.

On Windows, you can configure it so it predicts page failure, that's described at https://docs.microsoft.com/en-us/windows-hardware/drivers/whea/predictive-failure-analysis--pfa-

Event Log has ECC errors, so you'd need an Event Log monitoring utility to alert you. A quick google finds this for example: https://www.netwrix.com/netwrix_event_log_manager.html

MacOS I don't know; Linux likewise, something needs to monitor for the ECC errors which will show up via edac or in logs.

diversity · Apr 9, 2020

Yorick said:
There is absolutely a humanly friendly way to tell. Set up your system so it alerts you on ECC error; rest well in the knowledge that functionality has been tested, just like SMART error alerting and every other type of alert. To hammer that point home, you don't cause your hard drive to fail to verify that SMART alerting and checksum alerting works; you trust that that's been tested and you'll get an email alert if something goes wrong.

A good question would be: Does the FreeNAS alert daemon alert on ECC errors, recoverable and not? If not, that'd be a good feature to add.

Very interesting point indeed. I started my quest on ECC some time after I was informed by FreeNAS one of my drives failed. Back then I was not so critical as I am now but that did make me think about why I had never every seen any ECC reports what so ever thus my quest.

Since I have a failed drive laying around that means it will be easy for me to check SMART functionality. But now I think I should bite down hard and also wish there was a way to test SMART functionally. I wish you had not opened my eyes ;) my head hurts now.

I have another idea. I will loan my failed drive free of cost (excluding postage) to anyone who needs it to check SMART reporting functionality.

Is there someone out there that can loan me a failed unbuffered ECC stick (any size up to 32gb will do) I will pay for postage.

diversity · Apr 9, 2020

Yorick said:
FreeNAS does not yet alert, though Event Log through IPMI will show it of course. FreeNAS will alert on ECC error in a future version, that's tracked here: https://jira.ixsystems.com/browse/NAS-105287

Never mind on the whole loaning out faulty hardware. I will admit that I am really (and i am going through something like the stages of grief now) surprised, shocked, disappointed and now very very sad. even to hear that.
I mean now I am comming full circle on the whole premise of this post. How is this not a standard feature from the get go like smart reporting is?

Patrick M. Hausen · Apr 9, 2020

My FreeBSD servers do inform me of correctable ECC errors:

Code:

Nov 19 11:48:06 ph002 kernel: MCA: CPU 0 COR (5) OVER MS channel 3 memory error
Nov 19 11:48:06 ph002 kernel: MCA: Address 0x1f709a48c0
Nov 19 11:48:06 ph002 kernel: MCA: Misc 0x90010000040188c
Nov 19 11:48:06 ph002 kernel: MCA: Bank 12, Status 0xcc00010c000800c3
Nov 19 11:48:06 ph002 kernel: MCA: Global Cap 0x0000000007000c16, Status 0x0000000000000000
Nov 19 11:48:06 ph002 kernel: MCA: Vendor "GenuineIntel", ID 0x406f1, APIC ID 0

How to trigger alerts from that is left as an exercise to the reader ;)

jgreco · Apr 9, 2020

diversity said:
I mean now I am comming full circle on the whole premise of this post. How is this not a standard feature from the get go like smart reporting is?

Probably because it's extremely platform-dependent and there are many different ways to do it, if it's even available. This is like the dumb "why aren't temperatures reliably reported" thread from years ago. It sounds like it ought to be simple to do, but in practice is a bit of a nightmare. If you're willing to settle for doing it on just one or two platforms whose behaviour is known, that's a lot easier.

Yorick · Apr 9, 2020

Yeah, the reporting is there in logs. Middleware to alert on it in a meaningful way is extremely hardware-dependent. This is why, in the server world, you see this packaged up NOT by the OS, but by a server vendor app. The server vendor app hooks into the OS-provided IPMI interface and logs, knows how to make sense of what is where, and gives meaningful data back to the user: Email alert, DIMM 32GiB in Row C, CPU socket A, has gone bad.

FreeNAS is both middleware and OS (FreeBSD). The middleware layer can, reasonably, be taught how a ixSystems box works. But to expect that for every combination of motherboard and BIOS and memory out there - that is not reasonable. Difference between TrueNAS Enterprise and TrueNAS Core, right there.

Now, I do wonder: Could there be a community effort to contribute to the middleware layer for SuperMicro boards, since those are so well-liked and commonly used by those who are "a little more serious" about their setup? TrueNAS Core will never be Enterprise grade, just because it doesn't have true support behind it; but strengthening it for SMB use -- okay more S than M :) - could be a thing. Community-driven, as everything in TrueNAS Core / FreeNAS.

Edit: I asked the question, let's see what we hear back. "One, does the alerting middleware currently alert on ECC memory errors, even if with a caveat of "you will need to check IPMI to see which DIMM failed"? If not, this would be very helpful to have. Any concerns about implementing something like that?"

I think for TrueNAS Core, an email alert akin to the above would be completely reasonable. "We've had an ECC memory alert, here's the raw data. We can't meaningfully map this to a DIMM slot, but your motherboard vendor can. Please check IPMI for ECC errors and see which DIMM went bad."

Edit 2: Pro-active shutdown of any "but I don't have an IPMI or it's not hooked up" shenanigans: Boards without IPMI or setups without IPMI hooked up are not taking server maintenance seriously, and have no room to kvetch about ECC errors. There is plenty of room to not have an IPMI and run FreeNAS, it spans the use case gamut from hobbyist to prosumer to SMB. Just be okay with the choices made. "I don't have an IPMI" means "I am okay with chasing down hardware issues the hard way", and that's an okay choice to make, as long as it's a deliberate choice.

Edit 3: There may be an even easier way. My SM IPMI can send me email, or issue an SNMP trap, when an alert is detected. I'd assume ECC error is part of that, though I am not 100% on that. Someone with an SM board who happens to know whether these alerts get sent out on ECC failure, please chime in.

Edit 4: ASRock Rack does, this is from STH: "I can't speak for SuperMicro but my ASRock Rack board does and that's how I knew to replace one of my defective modules. I imagine SuperMicro ipmi system alerts would do the same."

Important Announcement for The TrueNAS Community.

SOLVED The usefulness of ECC (if we can't assess it's working)?

Contributor

Contributor

Resident Grinch

Contributor

Resident Grinch

Contributor

Resident Grinch

Wizard

Wizard

MVP

Wizard

Contributor

Wizard

Dabbler

Wizard

Contributor

Contributor

Hall of Famer

Resident Grinch

Wizard

Similar threads

Important Announcement for The TrueNAS Community.