Actual testing of ECC reporting / logging
As Jgreco mentioned earlier in this thread, the actual error correction is only a part of ECC functionality. I now understand that reporting / logging the occurance of corrected single-bit-errors and uncorrected multi-bit-errors is at least as important, because you need to know if a DIMM is dying, so you can replace it before it causes data corruption.
So that is why I didn't stop here and I've also tried to trigger memory errors and see if they are correctly handled.
In the beginning I had a lot of issues with achieving unstable, but bootable, memory settings. Either it would be stable or it wouldn't boot at all (and require a CMOS reset). After a lot testing, the trick was to find the highest bootable setting and then lowering the voltage until it became unstable.
And
a lot more testing I've done since then :)
To give you an idea, here's the Excel I've used to keep track of my testing:
I’ve tested from hardly bootable to slightly unstable using MemTest86, memtester (on Fedora Rawhide with kernel 5.4.0.0.rc3 and 5.4.0.2) and prime95/aida64_bench/Ryzen_Master_test (on a fully updated Windows 10 Pro, first with amd_software_1.09.27.1033.zip and later with amd_chipset_software_1.11.22.454.zip chipset drivers).
In mean time, I’ve had millions of memory errors (in total) in very varied conditions. It seems almost impossible to me if there was not a single single-bit-error or two-bit-error in all these millions of errors.
But… Unfortunately I couldn’t find any report of a corrected or logged memory error in either the IPMI Event Log, the Linux edac-util or the Windows Event Viewer (even though all of these report ECC to be active and correctly configured - see my posts above).
Now I know that doesn’t mean that no single-bit-error memory error-corrections have happened, but that is only half of what ECC functionality is. Reporting / logging these memory error-corrections is at least as important as the actual correcting itself (How else can you know your RAM is dying or is unstable).
So it seems to me that ECC is
not working correctly on this motherboard with a Ryzen 3000 CPU (I don't have the older Ryzen CPUs for testing).
I've reported this to Asrock Rack and they've send me the following response:
Dear Mastakilla,
Due to X470 belongs to desktop series
It’s not like server MB has native support of ECC report.
We are checking with RD and AMD if X470 can support ECC report.
We will reply to you ASAP
Best regards,
Kevin
Asrock Rack Incorporation
I've replied to this with:
Hi Kevin,
Thanks a lot for looking into this! That is greatly appreciated…
I understand that the X470 is indeed a desktop chipset. Also all AM4 CPUs don’t have officially validated ECC support by AMD (although AMD confirmed that it wasn’t disabled).
So you could argue that non-validated half-working (not reporting / logging) ECC support is acceptable. And I also agree with that, for consumer brands like Asrock, Asus, MSI, etc.
However, if a brand like Asrock Rack or SuperMicro creates a X470 motherboard with “Supports 4x DDR4 ECC and non-ECC UDIMM, max. 128 GB” in the specifications and if the IPMI Event Log contains sensors for “DRAM ECC Error A1/A2/B1/B2”, then people (like myself) will assume that it is actually working and validated. In that case, I don’t think that it is acceptable for it not to work 100%, as people buying these brands, actually are expecting it to fully work. I don’t think that is a reputation or name you are looking for, as a brand called “Asrock Rack”
Please let me know if there is anything else I can do to assist.
Kind regards,
Mastakilla
The response from Asrock Rack seems to admit that it currently does not fully support ECC, however, it could also just mean that Kevin is not sure about it... So I'm hoping for a decent response from their R&D.
FreeNAS specific questions
Now I do wonder what you FreeNAS gurus consider as the correct / desired way of how ECC should work...
Wendell (owner of Level1Techs.com) explained me there are couple ways ECC can be implemented (in quotes below):
Platform First
"The hardware does what it can to recover from the error silently. It’s not necessary logged."
I suppose this would be if the errors are logged in the IPMI Event Log, which could then send an email to me to report the issue. In that case, I suppose, it isn't even required that FreeNAS or FreeBSD has support for Ryzen 3000 in its kernel, as it will work just as good without it. I do find it strange that he said "It’s not necessary logged". I've send him a follow up question on that...
OS First
"The error is forwarded to the os for it to decide what to do and the hardware does not halt, panic or send a non maskable interrupt."
For this properly work, I guess you need a couple things:
- Full Ryzen 3000 support in the kernel (I guess this could take some time)
- Proper handling of the discovered memory errors by FreeNAS, so that the user is notified of them. I didn't really find yet if this properly set by default or if specific user configuration is required to make this work.
I saw that Asus and Gigabyte both have x570 mobos with explicit ECC support. Gigabyte explicitly mentions Ryzen 3000 even, Asus is less clear on that. But as these Mobos don't have an IPMI, I suppose that there must be "OS First" instead of "Platform First", and as far as I could tell, that is no solution for FreeNAS as the OS probably currently does not fully support the Ryzen 3000 yet...