BUILD Working AMD "prosumer" Build

cyberjock · Nov 3, 2014

Your numbers are out of order. ;)

But, assuming that there is a way to inject ECC Errors like you think (I didn't try to read the link... i'm at work and busy at the moment) a python file that causes those exact errors would be appreciate by many people. ;)

DJABE · Nov 3, 2014

I don't believe there is a way to simulate "real" ECC error - i.e. bitflip. One would need to 'enter' the memory after sending data and to change one bit.... it's been discussed already here..
BUT if there is a way... really .... it would be just nice to try it like cyberjock said :)

Ericloewe · Nov 3, 2014

DJABE said:
I don't believe there is a way to simulate "real" ECC error - i.e. bitflip. One would need to 'enter' the memory after sending data and to change one bit.... it's been discussed already here..
BUT if there is a way... really .... it would be just nice to try it like cyberjock said :)

If the memory controller supports it, it is theoretically possible to write an "incorrect" value to RAM, which could then be read back.

cyberjock · Nov 3, 2014

Ericloewe said:
If the memory controller supports it, it is theoretically possible to write an "incorrect" value to RAM, which could then be read back.

And that's the problem. Everything I've read and heard from others investigating is that the memory controller doesn't support it. That's why I threw down the gauntlet and said "python code" or bust. Willing to bet it doesn't do what he thinks it does. ;)

DJABE · Nov 3, 2014

If the memory controller supports such 'feature', it would be prone to cyber attacks (viruses, worms, etc..). And such a huge security hole CPU manufactures would never allow...

Ericloewe · Nov 3, 2014

DJABE said:
If the memory controller supports such 'feature', it would be prone to cyber attacks (viruses, worms, etc..). And such a huge security hole CPU manufactures would never allow...

Not really, beyond a (easily detectable) DoS attack, if the controller was properly designed.

Helper app requests memory
Enter diagnostic mode
Write bad data
Exit diagnostic mode
Read data
Report ECC failure or lack thereof
Memory is returned to the heap

cyberjock · Nov 3, 2014

DJABE said:
If the memory controller supports such 'feature', it would be prone to cyber attacks (viruses, worms, etc..). And such a huge security hole CPU manufactures would never allow...

Precisely.

anodos · Nov 3, 2014

cyberjock said:
Precisely.

Here's some fun reading regarding ECC error injection on UltraSparc-II. http://queue.acm.org/detail.cfm?id=1839574 The beauty of having hardware and software guys working together. :)

Unfortunately, creating something to inject ECC errors would probably require cooperation from AMD hardware engineers, and the lack of cooperation is probably the root cause of the whole AMD compatibility mess that we have right now.

edit: forgot the link

DJABE · Nov 3, 2014

@anodos: OK, let's just forget about AMD for a second, what about Intel? Any proof of concept wanted :)

cyberjock · Nov 3, 2014

There's a command somewhere. If you search the forums there's a few good threads that have very detailed info on ECC and such. Unfortunately I don't have the time to search for it, but if you search for "dmidecode" you should be able to find the thread.

DJABE · Nov 3, 2014

Yes, I know about dmidecode and little C program which can check ECC support, but only on newer Intel chips..

Urs · Nov 3, 2014

I See you haven't read the document that I have linked...
There is even an example program, because its really easy to do.
The example is how to write real bad data. That must be embedded in the exact routine ericloewe stated above.
Its late now where I live (23:19) and my wife waits for me to go to bed... tomorrow I will copy the interesting parts out of the document to post them here so we all can discuss happily!

Good night!

Urs · Nov 4, 2014

At first, according to the manual for AMD family 15h you can simulate ecc. It means there is no bad data written but the memorycontroller reports ecc errors.

2.15.3 Error Injection and Simulation
Error injection allows the introduction of errors into the system for test and debug purposes. See the following
sections for error injection details:
• DRAM: See 2.15.3.1 [DRAM Error Injection].
• Link:
• D18F3x44[GenLinkSel, GenSubLinkSel, GenCrcErrByte1, GenCrcErrByte0].
Error simulation involves creating the appearance to software that an error occurred, and can be used to debug
machine check interrupt handlers. This is performed by manually setting the MCA registers with desired values,
and then driving the software via INT18. See MSRC001_0015[McStatusWrEn] for making MCA registers
writable for non-zero values. When McStatusWrEn is set, privileged software can write non-zero values to the
specified registers without generating exceptions, and then simulate a machine check using the INT18 instruction
(INTn instruction with an operand of 18). Setting a reserved bit in these registers does not generate an
exception when this mode is enabled. However, setting a reserved bit may result in undefined behavior.

But if you read further you get the information how to write real bad data:

2.15.3.1 DRAM Error Injection
This section gives details and examples on injecting errors into DRAM using D18F3xBC_x8 [DRAM ECC].
The intent of DRAM error injection is to cause a discrepancy between the stored data and the stored ECC
value. Therefore, DRAM error injection is only possible on DRAM which supports ECC, and in which
D18F2x90_dct[3:0][DimmEccEn] and D18F3x44[DramEccEn] are set.
The memory subsystem operates on 64-byte cachelines. The following fields are used to set how the cacheline
is to be corrupted in DRAM:
• D18F3xB8[ArrayAddress] selects a cacheline quadrant (16-byte section) of the cacheline. Each cacheline
quadrant is protected by an ECC word. Note that there are special requirements for which bits are
used to specify the target quadrant.
• D18F3xBC_x8[ErrInjEn] selects a 16-bit word of the cacheline quadrant selected in ArrayAddress. The
16-bit word identified as ECC[15:0] refers to the bits which store the ECC value; the other 16-bit words
address the data on which the ECC is calculated. One or more of these 16-bit words can be selected, and
the error bitmask indicated in EccVector is applied to each of the selected words.
• D18F3xBC_x8[EccVector] is a bitmask which selects the individual bits to be corrupted in the 16-bit
words selected by ErrInjEn. When selecting the bits to be corrupted for correctable or uncorrectable
errors, consider the ECC scheme being used, including symbol size; see 2.15.2 [DRAM ECC Considerations]
for more details. Note that corrupting more than two symbols may exceed the limits of the ECC
to detect the errors; for testing purposes it is recommended that no more than two symbols be corrupted
in a single cacheline quadrant.
The distinction between D18F3xBC_x8[DramErrEn] and D18F3xBC_x8[EccWrReq] is that DramErrEn is
used to continuously inject errors on every write. This bit is set and cleared by software. EccWrReq is used to
inject an error on only one write. This bit is set by software and is cleared by hardware after the error is
injected.
When performing DRAM error injection on multi-node systems, D18F3xB8 and D18F3xBC_x8 of the NB to
which the memory is attached must be programmed.
The following can be used to trigger the injection:
• The memory address is not an explicit parameter of the error injection interface. Once the error injection
registers D18F3xB8 and D18F3xBC are set, the next non-cached access of the appropriate type will trigger
the mechanism and apply it to the accessed address. The access should be non-cached so that it is
ensured to be seen by the memory controller. Possible methods to ensure a non-cached access include
using the appropriate MTRR to set the memory type to UC or turning off caches. If it is important to
know the address, then system activity must be quiesced so that the access can take place under careful
software control. Once the error injection pattern is set in D18F3xB8 and D18F3xBC_x8:
• Set either D18F3xBC_x8[EccWrReq] or D18F3xBC_x8[DramErrEn] to enable the triggering mechanism.
• The next non-cached access of the appropriate type will trigger the mechanism and apply it to the
accessed address.
After the error is injected, the data must be accessed in order for the error detection to be triggered. The error
address logged in MSR0000_0412 will correspond to the cacheline quadrant that contains the error.
When using MSR0000_0411 to read MC4_STATUS after an error injection and subsequent error detection, be
aware that the setting of D18F3x44[NbMcaToMstCpuEn] can cause different cores to see different values.
Alternatively, MC4_STATUS can be read through the PCI-defined configuration space aliases D18F3x4C and
D18F3x48, which do not return different values to different cores, regardless of the setting of D18F3x44[NbMcaToMstCpuEn].

And now comes what cyberjock and i think many other people also want to know an EXAMPLE!

Example 1: Injecting a correctable error:
• Program error pattern:
• D18F3xB8[ArraySelect]=1000b // select DRAM as target
• D18F3xB8[ArrayAddress]=000000000b // select 16-byte (128-bit) section
• D18F3xBC_x8[ErrInjEn]=000000001b // select 16-bit word in 16-byte section
• D18F3xBC_x8[EccRdReq]=0 // not a read request
• D18F3xBC_x8[EccVector]=0001h // set bitmask to inject error into only one symbol
• Program error trigger:
• D18F3xBC_x8[DramErrEn]=0 // inject only a single error
• D18F3xBC_x8[EccWrReq]=1 // a write request; enable injection on next write
• Clean up // if programmed for continuous errors
• D18F3xBC_x8[DramErrEn]=0 // inject only a single error

That part of code has to inserted in a framework ericloewe stated:

Not really, beyond a (easily detectable) DoS attack, if the controller was properly designed.

Helper app requests memory

Enter diagnostic mode

Write bad data

Exit diagnostic mode

Read data

Report ECC failure or lack thereof

Memory is returned to the heap

EDIT: I have just read the other BKDG from AMD, it is for the complete family 15h and 16h, so for ALL ACTUAL AMD PROCESSOR THE SAME CODE!

DJABE · Nov 4, 2014

Quite interesting.
Someone should code little .c program for this purpose :)
AMD CPU + Asus MoBo + ECC RAM = winning combination ($$$) if it is working. :)

cyberjock · Nov 4, 2014

I'm not a programmer, so some of that is gobbelygook to me. BUT, the fact that it only injects a single error is a bit disappointing. Ideally you'd want to create single-bit and multi-bit errors to see the system's response. If it is possible to inject a multi-bit error the system's proper response would be important (and basically *is* the important bit).

DJABE said:
Quite interesting.
Someone should code little .c program for this purpose :)
AMD CPU + Asus MoBo + ECC RAM = winning combination ($$$) if it is working. :)

Actually, even if it proved that ECC RAM can work with a given combo, AMD is still a HORRIBLE idea (you could argue it has always been a bad idea for FreeBSD because of how small AMD is). In fact, there's internal discussions that maybe we should remove the AMD build from the hardware recommendations thread and say something like "if you use AMD don't cry to us if it suddenly won't even boot with an update". Right now 9.3 is not working for many AMD users (as in not even finishing the bootup sequence before panicing due to "incompatible hardware") and FreeNAS 10.1 (the next version after 9.3) is looking to be even more so.

In short, if you are wanting to build this system for long-term compatibility, you are barking up the wrong tree buying AMD hardware. There will be zero sympathy in the forum if/when this happens because the problems are going to be a FreeBSD thing and therefore isn't something the FreeNAS devs care about or even have any control over. Without AMD being 10,000x more involved than they have been lately you'll find their hardware more and more problematic and eventually there will be some fatal problem that will make AMD systems impossible on FreeBSD/FreeNAS. The situation is already there. It's just a matter of when FreeBSD is going to finally say "enough is enough.. we don't give a crap about AMD" and at that point you're going to be very disappointed.

Buying AMD is like playing russian roulette. You might win, or you might lose. The question is whether it's even worth it to play. If not, then you should just be going Intel so you don't have to worry about the game.

Sorry, but AMD has been, currently is, and will always be a bad idea unless/until AMD gets out of their financial problems, hires developers to truly make sure their hardware is supported in FreeBSD, and is willing to be a bigger player in development of code that is more compatible. Don't like that little dose of reality then too bad. That's the reality. AMD, as a total platform package, has many factors going against it. You solve one and there's just a bunch more to go. Some of them are unfixable at present, many are very likely to never be fixed on today's hardware due to support for today's hardware in the future, and some might be fixable someday.

Urs · Nov 4, 2014

How to get more error bits:

D18F3xBC_x8[EccVector] is a bitmask which selects the individual bits to be corrupted in the 16-bit
words selected by ErrInjEn. When selecting the bits to be corrupted for correctable or uncorrectable
errors, consider the ECC scheme being used, including symbol size; see 2.15.2 [DRAM ECC Considerations]
for more details. Note that corrupting more than two symbols may exceed the limits of the ECC
to detect the errors; for testing purposes it is recommended that no more than two symbols be corrupted
in a single cacheline quadrant

And the example code:

Example 2, written by myself: Injecting a non correctable error (multibit):
• Program error pattern:
• D18F3xB8[ArraySelect]=1000b // select DRAM as target
• D18F3xB8[ArrayAddress]=000000000b // select 16-byte (128-bit) section
• D18F3xBC_x8[ErrInjEn]=000000001b // select 16-bit word in 16-byte section
• D18F3xBC_x8[EccRdReq]=0 // not a read request
• D18F3xBC_x8[EccVector]=0101h // set bitmask to inject error into only one symbol <-- this line has to be changed to insert more errorbits, it is the 16bit error vector
• Program error trigger:
• D18F3xBC_x8[DramErrEn]=0 // inject only a single error (not meant only a single error bit)
• D18F3xBC_x8[EccWrReq]=1 // a write request; enable injection on next write
• Clean up // if programmed for continuous errors
• D18F3xBC_x8[DramErrEn]=0 // inject only a single error

Urs · Nov 4, 2014

Had been thinking about how does ECC error handling works in freenas? Is there support for anything like that? The posible tool for Linux is EDAC and edac_utils, working for both actual intel and amd. Is something like that implemented?

Urs · Nov 4, 2014

Memtest 86 PRO V5 (40$) has exactly this implemented for following processors:

- Intel Nehalem/Lynnfield/Westmere chipsets

- AMD 15h/16h chipsets

- Intel Xeon E3 v3 (Haswell)

So you can really prove that ECC is working, not only enabled!

The Hardware recommendation part for AMD can be changed that you can prove ecc is working. btw, how can you prove that ecc is really working and not only enabled on Intel without error injection?

cyberjock · Nov 4, 2014

uh, we *do* do error simulation with the dmidecode. We've also had real-world examples for bad RAM on the Intel builds. The problem is that we've had no real-world examples for AMD that were positive confirmation. But if you look at other forums like hardforums they've had AMD systems that were allegedly "supporting ECC" and while you should have expected ECC to work it didn't and they had bad RAM. Until your post the answer for AMD has always been "avoid it because of ECC alone".

While we could conceivably add a comment about ECC support, I think we're about 48 hours from saying "if you buy AMD do not come to this forum for support". AMD thing is getting ugly on future builds of FreeNAS and FreeBSD.. It looks like FreeBSD may be "AMD-incompatible". :/

While there's a glimmer of hope for AMD with ECC, it's not going to make a darn bit of good if you can't even boot the OS without it crashing. ;)

Urs · Nov 4, 2014

ok, i see there is no real testing mechanism implemented for intel either...
how does freenas react on getting ecc errors? how are they fetched from the memory controller?

Important Announcement for the TrueNAS Community.

BUILD Working AMD "prosumer" Build

Inactive Account

Contributor

Server Wrangler

Inactive Account

Contributor

Server Wrangler

Inactive Account

Sambassador

Contributor

Inactive Account

Contributor

Dabbler

Dabbler

Contributor

Inactive Account

Dabbler

Dabbler

Dabbler

Inactive Account

Dabbler

Similar threads