MCA: Memory Error

Status
Not open for further replies.

JJT211

Patron
Joined
Jul 4, 2014
Messages
323
So I just bought an additional 2x8gb RAM (used Ebay $60/stick, gamble I know) to add to my already 2x8gb RAM to bring my system (Supermicro A1SAi-2750F, Kingston 8GB SODIMM's, see sig) to 32GB. I installed it about a 10 days ago, everything booted fine and my system shows a full 32GB memory.

A few days ago I saw an MCA Error in my console log

Code:
 MCA: Bank 5, Status 0xd40000c000900090
> MCA: Global Cap 0x0000000000000806, Status 0x0000000000000000
> MCA: Vendor "GenuineIntel", ID 0x406d8, APIC ID 0
> MCA: CPU 0 COR OVER RD channel 0 memory error
> MCA: Address 0x12ef39498


It was just that one error. Everything seemed to run fine and my mem usage at the time according to GUI was just about maxed

mem%20usage_zpscfd8cqvh.png


I googled it and came across this thread here in the forums:

https://forums.freenas.org/index.php?threads/memory-errors.10573/

Did a memtest (2 passes) and didnt get any errors. Figured it may have been an aberation. Ive been stressing my system since trying to see if I can duplicate and I did. Got 2 MCA error, 2 hours apart this morning, similar to above. It was in my console footer but I stopped some jails and its way up there now, so I cant be sure it was the exact same one, I dunno.

Anyways, I figured before I do another memtest, which takes a quite while, I should do some digging around as what was mentioned in the above thread is that the memtest doesnt ID the slot, you just gotta plug n pull 1 stick at a time to duplicate failure, which sounds like a long process. Not to mention, it passed last time anyway.

So after some research I SSH'd in system and did
freenas-debug -h (Dump Hardware Configuration)

In hopes I can ID the slot with the MCA error.

Here's what I found,


Code:
Handle 0x002B, DMI type 16, 23 bytes

Physical Memory Array
        Location: System Board Or Motherboard
        Use: System Memory
        Error Correction Type: Single-bit ECC
        Maximum Capacity: 64 GB
        Error Information Handle: Not Provided
        Number Of Devices: 4

Handle 0x002C, DMI type 19, 31 bytes
Memory Array Mapped Address
        Starting Address: 0x00000000000
        Ending Address: 0x007FFFFFFFF
        Range Size: 32 GB
        Physical Array Handle: 0x002B
        Partition Width: 1

Handle 0x002D, DMI type 17, 34 bytes
Memory Device
        Array Handle: 0x002B
        Error Information Handle: Not Provided
        Total Width: 64 bits
        Data Width: 64 bits
        Size: 8192 MB
        Form Factor: SODIMM
        Set: None
        Locator: DIMMA1
        Bank Locator: BANK 0
        Type: DDR3
        Type Detail: Synchronous Unbuffered (Unregistered)
        Speed: 1600 MHz
        Manufacturer: Toshiba
        Serial Number: 68467021
        Asset Tag: FBANK 0 DIMMA1 AssetTag
        Part Number: 9965527-021.A00LF
        Rank: 1
        Configured Clock Speed: 1600 MHz

Handle 0x002E, DMI type 20, 35 bytes
Memory Device Mapped Address
        Starting Address: 0x00000000000
        Ending Address: 0x001FFFFFFFF
        Range Size: 8 GB
        Physical Device Handle: 0x002D
        Memory Array Mapped Address Handle: 0x002C
        Partition Row Position: Unknown

Handle 0x002F, DMI type 17, 34 bytes
Memory Device
        Array Handle: 0x002B
        Error Information Handle: Not Provided
        Total Width: 64 bits
        Data Width: 64 bits
        Size: 8192 MB
        Form Factor: SODIMM
        Set: None
        Locator: DIMMA2
        Bank Locator: BANK 0
        Type: DDR3
        Type Detail: Synchronous Unbuffered (Unregistered)
        Speed: 1600 MHz
        Manufacturer: Toshiba
        Serial Number: 43221469
        Asset Tag: FBANK 0 DIMMA2 AssetTag
        Part Number: 9965527-021.A00LF
        Rank: 2
        Configured Clock Speed: 1600 MHz

Handle 0x0030, DMI type 20, 35 bytes
Memory Device Mapped Address
        Starting Address: 0x00200000000
        Ending Address: 0x003FFFFFFFF
        Range Size: 8 GB
        Physical Device Handle: 0x002F
        Memory Array Mapped Address Handle: 0x002C
        Partition Row Position: Unknown

Handle 0x0031, DMI type 17, 34 bytes
Memory Device
        Array Handle: 0x002B
        Error Information Handle: Not Provided
        Total Width: 64 bits
        Data Width: 64 bits
        Size: 8192 MB
        Form Factor: SODIMM
        Set: None
        Locator: DIMMB1
        Bank Locator: BANK 0
        Type: DDR3
        Type Detail: Synchronous Unbuffered (Unregistered)
        Speed: 1600 MHz
        Manufacturer: Toshiba
        Serial Number: 68462224
        Asset Tag: FBANK 0 DIMMB1 AssetTag
        Part Number: 9965527-021.A00LF
        Rank: 1
        Configured Clock Speed: 1600 MHz

Handle 0x0032, DMI type 20, 35 bytes
Memory Device Mapped Address
        Starting Address: 0x00400000000
        Ending Address: 0x005FFFFFFFF
        Range Size: 8 GB
        Physical Device Handle: 0x0031
        Memory Array Mapped Address Handle: 0x002C
        Partition Row Position: Unknown

Handle 0x0033, DMI type 17, 34 bytes
Memory Device
        Array Handle: 0x002B
        Error Information Handle: Not Provided
        Total Width: 64 bits
        Data Width: 64 bits
        Size: 8192 MB
        Form Factor: SODIMM
        Set: None
        Locator: DIMMB2
        Bank Locator: BANK 0
        Type: DDR3
        Type Detail: Synchronous Unbuffered (Unregistered)
        Speed: 1600 MHz
        Manufacturer: Toshiba
        Serial Number: 44221169
        Asset Tag: FBANK 0 DIMMB2 AssetTag
        Part Number: 9965527-021.A00LF
        Rank: 2
        Configured Clock Speed: 1600 MHz

Handle 0x0034, DMI type 20, 35 bytes
Memory Device Mapped Address
        Starting Address: 0x00600000000
        Ending Address: 0x007FFFFFFFF
        Range Size: 8 GB
        Physical Device Handle: 0x0033
        Memory Array Mapped Address Handle: 0x002C
        Partition Row Position: Unknown

Handle 0x0035, DMI type 13, 22 bytes
BIOS Language Information
        Language Description Format: Long
        Installable Languages: 1
                en|US|iso8859-1
        Currently Installed Language: en|US|iso8859-1

Handle 0x0036, DMI type 127, 4 bytes
End Of Table


Looks like im close, but I cant quite match it. Anyone?
 

BigDave

FreeNAS Enthusiast
Joined
Oct 6, 2013
Messages
2,479
  1. Serial Number: 68467021
  2. Asset Tag: FBANK 0 DIMMA1 AssetTag
  3. Part Number: 9965527-021.A00LF
  4. Rank: 1
    1. Serial Number: 43221469
    2. Asset Tag: FBANK 0 DIMMA2 AssetTag
    3. Part Number: 9965527-021.A00LF
    4. Rank: 2
      1. Serial Number: 68462224
      2. Asset Tag: FBANK 0 DIMMB1 AssetTag
      3. Part Number: 9965527-021.A00LF
      4. Rank: 1
        1. Serial Number: 44221169
        2. Asset Tag: FBANK 0 DIMMB2 AssetTag
        3. Part Number: 9965527-021.A00LF
        1. Rank: 2
        • I'm no expert but I saw a difference in the two pairs of RAM.
        • The Rank is not the same.
        • I have absolutly no idea if this is the problem or not? That's all I got...
 

JJT211

Patron
Joined
Jul 4, 2014
Messages
323
nm
 
Last edited:

JJT211

Patron
Joined
Jul 4, 2014
Messages
323
Yea I got that, I meant match the MCA error address # to my hardware configuration dump.

  1. MCA: Bank 5, Status 0xd40000c000900090
  2. > MCA: Global Cap 0x0000000000000806, Status 0x0000000000000000
  3. > MCA: Vendor "GenuineIntel", ID 0x406d8, APIC ID 0
  4. > MCA: CPU 0 COR OVER RD channel 0 memory error
  5. > MCA: Address 0x12ef39498

If im able to do that, I can find which DIMM it is. Thx though, I figured it was a long shot anyway.

Think im just gonna remove my 2 original RAM sticks and memtest the 2 questionable ones. Ill let it run for a while as Ill have a pretty busy day tomm anyway.
 

JJT211

Patron
Joined
Jul 4, 2014
Messages
323
2 passes thus far, no errors.

Are there any other mem utilities out there?
 

JJT211

Patron
Joined
Jul 4, 2014
Messages
323
Wow, well this post is quite discouraging, from Jordan Hubbard himself

On Aug 24, 2014, at 7:57 AM, Cary Babcock <babcock.cary at gmail.com> wrote:

> I recently ran into data corrucption that was caused by my memory going bad.
> Have you considered the possibility of having a memory tester built into FreeNAS to
> help avoid a situation like this ?

We have. Sadly, almost all easily-obtained / easily-hosted memory testers are pretty worthless. They are trivially defeated by processor caches (which lie and report success even for bad memory locations) and are not thorough enough to catch all of the insidious single-bit errors. A good memory tester is very exhaustive, runs as close to the “bare metal” as possible (which is why most of them still run under DOS) and takes a long time to run.

This is the kind of situation that instead further reinforces the fact that it’s a much better idea to qualify your hardware separately from FreeNAS, before ever installing it, since a high-level OS is actually a poor hardware test platform.

http://lists.freenas.org/pipermail/freenas-devel/2014-August/000787.html

Lesson learnt.....dont buy used memory
 

JJT211

Patron
Joined
Jul 4, 2014
Messages
323
Did a bit more research and ive seen a few people say they didnt start seeing memory errors until 6+ passes. So I just gotta be a bit more patient. I got exams coming up anyways, so Ill let the memtest be the buffer b/w me and FreeNAS....lol

I'm doing 1 stick at a time and gonna be doing at least 10 passes per module. Ill report back.
 

JJT211

Patron
Joined
Jul 4, 2014
Messages
323
Ok, I did 3 passes per Ram stick (took 12 hours each) and didnt get any errors. I put my memory back in but this time made sure I have them in different slots. It's been about 24 hours since Ive booted my system back up and it appears I got the same exact error.

Code:
Apr 20 16:25:57 freenas MCA: Bank 5, Status 0xd40000c000900090
Apr 20 16:25:57 freenas MCA: Global Cap 0x0000000000000806, Status 0x0000000000000000
Apr 20 16:25:57 freenas MCA: Vendor "GenuineIntel", ID 0x406d8, APIC ID 0
Apr 20 16:25:57 freenas MCA: CPU 0 COR OVER RD channel 0 memory error
Apr 20 16:25:57 freenas MCA: Address 0x126f39498


Same Bank but different address. Could Bank be Memory slot and address be physical stick? Meaning maybe the slot on my MB might be bad???

That would suck

EDIT: Oh wait, nevermind, the address is the same. So its the same EXACT error. And also, from my memory dump above, all my RAM is being reported as Bank 0.

Hmmmm....

Think im gonna go back to a 16GB config with the new RAM and see if I get anything. I definitely wasnt getting any errors with the old RAM.
 
Last edited:

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
Well, you're doing generally the right things to see if it is a problem with one of the modules. One of the reasons that we burn in gear here for months.
 

JJT211

Patron
Joined
Jul 4, 2014
Messages
323
Apr 22 02:20:16 freenas MCA: Bank 5, Status 0xd400008000910091
Apr 22 02:20:16 freenas MCA: Global Cap 0x0000000000000806, Status 0x0000000000000000
Apr 22 02:20:16 freenas MCA: Vendor "GenuineIntel", ID 0x406d8, APIC ID 0
Apr 22 02:20:16 freenas MCA: CPU 0 COR OVER RD channel 1 memory error
Apr 22 02:20:16 freenas MCA: Address 0x56f39598
Apr 22 03:01:53 freenas upsmon[2789]: UPS ups on battery
Apr 22 03:01:58 freenas upsmon[2789]: UPS ups on line power
Apr 22 03:20:16 freenas MCA: Bank 5, Status 0x9400004000910091
Apr 22 03:20:16 freenas MCA: Global Cap 0x0000000000000806, Status 0x0000000000000000
Apr 22 03:20:16 freenas MCA: Vendor "GenuineIntel", ID 0x406d8, APIC ID 0
Apr 22 03:20:16 freenas MCA: CPU 0 COR RD channel 1 memory error
Apr 22 03:20:16 freenas MCA: Address 0x56f39580

And there it is, bad memory! Good news actually, as I was a tad bit worried it might be my memory controller/board.

Well, you're doing generally the right things to see if it is a problem with one of the modules. One of the reasons that we burn in gear here for months.

Cool thx, yea I isolated it to at least one of the two sticks of new RAM, but I think it wouldnt be wise to troubleshoot any further as leaving the single bad stick in my system might risk my data. It appears I can get my money back from Ebay anyway.

How do yall normally burn in new RAM? Is there an actual RAM tester or is it put into a dummy system?
 
Last edited:

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
My *strong* preference is to burn it in as part of the assembly it'll ultimately have to work as part of. This is part of why when we get a new server, it is likely to spend a lot of time doing testing and burn in before getting put into production. As we've moved towards virtualization here, the amount of spare gear we've had laying around to do testing has plummeted, so it is also sometimes the only way to do testing.
 

JJT211

Patron
Joined
Jul 4, 2014
Messages
323
The only logs I was able to check were the ones I posted. If there's something else I missed in the IPMI console, please let me know how.
 
Status
Not open for further replies.
Top