Memory Error on FreeNAS Mini

Status
Not open for further replies.

Redcoat

MVP
Joined
Feb 18, 2014
Messages
2,925
In the last couple of weeks my 11.0-U4, 32GB, FreeNAS Mini (mainboard replaced a few months ago under warranty by iXsystems after failing) has reported 4 instances of memory error:

Code:
MCA: Bank 5, Status 0x9400004000910091
MCA: Global Cap 0x0000000000000806, Status 0x0000000000000000
MCA: Vendor "GenuineIntel", ID 0x406d8, APIC ID 0
MCA: CPU 0 COR RD channel 1 memory error
MCA: Address 0x818e45508

MCA: Bank 5, Status 0x9400004000910091
MCA: Global Cap 0x0000000000000806, Status 0x0000000000000000
MCA: Vendor "GenuineIntel", ID 0x406d8, APIC ID 0
MCA: CPU 0 COR RD channel 1 memory error
MCA: Address 0x51bd011c0

MCA: Bank 5, Status 0x9400004000910091
MCA: Global Cap 0x0000000000000806, Status 0x0000000000000000
MCA: Vendor "GenuineIntel", ID 0x406d8, APIC ID 0
MCA: CPU 0 COR RD channel 1 memory error
MCA: Address 0x80ae56130

MCA: Bank 5, Status 0x9400004000910091
MCA: Global Cap 0x0000000000000806, Status 0x0000000000000000
MCA: Vendor "GenuineIntel", ID 0x406d8, APIC ID 0
MCA: CPU 0 COR RD channel 1 memory error
MCA: Address 0x4ec722100


The last two were yesterday and today - in between them I reseated the 4 memory modules, generally checked the inside of the box, and mentally prepared for the next phase if there was a repeat event. So, here we are.

Dmidecode results today:

Code:
root@freenasmini:~ # dmidecode
# dmidecode 3.0
Scanning /dev/mem for entry point.
SMBIOS 2.8 present.
25 structures occupying 1495 bytes.
Table at 0xCF527000.

Handle 0x0000, DMI type 0, 24 bytes
BIOS Information
		Vendor: American Megatrends Inc.
		Version: P2.90
		Release Date: 01/26/2016
		Address: 0xF0000
		Runtime Size: 64 kB
		ROM Size: 8192 kB
		Characteristics:
				PCI is supported
				BIOS is upgradeable
				BIOS shadowing is allowed
				Boot from CD is supported
				Selectable boot is supported
				BIOS ROM is socketed
				EDD is supported
				5.25"/1.2 MB floppy services are supported (int 13h)
				3.5"/720 kB floppy services are supported (int 13h)
				3.5"/2.88 MB floppy services are supported (int 13h)
				Print screen service is supported (int 5h)
				8042 keyboard services are supported (int 9h)
				Serial services are supported (int 14h)
				Printer services are supported (int 17h)
				ACPI is supported
				USB legacy is supported
				BIOS boot specification is supported
				Targeted content distribution is supported
				UEFI is supported
		BIOS Revision: 5.6

Handle 0x0001, DMI type 1, 27 bytes
System Information
		Manufacturer: iXsystems
		Product Name: FREENAS-MINI-2.0
		Version: To Be Filled By O.E.M.
		Serial Number: A1-37201
		UUID: 03000200-0400-0500-0006-000700080009
		Wake-up Type: Power Switch
		SKU Number: To Be Filled By O.E.M.
		Family: To Be Filled By O.E.M.

Handle 0x0002, DMI type 2, 15 bytes
Base Board Information
		Manufacturer: ASRock
		Product Name: C2750D4I
		Version: 1.02
		Serial Number: 73S0X39I0073
		Asset Tag:
		Features:
				Board is a hosting board
				Board is replaceable
		Location In Chassis:
		Chassis Handle: 0x0003
		Type: Motherboard
		Contained Object Handles: 0

Handle 0x0003, DMI type 3, 25 bytes
Chassis Information
		Manufacturer: To Be Filled By O.E.M.
		Type: Desktop
		Lock: Not Present
		Version: To Be Filled By O.E.M.
		Serial Number: To Be Filled By O.E.M.
		Asset Tag: To Be Filled By O.E.M.
		Boot-up State: Safe
		Power Supply State: Safe
		Thermal State: Safe
		Security Status: None
		OEM Information: 0x00000000
		Height: Unspecified
		Number Of Power Cords: 1
		Contained Elements: 1
				Power Supply (1)
		SKU Number: To Be Filled By O.E.M.

Handle 0x0008, DMI type 9, 17 bytes
System Slot Information
		Designation: PCIE1
		Type: x8 PCI Express
		Current Usage: In Use
		Length: Long
		ID: 17
		Characteristics:
				3.3 V is provided
				Opening is shared
				PME signal is supported

Handle 0x0009, DMI type 11, 5 bytes
OEM Strings
		String 1: To Be Filled By O.E.M.

Handle 0x0014, DMI type 32, 20 bytes
System Boot Information
		Status: No errors detected

Handle 0x0015, DMI type 41, 11 bytes
Onboard Device
		Reference Designation:  Onboard IGD
		Type: Video
		Status: Enabled
		Type Instance: 1
		Bus Address: 0000:00:02.0

Handle 0x0016, DMI type 41, 11 bytes
Onboard Device
		Reference Designation:  Onboard LAN
		Type: Ethernet
		Status: Enabled
		Type Instance: 1
		Bus Address: 0000:00:19.0

Handle 0x0017, DMI type 41, 11 bytes
Onboard Device
		Reference Designation:  Onboard 1394
		Type: Other
		Status: Enabled
		Type Instance: 1
		Bus Address: 0000:03:1c.2

Handle 0x0018, DMI type 7, 19 bytes
Cache Information
		Socket Designation: L1-Cache
		Configuration: Enabled, Not Socketed, Level 1
		Operational Mode: Write Back
		Location: Internal
		Installed Size: 448 kB
		Maximum Size: 448 kB
		Supported SRAM Types:
				Synchronous
		Installed SRAM Type: Synchronous
		Speed: Unknown
		Error Correction Type: Single-bit ECC
		System Type: Instruction
		Associativity: 8-way Set-associative

Handle 0x0019, DMI type 7, 19 bytes
Cache Information
		Socket Designation: L2-Cache
		Configuration: Enabled, Not Socketed, Level 2
		Operational Mode: Write Back
		Location: Internal
		Installed Size: 4096 kB
		Maximum Size: 4096 kB
		Supported SRAM Types:
				Synchronous
		Installed SRAM Type: Synchronous
		Speed: Unknown
		Error Correction Type: Single-bit ECC
		System Type: Unified
		Associativity: 16-way Set-associative

Handle 0x001A, DMI type 4, 42 bytes
Processor Information
		Socket Designation: CPUSocket
		Type: Central Processor
		Family: Atom
		Manufacturer: Intel(R) Corporation
		ID: D8 06 04 00 FF FB EB BF
		Signature: Type 0, Family 6, Model 77, Stepping 8
		Flags:
				FPU (Floating-point unit on-chip)
				VME (Virtual mode extension)
				DE (Debugging extension)
				PSE (Page size extension)
				TSC (Time stamp counter)
				MSR (Model specific registers)
				PAE (Physical address extension)
				MCE (Machine check exception)
				CX8 (CMPXCHG8 instruction supported)
				APIC (On-chip APIC hardware supported)
				SEP (Fast system call)
				MTRR (Memory type range registers)
				PGE (Page global enable)
				MCA (Machine check architecture)
				CMOV (Conditional move instruction supported)
				PAT (Page attribute table)
				PSE-36 (36-bit page size extension)
				CLFSH (CLFLUSH instruction supported)
				DS (Debug store)
				ACPI (ACPI supported)
				MMX (MMX technology supported)
				FXSR (FXSAVE and FXSTOR instructions supported)
				SSE (Streaming SIMD extensions)
				SSE2 (Streaming SIMD extensions 2)
				SS (Self-snoop)
				HTT (Multi-threading)
				TM (Thermal monitor supported)
				PBE (Pending break enabled)
		Version: Intel(R) Atom(TM) CPU  C2750  @ 2.40GHz
		Voltage: 1.6 V
		External Clock: 100 MHz
		Max Speed: 2600 MHz
		Current Speed: 2400 MHz
		Status: Populated, Enabled
		Upgrade: Other
		L1 Cache Handle: 0x0018
		L2 Cache Handle: 0x0019
		L3 Cache Handle: Not Provided
		Serial Number: Not Specified
		Asset Tag: ProcessorInfo_ASSET_TAG
		Part Number: Not Specified
		Core Count: 8
		Core Enabled: 8
		Thread Count: 8
		Characteristics:
				64-bit capable

Handle 0x001D, DMI type 15, 73 bytes
System Event Log
		Area Length: 65535 bytes
		Header Start Offset: 0x0000
		Header Length: 16 bytes
		Data Start Offset: 0x0010
		Access Method: Memory-mapped physical 32-bit address
		Access Address: 0xFFA21000
		Status: Valid, Not Full
		Change Token: 0x000000DC
		Header Format: Type 1
		Supported Log Type Descriptors: 25
		Descriptor 1: Single-bit ECC memory error
		Data Format 1: Multiple-event handle
		Descriptor 2: Multi-bit ECC memory error
		Data Format 2: Multiple-event handle
		Descriptor 3: Parity memory error
		Data Format 3: None
		Descriptor 4: Bus timeout
		Data Format 4: None
		Descriptor 5: I/O channel block
		Data Format 5: None
		Descriptor 6: Software NMI
		Data Format 6: None
		Descriptor 7: POST memory resize
		Data Format 7: None
		Descriptor 8: POST error
		Data Format 8: POST results bitmap
		Descriptor 9: PCI parity error
		Data Format 9: Multiple-event handle
		Descriptor 10: PCI system error
		Data Format 10: Multiple-event handle
		Descriptor 11: CPU failure
		Data Format 11: None
		Descriptor 12: EISA failsafe timer timeout
		Data Format 12: None
		Descriptor 13: Correctable memory log disabled
		Data Format 13: None
		Descriptor 14: Logging disabled
		Data Format 14: None
		Descriptor 15: System limit exceeded
		Data Format 15: None
		Descriptor 16: Asynchronous hardware timer expired
		Data Format 16: None
		Descriptor 17: System configuration information
		Data Format 17: None
		Descriptor 18: Hard disk information
		Data Format 18: None
		Descriptor 19: System reconfigured
		Data Format 19: None
		Descriptor 20: Uncorrectable CPU-complex error
		Data Format 20: None
		Descriptor 21: Log area reset/cleared
		Data Format 21: None
		Descriptor 22: System boot
		Data Format 22: None
		Descriptor 23: End of log
		Data Format 23: None
		Descriptor 24: OEM-specific
		Data Format 24: OEM-specific
		Descriptor 25: OEM-specific
		Data Format 25: OEM-specific

Handle 0x001E, DMI type 16, 23 bytes
Physical Memory Array
		Location: System Board Or Motherboard
		Use: System Memory
		Error Correction Type: Single-bit ECC
		Maximum Capacity: 64 GB
		Error Information Handle: Not Provided
		Number Of Devices: 4

Handle 0x001F, DMI type 19, 31 bytes
Memory Array Mapped Address
		Starting Address: 0x00000000000
		Ending Address: 0x007FFFFFFFF
		Range Size: 32 GB
		Physical Array Handle: 0x001E
		Partition Width: 1

Handle 0x0020, DMI type 17, 34 bytes
Memory Device
		Array Handle: 0x001E
		Error Information Handle: Not Provided
		Total Width: 64 bits
		Data Width: 64 bits
		Size: 8192 MB
		Form Factor: DIMM
		Set: None
		Locator: DIMM0
		Bank Locator: BANK 0
		Type: DDR3
		Type Detail: Synchronous Unbuffered (Unregistered)
		Speed: 1600 MHz
		Manufacturer: Samsung
		Serial Number: 20716568
		Asset Tag:  BANK 0 DIMM0 AssetTag
		Part Number: M391B1G73QH0-YK0
		Rank: 2
		Configured Clock Speed: 1600 MHz

Handle 0x0021, DMI type 20, 35 bytes
Memory Device Mapped Address
		Starting Address: 0x00000000000
		Ending Address: 0x001FFFFFFFF
		Range Size: 8 GB
		Physical Device Handle: 0x0020
		Memory Array Mapped Address Handle: 0x001F
		Partition Row Position: 1

Handle 0x0022, DMI type 17, 34 bytes
Memory Device
		Array Handle: 0x001E
		Error Information Handle: Not Provided
		Total Width: 64 bits
		Data Width: 64 bits
		Size: 8192 MB
		Form Factor: DIMM
		Set: None
		Locator: DIMM0
		Bank Locator: BANK 1
		Type: DDR3
		Type Detail: Synchronous Unbuffered (Unregistered)
		Speed: 1600 MHz
		Manufacturer: Samsung
		Serial Number: 20805821
		Asset Tag:  BANK 1 DIMM0 AssetTag
		Part Number: M391B1G73QH0-YK0
		Rank: 2
		Configured Clock Speed: 1600 MHz

Handle 0x0023, DMI type 20, 35 bytes
Memory Device Mapped Address
		Starting Address: 0x00200000000
		Ending Address: 0x003FFFFFFFF
		Range Size: 8 GB
		Physical Device Handle: 0x0022
		Memory Array Mapped Address Handle: 0x001F
		Partition Row Position: 1

Handle 0x0024, DMI type 17, 34 bytes
Memory Device
		Array Handle: 0x001E
		Error Information Handle: Not Provided
		Total Width: 64 bits
		Data Width: 64 bits
		Size: 8192 MB
		Form Factor: DIMM
		Set: None
		Locator: DIMM1
		Bank Locator: BANK 0
		Type: DDR3
		Type Detail: Synchronous Unbuffered (Unregistered)
		Speed: 1600 MHz
		Manufacturer: Samsung
		Serial Number: 20716567
		Asset Tag:  BANK 0 DIMM1 AssetTag
		Part Number: M391B1G73QH0-YK0
		Rank: 2
		Configured Clock Speed: 1600 MHz

Handle 0x0025, DMI type 20, 35 bytes
Memory Device Mapped Address
		Starting Address: 0x00400000000
		Ending Address: 0x005FFFFFFFF
		Range Size: 8 GB
		Physical Device Handle: 0x0024
		Memory Array Mapped Address Handle: 0x001F
		Partition Row Position: 1

Handle 0x0026, DMI type 17, 34 bytes
Memory Device
		Array Handle: 0x001E
		Error Information Handle: Not Provided
		Total Width: 64 bits
		Data Width: 64 bits
		Size: 8192 MB
		Form Factor: DIMM
		Set: None
		Locator: DIMM1
		Bank Locator: BANK 1
		Type: DDR3
		Type Detail: Synchronous Unbuffered (Unregistered)
		Speed: 1600 MHz
		Manufacturer: Samsung
		Serial Number: 20805822
		Asset Tag:  BANK 1 DIMM1 AssetTag
		Part Number: M391B1G73QH0-YK0
		Rank: 2
		Configured Clock Speed: 1600 MHz

Handle 0x0027, DMI type 20, 35 bytes
Memory Device Mapped Address
		Starting Address: 0x00600000000
		Ending Address: 0x007FFFFFFFF
		Range Size: 8 GB
		Physical Device Handle: 0x0026
		Memory Array Mapped Address Handle: 0x001F
		Partition Row Position: 1

Handle 0x0028, DMI type 127, 4 bytes
End Of Table


The ASRock manual doesn't have a memory map and my searching so far has not turned one up, so I'm not encouraged that the information above is very helpful (if at all) in physically locating the problem - assuming it is with a memory module, in which of the four slots it resides? A general Google search didn't give me much hope either. I'd be delighted if anyone can tell me how I can use that information or anything else that we can get from a functional (if flakey) FreeNAS install to nail down the ailing component(s).

I'm gearing myself up to pull the four drives and install them in spare bays in my Dell backup box - more on that in my other post in Storage - then run Memtest86+ on the box to see if I can find a bad module. I'm uncertain right now about the testing protocol - if I do all four simultaneously will Memtest allow me to identify a culprit or must I test each one alone? If anyone has any pointers on that topic I'd be delighted to have them (I only ever tested "good" memory before - no errors found).

In fact, any comments will be welcome - this is new territory for me.
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
The IPMI log will probably tell you which DIMM it is.
 

Redcoat

MVP
Joined
Feb 18, 2014
Messages
2,925
Thanks, @Ericloewe but unfortunately not with this ASRock/AMI IPMI implementation unless I missed setting something up within it or in the BIOS, and I looked at everything again yesterday after upgrading the IPMI firmware (which only fixed the Java issue).
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
You mean there isn't a log? I find that rather unusual.
 

Redcoat

MVP
Joined
Feb 18, 2014
Messages
2,925
You mean there isn't a log? I find that rather unusual.
Yes, there’s a log, but no identifiable record of the memory event in it either by description or time stamp.


Sent from my iPhone using Tapatalk
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
Okay, that's slightly less weird.

I guess you're down to trying them individually until you identify the culprit.
 

Constantin

Vampire Pig
Joined
May 19, 2017
Messages
1,829

Redcoat

MVP
Joined
Feb 18, 2014
Messages
2,925
Thanks very much for this - very interesting. I have found new 8GB sticks for ~$70 and have been considering that option. I'm less certain about used memory. In any case, read on ...
I realized that I needed to check the board's slots as well as the sticks so I think I have to plan to run each stick in each slot and begin an elimination process based on what I find. I'm guessing this'll be a long job - but I really don't have any experience of Memtest finding bad sticks - all my burn-in tests to date have shown up no errors. I did not, however, do any testing when I got my warranty-replaced board back from iXsystems so there's a minor unknown element in the system ...
 

Constantin

Vampire Pig
Joined
May 19, 2017
Messages
1,829
These sticks currently cost about $250 a piece new - IF you can find them.

I jumped at the opportunity - 75% discount!

This RAM should buy me plenty of time before I need to upgrade my server further.
 

Redcoat

MVP
Joined
Feb 18, 2014
Messages
2,925
Understood. Did you test them before putting them in to service?
 

Constantin

Vampire Pig
Joined
May 19, 2017
Messages
1,829
Not yet. Per the seller, they were functional when pulled from a system (and tested). The seller has a 99.6% approval rating...

No issues came up when the system booted. Not sure the FreeNAS does any special tests but the longer 'System Initializing' message interval suggests that the hardware detected the RAM change.

I still have the old sticks to go back to. I will educate myself on how to test them when Turkey Day is over. But no errors reported by FreeNAS thus far.
 
Last edited:

Redcoat

MVP
Joined
Feb 18, 2014
Messages
2,925
You are less risk-averse than I am. If you do decide to test, you might want to take a look at the recommendations here (about 3/4 way down first page).
 

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,994
In the last couple of weeks my 11.0-U4, 32GB, FreeNAS Mini (mainboard replaced a few months ago under warranty by iXsystems after failing) has reported 4 instances of memory error:

Code:
MCA: Bank 5, Status 0x9400004000910091
MCA: Global Cap 0x0000000000000806, Status 0x0000000000000000
MCA: Vendor "GenuineIntel", ID 0x406d8, APIC ID 0
MCA: CPU 0 COR RD channel 1 memory error
MCA: Address 0x818e45508
 
MCA: Bank 5, Status 0x9400004000910091
MCA: Global Cap 0x0000000000000806, Status 0x0000000000000000
MCA: Vendor "GenuineIntel", ID 0x406d8, APIC ID 0
MCA: CPU 0 COR RD channel 1 memory error
MCA: Address 0x51bd011c0
 
MCA: Bank 5, Status 0x9400004000910091
MCA: Global Cap 0x0000000000000806, Status 0x0000000000000000
MCA: Vendor "GenuineIntel", ID 0x406d8, APIC ID 0
MCA: CPU 0 COR RD channel 1 memory error
MCA: Address 0x80ae56130
 
MCA: Bank 5, Status 0x9400004000910091
MCA: Global Cap 0x0000000000000806, Status 0x0000000000000000
MCA: Vendor "GenuineIntel", ID 0x406d8, APIC ID 0
MCA: CPU 0 COR RD channel 1 memory error
MCA: Address 0x4ec722100


The last two were yesterday and today - in between them I reseated the 4 memory modules, generally checked the inside of the box, and mentally prepared for the next phase if there was a repeat event. So, here we are.

Dmidecode results today:

Code:
Handle 0x0020, DMI type 17, 34 bytes
Memory Device
		Array Handle: 0x001E
		Error Information Handle: Not Provided
		Total Width: 64 bits
		Data Width: 64 bits
		Size: 8192 MB
		Form Factor: DIMM
		Set: None
		Locator: DIMM0
		Bank Locator: BANK 0
		Type: DDR3
		Type Detail: Synchronous Unbuffered (Unregistered)
		Speed: 1600 MHz
		Manufacturer: Samsung
		Serial Number: 20716568
		Asset Tag:  BANK 0 DIMM0 AssetTag
		Part Number: M391B1G73QH0-YK0
		Rank: 2
		Configured Clock Speed: 1600 MHz
 
Handle 0x0021, DMI type 20, 35 bytes
Memory Device Mapped Address
		Starting Address: 0x00000000000
		Ending Address: 0x001FFFFFFFF
		Range Size: 8 GB
		Physical Device Handle: 0x0020
		Memory Array Mapped Address Handle: 0x001F
		Partition Row Position: 1
 
Handle 0x0022, DMI type 17, 34 bytes
Memory Device
		Array Handle: 0x001E
		Error Information Handle: Not Provided
		Total Width: 64 bits
		Data Width: 64 bits
		Size: 8192 MB
		Form Factor: DIMM
		Set: None
		Locator: DIMM0
		Bank Locator: BANK 1
		Type: DDR3
		Type Detail: Synchronous Unbuffered (Unregistered)
		Speed: 1600 MHz
		Manufacturer: Samsung
		Serial Number: 20805821
		Asset Tag:  BANK 1 DIMM0 AssetTag
		Part Number: M391B1G73QH0-YK0
		Rank: 2
		Configured Clock Speed: 1600 MHz
 
Handle 0x0023, DMI type 20, 35 bytes
Memory Device Mapped Address
		Starting Address: 0x00200000000
		Ending Address: 0x003FFFFFFFF
		Range Size: 8 GB
		Physical Device Handle: 0x0022
		Memory Array Mapped Address Handle: 0x001F
		Partition Row Position: 1
 
Handle 0x0024, DMI type 17, 34 bytes
Memory Device
		Array Handle: 0x001E
		Error Information Handle: Not Provided
		Total Width: 64 bits
		Data Width: 64 bits
		Size: 8192 MB
		Form Factor: DIMM
		Set: None
		Locator: DIMM1
		Bank Locator: BANK 0
		Type: DDR3
		Type Detail: Synchronous Unbuffered (Unregistered)
		Speed: 1600 MHz
		Manufacturer: Samsung
		Serial Number: 20716567
		Asset Tag:  BANK 0 DIMM1 AssetTag
		Part Number: M391B1G73QH0-YK0
		Rank: 2
		Configured Clock Speed: 1600 MHz
 
Handle 0x0025, DMI type 20, 35 bytes
Memory Device Mapped Address
		Starting Address: 0x00400000000
		Ending Address: 0x005FFFFFFFF
		Range Size: 8 GB
		Physical Device Handle: 0x0024
		Memory Array Mapped Address Handle: 0x001F
		Partition Row Position: 1
 
Handle 0x0026, DMI type 17, 34 bytes
Memory Device
		Array Handle: 0x001E
		Error Information Handle: Not Provided
		Total Width: 64 bits
		Data Width: 64 bits
		Size: 8192 MB
		Form Factor: DIMM
		Set: None
		Locator: DIMM1
		Bank Locator: BANK 1
		Type: DDR3
		Type Detail: Synchronous Unbuffered (Unregistered)
		Speed: 1600 MHz
		Manufacturer: Samsung
		Serial Number: 20805822
		Asset Tag:  BANK 1 DIMM1 AssetTag
		Part Number: M391B1G73QH0-YK0
		Rank: 2
		Configured Clock Speed: 1600 MHz
 
Handle 0x0027, DMI type 20, 35 bytes
Memory Device Mapped Address
		Starting Address: 0x00600000000
		Ending Address: 0x007FFFFFFFF
		Range Size: 8 GB
		Physical Device Handle: 0x0026
		Memory Array Mapped Address Handle: 0x001F
		Partition Row Position: 1
 

So lets look at this data you provided a bit closer and maybe it does tell you something. I did cut a lot out of the dmidecode section and left what I needed.

You have error messages for address: 0x818e45508, 0x51bd011c0, 0x80ae56130, and 0x4ec722100.
DMIDecode tells you that these address ranges are mapped to RAM serial numbers 20716567 and 20805822.

What bug me is the upper addresses as they are mapped above the physical RAM address according to DMIDecode.

The manual does lead you almost to a type of memory map, it sort of tells you which is channel 0 and which is channel 1. Keep in mind that youa re running in Interleve mode when you have all of your RAM installed (four slots filled or both blue slots filled). I suspect that your problem is in one of the White slots.

First I'd run a CPU stress test for about an hour to make sure it's not a CPU or power supply issue.

Next I'd run Memtest86 on your RAM for several days, maybe even a solid week.

If this isn't fruitful then I'd swap the RAM around but I'd swap the RAM in the blue slots to the white and the RAM in the white slots to the blue. If the problem returns then it should be below the 0x3ffffffff range. If the problem returns above that range then I would suspect somthing other than the RAM. If the problem returns and it's below 0x3ffffffff then suspect one of the RAM modules in the blue slots. By the way, this is all an educated guess on my part, for all I know the white slots are channel 0, however I know that slot DDR3_A1 is the first slot to be populated.

You could run the system on a single stick but that would be my last resort. Also it would be nice to know if there was something going on with the system when a failure occured.

Good luck, you will need it.
 

Redcoat

MVP
Joined
Feb 18, 2014
Messages
2,925
@joeschmuck thank you for this analysis - it's past my skill range so I'm going to work to understand it while I follow along, so I'll likely have questions on the memory mapping aspect.
This morning I was looking for a CPU stress test utility as I figured I would start there. I have the Ultimate Boot collection which I plan to use. Do you have an alternate suggestion?

BTW, I have a couple more memory error reports today - do they help any?
Code:
Nov 23 08:14:47 freenasmini MCA: Bank 5, Status 0x9400004000910091
Nov 23 08:14:47 freenasmini MCA: Global Cap 0x0000000000000806, Status 0x0000000000000000
Nov 23 08:14:47 freenasmini MCA: Vendor "GenuineIntel", ID 0x406d8, APIC ID 0
Nov 23 08:14:47 freenasmini MCA: CPU 0 COR RD channel 1 memory error
Nov 23 08:14:47 freenasmini MCA: Address 0x37e1f2b00


Nov 23 11:14:47 freenasmini MCA: Bank 5, Status 0x9400004000910091
Nov 23 11:14:47 freenasmini MCA: Global Cap 0x0000000000000806, Status 0x0000000000000000
Nov 23 11:14:47 freenasmini MCA: Vendor "GenuineIntel", ID 0x406d8, APIC ID 0
Nov 23 11:14:47 freenasmini MCA: CPU 0 COR RD channel 1 memory error
Nov 23 11:14:47 freenasmini MCA: Address 0x2fd059900


Thanks once again for your help.
 

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,994
This morning I was looking for a CPU stress test utility as I figured I would start there. I have the Ultimate Boot collection which I plan to use. Do you have an alternate suggestion?
This is what I'd use myself so I think you are doing well.

BTW, I have a couple more memory error reports today - do they help any?
Well it's unfortunate that you had one at address 0x2fd059900 becasue now it's impacting a different memory module. The only thing I see as being consistent is ist's always CPU 0 on Channel 1.

I've done a quick internet search on "MCA: Bank 5, Status 0x9400004000910091" and found that this looks like a RAM/CPU voltage/timing issue. Running Memtest86 without the ability to inject ECC errors may not find an error. And I'm not saying your RAM is bad, based on the results almost all of it would have to magically be bad. There are a few things you can try and I'll list them. Don't do these if you feel uncomfortable doing it. If your FreeNAS computer is still under warranty then you should return it for a replacement.

Things you an try to make the CPU and RAM communicate better:
1) Slow down the RAM Speed slightly.
2) Slow down the internal bus clock speed.
WARNING: Do not do step 3 unless you are sure you know what you are doing, this is a great way to destroy something and even .001 VDC is a lot of change when dealing with these high speed low voltage circuits. If you can send the computer back, I would!
3) Increase the CPU voltage by .005 or .01 VDC (depends on the granularity of the BIOS voltage regulators).

One last thing, is the BIOS up to date? That of course would be the first thing I check!
 

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,994
So lets look at the data you provided to identify a stick of RAM...
In the error message it provides the address where the error occured, for this example lets assume the error occured at 0x0004c6f2e06.

DMIDecode Output
Handle 0x0020, DMI type 17, 34 bytes
Memory Device
Array Handle: 0x001E
Error Information Handle: Not Provided
Total Width: 64 bits
Data Width: 64 bits
Size: 8192 MB
Form Factor: DIMM
Set: None
Locator: DIMM0
Bank Locator: BANK 0
Type: DDR3
Type Detail: Synchronous Unbuffered (Unregistered)
Speed: 1600 MHz
Manufacturer: Samsung
Serial Number: 20716568
Asset Tag: BANK 0 DIMM0 AssetTag
Part Number: M391B1G73QH0-YK0
Rank: 2
Configured Clock Speed: 1600 MHz

Handle 0x0021, DMI type 20, 35 bytes
Memory Device Mapped Address
Starting Address: 0x00000000000
Ending Address: 0x001FFFFFFFF
Range Size: 8 GB
Physical Device Handle: 0x0020
Memory Array Mapped Address Handle: 0x001F
Partition Row Position: 1


When you have an error then you should have an address of the error. In DMIDecode you look for the Memory Device Mapped Address and this provides a starting and ending address, just find the section where your address would fit in. In our case 0x0004c6f2e06 falls between 0x0 and 0x001FFFFFFFF.

Now you look at the "Physical Device Handle" value which in our case is 0x0020. This directly related to "Handle 0x0020" so we scroll up to look at this entry, and this entry tells you all about the RAM stick, the part number, speed, and thankfully the serial number which you can use to identify the stick.

I hope this helps some and doesn't confuse you.
 

Redcoat

MVP
Joined
Feb 18, 2014
Messages
2,925
I hope this helps some and doesn't confuse you.

Ah, now I see the linkage - that's very helpful towards understanding. I gather from your comment though that it affects more than one stick ... I'll study and come back. Get CPU stress test underway first.
 

Redcoat

MVP
Joined
Feb 18, 2014
Messages
2,925
Get CPU stress test underway first.

Well, here's something interesting - CPU temperature exceeded upper non-recoverable 90C <2 minutes into first stress test (not a max level test). Pulled the plug as temp rising relatively fast. Single case fan running normally, ambient 70F, A/C on pulling in ~35F air from outside (compressor not running), NAS case closed .... This is a passive cooled CPU, no CPU fan. The BMC log history was blank as a result of me updating the BMC a few days ago so there may have been excursions before that I do not have info about.
I think I have to get a ticket in with iXsystems on this replacement board.

@joeschmuck I'll study the messages and the dmidecode report later tomorrow - thanks once again for the pointers.
 

Redcoat

MVP
Joined
Feb 18, 2014
Messages
2,925
I've quit the memory test, too. I just hit 95C on the CPU a few minutes into multi-threaded memtest 86+.
 
Status
Not open for further replies.
Top