Uncorrectable ECC Error

Status
Not open for further replies.

antivirus

Dabbler
Joined
Jun 19, 2014
Messages
24
I've setup a FreeNAS box using a Supermicro X8DT3-F motherboard and 96GB ECC ram. Last year the server hung and popped up an error saying "Un-Correctable DRAM ECC Error Detected at CPU02/DIMM1A". I hit F1 to resume and reboot the server. This error happened again a few weeks after.

I figured that specific DIMM had an issue so I removed it and the server chugged along just fine for another 8 months before throwing another error in regards to CPU02/DIMM2B. At this point I decided screw it and ordered 196GB of all new RAM and installed it a few weeks ago. Now today the server froze up again and I got the same error on CPU02/DIMM1A. I'm starting to suspect it may be CPU2 that is bad or slots 1A and 2B instead of my RAM going bad. Any thoughts?
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
I'm starting to suspect it may be CPU2 that is bad or slots 1A and 2B instead of my RAM going bad. Any thoughts?
Quite possible. Unfortunately, the only way to test it is to try.
 

BigDave

FreeNAS Enthusiast
Joined
Oct 6, 2013
Messages
2,479
Switch the CPUs and if the error follows to the other set of DIMM slots, it's the Proc?
 

antivirus

Dabbler
Joined
Jun 19, 2014
Messages
24
Switch the CPUs and if the error follows to the other set of DIMM slots, it's the Proc?

Great idea! I'm dreading spending more money on a new CPU/mobo after buying 196GB of RAM. I'll switch the CPUs and report back.
 

Stux

MVP
Joined
Jun 2, 2016
Messages
4,419
If you're getting uncorrectable errors every few months, you're probably getting correctable errors more often.

Check for those In the IPMI event log.

Motherboard could be faulty. Could be dust in the slots. Could be electrical noise.
 

antivirus

Dabbler
Joined
Jun 19, 2014
Messages
24
If you're getting uncorrectable errors every few months, you're probably getting correctable errors more often.

Check for those In the IPMI event log.

Motherboard could be faulty. Could be dust in the slots. Could be electrical noise.

Log doesn't show any correctable errors, just a few uncorrectable ones (mentioned above in DIMM 1A and 2B).
 

antivirus

Dabbler
Joined
Jun 19, 2014
Messages
24
So I switched the CPUs and the same errors occurred. I replaced all the RAM with new ones and kept DIMM 1A and 2B empty (the previously problematic slots). Everything ran smoothly for 6 months. Today I get another error on DIMM2A on CPU2 now :(

I'm thinking the motherboard may be faulty now and I'll have to rebuild the whole box from scratch...
 

Stux

MVP
Joined
Jun 2, 2016
Messages
4,419
So, it could also be the CPU socket. Which is motherboard. Or power delivery to CPU socket?

:(

Its always CPU Socket 2 which gets errors right?

I still maintain you should be seeing correctable errors quite often, if you're getting uncorrectable errors every 6-8 months.
 

antivirus

Dabbler
Joined
Jun 19, 2014
Messages
24
So, it could also be the CPU socket. Which is motherboard. Or power delivery to CPU socket?

:(

Its always CPU Socket 2 which gets errors right?

I still maintain you should be seeing correctable errors quite often, if you're getting uncorrectable errors every 6-8 months.

Yup, it is always the CPU2 socket getting errors. I think there may be a BIOS setting related to ECC I need to look into. Will have to find some time to pull this server out of production and look into it.

I don't see any correctable errors which may be due to my logging level which I will also look at in the BIOS. Hopefully I don't have to build another server.
 
Status
Not open for further replies.
Top