Memory Error?

Status
Not open for further replies.

DaPlumber

Patron
Joined
May 21, 2014
Messages
246
So I started getting a bunch of these in my kernel log:

MCA: Bank 5, Status 0xd404c78000910091
MCA: Global Cap 0x0000000000000806, Status 0x0000000000000000
MCA: Vendor "GenuineIntel", ID 0x406d8, APIC ID 0
MCA: CPU 0 COR OVER RD channel 1 memory error
MCA: Address 0x45beb4fc0​

Unless I'm completely deprecated in the memory department myself I believe that's an ECC error?

Anyhoo, this being an x86 box before I went digging into hardware I did the usual "if in doubt reboot" routine. No more messages for a day or so now since the reboot. Like with small children silence is suspicious, very suspicious. Any thoughts/rants on the subject?

(Yes Virginia I should probably run memtest on it for a while. I would if I had a spare array...)
 

BigDave

FreeNAS Enthusiast
Joined
Oct 6, 2013
Messages
2,479
If you can't test it in that machine, replace ALL sticks and then test the pulls...
How else are you gonna find the misbehaving child :-/
 

DaPlumber

Patron
Joined
May 21, 2014
Messages
246
If you can't test it in that machine, replace ALL sticks and then test the pulls...
How else are you gonna find the misbehaving child :-/

That would also be a wonderful plan if, in fact, I had extra sticks and another machine in which to test the originals. :rolleyes::p:cool:
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
If it's an ECC correction, there should be something in the IPMI log to confirm it. It should even tell you exactly which module was affected.

It could be a simple bitflip though.
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
If you go the memtest route youll get errors from ECC corrections that are different from non-ECC errors (and especially uncorrectable errors with ECC RAM). You will be able to distinguish between the two with memtest.
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
You can always try to decode the info by hand, see chapter 15.

But yeah, without doing the hard work it looks like you had an ECC correction. Personally I'd just print a copy and stuff it away for future reference. A single event is not a cause for panic; this is the sort of thing that ECC is expected to catch. If you can afford to take the machine down and run memtest86 then that's a great thing to do if you're paranoid.

If you get a repeat within a month, however, then you pull out your printed copy and compare and see if maybe one of your modules is just a little flaky.
 

DaPlumber

Patron
Joined
May 21, 2014
Messages
246
So I got dozens of these in the logs but none since a reboot. IPMI only had this:
Screen Shot 2014-09-05 at 8.10.57 AM.JPG

Anything power related makes me suspicious because it causes all kinds of weird, apparently unrelated, ****. But again, nothing since the reboot. Yes, the system is on a UPS with NUT monitoring.
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
That's kind of ick. Is that on a FreeNAS Mini?
 

DaPlumber

Patron
Joined
May 21, 2014
Messages
246
Yup. I'm trying to decide if this was a temporary glitch or something I need to log a support call with iX about. I'm somewhat inclined to let sleeping systems lie unless it does it again. The mini isn't "Enterprise" gear with redundant power, so I wouldn't neccessarily expect it to recover in the event of power weirdness. We have had a few storms through here with lights flickering, so that's a thing. The UPS isn't in-line so a short transient event could have conceivably glitched the power circuitry.
 

rogerh

Guru
Joined
Apr 18, 2014
Messages
1,111
Those warnings have been going on for over a month. Sounds like a loose plug issue, but it could be power supply or line power or UPS problem.
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
It could also be crappy motherboard voltage monitoring. At least the desktop boards have a certain reputation for inaccuracy.
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
Yup. I'm trying to decide if this was a temporary glitch or something I need to log a support call with iX about. I'm somewhat inclined to let sleeping systems lie unless it does it again. The mini isn't "Enterprise" gear with redundant power, so I wouldn't neccessarily expect it to recover in the event of power weirdness. We have had a few storms through here with lights flickering, so that's a thing. The UPS isn't in-line so a short transient event could have conceivably glitched the power circuitry.

I think you've got a great handle on it. Follow your instinct.
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
My guess is that since they didn't last more than 1 second it was a temporary power glitch from your power lines. It may also be responsible for your memory error.

When it goes low (or high) and actually stays that way for an extended period, that's a clue its not power lines. At least, not normally in the USA. In other countries that can be the norm unfortunately. :(
 

rogerh

Guru
Joined
Apr 18, 2014
Messages
1,111
If the voltage errors are due to power line glitches then the UPS is not reacting quickly enough to prevent power line glitches getting to the motherboard. This is unsatisfactory and suggests the UPS may not prevent data loss. Most UPSs have some powerline filtering, and in any case these are low voltage errors, so this suggests the particular power supply does not maintain output long enough for the UPS to go on battery. I would carefully check the power lead is well inserted at both ends. These kettle-type leads are in any case not fit for purpose - any connection living outside a chassis should have locking connectors at both ends. That is a hard to overcome design fault though. I would also check the multiway plug from PSU to motherboard is well seated.
If there are no problems here, I would be looking at a more expensive UPS with a constantly running inverter - do they call these "online"?
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
Yes, those are online supplies. They take a little more power but the battery is always supplying the inverter, which is always running.

It is also possible that the thresholds on the IPMI are overly sensitive and need adjusting, but I haven't looked at that.
 

DaPlumber

Patron
Joined
May 21, 2014
Messages
246
Yeah wall-juice here in Texas is usually clean enough to not warrant the online/inline inverter UPS cost. There's other equipment on that UPS, and it hasn't shown a problem. I usually expect a PSU to have big enough capacitors to cover the UPS switchover. Power weirdness happens. Shrug. Since the issue hasn't happened again, I'm guessing that the reboot allowed whatever state the PSU got into to reset. In my professional life I would never consider this situation, but at home: Champagne taste and Beer budget... :rolleyes:
 
Status
Not open for further replies.
Top