Hardware Error - Ram?

Zain · Dec 23, 2022

I suppose it's probably time for a re-paste of the CPU anyways. I'll probably get through the holiday weekend here and next week I'll pull the cooler and cpu and re-seat it all (including the ram since it's under this massive cooler). Little bit of dust buildup in there anyways that could also be cleaned up.

Thanks.

joeschmuck · Dec 23, 2022

Redcoat said:
@joeschmuck , looks like you missed the ref

Yea, it was the reference at the bottom of the message. My brain was having a rough day.

Redcoat · Dec 23, 2022

joeschmuck said:
Yea, it was the reference at the bottom of the message. My brain was having a rough day.

Thx!

joeschmuck · Dec 23, 2022

Zain said:
As soon as I add a 5th or 6th stick, the errors populate almost immediately, and it don't seem to matter which other slots the sticks are installed in (D1/C1/A1/B1).

Based on your posting, it's either two DIMMS, four DIMMS, or all 8 DIMMS, not any other variation of that. 5 or 6 DIMMS may not be a proper configuration.

As for the M.2 SSD, yes I imagine it could be related. Pull those out and test the system will all the RAM installed. See if it makes a difference. Also the power supply certainly could be the issue, just because it's new means absolutely nothing. Infant mortality is very real.

Did you know that DUST can and does short out electronics? I'm not saying you will see it often but in the government/military systems it's taken very seriously, along with silver migration. So while this is not your issue, definitely clean that out. I generally blow compressed air across my computers to clean them out at least twice a year. It builds up so quickly.

Zain · Dec 25, 2022

6 passes on memtest86+ (took a long time this time, ≈ a day) with 0 errors. Going to let it continue going through the night tonight though.

TrueNAS has been become unresponsive over the last couple of days, UI won’t come up, not showing up on network, apps not working, like the system is off. The attached monitor still shows menu options but cannot be navigated. Have to forcibly shut it down and start it again. Seems to happen within couple hours of being online. Think this can be attributed to the ram issue?

Davvo · Dec 25, 2022

How are you running memtest if you have the system up? Anyway that could definitively be RAM issue, but could also be M2 issue.

Zain · Dec 25, 2022

Davvo said:
How are you running memtest if you have the system up? Anyway that could definitively be RAM issue, but could also be M2 issue.

The third time the unresponsiveness issue happened I decided to start a longer instance of memtest (since it was gonna be down anyways).

Zain · Dec 26, 2022

Can anyone elaborate on this line from shell please, whether it's some type of error/good/bad please?

Dec 26 14:07:51 truenas kernel: mpt3sas 0000:08:00.0: invalid VPD tag 0x00 (size 0) at offset 0; assume missing optional EEPROM

Also noticing a bunch of these:
Dec 26 16:15:32 truenas kernel: net_ratelimit: 66 callbacks suppressed
Dec 26 16:15:37 truenas kernel: IPv6: ADDRCONF(NETDEV_CHANGE): veth304b57d8: link becomes ready
Dec 26 16:15:37 truenas kernel: kube-bridge: port 25(veth304b57d8) entered blocking state
Dec 26 16:15:37 truenas kernel: kube-bridge: port 25(veth304b57d8) entered disabled state
Dec 26 16:15:37 truenas kernel: device veth304b57d8 entered promiscuous mode

I think I've also, through process of elimination, nailed down the ECC errors to possible two sticks. Hopefully I'm not prematurely posting this but fingers crossed.

Ericloewe · Dec 26, 2022

Zain said:
Dec 26 14:07:51 truenas kernel: mpt3sas 0000:08:00.0: invalid VPD tag 0x00 (size 0) at offset 0; assume missing optional EEPROM

Is that at boot time?

Zain · Dec 26, 2022

Ericloewe said:
Is that at boot time?

No, that is while the system has been in operation for an hour or so.

Zain · Dec 26, 2022

Davvo said:
How are you running memtest if you have the system up? Anyway that could definitively be RAM issue, but could also be M2 issue.

Just booting into memtest and letting it run, all sticks in the system.

Davvo · Dec 26, 2022

Zain said:
Just booting into memtest and letting it run, all sticks in the system.

See you in a week.

Zain · Dec 26, 2022

Wait, is one pass a single stick?

Davvo · Dec 26, 2022

Zain said:
Wait, is one pass a single stick?

No, but you want at least a few days of test.
Some pople here do up to 3 weeks I believe.

Zain · Dec 26, 2022

I would think that with as frequent as the errors are in the TN logs that memtest would show errors rather quickly (within a couple passes?). For testing that long, I presume another system would be required for that. That said, I'm pretty confident that I have isolated the culprit(s) here.

Thanks.

Ericloewe · Dec 26, 2022

You're not wrong, but with multiple bad DIMMs in one batch, they're all suspect. Keep in mind some might be more marginal than others.

Zain · Dec 27, 2022

How do you identify which slot is which here, based on the error log?

Dec 22 00:48:31 truenas kernel: mce: [Hardware Error]: Machine check events logged
Dec 22 00:48:31 truenas kernel: EDAC MC0: 1 CE on mc#0csrow#2channel#1 (csrow:2 channel:1 page:0x103e230 offset:0x700 grain:64 syndrome:0x8901)

I presume "channel:1" means exactly what you'd think, channel 1 memory controller, but how what does "csrow" correspond to? Fatal1ty X3999 Professional Gaming Manual attached, in case it helps.

Using dmidecode, I think I've mapped the slots correctly, but I'm unsure how to translate the csrow portion of the error to tell me exactly which slot the ecc errors are occurring.

TIA

Davvo · Dec 27, 2022

This might be helpful.

How to solve EDAC DIMM CE Error · Site Reliability Engineer HandBook

s905060.gitbooks.io

Zain · Dec 27, 2022

Yeah, I was referencing that but I couldn't get dmidecode -t memory | grep 'Locator: DIMM' to populate anything in shell.

Plus I'm not sure how to convert the quad-channel slot configuration of my board to the table that was provided.

Thanks.

Davvo · Dec 27, 2022

This has more info

https://support.siliconmechanics.com/portal/en/kb/articles/identify-bad-dimm-from-edac

Important Announcement for the TrueNAS Community.

Hardware Error - Ram?

Contributor

Old Man

MVP

Old Man

Contributor

MVP

Contributor

Contributor

Server Wrangler

Contributor

Contributor

MVP

Contributor

MVP

Contributor

Server Wrangler

Contributor

Attachments

MVP

Contributor

MVP

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Hardware Error - Ram?"

Similar threads