Hardware Error - Ram?

Zain

Contributor
Joined
Mar 18, 2021
Messages
124
I suppose it's probably time for a re-paste of the CPU anyways. I'll probably get through the holiday weekend here and next week I'll pull the cooler and cpu and re-seat it all (including the ram since it's under this massive cooler). Little bit of dust buildup in there anyways that could also be cleaned up.

Thanks.
 

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,994

Redcoat

MVP
Joined
Feb 18, 2014
Messages
2,925

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,994
As soon as I add a 5th or 6th stick, the errors populate almost immediately, and it don't seem to matter which other slots the sticks are installed in (D1/C1/A1/B1).
Based on your posting, it's either two DIMMS, four DIMMS, or all 8 DIMMS, not any other variation of that. 5 or 6 DIMMS may not be a proper configuration.

As for the M.2 SSD, yes I imagine it could be related. Pull those out and test the system will all the RAM installed. See if it makes a difference. Also the power supply certainly could be the issue, just because it's new means absolutely nothing. Infant mortality is very real.

Did you know that DUST can and does short out electronics? I'm not saying you will see it often but in the government/military systems it's taken very seriously, along with silver migration. So while this is not your issue, definitely clean that out. I generally blow compressed air across my computers to clean them out at least twice a year. It builds up so quickly.
 
Last edited:

Zain

Contributor
Joined
Mar 18, 2021
Messages
124
6 passes on memtest86+ (took a long time this time, ≈ a day) with 0 errors. Going to let it continue going through the night tonight though.

TrueNAS has been become unresponsive over the last couple of days, UI won’t come up, not showing up on network, apps not working, like the system is off. The attached monitor still shows menu options but cannot be navigated. Have to forcibly shut it down and start it again. Seems to happen within couple hours of being online. Think this can be attributed to the ram issue?
 

Davvo

MVP
Joined
Jul 12, 2022
Messages
3,222
How are you running memtest if you have the system up? Anyway that could definitively be RAM issue, but could also be M2 issue.
 

Zain

Contributor
Joined
Mar 18, 2021
Messages
124
How are you running memtest if you have the system up? Anyway that could definitively be RAM issue, but could also be M2 issue.
The third time the unresponsiveness issue happened I decided to start a longer instance of memtest (since it was gonna be down anyways).
 

Zain

Contributor
Joined
Mar 18, 2021
Messages
124
Can anyone elaborate on this line from shell please, whether it's some type of error/good/bad please?

Dec 26 14:07:51 truenas kernel: mpt3sas 0000:08:00.0: invalid VPD tag 0x00 (size 0) at offset 0; assume missing optional EEPROM

Also noticing a bunch of these:
Dec 26 16:15:32 truenas kernel: net_ratelimit: 66 callbacks suppressed
Dec 26 16:15:37 truenas kernel: IPv6: ADDRCONF(NETDEV_CHANGE): veth304b57d8: link becomes ready
Dec 26 16:15:37 truenas kernel: kube-bridge: port 25(veth304b57d8) entered blocking state
Dec 26 16:15:37 truenas kernel: kube-bridge: port 25(veth304b57d8) entered disabled state
Dec 26 16:15:37 truenas kernel: device veth304b57d8 entered promiscuous mode

I think I've also, through process of elimination, nailed down the ECC errors to possible two sticks. Hopefully I'm not prematurely posting this but fingers crossed.
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
Dec 26 14:07:51 truenas kernel: mpt3sas 0000:08:00.0: invalid VPD tag 0x00 (size 0) at offset 0; assume missing optional EEPROM
Is that at boot time?
 

Zain

Contributor
Joined
Mar 18, 2021
Messages
124

Zain

Contributor
Joined
Mar 18, 2021
Messages
124
How are you running memtest if you have the system up? Anyway that could definitively be RAM issue, but could also be M2 issue.
Just booting into memtest and letting it run, all sticks in the system.
 

Davvo

MVP
Joined
Jul 12, 2022
Messages
3,222

Zain

Contributor
Joined
Mar 18, 2021
Messages
124
Wait, is one pass a single stick?
 

Davvo

MVP
Joined
Jul 12, 2022
Messages
3,222
Wait, is one pass a single stick?
No, but you want at least a few days of test.
Some pople here do up to 3 weeks I believe.
 

Zain

Contributor
Joined
Mar 18, 2021
Messages
124
I would think that with as frequent as the errors are in the TN logs that memtest would show errors rather quickly (within a couple passes?). For testing that long, I presume another system would be required for that. That said, I'm pretty confident that I have isolated the culprit(s) here.

Thanks.
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
You're not wrong, but with multiple bad DIMMs in one batch, they're all suspect. Keep in mind some might be more marginal than others.
 

Zain

Contributor
Joined
Mar 18, 2021
Messages
124
How do you identify which slot is which here, based on the error log?

Dec 22 00:48:31 truenas kernel: mce: [Hardware Error]: Machine check events logged
Dec 22 00:48:31 truenas kernel: EDAC MC0: 1 CE on mc#0csrow#2channel#1 (csrow:2 channel:1 page:0x103e230 offset:0x700 grain:64 syndrome:0x8901)

I presume "channel:1" means exactly what you'd think, channel 1 memory controller, but how what does "csrow" correspond to? Fatal1ty X3999 Professional Gaming Manual attached, in case it helps.

Using dmidecode, I think I've mapped the slots correctly, but I'm unsure how to translate the csrow portion of the error to tell me exactly which slot the ecc errors are occurring.

TIA

1672177867986.png
 

Attachments

  • Fatal1ty X399 Professional Gaming.pdf
    2.5 MB · Views: 80

Zain

Contributor
Joined
Mar 18, 2021
Messages
124
Yeah, I was referencing that but I couldn't get dmidecode -t memory | grep 'Locator: DIMM' to populate anything in shell.

Plus I'm not sure how to convert the quad-channel slot configuration of my board to the table that was provided.

Thanks.
 
Last edited:
Top