Random Reboots

Kenfolk

Explorer
Joined
Sep 4, 2016
Messages
51
I have a supermicro box, 24 bays, little over 100TB in RaidZ6. The past few months it has started randomly rebooting, and sometimes it will give me a non-correctable dram ecc error detected and then give me a slot, sometimes it reboots back just fine. I've run memtest, I've pulled out all the ram sticks in all the slots it kept giving me (the slots kept changing). I thought it might have been an HBA card went bad (I have 3 of them) since this started a little bit after I added the 3rd HBA and last batch of drives so I replaced the HBA card. Thought it might be a boot drive so I replaced my boot drive. It has 2 PSUs in it and I did have to replace one of those a couple months ago. I've noticed that it does tend to reboot more often when I'm transfering data off the NAS and to another external drive. For the life of me though I can't figure out what the actual problem might be. The fact that it sometimes tells me there was a non-correctable dram ecc error detected made me initially think that maybe a stick of RAM went bad, but I have 18 sticks separated out into 2 groups of 9 on CPU1 and CPU2. When I get the error for say CPU1 Dimm1A, I take out all the sticks from CPU1 and then next time it would say CPU2 Dimm1A for example. I can't imagine all of the RAM sticks going bad or all the slots going bad, and it is weird that it tends to have issues way more often when I'm trying to transfer data, I'm kinda getting to my wits end. I did buy the unit used about 3-4 years ago if that helps any. I know I read that sometimes a bad PSU can cause weird issues, is there a chance I need to replace my other PSU? (The issues happened before the one PSU did go bad, so I don't think it is my new PSU causing all these issues). Any advice or input would be greatly appreciated, thanks!
 

Samuel Tai

Never underestimate your own stupidity
Moderator
Joined
Apr 24, 2020
Messages
5,399
RAM controllers are on the CPU die these days. Since you’re seeing memory errors move around, this suggests the CPU is starting to go bad, or you have a thermal issue. Have you tried reseating your CPU and reinstalling the HSF with fresh thermal compound?
 

Kenfolk

Explorer
Joined
Sep 4, 2016
Messages
51
I have not. Since I've seen it say both CPU1 and 2, I'm guessing it might be good if I try and reseat and put fresh thermal compound on both of them?
 

Kenfolk

Explorer
Joined
Sep 4, 2016
Messages
51
So quick update, I removed CPU2 (the thermal paste did look like it needed to be reapplied). I haven't put it back yet, but the system has become way more stable. No longer getting the RAM errors or anything. I do have another question though, I have 3 HBA cards, how can I tell if one of them is going bad? Or is there a quick way to test to see if one of them is?

*edit*

Nevermind on bad HBA card, it did in fact fully die after I posted that. Seems like I had been dealing with a failing HBA card and some type of issue with one of the CPUs
 
Last edited:

JaimieV

Guru
Joined
Oct 12, 2012
Messages
742
It might not have been a bad CPU - the system timing will be different with two CPUs in place, which may have been enough to tickle the bad HBA's issues. It'll be worth trying it again.
 

Kenfolk

Explorer
Joined
Sep 4, 2016
Messages
51
Just an update, it has been over 2 weeks since I re-seated my CPU and I've had no issues. No restarts even when transferring large amounts of data. My guess is that the HBA was just bad enough to cause those random issues. I'm just glad it fully crapped out finally or it would've taken me a long time to actually find it
 
Top