M11SDV-8C-LN4F (AMD Epyc 3251) crashing

JJT211

Patron
Joined
Jul 4, 2014
Messages
323
I just upgraded my system a few weeks ago and all has been running great up until now.

Yesterday I was out of town and noticed my Plex server was down. I came home and I found the first pic at the console IPMI.



esxi_failure.jpg




I then power cycled system, and everything came back up just fine. I was working on my Zabbix VM all night without issue.

Well I just woke up this morning and found this. It happened again, although with a different output.

esxi_failure2.png




How do I troubleshoot this? Is there any way find some related logs?



Here's my complete build. In bold is the recently upgraded hardware.



FreeNAS Stable VM under ESXi 6.7
Supermicro M11SDV-8C+-TLN4F
AMD EPYC 3251 8 core/16 Thread 2.5 Ghz base/3.1 Ghz Turbo

64GB (2x32GB) Samsung DDR4 2133MHZ ECC RDIMM
250GB m.2 SATA SSD
LSI SAS 9207-8i PCI-E 3.0 HBA
6x4TB WD Red RAID Z2
Supermicro 920W Platinum PSU-SQ
Supermicro CSE-826BE16-R920LPB 2U Server
BPN-SAS2-826EL1 Backplane




A few things to note: I was previously running ESXi with a FreeNAS VM and that LSI card on a Supermicro Xeon-D 1521 board. I bought the LSI card off Ebay about 2 months ago when I made the transition from baremetal FreeNAS to ESXi. It was already in IT mode, although only on version 15.00. It was running on v15 up until a few days ago when I flashed it to the newest 20.0.07 firmware the other day. Everything seemed to be fine up until now.



EDIT: I just rebooted and everything appears to be fine for now. From the looks of the second pic, this appears to be a hardware CPU issue. What do yall think?

EDIT2: Ok, it just happened again. Im RMA'ing this mofo. Consider this post a report on model.

esxi_failure3.png
 
Last edited:

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
And it worked fine for a couple of weeks? Did anything change?
 

JJT211

Patron
Joined
Jul 4, 2014
Messages
323
No, I meant it worked fine for a few weeks and just started doing this. I'm in process RMA right now. I've never seen a 'purple' screen of death before....lol
 

Mlovelace

Guru
Joined
Aug 19, 2014
Messages
1,111
Out of curiosity did you have IPv6 globally disabled? There is a known issue currently with 6.5 and 6.7 where you can get PSOD if IPv6 is globally disabled.
 

JJT211

Patron
Joined
Jul 4, 2014
Messages
323
Out of curiosity did you have IPv6 globally disabled? There is a known issue currently with 6.5 and 6.7 where you can get PSOD if IPv6 is globally disabled.

Hmmm....I'm pretty sure I did actually. And I might have just disabled it recently as well.

Did a little bit of research, and this says that it's been patched as of 6.7u2 though. However, article also notes that they had previously released a patch on 6.5 u1 and they quickly pulled it.

How old is the 6.7u2 update? I'm an ESXI noob, only been messing around with it for about a month.

EDIT: Actually I think the previous link is may be inaccurate. According to the KB, (which is current as of 2 Jul 2019) there is currently no resolution 6.7u2. It only applies to the recent 6.5u3. This would be some amazing news if true. I won't have a chance to test it until later this week. Im gonna see what Supermicro says. I sent them the screenshots with my RMA
 
Last edited:

JJT211

Patron
Joined
Jul 4, 2014
Messages
323
UPDATE:

So im finally back home. My RMA got approved, but I figured I'd reinstall board to see if was just that ipv6 disable bug.

Turns out its the first RAM DIMM socket that went bad. It's not reading the stick of RAM and in the BIOS the mobo is not seeing it at all. I tried swapping Mem sticks and same slot is not registering there's any RAM installed. Although, its oddly still giving a temp reading to the IPMI. I even took RAM to different board and RAM registered just fine there. What absolutely confirms its the slot is seeing the below entry in the IPMI Health event logs. It's right when I started having probs.

Code:
    2019/07/05 03:38:16    Memory(OEM)    Correctable ECC / other correctable memory error @DIMMA1
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
At least that's a straightforward fix.
 
Top