had an unscheduled system reboot

rebo00

Dabbler
Joined
May 29, 2020
Messages
24
Hi,

Looked at a few threads about unscheduled reboots but couldn't see any troubleshooting steps to get to the bottom of the issue as most of them were issues with dodgy RAM or PSUs.

Any help is appreciated, I suspect it's the Solarflare NIC causing the issue but not 100% on that.

This is a new build as of around June this year and as far as I know it's always had this issue.

Memtext86 ran, all tests, 4 passes. Took about 4 hours and reported no errors
Hard drive burn in test was completed following this guide, and after many days no errors were found on the final SMART test

OS: TrueNAS-13.0-U5.3
CPU: 13th Gen Intel(R) Core(TM) i5-13500
Motherboard: ASRock Z790 PG Lightning/D4
RAM: 2x16G CorsVengLPX DDR4 3200C16
Onboard NIC: Realtek 2.5Gbps, not in use
Additional NIC: Solarflare SFN6122F SF329-9021-R7
HBA: LSI SAS 9300-16I
PSU: Fractal Design ION Gold 550 Watt

Pool1: Media - 10 x 18TB CMR Disk
Pool2 Plugins - 2 x 500GB NVME m.2

Thanks,

Andy
 

Redcoat

MVP
Joined
Feb 18, 2014
Messages
2,925
The Solarflare NICs run warm, so I'd look at whether you have enough "front to back" air flow across the card to keep it cool. This is not a criticism of the NIC - I run 2 of them in my TrueNAS boxes and they are fine performers.

Same comments would go for your HBA.

You aren't overclocking are you?
 

rebo00

Dabbler
Joined
May 29, 2020
Messages
24
I've not overclocked it but will double check that tomorrow evening.
I had a problem with my HBA caused by overheating (see this post) so I added additional fans to the case and installed small fans on the HBA and the Solarflare card.
I initially had issues with the Solarflare card in my Windows PC (see this post) but it's been stable since I sorted the drive issue so that's why I suspect it's the cause of the issue in the TrueNAS server.
Also, just remembered that occasionally the server will boot up but the NIC has no connectivity but from what I remember the server still reports the network card as fine, not 100% on that though.

Andy
 

Redcoat

MVP
Joined
Feb 18, 2014
Messages
2,925
I initially had issues with the Solarflare card in my Windows PC (see this post)
Yes, I read that one but discounted it in this case as it was windows-specific and your issue is with BSD-based TrueNAS, at least so it seems, EDIT - and SF's with FreeNAS/TrueNAS have been golden in my experience.

How are you feeding power to 10 drives from the 6 connectors of your PSU? Are you using Molex splitters?
 

rebo00

Dabbler
Joined
May 29, 2020
Messages
24
Yes, I read that one but discounted it in this case as it was windows-specific and your issue is with BSD-based TrueNAS, at least so it seems, EDIT - and SF's with FreeNAS/TrueNAS have been golden in my experience.

How are you feeding power to 10 drives from the 6 connectors of your PSU? Are you using Molex splitters?
3 x 6pin to 4 SATA power connectors from the PSU, no converters or splitters.
XMP was enabled so the RAM had boosted to 3200mhz, I've turned that off and it goes back down to 2133mhz
CPU Boost is set to Auto, there isn't an option to Disable it, only manually set the boost

EDIT: Also noticed the BIOS was on 3.01 (19/10/2022) so that's been updated to 9.01 (17/10/2023) - https://pg.asrock.com/mb/Intel/Z790 PG LightningD4/index.asp#BIOS

XMP 2.0 profile:
PXL_20231105_215314973.jpg


XMP Profile disabled
PXL_20231105_215440278.jpg



Andy
 

Redcoat

MVP
Joined
Feb 18, 2014
Messages
2,925
Any random reboots since the changes you made?

Have you reseated all components, connectors, add-on cards, checked heat sink thermal paste? I'd suspect thermal or power issues if there are no system error messages.
 

rebo00

Dabbler
Joined
May 29, 2020
Messages
24
Any random reboots since the changes you made?

Have you reseated all components, connectors, add-on cards, checked heat sink thermal paste? I'd suspect thermal or power issues if there are no system error messages.
No but can sometimes be days, other times hours so will keep it running to see if it happens again.

I can't see it being a thermal issue, the server is in the loft and thanks to British weather it's currently 7c in there but I probably wouldn't be able to verify the thermals of all components without a FLIR.

Are there any historical logs kept after a reboot? To see if it gives any errors before the reboot occurs or are they wiped after a reboot?

Example of current CPU temps
1699228061895.png


Andy
 

Redcoat

MVP
Joined
Feb 18, 2014
Messages
2,925
server is in the loft and thanks to British weather it's currently 7c in there
As a British expat I know about lofts and British weather... I might be concerned about gradual thermal cycling on connectors (betting it was a lot hotter than 7C in June when you built this system), or of humidity in the roof space (with potential for condensation and subsequent corrosion).

/var/log/messages is where you should find the TrueNAS system messages.

EDIT: you can go to System > Advanced> Save Debug and look through the resulting file for clues, too.
 
Last edited:

rebo00

Dabbler
Joined
May 29, 2020
Messages
24
As a British expat I know about lofts and British weather... I might be concerned about gradual thermal cycling on connectors (betting it was a lot hotter than 7C in June when you built this system), or of humidity in the roof space (with potential for condensation and subsequent corrosion).

/var/log/messages is where you should find the TrueNAS system messages.

EDIT: you can go to System > Advanced> Save Debug and look through the resulting file for clues, too.
Thanks,

Looked through some of the logs files but I can't see anything obvious between system bootups.

House Survey said the loft had damp issues so I've had plenty of ventilation installed so it's nice and airy, also have a temp & humidity sensor installed up there just in case and no condensation issues anywhere but i'll keep an eye on it.

48hours and no random reboot yet

Andy
 

Redcoat

MVP
Joined
Feb 18, 2014
Messages
2,925
Thx for the follow-up. If you can "manually" drop the CPU speed I'd give that a go if you continue to have random reboots.
Good luck.
 

rebo00

Dabbler
Joined
May 29, 2020
Messages
24
6 days and no random restarts, so either the XMP profile on the RAM caused the issue or it was a BIOS issue and was fixed when I updated the firmware.
Because I have to know what the issue was, I'm going to enable the XMP profile again and see if it has the random restarts :)

Andy
 

rebo00

Dabbler
Joined
May 29, 2020
Messages
24
Actually left it as it was for a couple of weeks just to be sure and still no random restarts.
Enabled XPM again and it's been stable for about a week now so looks like it was a BIOS issue that fixed it.
Thanks again Redcoat
 

Redcoat

MVP
Joined
Feb 18, 2014
Messages
2,925
Actually left it as it was for a couple of weeks just to be sure and still no random restarts.
Enabled XPM again and it's been stable for about a week now so looks like it was a BIOS issue that fixed it.
Thanks again Redcoat
:smile::smile:
 
Top