SOLVED TrueNAS keeps restarting every 6-10 min

Sawtaytoes

Patron
Joined
Jul 9, 2022
Messages
221
All of a sudden, my NAS started restarting randomly tonight.

I'm running TrueNAS SCALE 23.

---

SKIP TO https://www.truenas.com/community/threads/truenas-keeps-restarting-every-6-10-min.114668/post-795150. I summarize and detail everything there.

Leaving the rest of this post for historical purposes.

---

I think it's a power-related issue because it doesn't appear to shut down; it only starts up again randomly.

I tested a bunch of things, but I'm wondering if I've overloaded the +5V rail on my PSUs in some way.

Logs

It looks like it's been happening quite a few times tonight, and only the startup is logged. That makes me think it's a hardware issue:
dc5a834004241b3cd1a9e4e764ae3ddf0ec33934.png


I went ahead and shut down the server for tonight to avoid any other issues.

Why SSDs?

I used to have 68 SSDs in this NAS until a month ago when I started got an upgrade to swap the 60 hot-swap bays for 128 hot-swap bays.

During this upgrade, I found that I was missing 2 power connectors, so I only plugged in 96 drives.

After talking to the chassis manufacturer, I found out my NAS wasn't in a supported configuration and that it shouldn't even be working because the PSU only has 50A of +5V power. I later found out it worked only because it evenly distributes the power draw between both PSUs effectively giving me 100A of +5V power so long as I don't care about having redundant PSUs.

Even though I had 96 SSDs installed, I was only using 80 of them until I switch this over to dRAID with all 128 drives, but that requires some +5V offloading which I'm going to do soon, but haven't yet done. The other night, I removed of those 15 unused drives, so only 1 remains. That puts me at 81 SSDs in this system.

All but 6 of my SSDs are Crucial MX500 drives which are rated at 1.7A on the back (138A on the +5V rail if they actually used 1.7A), but the power draw from my PSUs, even the peak readings, have never gone over a rough estimate of 0.8A per drive. I left my NAS in this configuration until I had time to offload the +5V rail.

I have a solution for this +5V rail problem, but my dad was gonna help me out and couldn't make it this weekend.

The Issue Begins

Tonight, I was copying some data from my main rig to my NAS (to the SSD zpool), and suddenly, I lost connection. That's because it restarted. And then, it happened again, and again, and again every 6-10 minutes or so.

I checked a bunch of things like:
  1. Unplugging the UPS USB cables from the NAS.
  2. Plugging each PSU into a different UPS.
  3. Pulling 48 SSDs into two separate 24-drive rackmount cases I bought in case the +5V offloading idea failed.
  4. Turning off any SSD snapshot backup tasks; although, I left the hourly snapshot task running.
I checked the +5V rail in Supermicro's BMC for my motherboard. It reported 4.99V. That seems slow, but it was also unchanging.

The peak power draw numbers, even with all the drives in my NAS, were nowhere near the values I've seen during a scrub on this zpool. They also went down 40% from 470W to 290W after pulling those 48 drives. That's about 3.75W per drive (0.75A of +5V).

This is the reported wattage with the system powered off:
1701687004946.png

Not sure whats going on there. The system is OFF.

Looking at the power draw over the last 6 hours when this started, I actually pulled 48 SSDs out and while the average power draw decreated, the peak power draw doesn't look like it changed at all:
1701687118826.png


Just to be clear, the system really is powered off:
1701687257706.png

1701687312889.png


I also checked the temps on my CPU and other hardware from the motherboard's IPMI panel. Everything seems in order. Processor is at 60C idle, but it's a 16-core Epyc with a half-height cooler and a single 60mm Noctua fan. I don't think 60C would force a reboot considering processors tend to throttle around 90-100C.

The IPMI web app also doesn't report any logs other than the fans spinning down, but those errors have shown for over a year even with an ASRock Rack motherboard.

1701686940160.png

Please Help!

Even after all these checks and changes, I still had the same restarting issue.

Now I'm wondering if it's something else.

Is there a good way to check what might be going on?

UPDATE
At this point, I've determined it's certain SAS ports on 1 or more of the SAS controllers causing the problem. I'm 95% certain this is the issue.
 
Last edited:

Sawtaytoes

Patron
Joined
Jul 9, 2022
Messages
221
I tried one last test and watched the UPS's reported power numbers and marked down what the IPMI panel showed:
1701689497810.png


These numbers reported by the IPMI web app match what I saw on the UPS, except for the average power draw. On average, it was between 290W and 315W. If I went to the "Storage" page in TrueNAS, it jumped up to that 390W number, and then went back to ~310W after.

I was able to do that multiple times no problem.

Then I loaded the Plex Dash app on my phone. That killed it. All the data is on the main SSD zpool, but a separate 2-drive mirror SSD zpool has all the TrueNAS apps. I don't know what actually happened, but when I loaded that app on my phone the fan noise got softer, and I noticed it was rebooting.

Sadly, even though I was in the room listening, I didn't hear any beeps or boops that you'd expect to hear in the event of a power failure.
 

nKk

Dabbler
Joined
Jan 8, 2018
Messages
42
You can try to open IPMI console from PC and start video capture, after this boot up the server and try to access some resources. Perhaps there is a Kernel messages that is printed to the console when the problem occurred.
Other possible reason for reboots can be defect RAM - the server is started and working until you access data that reach problematic RAM DIMM and then the server is rebooted. Run memtest.
 

Sawtaytoes

Patron
Joined
Jul 9, 2022
Messages
221
I finally got something.

So this time, I swapped all +5V power from my SSDs to +12V power. Any devices that already use +5V power, I didn't touch.

I put the boot drives and TrueNAS Apps SSDs in; no problem so far. Running like this for 2 hours.

I connected my HDDs over external SAS, still fine. I ran it for maybe 20 min like this.

I started plugging in SSDs, and then it happened again:
Code:
  2023-12-05 03:09:21 Critical Interrupt  [CI-0004] , PCI PERR @Bus05 (DevFn00) - Assertion
  2023-12-05 03:09:21 Critical Interrupt  [CI-0004] , PCI PERR @Bus04 (DevFn00) - Assertion
  2023-12-05 03:09:21 Critical Interrupt  [CI-0004] , PCI PERR @Bus42 (DevFn00) - Assertion

I wasn't even done plugging in the drives. I got 13 in, and it was already rebooting.

These errors are related to my LSI cards, but why? Too hot? Not liking these drives anymore?

The error messages are logged in the +1 timezone from me, but this could be a daylight savings difference. Not sure if that's the issue.
 

Sawtaytoes

Patron
Joined
Jul 9, 2022
Messages
221
Looking at the other log messages, some have green checkmarks, but after reading a bit into it, it looks like AOC_NIC7 (whatever that is) is overheating?

A network interface? Which one? Why overheating?

status_green.png
2023-12-05 01:28:04Temperature[IPMI-1009] AOC_NIC7 Temp, Upper Critical - going high - Deassertion
status_green.png
2023-12-05 01:28:04Temperature[IPMI-1011] AOC_NIC7 Temp, Upper Non-recoverable - going high - Deassertion
status_green.png
2023-12-05 01:27:55AOC Network[AOC-1001] System NIC in Slot 7, Health transition to Normal from Upper Critical - Assertion
status_red.png
2023-12-05 00:32:05Fan[IPMI-2004] FAN4, Lower Non-recoverable - going low - Assertion
status_red.png
2023-12-05 00:28:39Temperature[IPMI-1011] AOC_NIC7 Temp, Upper Non-recoverable - going high - Assertion
status_red.png
2023-12-05 00:28:31AOC Network[AOC-1007] System NIC in Slot 7, Health transition to Upper Critical from Upper Non-Critical - Assertion
status_red.png
2023-12-05 00:27:55Temperature[IPMI-1009] AOC_NIC7 Temp, Upper Critical - going high - Assertion

I'm only using the onboard IPMI NIC. The other two 10GB NICs are unused.

There is a PCIe ConnectX-6. Five days ago, I lost connection to my NAS. It was the NIC not responding. I connected another SFP28 adapter and tried swapping around fiber cables and SFP28 adapters.

I eventually had to restart the NAS to fix it.

Is that the card over heating now? It always had a single SFP28 adapter up until 5 days ago.

From what I remember in the UEFI and manual, the closest PCIe slot to the CPU is slot 7.
 

nKk

Dabbler
Joined
Jan 8, 2018
Messages
42
Remove all additional cards that is not needed for basic operation of the server and try to connect all drives again. Perhaps this problematic NIC disturb the operations on PCIe bus and server reboots.
 

Sawtaytoes

Patron
Joined
Jul 9, 2022
Messages
221
I think that's it.

My HDD array and SSDs are all connected to SAS controllers in the other 6 slots, so if you try to read from one and the NIC is triggered in any way, that could cause a failure.

I put my finger on the heatsink for the NIC. OH MAN! It's hot. Burns my finger. As do the heatsinks for the other cards. Usually, they're cool to the touch, but since it's winter and this is in my HVAC room (pumping heat), that's probably related.

I propped open the door which helps cool down the room, but I'm wondering if I just don't have enough airflow anymore for the amount of heat being generated by these cards.
 

Sawtaytoes

Patron
Joined
Jul 9, 2022
Messages
221
Here's another thing I noticed:
1701765848113.png


I don't know what these fans are. Why does it care about FAN4? What even is that?

There are a bunch of fan ports on the board, but none are occuplied by a fan except the one next to the I/O shield with the CPU fan.

I tried plugging my 5 Noctua NF-12 chassis fans into the motherboard, but no matter what I did, they spun really slowly.

To mitigate this, I plugged them all into the PSU directly (it has fan cables). I took the yellow cable connector and plugged that into the board. There are 3 and 2 fans plugged into 2 separate ports this way.
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
You have NOCTUA fans in a chassis with 120 bays? Holy shit. You've probably got super bad airflow, which is reflected by everything showing symptoms of overheating.

Pull the Noctua crap and reinstall the OEM fans. If this is a Supermicro, Dell, etc., chassis, the airflow engineering in these is designed to use high static pressure differentials to force air in through the various millimeter-ish wide gaps between the drives and the chassis. This is very hard work and is generally a noisy business; most people who replace their fans with Noctua fans are trying to "silence" their servers based on bad Internet advice or their own previous gaming experience. You need to get rid of the heat.
 

Sawtaytoes

Patron
Joined
Jul 9, 2022
Messages
221
OMG, it's the NIC!

"Network AOC" stands for add-on card. The model number looks similar to the redundant PSUs, so I figured it was them:

1701766335054.png


This is making more sense, but the temperature is reading good now. I still can't access the NAS over its NIC at the moment.

---

It's working now. Must've needed to cool down a bit. No restarts either, and I put 128 SSDs in here. I also added eight +12V to +5V DC-DC converters to help with the SSD load on the PSU.
 

Sawtaytoes

Patron
Joined
Jul 9, 2022
Messages
221
You have NOCTUA fans in a chassis with 120 bays? Holy shit. You've probably got super bad airflow, which is reflected by everything showing symptoms of overheating.

Pull the Noctua crap and reinstall the OEM fans. If this is a Supermicro, Dell, etc., chassis, the airflow engineering in these is designed to use high static pressure differentials to force air in through the various millimeter-ish wide gaps between the drives and the chassis. This is very hard work and is generally a noisy business; most people who replace their fans with Noctua fans are trying to "silence" their servers based on bad Internet advice or their own previous gaming experience. You need to get rid of the heat.
You're probably right.

I didn't do it on Internet advice; I just have a bunch of these Noctua fans lying around and the fans in here are the same 120mm size with the same thickness. This is a Storinator XL60 chassis.

There aren't any HDDs in this case. It's all SSDs that produce considerably less heat. I've had 2 Noctuas in the back and 1 in the front for quite some time now. The CPU fan is also a 60mm Noctua because it's a low-profile cooler.

The stock fans had these little o-rings that, once they came off, won't easily go back on, so I decided to just swap them for Noctuas. I only had 30 SSDs in here when I made that change though. Now I have up to 123 (gonna be 124 to 128) in the chassis, so that logic isn't the same anymore.

The Noctuas definitely spin slower than the stock fans, and the airflow feels less, so I don't mind changing them back to stock.

I think my issue here is that I moved these Storinators to the HVAC room since it has more space than under my stairs, and they could sit farther from the door (eliminating server noise permeating my house). It's the eight 19mm screamers on the redundant PSUs that cause noise issues, not the case fans.
 

Sawtaytoes

Patron
Joined
Jul 9, 2022
Messages
221
Swapped all the fans back to the stock fans:
1701779295010.png


I also have two SFP28 adapters in this card right now and it's gone down from 65C to 39C.
 

Etorix

Wizard
Joined
Dec 30, 2020
Messages
2,134
I didn't do it on Internet advice; I just have a bunch of these Noctua fans lying around and the fans in here are the same 120mm size with the same thickness. This is a Storinator XL60 chassis.
Forget the SSDs (though how air flows around them and/or escapes through the gaps that these O-rings sought to close may be of relevance…).

My HDD array and SSDs are all connected to SAS controllers in the other 6 slots, so if you try to read from one and the NIC is triggered in any way, that could cause a failure.

I put my finger on the heatsink for the NIC. OH MAN! It's hot. Burns my finger. As do the heatsinks for the other cards.
Because your case lacks proper SAS backplanes and expanders, you have six HBAs in there, plus the NIC. Over 100 W of heat. What is providing airflow to cool these cards?
You need to address this. Fortunately, the overheating NIC gave a warning by tripping the NAS over. Overheating HBAs can corrupt your pool—and maybe they are doing that silently.
 

Sawtaytoes

Patron
Joined
Jul 9, 2022
Messages
221
Forget the SSDs (though how air flows around them and/or escapes through the gaps that these O-rings sought to close may be of relevance…).


Because your case lacks proper SAS backplanes and expanders, you have six HBAs in there, plus the NIC. Over 100 W of heat. What is providing airflow to cool these cards?
You need to address this. Fortunately, the overheating NIC gave a warning by tripping the NAS over. Overheating HBAs can corrupt your pool—and maybe they are doing that silently.
Even if I had SAS expanders, I'd still have 6 cards in here. They'd just be all SAS expanders rather than SAS controllers. They'd still produce heat as my SAS expanders get pretty hot.

There are two 120mm fans blowing straight through these cards, but they pull air from the front of the case which has to now pass through 128 SSDs. I could feel the heat off all those SAS cards before, but they're running cooler right now. The middle card is the one running hot at the moment.

---

UPDATE: The restart occurred again. This time after I ran a scrub on my main SSD pool. I'd run 2 scrubs last month with this same pool and 96 SSDs in the front chassis no problem.
What has changed since before?
  1. I'm using the stock chassis fans again which blow a lot more air than the Noctuas.
  2. All the SSDs are in the case now, but none of them are pulling +5V from the PSU; only +12V.
  3. I'm only running one SFP28 fiber jack, but I did run a 10Gb Ethernet jack that was previously unused. Gonna remove that now as well because I am getting "heat critical" messages about the integrated 10Gb jack. Just to be clear, I haven't done any large network transactions.
  4. All my cards feel cool from the back meaning they're losing the heat pretty good.
  5. Looks like I am still having NIC overheating messages. It's possible I need to move the NIC in between other cards and farther away from the CPU.
These numbers are from 2 hours ago, pretty sure:
1701786834502.png


Oh man... Another restart just occurred. And they'll probably keep happening because it's in a "zpool scrub" loop now. Each restart, it will restart the scrub and cause another restart.
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
All my cards feel cool from the back meaning they're losing the heat pretty good.

Or that there's insufficient airflow over them, and you're either not getting sufficient CFM or cool air is mixing from "above" the card, both of which can happen easily if you're not engineering for airflow.

One of the weird things about servers is that you often end up needing to direct airflow with shrouds and placement. I once saw a site where they had torn out various shrouds such as the clear Supermicro plastic ones with the little flexible loopy thing, totally unaware that this was designed to guide chassis airflow from the fan bulkhead fans. Or removing the CPU airflow shroud(s). Bad juju.
 

nKk

Dabbler
Joined
Jan 8, 2018
Messages
42
Remove this NIC that is overheating, if you can't use integrated NIC plug one 1Gb only to be possible to monitor and manage the NAS.
The scrub will start and you can check if the restart will happen again. If the scrub finish without restart the problem is with this NIC, if the NAS is restarted again you have something else that make problems.
 

HoneyBadger

actually does care
Administrator
Moderator
iXsystems
Joined
Feb 6, 2014
Messages
5,112
Or that there's insufficient airflow over them, and you're either not getting sufficient CFM or cool air is mixing from "above" the card, both of which can happen easily if you're not engineering for airflow.

One of the weird things about servers is that you often end up needing to direct airflow with shrouds and placement. I once saw a site where they had torn out various shrouds such as the clear Supermicro plastic ones with the little flexible loopy thing, totally unaware that this was designed to guide chassis airflow from the fan bulkhead fans. Or removing the CPU airflow shroud(s). Bad juju.
The other counter-intuitive case is that there's bad contact between the heatsink and the component beneath it.

A user may say "the heatsink feels cold" - but that's because the heatsink isn't making contact, due to improper fitting, degraded thermal interface material, or similar - so the chip underneath is warming up and not passing its heat on to the heatsink for dissipation.

I'd suggest a spot thermometer to check temperatures if you have one - heatsinks in operation can get hot enough to cause serious burns, so don't simply touch it to figure out if it's hot.
 

Sawtaytoes

Patron
Joined
Jul 9, 2022
Messages
221
There's a key piece I haven't been considering in all this. If the power is really being cut, why is the system rebooting? Is it a blip? My UPSs never make any beeping noises, so that brings me to the PSU or motherboard itself.

Is this affecting the IMPI hardware as well? It should be sensitive to power issues. Maybe that's why nothing's logged? But wouldn't I also get logged out of the IPMI web view? I don't get logged out, and in fact, the "remote control" screen view stays open when this occurs. The screen will blank like a regular restart occurred, and then the "hit DEL for UEFI" screen shows up.

To be clear, I'm seeing NO health reports since early this morning where I had two SFP28 adapters plugged into the NIC. Is it possible that's messed up as well? Maybe the motherboard is causing problems and forcing restarts? It only takes bridging two of those wires.

Or that there's insufficient airflow over them, and you're either not getting sufficient CFM or cool air is mixing from "above" the card, both of which can happen easily if you're not engineering for airflow.

One of the weird things about servers is that you often end up needing to direct airflow with shrouds and placement. I once saw a site where they had torn out various shrouds such as the clear Supermicro plastic ones with the little flexible loopy thing, totally unaware that this was designed to guide chassis airflow from the fan bulkhead fans. Or removing the CPU airflow shroud(s). Bad juju.
This case didn't have any sort of shroud, but I get what you mean. It's possible all the airflow is going right out of the top of the case and not through the cards.

Just to be clear, it's been working for a year; although, over the last year, I made a number of changes.

Changes I can remember in-order over the last year (first four are within the first 6 months):
  1. Change to using 3 SAS controllers.
  2. Swapped two case fans in the back for Noctuas that weren't spinning very fast.
  3. Change to using 4 SAS controllers, 2 x 24i, 2 x 16i.
  4. Add a 16e SAS controller.
  5. Add a 25Gb Ethernet card.
  6. Swap two 16i cards for 24i and add a 5th 24i (just two months ago).
  7. Added a new dedicated 20A circuit just for these servers (9 days ago).
  8. Added a second UPS to separate the main chassis from the HDD one (9 days ago).
  9. Swapped a front case fan for a Noctua because the rubber grommits around the screws came out.
  10. Added another UPS. Originally had this spit between both PSUs on the main chassis, but now it's only for the network switch. (7 days ago).
  11. Add a second SFP28 adapter to the Ethernet card (6 days ago).
  12. Copied some files and noticed the 6-10min reboot loop (3 days ago).
  13. Swapped out 2 more case fans for Noctuas (1 day ago).
  14. Swapped all fans back to stock (6 hours later).
A bunch of stuff is suspect like a new circuit and UPSs, but only one corresponds with these reboots: the second SFP28 adapter.

Remove this NIC that is overheating, if you can't use integrated NIC plug one 1Gb only to be possible to monitor and manage the NAS.
The scrub will start and you can check if the restart will happen again. If the scrub finish without restart the problem is with this NIC, if the NAS is restarted again you have something else that make problems.
The motherboard has two 10Gb NICs; although, one of them was also complaining about overheating yesterday:
1701817582547.png


The 1Gb NIC is for IPMI.

At this point, I think removing the 16e SAS card and the 25Gb NIC is gonna be my first task. Even if those aren't the cause, they can help me narrow down the issue. I think putting the 16e card next to the 25Gb NIC could be cause for concern.

The other counter-intuitive case is that there's bad contact between the heatsink and the component beneath it.

A user may say "the heatsink feels cold" - but that's because the heatsink isn't making contact, due to improper fitting, degraded thermal interface material, or similar - so the chip underneath is warming up and not passing its heat on to the heatsink for dissipation.

I'd suggest a spot thermometer to check temperatures if you have one - heatsinks in operation can get hot enough to cause serious burns, so don't simply touch it to figure out if it's hot.

Do you mean the NIC? That could very-well be the case. I could put some good thermal paste on there.

I have one of those thermometers for kids that you can do from a distance. I can try that.
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
There's a key piece I haven't been considering in all this. If the power is really being cut, why is the system rebooting? Is it a blip? My UPSs never make any beeping noises, so that brings me to the PSU or motherboard itself.

A system brownout can cause unpredictable and various problems. Proper sizing of the power supply is important; your server lives or dies by the sufficiency of its power supply. This can happen externally (i.e. sucking down 16A of a 20A circuit is right on the breaking limit for circuit derating), a UPS that cannot quite shoulder the load (especially a "double conversion" UPS), a PSU that is flaking on one or more rails due to excessive draw, power cabling that fails to obey ampacity limits for the various cables, connectors, and wire gauges inside the server, or just the good ol' classic failure to properly project required power. Certainly a reboot on brownout is possible.

In general, the time to debug a system is during the burn-in and testing phase, which should be the several weeks after you build the server and before you deploy it. Problem servers may take over a month to prove their stability.
 

Sawtaytoes

Patron
Joined
Jul 9, 2022
Messages
221
I just updated the UEFI, so I'm setting things back to how they were manually. Not sure how else to do it.

I think I figured out why it's restarting (this is how it was set before):

1701821529923.png


It's set to restart on AC power loss.
 
Top