SSD Pool/Network freezing

nickdems

Dabbler
Joined
Jun 7, 2023
Messages
13
Hi all,

Recently i've been having quite a few inconsistencies in the performance of my NAS, specifically it will freeze very often, for 2-20seconds then resume normally. Everything gets frozen, any file transfers, VMs, jails, even the Truenas GUI

I have a sneaky feeling this is related either to my ssd pool or the mobo's realtek NIC.

Here are the pc specs:
  • cpu: ryzen 5 3600
  • memory: 64gb ddr4
  • mobo: msi B450 TOMAHAWK MAX II (Realtek® 8111H Gigabit LAN controller)
  • pci: Realtek RTL8125B 2.5gbe for direct link to pc
  • pci: Dell PERC H200 (flashed IT)

The system has these drives:
  • Direct mobo sata:
    • 2x250gb ssd mirror boot drive
    • 4x8tb hdd (WD Red pro) raidz1 Tank pool
  • HBA:
    • 4x16tb hdd (exos x16) raidz1 Tank pool
    • 2x500gb ssd mirror SSD_Pool
truenas# zpool list -v NAME SIZE ALLOC FREE CKPOINT EXPANDSZ FRAG CAP DEDUP HEALTH ALTROOT SSD_Pool 460G 55.7G 404G - - 21% 12% 1.00x ONLINE /mnt mirror-0 460G 55.7G 404G - - 21% 12.10% - ONLINE gptid/04444716-0c81-11ee-8b50-2cf05de61ea9 464G - - - - - - - ONLINE gptid/04349e2f-0c81-11ee-8b50-2cf05de61ea9 464G - - - - - - - ONLINE Tank 87.3T 64.4T 22.9T - - 12% 73% 1.00x ONLINE /mnt raidz1-0 29.1T 23.4T 5.67T - - 19% 80.50% - ONLINE gptid/2ba46015-18bf-11ed-9bbf-2cf05de61ea9 7.28T - - - - - - - ONLINE gptid/2ba076b3-18bf-11ed-9bbf-2cf05de61ea9 7.28T - - - - - - - ONLINE gptid/2ba250b6-18bf-11ed-9bbf-2cf05de61ea9 7.28T - - - - - - - ONLINE gptid/2ba35dac-18bf-11ed-9bbf-2cf05de61ea9 7.28T - - - - - - - ONLINE raidz1-1 58.2T 41.0T 17.2T - - 10% 70.50% - ONLINE gptid/9589f772-0c86-11ee-8b50-2cf05de61ea9 14.6T - - - - - - - ONLINE gptid/956f2cc9-0c86-11ee-8b50-2cf05de61ea9 14.6T - - - - - - - ONLINE gptid/95821028-0c86-11ee-8b50-2cf05de61ea9 14.6T - - - - - - - ONLINE gptid/957ad862-0c86-11ee-8b50-2cf05de61ea9 14.6T - - - - - - - ONLINE boot-pool 222G 3.03G 219G - - 0% 1% 1.00x ONLINE - mirror-0 222G 3.03G 219G - - 0% 1.36% - ONLINE ada1p2 222G - - - - - - - ONLINE ada0p2 222G - - - - - - - ONLINE

The SSD_Pool runs 3vms
  • 1 super light just runs pihole (1core 1gb ram)
  • 1 light just runs vpn & prowalar (1core 2gb ram)
  • 1 heavy runs *arr suite, and qbittorrent (3cores 24gb ram)
Additionally i have a PLEX jail, as well as another qbittorrent jail.

I've tried running a htop & iotop commands on all three vms but can't pinpoint anything to the freezes.
This may or may not be related but i was reading this thread Device is causing slow io on SSD drive? where Patrick M. Hausen noted that sata ssds could be bottlenecking/chocking the system ?

Historically, when i restart the heavy VM, the situation seems to get better for a short period of time before going back to freezing often.


As mentioned, when the system freezes, even the UI is frozen so i can't see much, but once it unblocks, i can see the CPU saw a small spike in Interrupts, and ofc the network IO goes to 0.

Today, and the reason I've come to post here for help, there was a big spike, and even got an alert from the server
Device /dev/gptid/04349e2f-0c81-11ee-8b50-2cf05de61ea9 is causing slow I/O on pool SSD_Pool.

Theres nothing on the smart test to signify any further issue.

Here is the pool disk reports from at that time
disk_usage.jpg


Is there anything i can run to further debug this issue ? Worth reducing cpu/ram on vm ?
Open to any suggestions/help, i'm pretty new to the whole eco system.

Thanks for your time,
Nicholas


PS, and this may be irrelevant, I've noticed that the SSD Pool seems to hit 100% usage quite often, every 15-20mins but doesn't necessarily line up with the freezes. (Why is it often both disks at different times ?)
 

Attachments

  • Capture.JPG
    Capture.JPG
    75.8 KB · Views: 44
Last edited:

nickdems

Dabbler
Joined
Jun 7, 2023
Messages
13
Oh, i thought the spoilers would truncate the images rather than blur them out - will see if i can edit the post later.

Wanted to add that the SSD_Pool has 2x Crucial BX500 500GB SATA 2.5 SSD
And the Jails run on Tank.

might be worth saying that on the VM's qbittorrent i usually get around 50MB/s sustained download speed, whereas the Jail qbittorrent gets sustained 100mb all day
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
Hmm, looks like you used ISPOILER, which I didn't even know was a thing. It blurs things like a russian surveillance drone, so you probably don't want it around. Regular SPOILER has a hidey button thingy, as usual.
I also fixed the console output, which should have been CODE (it automatically gets collapsed down to a possibly-too-small area).
 

nickdems

Dabbler
Joined
Jun 7, 2023
Messages
13
Hmm, looks like you used ISPOILER, which I didn't even know was a thing. It blurs things like a russian surveillance drone, so you probably don't want it around. Regular SPOILER has a hidey button thingy, as usual.
I also fixed the console output, which should have been CODE (it automatically gets collapsed down to a possibly-too-small area).
i clicked on the "inline spoiler" (left more options...) rather than the "show/dont show" one (to the right more options..)
 

ChrisRJ

Wizard
Joined
Oct 23, 2020
Messages
1,919
The first thing you need to do is trying to find a way to trigger this behavior. Otherwise we can only guess.

As to guessing, though, the NICs are possible culprits here. Realtek NICs have a truly bad repuation and may well be behind this. Also see the recommended readings in my signature.
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
i clicked on the "inline spoiler" (left more options...) rather than the "show/dont show" one (to the right more options..)
So that's what the symbol is supposed to stand for... I've mostly dismissed it whenever I run across it.
 

Davvo

MVP
Joined
Jul 12, 2022
Messages
3,222
Please read the following page regarding BBcodes :smile:
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,194
Ooh, ICODE/CODE=rich sounds like something I've occasionally needed...
 

Etorix

Wizard
Joined
Dec 30, 2020
Messages
2,134
Any reason why the HDD is split between motherboard and HBA? I would rather put all HDDs on the HBA and SSDs.

You have two Realtek NICs. Are they both in use, and if so how?
 

nickdems

Dabbler
Joined
Jun 7, 2023
Messages
13
Any reason why the HDD is split between motherboard and HBA? I would rather put all HDDs on the HBA and SSDs.

You have two Realtek NICs. Are they both in use, and if so how?
The HDD split was because i first had a zvol before getting a hba and wasn't sure i could move them from the mobo's controller to the hba with no issues/process so just left them there.

The second nic (2.5gbe) is a pcie card connected to my main pc for faster file transfer, its not used by anything else

Thanks
 

Etorix

Wizard
Joined
Dec 30, 2020
Messages
2,134
The HDD split was because i first had a zvol before getting a hba and wasn't sure i could move them from the mobo's controller to the hba with no issues/process so just left them there.
You can move or reshuffle drives at will. ZFS tracks them by UUID, not by device number under /dev/disk# .

The second nic (2.5gbe) is a pcie card connected to my main pc for faster file transfer, its not used by anything else
And the on-board NIC is not connected at all? Having multiple interfaces on the same network is bad.
 

nickdems

Dabbler
Joined
Jun 7, 2023
Messages
13
You can move or reshuffle drives at will. ZFS tracks them by UUID, not by device number under /dev/disk# .


And the on-board NIC is not connected at all? Having multiple interfaces on the same network is bad.
Oh ok then I might do that next time I shut it down - thought it would be more complicated

sorry for the confusion on the nics, maybe my terminology is wrong.
The onboard 1gbe is plugged into the router and is used by all vms/jails etc.
the 2.5gbe is a pcie card, and it connects directly to my pc - safe to ignore i suppose, i was just listing the server parts
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680

 

ChrisRJ

Wizard
Joined
Oct 23, 2020
Messages
1,919
The onboard 1gbe is plugged into the router and is used by all vms/jails etc.
the 2.5gbe is a pcie card, and it connects directly to my pc - safe to ignore i suppose, i was just listing the server parts
In a scenario like this it is never a good idea to ignore anything. The overall system is complex and I can think of multiple reasons why the additional NIC might be the root cause.

As mentioned above: If you want to approach this in a structured way (which greatly increases the chance to fix things), you must first find a way to isolate under which conditions the problem occurs. Should that not be possible, it will not make it impossible but certainly more difficult to resolve things.

The first step is always to remove, wherever possible, things from the equation. In other words: See if unplugging the additional NIC, which is not suitable anway, does have an impact.
 
Top