SOLVED TrueNAS keeps restarting every 6-10 min

Sawtaytoes

Patron
Joined
Jul 9, 2022
Messages
221
Just enabled logging system boots:
1701821731085.png


This should help me track down when they occur.

I also had it turn off on AC Power Loss. That should also help identify what just happened.

On top of this, I've removed the NIC and 16e SAS controller since the primary data pool only needs 5x24i cards to operate.

---

Are these console errors related to my issue? I just noticed them after doing this UEFI update:

1701823334312.png

1701823359237.png
 
Last edited:

Sawtaytoes

Patron
Joined
Jul 9, 2022
Messages
221
Update again.

After removing the NIC and 16e card, upgraded the UEFI, but ran into issues, possibly with a config change relating to which drives were SATA vs NVMe. Either way, it wouldn't boot anymore into the UEFI menu, so I upgraded it again (which wipes the config), and now it's booting.

First thing I did was import my main zpool and start a scrub. I had it going for an hour before I said "lemme copy over some files now", and it died as soon as I started copying; this time, to the built-in 10Gb NIC.

I finally got some some sort of health report this time:
1701843025067.png


Is it the PSUs then?

Why does this only happen when I copy files over the network? My PC is doing a backup operation every 12 hours and on idle, so I wonder if that's also triggering it.
 
Last edited:

Sawtaytoes

Patron
Joined
Jul 9, 2022
Messages
221
Swapped the PSUs with those in another server and got a bunch of error messages (time is now 1 hour ahead):

1701844951463.png


This proves a few things:
  1. The system can run on 1 PSU even under load (`zpool scrub`).
  2. It logs error messages if the PSUs get disconnected. If both get disconnected, that's something different though.
I'm still not sure if this is a software or hardware issue. I would assume it's a hardware issue, but it looks like something did a force-reset like it physically attached the reset button pin on my motherboard, but only when certain things happen in software. That's the weird thing about it.
 
Last edited:

Sawtaytoes

Patron
Joined
Jul 9, 2022
Messages
221
I'm getting closer to the actual issue.

Even after swapping the PSUs, copying files over SMB killed it again. It was running fine with just the scrub though.

Since scrub works, that means reads work. That doesn't test writes at all!

I tried a `fio` benchmark to see what happens and guess what? It rebooted again! Now we're seeing a pattern.
Code:
fio --ioengine=libaio --filename=/mnt/Bunnies/performanceTest --direct=1 --sync=0 --rw=readwrite --bs=16M --runtime=10 --size=20G --time_based --name=fio
rm /mnt/Bunnies/performanceTest


Any idea why only writes, not reads, are causing an entire system meltdown?
 
Last edited:

nKk

Dabbler
Joined
Jan 8, 2018
Messages
42
Can you try fio test on boot pool and on "/tmp" to check if every one write cause system reboot or only when the write is on your pool?
 

Sawtaytoes

Patron
Joined
Jul 9, 2022
Messages
221
What drive is `/tmp`?

I made the script a bit gentler this time:
Code:
# /tmp
fio --ioengine=libaio --filename=/tmp/performanceTest --sync=0 --rw=readwrite --bs=16M --runtime=10 --size=1G --time_based --name=fio
rm /tmp/performanceTest
   READ: bw=5843MiB/s (6126MB/s), 5843MiB/s-5843MiB/s (6126MB/s-6126MB/s), io=57.1GiB (61.3GB), run=10001-10001msec
  WRITE: bw=6055MiB/s (6350MB/s), 6055MiB/s-6055MiB/s (6350MB/s-6350MB/s), io=59.1GiB (63.5GB), run=10001-10001msec

# boot-pool
fio --ioengine=libaio --filename=./performanceTest --sync=0 --rw=readwrite --bs=16M --runtime=10 --size=1G --time_based --name=fio
rm ./performanceTest
   READ: bw=6664MiB/s (6988MB/s), 6664MiB/s-6664MiB/s (6988MB/s-6988MB/s), io=65.1GiB (69.9GB), run=10002-10002msec
  WRITE: bw=6839MiB/s (7171MB/s), 6839MiB/s-6839MiB/s (7171MB/s-7171MB/s), io=66.8GiB (71.7GB), run=10002-10002msec

# TrueNAS-Apps
fio --ioengine=libaio --filename=/mnt/TrueNAS-Apps/performanceTest --sync=0 --rw=readwrite --bs=16M --runtime=10 --size=1G --time_based --name=fio
rm /mnt/TrueNAS-Apps/performanceTest
   READ: bw=6134MiB/s (6432MB/s), 6134MiB/s-6134MiB/s (6432MB/s-6432MB/s), io=59.9GiB (64.3GB), run=10001-10001msec
  WRITE: bw=6316MiB/s (6623MB/s), 6316MiB/s-6316MiB/s (6623MB/s-6623MB/s), io=61.7GiB (66.2GB), run=10001-10001msec

Looks like it's hitting memory, but that's fine right? It should still be writing to the pool eventually right?

Tried doing this on the Bunnies pool (all SSDs) and this is what happened right before it died:

1701847422517.png
 

Sawtaytoes

Patron
Joined
Jul 9, 2022
Messages
221
It's literally any write to this pool. `rm` didn't cause problems, but `touch` did:

1701847672331.png
 

Sawtaytoes

Patron
Joined
Jul 9, 2022
Messages
221
This is someone else who had the same, or a very similar, issue as me:
And started to write/read from it, that creates the panic, but if the pool is mounted in read-only mode there is no panic/reboot

I think this is the issue I'm running into, but why would it randomly start a few days ago?
 
Last edited:

Sawtaytoes

Patron
Joined
Jul 9, 2022
Messages
221
I got a few ZFS error emails from TrueNAS:
ZFS has detected that a device was removed.

impact: Fault tolerance of the pool may be compromised.
eid: 11
class: statechange
state: UNAVAIL
host: storeman
time: 2023-12-06 01:29:52-0600
vpath: /dev/sdak2
vguid: 0x83BF13979724B965
pool: Bunnies (0x3B56F42B4AAFB1A2)
I wish it would send me serial numbers rather than GUIDs. I dunno which GUID goes to which drive nor what drive `/dev/sdak` was at the time, but I can't verify it anymore because the drive letters change every reboot.

I got this email 4 times (for 4 separate drives) which means it's possible one of the SAS controller cables probably had issues as I was moving things around.

Also strange that the errors came in 12:09a, 12:57a, 1:01a, and 1:30a. Very random times for these drives to randomly start failing. Is it that these four drives caused the system shutdowns I experienced earlier? Or are these drives the last ones to be written to before the system gives up and forces a reboot?

EDIT: There were more of these errors. 6 so far total as I was skimming emails; all on different drives, but after a reboot, the drive letters change, so it still could have been the same drive each time.
 
Last edited:

nKk

Dabbler
Joined
Jan 8, 2018
Messages
42
Perhaps SAS cables or overheating HBA.
Do you close the case after each manipulation or you test with open case? Because with so many HBA'a in the system if the case is open the airflow can't reach HBA's and they can overheat.
 

Sawtaytoes

Patron
Joined
Jul 9, 2022
Messages
221
Perhaps SAS cables or overheating HBA.
Do you close the case after each manipulation or you test with open case? Because with so many HBA'a in the system if the case is open the airflow can't reach HBA's and they can overheat.
Yeah, I close it back to reduce heat.

This time I started my NAS, now a bunch of drives aren't showing up. I found 7 of them failed or removed or unavailable which are all in the same group; possibly on the same SAS controller.
 

Sawtaytoes

Patron
Joined
Jul 9, 2022
Messages
221
I pulled 2 24i SAS cards and re-added the 16e card. Then I hooked up 3 x 24i SAS cards.

This is a start:

1701856494819.png

I find it distressing that both drives in a mirror are resilvering together:

1701856581809.png


None of these drives needed resilvering until something happened in the previous startup. It was a clean startup too from when I shutdown the system myself to check something.

I'm going to assume this was related to a SAS controller, but even after a 10-15 minutes, the system still rebooted.

Something is still triggering it. It could be that multiple SAS controllers are bad, but I have no way of knowing. Any ideas?

I'm really upset this was working for months and suddenly decided to throw a fit and start rebooting itself.

Without touching anything, the system is currently resilvering for hopefully the next 6 hours. I'm 48 minutes and counting. Usually around an hour, it will reboot again, and I think that's because I had hourly snapshots. Those are disabled now though.

Once I can get this pool back to a working state, I can go back to finding out why writing to this pool causes issues.
 
Last edited:

Sawtaytoes

Patron
Joined
Jul 9, 2022
Messages
221
I pulled 2 24i SAS cards and re-added the 16e card. Then I hooked up 3 x 24i SAS cards.

This is a start:

View attachment 73173

I find it distressing that both drives in a mirror are resilvering together:

View attachment 73174

None of these drives needed resilvering until something happened in the previous startup. It was a clean startup too from when I shutdown the system myself to check something.

I'm going to assume this was related to a SAS controller, but even after a 10-15 minutes, the system still rebooted.

Something is still triggering it. It could be that multiple SAS controllers are bad, but I have no way of knowing. Any ideas?

I'm really upset this was working for months and suddenly decided to throw a fit and start rebooting itself.

Without touching anything, the system is currently resilvering for hopefully the next 6 hours. I'm 48 minutes and counting. Usually around an hour, it will reboot again, and I think that's because I had hourly snapshots. Those are disabled now though.

Once I can get this pool back to a working state, I can go back to finding out why writing to this pool causes issues.
After the 12 hours of resilver and scrub completed, no errors.

My next plan is to use 5 SAS expanders and 1 SAS card to track down which is/are bad.

36 of my SSDs are unused right now, and I can use them to setup a new pool (exporting my main one).

Should be pretty quick to figure out which of the 5 SAS controllers is problematic at that point provided my SAS expanders are fine.

The rebooting on write is still suspect.
 

Sawtaytoes

Patron
Joined
Jul 9, 2022
Messages
221
I created a new pool and have written to it successfully twice now in different zpool configurations.

There was one thing I did before this issue began that I'd forgotten:
I rearranged all the drives so there were no gaps between which were in my main SSD pool and which were currently extra.

That tells me it's possible certain ports are problematic on certain SAS controllers. Also, some of my mirrors are filled up more than others, so I bet zero bytes are written to them meaning I can pinpoint which drives are more active than others and potentially problematic.

After I redid all the +5V power and got all 125 drives in here, I rearranged them again. It makes sense then that the SAS controllers could be the issue. It's possible I completely avoided one in this whole transition because of drive positions and how filled up each drive was in the zpool.

But... The reset only happens when writing, and it really seems like an OS issue like ZFS had an error and panicked.
 

Sawtaytoes

Patron
Joined
Jul 9, 2022
Messages
221
Here's where I'm at now. STILL completely confused:
  1. My main zpool (Bunnies) causes reboots only when writing to it. It would even reboot if I `touch`'d a new file, but now it takes more than that such as a 20GiB `fio` test.
  2. I can create a new SSD zpool, and it works fine with both reads and writes. I even moved around 4 SAS ports and was still successful.
  3. My HDD pool in another chassis (connected via 3 ports on the 16e), uses 2 SSDs for metadata, and writes to it work fine.
My main zpool has 4 x Intel Optane NVMe drives for metadata. Writes should hit those first, and then the SSDs. And since I was potentially having PCIe issues, I wonder if those drives are related. Maybe one of them went bad?

I tried to run SMART tests on them, but TrueNAS says SMART tests can't run on these drives.

I wanna write some sort of data to these drives to see if I can force a system reboot. TrueNAS should make 2 partitions, so I should be able to format and write to one to test it right? Or is there another way to test?

Another thing I wanna do is test each SAS controller card 1-by-1, but so far, it seems to only occur when writing to my main zpool; so I really have no way of knowing the actual issue.
 

Sawtaytoes

Patron
Joined
Jul 9, 2022
Messages
221
After searching around, I found an unanswered StackOverflow question where someone asked to run:
Code:
journalctl | egrep 'kernel.*nvme'

Look what I found:
Code:
Dec 06 02:59:51 storeman kernel: nvme1n1: detected capacity change from 1875385008 to 0

I wonder if it's related.

Here's the rest of that log:
Code:
# journalctl | egrep 'kernel.*nvme'
Dec 06 02:56:09 storeman kernel: Command line: BOOT_IMAGE=/ROOT/23.10.0.1@/boot/vmlinuz-6.1.55-production+truenas root=ZFS=boot-pool/ROOT/23.10.0.1 ro libata.allow_tpm=1 amd_iommu=on iommu=pt kvm_amd.npt=1 kvm_amd.avic=1 intel_iommu=on zfsforce=1 nvme_core.multipath=N
Dec 06 02:56:09 storeman kernel: Kernel command line: BOOT_IMAGE=/ROOT/23.10.0.1@/boot/vmlinuz-6.1.55-production+truenas root=ZFS=boot-pool/ROOT/23.10.0.1 ro libata.allow_tpm=1 amd_iommu=on iommu=pt kvm_amd.npt=1 kvm_amd.avic=1 intel_iommu=on zfsforce=1 nvme_core.multipath=N
Dec 06 02:56:09 storeman kernel: nvme nvme0: pci function 0000:02:00.0
Dec 06 02:56:09 storeman kernel: nvme nvme1: pci function 0000:03:00.0
Dec 06 02:56:09 storeman kernel: nvme nvme2: pci function 0000:04:00.0
Dec 06 02:56:09 storeman kernel: nvme nvme3: pci function 0000:05:00.0
Dec 06 02:56:09 storeman kernel: nvme nvme3: 31/0/0 default/read/poll queues
Dec 06 02:56:09 storeman kernel: nvme nvme0: 31/0/0 default/read/poll queues
Dec 06 02:56:09 storeman kernel: nvme nvme1: 31/0/0 default/read/poll queues
Dec 06 02:56:09 storeman kernel: nvme nvme2: 31/0/0 default/read/poll queues
Dec 06 02:56:09 storeman kernel:  nvme1n1: p1
Dec 06 02:56:09 storeman kernel:  nvme2n1: p1
Dec 06 02:56:09 storeman kernel:  nvme3n1: p1
Dec 06 02:56:09 storeman kernel:  nvme0n1: p1
Dec 06 02:59:51 storeman kernel: nvme1n1: detected capacity change from 1875385008 to 0
Dec 06 02:59:56 storeman kernel: nvme nvme1: pci function 0000:03:00.0
Dec 06 02:59:56 storeman kernel: nvme 0000:03:00.0: enabling device (0000 -> 0002)
Dec 06 02:59:56 storeman kernel: nvme nvme1: 31/0/0 default/read/poll queues
Dec 06 02:59:56 storeman kernel:  nvme1n1: p1
Dec 06 03:18:56 storeman kernel: Command line: BOOT_IMAGE=/ROOT/23.10.0.1@/boot/vmlinuz-6.1.55-production+truenas root=ZFS=boot-pool/ROOT/23.10.0.1 ro libata.allow_tpm=1 amd_iommu=on iommu=pt kvm_amd.npt=1 kvm_amd.avic=1 intel_iommu=on zfsforce=1 nvme_core.multipath=N
Dec 06 03:18:56 storeman kernel: Kernel command line: BOOT_IMAGE=/ROOT/23.10.0.1@/boot/vmlinuz-6.1.55-production+truenas root=ZFS=boot-pool/ROOT/23.10.0.1 ro libata.allow_tpm=1 amd_iommu=on iommu=pt kvm_amd.npt=1 kvm_amd.avic=1 intel_iommu=on zfsforce=1 nvme_core.multipath=N
Dec 06 03:18:56 storeman kernel: nvme nvme0: pci function 0000:02:00.0
Dec 06 03:18:56 storeman kernel: nvme nvme1: pci function 0000:03:00.0
Dec 06 03:18:56 storeman kernel: nvme nvme2: pci function 0000:04:00.0
Dec 06 03:18:56 storeman kernel: nvme nvme3: pci function 0000:05:00.0
Dec 06 03:18:56 storeman kernel: nvme nvme0: 31/0/0 default/read/poll queues
Dec 06 03:18:56 storeman kernel: nvme nvme3: 31/0/0 default/read/poll queues
Dec 06 03:18:56 storeman kernel: nvme nvme1: 31/0/0 default/read/poll queues
Dec 06 03:18:56 storeman kernel: nvme nvme2: 31/0/0 default/read/poll queues
Dec 06 03:18:56 storeman kernel:  nvme1n1: p1
Dec 06 03:18:56 storeman kernel:  nvme2n1: p1
Dec 06 03:18:56 storeman kernel:  nvme0n1: p1
Dec 06 03:18:56 storeman kernel:  nvme3n1: p1
Dec 06 03:29:03 storeman kernel: Kernel command line: BOOT_IMAGE=/ROOT/23.10.0.1@/boot/vmlinuz-6.1.55-production+truenas root=ZFS=boot-pool/ROOT/23.10.0.1 ro libata.allow_tpm=1 amd_iommu=on iommu=pt kvm_amd.npt=1 kvm_amd.avic=1 intel_iommu=on zfsforce=1 nvme_core.multipath=N
Dec 06 03:29:03 storeman kernel: nvme nvme0: pci function 0000:02:00.0
Dec 06 03:29:03 storeman kernel: nvme nvme1: pci function 0000:03:00.0
Dec 06 03:29:03 storeman kernel: nvme nvme2: pci function 0000:04:00.0
Dec 06 03:29:03 storeman kernel: nvme nvme3: pci function 0000:05:00.0
Dec 06 03:29:03 storeman kernel: nvme nvme0: 31/0/0 default/read/poll queues
Dec 06 03:29:03 storeman kernel: nvme nvme1: 31/0/0 default/read/poll queues
Dec 06 03:29:03 storeman kernel: nvme nvme2: 31/0/0 default/read/poll queues
Dec 06 03:29:03 storeman kernel: nvme nvme3: 31/0/0 default/read/poll queues
Dec 06 03:29:03 storeman kernel:  nvme2n1: p1
Dec 06 03:29:03 storeman kernel:  nvme3n1: p1
Dec 06 03:29:03 storeman kernel:  nvme0n1: p1
Dec 06 03:29:03 storeman kernel:  nvme1n1: p1
Dec 06 04:05:58 storeman kernel: Kernel command line: BOOT_IMAGE=/ROOT/23.10.0.1@/boot/vmlinuz-6.1.55-production+truenas root=ZFS=boot-pool/ROOT/23.10.0.1 ro libata.allow_tpm=1 amd_iommu=on iommu=pt kvm_amd.npt=1 kvm_amd.avic=1 intel_iommu=on zfsforce=1 nvme_core.multipath=N
Dec 06 04:05:58 storeman kernel: nvme nvme0: pci function 0000:02:00.0
Dec 06 04:05:58 storeman kernel: nvme nvme1: pci function 0000:03:00.0
Dec 06 04:05:58 storeman kernel: nvme nvme2: pci function 0000:04:00.0
Dec 06 04:05:58 storeman kernel: nvme nvme3: pci function 0000:05:00.0
Dec 06 04:05:58 storeman kernel: nvme nvme0: 31/0/0 default/read/poll queues
Dec 06 04:05:58 storeman kernel: nvme nvme1: 31/0/0 default/read/poll queues
Dec 06 04:05:58 storeman kernel: nvme nvme2: 31/0/0 default/read/poll queues
Dec 06 04:05:58 storeman kernel: nvme nvme3: 31/0/0 default/read/poll queues
Dec 06 04:05:58 storeman kernel:  nvme2n1: p1
Dec 06 04:05:58 storeman kernel:  nvme3n1: p1
Dec 06 04:05:58 storeman kernel:  nvme0n1: p1
Dec 06 04:05:58 storeman kernel:  nvme1n1: p1
Dec 06 15:02:31 storeman kernel: Kernel command line: BOOT_IMAGE=/ROOT/23.10.0.1@/boot/vmlinuz-6.1.55-production+truenas root=ZFS=boot-pool/ROOT/23.10.0.1 ro libata.allow_tpm=1 amd_iommu=on iommu=pt kvm_amd.npt=1 kvm_amd.avic=1 intel_iommu=on zfsforce=1 nvme_core.multipath=N
Dec 06 15:02:32 storeman kernel: nvme nvme0: pci function 0000:02:00.0
Dec 06 15:02:32 storeman kernel: nvme nvme1: pci function 0000:03:00.0
Dec 06 15:02:32 storeman kernel: nvme nvme2: pci function 0000:04:00.0
Dec 06 15:02:32 storeman kernel: nvme nvme3: pci function 0000:05:00.0
Dec 06 15:02:32 storeman kernel: nvme nvme0: 31/0/0 default/read/poll queues
Dec 06 15:02:32 storeman kernel: nvme nvme2: 31/0/0 default/read/poll queues
Dec 06 15:02:32 storeman kernel: nvme nvme3: 31/0/0 default/read/poll queues
Dec 06 15:02:32 storeman kernel: nvme nvme1: 31/0/0 default/read/poll queues
Dec 06 15:02:32 storeman kernel:  nvme1n1: p1
Dec 06 15:02:32 storeman kernel:  nvme3n1: p1
Dec 06 15:02:32 storeman kernel:  nvme0n1: p1
Dec 06 15:02:32 storeman kernel:  nvme2n1: p1
Dec 06 15:48:25 storeman kernel: Kernel command line: BOOT_IMAGE=/ROOT/23.10.0.1@/boot/vmlinuz-6.1.55-production+truenas root=ZFS=boot-pool/ROOT/23.10.0.1 ro libata.allow_tpm=1 amd_iommu=on iommu=pt kvm_amd.npt=1 kvm_amd.avic=1 intel_iommu=on zfsforce=1 nvme_core.multipath=N
Dec 06 15:48:25 storeman kernel: nvme nvme0: pci function 0000:02:00.0
Dec 06 15:48:25 storeman kernel: nvme nvme1: pci function 0000:03:00.0
Dec 06 15:48:25 storeman kernel: nvme nvme2: pci function 0000:04:00.0
Dec 06 15:48:25 storeman kernel: nvme nvme3: pci function 0000:05:00.0
Dec 06 15:48:25 storeman kernel: nvme nvme0: 31/0/0 default/read/poll queues
Dec 06 15:48:25 storeman kernel: nvme nvme1: 31/0/0 default/read/poll queues
Dec 06 15:48:25 storeman kernel:  nvme1n1: p1
Dec 06 15:48:25 storeman kernel: nvme nvme2: 31/0/0 default/read/poll queues
Dec 06 15:48:25 storeman kernel:  nvme2n1: p1
Dec 06 15:48:25 storeman kernel:  nvme0n1: p1
Dec 06 15:48:25 storeman kernel: nvme nvme3: 31/0/0 default/read/poll queues
Dec 06 15:48:25 storeman kernel:  nvme3n1: p1
Dec 06 16:47:31 storeman kernel: Command line: BOOT_IMAGE=/ROOT/23.10.0.1@/boot/vmlinuz-6.1.55-production+truenas root=ZFS=boot-pool/ROOT/23.10.0.1 ro libata.allow_tpm=1 amd_iommu=on iommu=pt kvm_amd.npt=1 kvm_amd.avic=1 intel_iommu=on zfsforce=1 nvme_core.multipath=N
Dec 06 16:47:31 storeman kernel: Kernel command line: BOOT_IMAGE=/ROOT/23.10.0.1@/boot/vmlinuz-6.1.55-production+truenas root=ZFS=boot-pool/ROOT/23.10.0.1 ro libata.allow_tpm=1 amd_iommu=on iommu=pt kvm_amd.npt=1 kvm_amd.avic=1 intel_iommu=on zfsforce=1 nvme_core.multipath=N
Dec 06 16:47:31 storeman kernel: nvme nvme0: pci function 0000:02:00.0
Dec 06 16:47:31 storeman kernel: nvme nvme1: pci function 0000:03:00.0
Dec 06 16:47:31 storeman kernel: nvme nvme2: pci function 0000:04:00.0
Dec 06 16:47:31 storeman kernel: nvme nvme3: pci function 0000:05:00.0
Dec 06 16:47:31 storeman kernel: nvme nvme2: 31/0/0 default/read/poll queues
Dec 06 16:47:31 storeman kernel: nvme nvme1: 31/0/0 default/read/poll queues
Dec 06 16:47:31 storeman kernel: nvme nvme0: 31/0/0 default/read/poll queues
Dec 06 16:47:31 storeman kernel: nvme nvme3: 31/0/0 default/read/poll queues
Dec 06 16:47:31 storeman kernel:  nvme3n1: p1
Dec 06 16:47:31 storeman kernel:  nvme0n1: p1
Dec 06 16:47:31 storeman kernel:  nvme2n1: p1
Dec 06 16:47:31 storeman kernel:  nvme1n1: p1

Not sure if it's doing anything because there's clearly a size here:
1701932678805.png
 

Etorix

Wizard
Joined
Dec 30, 2020
Messages
2,134
An Optane metadata vdev with a SSD pool? That's borderline crazy.

You have not fully described your system and how everything is wired and powered. From the symptoms, it could be that there's enough power for reading but not for writing.
 

Sawtaytoes

Patron
Joined
Jul 9, 2022
Messages
221
This was enough to cause it to reboot:

1701936813346.png



A few seconds after performing this task, that's when it rebooted. Is there something telling it "after writing a file, go ahead and force-reboot the system"?

An Optane metadata vdev with a SSD pool? That's borderline crazy.

You have not fully described your system and how everything is wired and powered. From the symptoms, it could be that there's enough power for reading but not for writing.
Hardware Specs
  • Chassis: 45Drives Storinator XL60.
  • Motherboard: Supermicro HT12SSL-NT.
  • CPU: AMD Epyc 7313p (16-core).
  • RAM: 256GB -> 8 sticks of 32GB DDR4 3200.
  • PCIe1: ConnectX-6 -> PCIe 4.0 x4 -> 2-port 25Gb SFP28.
  • PCIe2: LSI 9305 24i -> PCIe 3.0 x8 (all 6 ports plugged into direct-attach backplanes with 24 SSDs).
  • PCIe3: LSI 9305 24i -> PCIe 3.0 x8 (all 6 ports plugged into direct-attach backplanes with 24 SSDs).
  • PCIe4: LSI 9305 24i -> PCIe 3.0 x8 (all 6 ports plugged into direct-attach backplanes with 24 SSDs).
  • PCIe5: LSI 9305 24i -> PCIe 3.0 x8 (all 6 ports plugged into direct-attach backplanes with 24 SSDs).
  • PCIe6: LSI 9305 24i -> PCIe 3.0 x8 (all 6 ports plugged into direct-attach backplanes with 24 SSDs).
  • PCIe7: LSI 9305 16e -> PCIe 3.0 x8 (3 ports plugged into SAS expanders in another Storinator XL60 chassis with 60 HDDs.
  • NVMe1: 960GB Intel Optane 905p -> PCIe 3.0 x4
  • NVMe1: 960GB Intel Optane 905p -> PCIe 3.0 x4
  • SlimSAS1: SlimSAS to 2 x U.2 -> 2 x 960GB Intel Optane 905p -> PCIe 3.0 x4
  • SlimSAS2: SlimSAS to 2 x miniSAS HD (only 6 SSDs connected, but both cables plugged in)
There are no free PCIe ports in this system; although, if I wanted to use bifurcation, there are an additional 20 lanes on the x16 PCIe ports.
zpool Specs

My main Bunnies zpool is gonna change to a multi-vdev dRAID configuration once I figure out what's wrong. When copying snapshots to my HDD pool, some were missed, so I need to copy those again.
  1. boot-pool -> 2 x 60GB Corsair Force SSDs
  2. TrueNAS-Apps -> 2 x 2TB Crucial MX500 SSDs
  3. Bunnies
    -> 80 x 2TB and 4TB Crucial MX500 SSDs and 4 x 960GB Intel Optane 905p
    -> This is where I store my files.
    -> I backup this pool to Wolves and also to another zpool on an offsite NAS.
  4. Wolves
    -> 60 x 10TB HGST Helium HDDs and 2 x 2TB Crucial MX500 SSDs as metadata.
    -> All data in this pool is a backup of Bunnies and another pool on my offsite NAS.
Code:
# zpool status -vL
  pool: Bunnies
 state: ONLINE
  scan: scrub repaired 0B in 08:58:36 with 0 errors on Wed Dec  6 14:05:11 2023
remove: Removal of vdev 1 copied 1.79T in 1h37m, completed on Tue Oct 31 07:02:38 2023
        955M memory used for removed device mappings
config:

        NAME           STATE     READ WRITE CKSUM
        Bunnies        ONLINE       0     0     0
          mirror-0     ONLINE       0     0     0
            sdch2      ONLINE       0     0     0
            sdcp2      ONLINE       0     0     0
          mirror-2     ONLINE       0     0     0
            sdfc2      ONLINE       0     0     0
            sdw2       ONLINE       0     0     0
          mirror-3     ONLINE       0     0     0
            sdg2       ONLINE       0     0     0
            sdi2       ONLINE       0     0     0
          mirror-4     ONLINE       0     0     0
            sdfj2      ONLINE       0     0     0
            sde2       ONLINE       0     0     0
          mirror-5     ONLINE       0     0     0
            sdfo2      ONLINE       0     0     0
            sdaa2      ONLINE       0     0     0
          mirror-8     ONLINE       0     0     0
            sdae2      ONLINE       0     0     0
            sdff2      ONLINE       0     0     0
          mirror-11    ONLINE       0     0     0
            sdfi2      ONLINE       0     0     0
            sdz2       ONLINE       0     0     0
          mirror-15    ONLINE       0     0     0
            sdfp2      ONLINE       0     0     0
            sdfe2      ONLINE       0     0     0
          mirror-16    ONLINE       0     0     0
            sdh2       ONLINE       0     0     0
            sdfr2      ONLINE       0     0     0
          mirror-17    ONLINE       0     0     0
            sdv2       ONLINE       0     0     0
            sdfg2      ONLINE       0     0     0
          mirror-18    ONLINE       0     0     0
            sdfh2      ONLINE       0     0     0
            sdfd2      ONLINE       0     0     0
          mirror-20    ONLINE       0     0     0
            sdf2       ONLINE       0     0     0
            sdfq2      ONLINE       0     0     0
          mirror-24    ONLINE       0     0     0
            sdk2       ONLINE       0     0     0
            sdcq2      ONLINE       0     0     0
          mirror-26    ONLINE       0     0     0
            sdad2      ONLINE       0     0     0
            sdd2       ONLINE       0     0     0
          mirror-27    ONLINE       0     0     0
            sdco2      ONLINE       0     0     0
            sdca2      ONLINE       0     0     0
          mirror-28    ONLINE       0     0     0
            sdcn2      ONLINE       0     0     0
            sdal2      ONLINE       0     0     0
          mirror-29    ONLINE       0     0     0
            sdce2      ONLINE       0     0     0
            sdn2       ONLINE       0     0     0
          mirror-31    ONLINE       0     0     0
            sdeh2      ONLINE       0     0     0
            sdan2      ONLINE       0     0     0
          mirror-32    ONLINE       0     0     0
            sdcg2      ONLINE       0     0     0
            sda2       ONLINE       0     0     0
          mirror-33    ONLINE       0     0     0
            sdc2       ONLINE       0     0     0
            sdb2       ONLINE       0     0     0
          mirror-34    ONLINE       0     0     0
            sdcb2      ONLINE       0     0     0
            sdp2       ONLINE       0     0     0
          mirror-35    ONLINE       0     0     0
            sdcf2      ONLINE       0     0     0
            sdq2       ONLINE       0     0     0
          mirror-36    ONLINE       0     0     0
            sdfb2      ONLINE       0     0     0
            sdao2      ONLINE       0     0     0
          mirror-37    ONLINE       0     0     0
            sdai2      ONLINE       0     0     0
            sdcv2      ONLINE       0     0     0
          mirror-38    ONLINE       0     0     0
            sdcl2      ONLINE       0     0     0
            sdaq2      ONLINE       0     0     0
          mirror-39    ONLINE       0     0     0
            sdap2      ONLINE       0     0     0
            sdfk2      ONLINE       0     0     0
          mirror-40    ONLINE       0     0     0
            sdfn2      ONLINE       0     0     0
            sdfl2      ONLINE       0     0     0
          mirror-41    ONLINE       0     0     0
            sdfm2      ONLINE       0     0     0
            sdcj2      ONLINE       0     0     0
          mirror-42    ONLINE       0     0     0
            sdcm2      ONLINE       0     0     0
            sdck2      ONLINE       0     0     0
          mirror-43    ONLINE       0     0     0
            sdj2       ONLINE       0     0     0
            sdah2      ONLINE       0     0     0
          mirror-44    ONLINE       0     0     0
            sdde2      ONLINE       0     0     0
            sdcu2      ONLINE       0     0     0
          mirror-45    ONLINE       0     0     0
            sdcs2      ONLINE       0     0     0
            sdeo2      ONLINE       0     0     0
          mirror-46    ONLINE       0     0     0
            sdaj2      ONLINE       0     0     0
            sdu2       ONLINE       0     0     0
          mirror-47    ONLINE       0     0     0
            sdt2       ONLINE       0     0     0
            sdac2      ONLINE       0     0     0
          mirror-48    ONLINE       0     0     0
            sdaf2      ONLINE       0     0     0
            sdcd2      ONLINE       0     0     0
          mirror-49    ONLINE       0     0     0
            sdci2      ONLINE       0     0     0
            sdak2      ONLINE       0     0     0
          mirror-50    ONLINE       0     0     0
            sdam2      ONLINE       0     0     0
            sdcc2      ONLINE       0     0     0
          mirror-51    ONLINE       0     0     0
            sdo2       ONLINE       0     0     0
            sdep2      ONLINE       0     0     0
          mirror-52    ONLINE       0     0     0
            sdr2       ONLINE       0     0     0
            sdag2      ONLINE       0     0     0
          mirror-53    ONLINE       0     0     0
            sdl2       ONLINE       0     0     0
            sdm2       ONLINE       0     0     0
        special
          mirror-13    ONLINE       0     0     0
            nvme1n1p1  ONLINE       0     0     0
            nvme0n1p1  ONLINE       0     0     0
          mirror-14    ONLINE       0     0     0
            nvme3n1p1  ONLINE       0     0     0
            nvme2n1p1  ONLINE       0     0     0
        spares
          sds2         AVAIL 

errors: No known data errors

  pool: TrueNAS-Apps
 state: ONLINE
  scan: resilvered 232K in 00:00:00 with 0 errors on Wed Dec  6 03:07:23 2023
config:

        NAME          STATE     READ WRITE CKSUM
        TrueNAS-Apps  ONLINE       0     0     0
          mirror-0    ONLINE       0     0     0
            sdfw1     ONLINE       0     0     0
            sdfx2     ONLINE       0     0     0

errors: No known data errors

  pool: Wolves
 state: ONLINE
  scan: scrub repaired 0B in 05:11:50 with 0 errors on Thu Nov 23 10:07:44 2023
config:

        NAME                  STATE     READ WRITE CKSUM
        Wolves                ONLINE       0     0     0
          draid2:5d:15c:1s-0  ONLINE       0     0     0
            sdbt2             ONLINE       0     0     0
            sdbs2             ONLINE       0     0     0
            sdbu2             ONLINE       0     0     0
            sddv2             ONLINE       0     0     0
            sded2             ONLINE       0     0     0
            sdas2             ONLINE       0     0     0
            sddp2             ONLINE       0     0     0
            sdbg2             ONLINE       0     0     0
            sddn2             ONLINE       0     0     0
            sdbl2             ONLINE       0     0     0
            sdbm2             ONLINE       0     0     0
            sdbn2             ONLINE       0     0     0
            sdbi2             ONLINE       0     0     0
            sdbj2             ONLINE       0     0     0
            sdaw2             ONLINE       0     0     0
          draid2:5d:15c:1s-1  ONLINE       0     0     0
            sddx2             ONLINE       0     0     0
            sdax2             ONLINE       0     0     0
            sdda2             ONLINE       0     0     0
            sday2             ONLINE       0     0     0
            sdby2             ONLINE       0     0     0
            sdbv2             ONLINE       0     0     0
            sdbz2             ONLINE       0     0     0
            sdds2             ONLINE       0     0     0
            sdbw2             ONLINE       0     0     0
            sdbx2             ONLINE       0     0     0
            sddt2             ONLINE       0     0     0
            sddq2             ONLINE       0     0     0
            sdar2             ONLINE       0     0     0
            sdbc2             ONLINE       0     0     0
            sddo2             ONLINE       0     0     0
          draid2:5d:15c:1s-2  ONLINE       0     0     0
            sdaz2             ONLINE       0     0     0
            sdcx2             ONLINE       0     0     0
            sdba2             ONLINE       0     0     0
            sdbb2             ONLINE       0     0     0
            sddr2             ONLINE       0     0     0
            sdbd2             ONLINE       0     0     0
            sdbf2             ONLINE       0     0     0
            sdbe2             ONLINE       0     0     0
            sdcr2             ONLINE       0     0     0
            sdbk2             ONLINE       0     0     0
            sdbp2             ONLINE       0     0     0
            sdbh2             ONLINE       0     0     0
            sdea2             ONLINE       0     0     0
            sddz2             ONLINE       0     0     0
            sdee2             ONLINE       0     0     0
          draid2:5d:15c:1s-3  ONLINE       0     0     0
            sdeg2             ONLINE       0     0     0
            sdct2             ONLINE       0     0     0
            sdef2             ONLINE       0     0     0
            sdeq2             ONLINE       0     0     0
            sddu2             ONLINE       0     0     0
            sdei2             ONLINE       0     0     0
            sdat2             ONLINE       0     0     0
            sdec2             ONLINE       0     0     0
            sdau2             ONLINE       0     0     0
            sdeb2             ONLINE       0     0     0
            sdav2             ONLINE       0     0     0
            sddw2             ONLINE       0     0     0
            sdbo2             ONLINE       0     0     0
            sdbq2             ONLINE       0     0     0
            sdbr2             ONLINE       0     0     0
        special
          mirror-4            ONLINE       0     0     0
            sdab              ONLINE       0     0     0
            sdfs              ONLINE       0     0     0
        spares
          draid2-0-0          AVAIL 
          draid2-1-0          AVAIL 
          draid2-2-0          AVAIL 
          draid2-3-0          AVAIL 

errors: No known data errors

  pool: boot-pool
 state: ONLINE
status: Some supported and requested features are not enabled on the pool.
        The pool can still be used, but some features are unavailable.
action: Enable all features using 'zpool upgrade'. Once this is done,
        the pool may no longer be accessible by software that does not support
        the features. See zpool-features(7) for details.
  scan: scrub repaired 0B in 00:00:15 with 0 errors on Sat Dec  2 03:45:17 2023
config:

        NAME        STATE     READ WRITE CKSUM
        boot-pool   ONLINE       0     0     0
          mirror-0  ONLINE       0     0     0
            sdft3   ONLINE       0     0     0
            sdfu3   ONLINE       0     0     0

errors: No known data errors

Summary of the reboot issue

Bunnies is the one that has the force-reboot issue when writing to it. This began happening 3, maybe 4 days ago now?

I thought it was a power issue

2 days before, I had already removed 15 SSDs that I currently wasn't using. After it started happening, I thought it was a power issue, so I removed 48 SSDs (of 85), but still had the problem.

I already planned to add 128 SSD bays in here for my 123 SSDs, so I completely redid the power wiring and removed every one of the 128 SSD slots from the +5V rail on the PSUs. Also note, I swapped the PSUs in this server with the one in the other Storinator XL60 chassis, again, thinking power was the issue.

I thought it was a heat issue

At some point, I noticed heat issues because I shifted all the fans to Noctua NF-12s during this 128 SSD bay transition. I put them all back to stock, and the heat issues went away. Still, I removed the ConnectX-6 card, and the reboots became less frequent.

I only recently found out why:
The real problem occurs when writing to Bunnies.

Because my PCs backup their data every 12 hours and on idle, when I stepped away from one, it would start writing data to Bunnies; forcing the reboot situation. When I removed the ConnectX-6 and switched to the onboard NICs, the DNS hostname didn't match, so I could only access my NAS by IP. Because of this, my Windows boxes stopped backing up, lengthening the time between reboots to whenever my snapshots ran.

It was the writes

Based on the fact that reboots started occurring about every hour, it was pretty clear those were causing forced reboots since I take hourly snapshots.

After disabling all snapshots and backup tasks, and now since Windows can't access the NAS, it was able to successfully stay on for over 10 hours doing a `zpool scrub`. I'm 100% certain now that writes are the issue and only to the Bunnies pool. I'm not yet certain if the issue is physical (NVMe or SATA SSDs) or TrueNAS.

How to figure out what's wrong?

Is there a zdb way I can check this out? If it's just a corrupt pool, I already planned to nix it and convert it to dRAID, but I need to first move off some more recent snapshots. That requires manually running `zfs send` to Wolves.
 
Last edited:

nKk

Dabbler
Joined
Jan 8, 2018
Messages
42
To check if TrueNAS have a issue you can try to boot other Linux distro that use the same or new version of ZFS and try to import Bunnies pool and test writes.
If the resets continue TrueNAS is not a problem but still can be a software issue - ZFS or something else.
Or you can try TrueNAS Core if Core ZFS can import pool from Scale to check if there is something different in logs.
 
Top