drive failures across multiple pools

dealy663 · Jun 12, 2023

Hi

I've been having some instability problems on my TrueNas Scale system. It was crashing pretty infrequently (2-3 times over the past year, up 'till last week). I updated some hardware (added a GPU) last friday. Then on saturday morning it crashed again, but this time when it came back up one of the drives had failed. A single drive, 1 year old, 7200 rpm 12TB. So I thought aha! the drive was flakey and causing some system instability. I removed the drive and I proceeded with the GPU install and got my system back up and running again yesterday. During the process there were many reboots as I was trying to figure out passthrough. Then today I saw a message in the logs indicating that another pool was degraded and one of the mirrored drives was offline. I shutdown the machine and checked all the connections, rebooted and then that mirrored drive reported that it was ok. A bit later a different mirrored pool reported that it was degraded with one drive offline, while the first pool was reporting 0 errors.

In my frustration I shutdown the machine and shook my head, and am sitting down asking if anyone has seen this type of behavior before. How can different drives in different pools develop problems like this? I'm pretty sure the first single drive failure was correct. But the pool failures are hard to explain or understand. I'm running 32 GB of ECC ram, and host 3 apps (truecharts, pihole and plex). I have 2 VMs one is the new one with the GPU, the other doesn't put much strain on the system.

I'm running Bluefin and upgraded to it last weekend, but the system crashes started before that, however the first real noticeable drive failure came after the upgrade to Bluefin.

Any suggestions would be appreciated.

Thanks, Derek

NugentS · Jun 13, 2023

I suggest..........
That you post your hardware as per forum rules - we need context

dealy663 · Jun 13, 2023

Thanks for pointing out the hardware rules. I never noticed that link before. My sig should show the details now.

I restarted the machine this morning and disconnected the drive I believe has the actual failure. So far the system is showing all drives and pools as healthy. This is very very perplexing to me. Yesterday I was seeing drive/pool errors when I had the bad drive disconnected, so that isn't the explanation.

My most important data was luckily backed up offsite, but I'm still feeling kinda uneasy with the rest of what is on all of these disks. Interestingly none of the SSDs reported any failures. None of the hardware in this computer is more than 1.5 years old

NugentS · Jun 13, 2023

zpool status in code blocks please

Also can you explain your zpool (and vdev) setup please

dealy663 · Jun 13, 2023

There are 3 standalone pools

Andromeda-Pool (HDD)
- temp-slow dataset
- timemachine dataset
fast-pool (NVME)
- 4 small datasets and 1 zvol
med-pool (SSD)
- medio DS
- test-med DS

I have 2 mirrored pools

Aphrodite-MPool
- 2 12TB HDDs mirror
  - FamiliyPhotos-DS
  - PhotoRaid-DS
StripeHD1
- 2 4TB HDDs
  - Datasets
    - iso images
    - ix applications
    - media
    - rekpvt - unused, deletable
    - rektvp - unused, deleteable
    - roytestds - unused
    - ubuntucairo-ds
  - ZVols
    - memphis-disk - hosts latest VM
    - tanis-bbmh5n - unused, to be deleted

I'm not sure what you mean by code blocks

derek@TrueNAS ~ % sudo zpool status
pool: Andromeda-Pool
state: ONLINE
scan: scrub repaired 0B in 05:34:08 with 0 errors on Sun May 28 05:34:10 2023
config:

NAME STATE READ WRITE CKSUM
Andromeda-Pool ONLINE 0 0 0
6f0372c2-b11a-467c-b954-7b1e09ac6c39 ONLINE 0 0 0

errors: No known data errors

pool: Aphrodite-MPool
state: ONLINE
scan: resilvered 1.69M in 00:00:01 with 0 errors on Tue Jun 13 08:14:38 2023
config:

NAME STATE READ WRITE CKSUM
Aphrodite-MPool ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
11687310-8ac2-4d6e-8b68-eedaea70e11c ONLINE 0 0 0
43f6d3e4-40cf-49c2-88b5-3fbd74d3e4de ONLINE 0 0 0

errors: No known data errors

pool: StripeHD1
state: ONLINE
scan: resilvered 164M in 00:00:14 with 0 errors on Mon Jun 12 22:46:51 2023
config:

NAME STATE READ WRITE CKSUM
StripeHD1 ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
f03b03fe-25f6-44be-9d73-22b0018b0016 ONLINE 0 0 0
c3614f66-99b6-4a8e-a237-b6496f10ce21 ONLINE 0 0 0

errors: No known data errors

pool: boot-pool
state: ONLINE
status: Some supported and requested features are not enabled on the pool.
The pool can still be used, but some features are unavailable.
action: Enable all features using 'zpool upgrade'. Once this is done,
the pool may no longer be accessible by software that does not support
the features. See zpool-features(7) for details.
scan: scrub repaired 0B in 00:00:44 with 0 errors on Mon Jun 12 03:45:45 2023
config:

NAME STATE READ WRITE CKSUM
boot-pool ONLINE 0 0 0
sde3 ONLINE 0 0 0

errors: No known data errors

pool: fast-pool
state: ONLINE
scan: scrub repaired 0B in 00:02:12 with 0 errors on Sun May 14 00:02:14 2023
config:

NAME STATE READ WRITE CKSUM
fast-pool ONLINE 0 0 0
e957685d-e4e3-42f4-a06e-ddffe3d5071f ONLINE 0 0 0

errors: No known data errors

pool: med-pool-ssd
state: ONLINE
scan: scrub repaired 0B in 00:00:38 with 0 errors on Sun May 21 00:00:39 2023
config:

NAME STATE READ WRITE CKSUM
med-pool-ssd ONLINE 0 0 0
eb2af596-53b8-4725-a5fa-59a33da907d6 ONLINE 0 0 0

errors: No known data errors

Whattteva · Jun 13, 2023

dealy663 said:
In my frustration I shutdown the machine and shook my head, and am sitting down asking if anyone has seen this type of behavior before. How can different drives in different pools develop problems like this? I'm pretty sure the first single drive failure was correct. But the pool failures are hard to explain or understand.

Never seen this specific type of behavior, but I've seen random stability problems quite often in the last 20 years of building PC's (mostly prosumer gear). In my opinion, this looks a bit like it. You have some latent stability problem that is finally catching up to you and happening more frequently. It's quite unfortunate as your gear looks like it's at most only a couple of years old.

joeschmuck · Jun 13, 2023

Since this appears to be a stability problem at first glance, I'd recommend running both a CPU Stress test and RAM Test for an extended period of time. For example: RAM Test for 1 week, just let it test non-stop. Maybe it will fail and point you in a direction. If it fails, I'd restart and run the test again and if if fails again, did it fail at the same place, or near the same place? Repeatability of a failure is very important here. Take good notes. Also for the CPU Stress test, I personally would say 24 hours is good enough but there are other guru's here that would test for a whole week. That is the difference between a home user and a IT specialist I guess.

Components do fail and it's no fun when the do fail.

You say you were installing a GPU, something that would put the power supply under more strain. So do not forget that as well. I would test with the GPU installed and all the hard drives connected, the same situation the server would be normally.

Good Luck

Important Announcement for the TrueNAS Community.

drive failures across multiple pools

dealy663

Dabbler

NugentS

MVP

dealy663

Dabbler

NugentS

MVP

dealy663

Dabbler

Whattteva

Wizard

joeschmuck

Old Man

Similar threads

Important Announcement for the TrueNAS Community.

drive failures across multiple pools

dealy663

Dabbler

NugentS

MVP

dealy663

Dabbler

NugentS

MVP

dealy663

Dabbler

Whattteva

Wizard

joeschmuck

Old Man

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "drive failures across multiple pools"

Similar threads