Unsure Why Pool failed

Stickeris

Dabbler
Joined
Sep 3, 2022
Messages
18
Hey Forums,

I just got back from a two-week-long trip to my server not running. When I restart I am faced with a ton of issues. The main issue seems to be my primary pool will not import at boot, and if I try manually importing the pool it starts giving me read and write errors almost immediately. These errors compound until the system crashes or I shut it down. I am a little unsure where to start. My boot pool looks fine, and the middleware is running, I'm not seeing any issues at startup other than IMPI not starting, but my board doesn't come with IMPI so that's not surprising.

Here are the 3 critical errors I get at startup:
1. Failed to sync TRUECHARTS catalog: [EFAULT] Invalid operation: ==
2. Failed to configure Kubernetes cluster for Applications: Missing (a list of all Kubernetes files on my main pool)
3. Pool -Mainpool- state is OFFLINE: None

I am running with
  • Intel E52650I-v2
  • Asus P9x79 WS
  • 64 GB Ram
  • TrueNAS-SCALE-22.02.3
  • 2 Onboard 1GB Intel NIC
  • Boot off 2 120 GB IornWolf SSD
  • 1 zPool with 4 IornWolf 8TB drives in raid z1 with x1 256GB nvme cache
about 4 months ago I installed a Dell H200 from another hobbyist, it was already flashed and set up for ZFS. It has been working with no issues since install.

Any help would be appreciated thank you
 

Samuel Tai

Never underestimate your own stupidity
Moderator
Joined
Apr 24, 2020
Messages
5,399
Please attach the output of zpool import, so we can see the state of the pool.
 

Stickeris

Dabbler
Joined
Sep 3, 2022
Messages
18
The output is simply
no pools available to import

if I
Code:
zpool import -N name-of-my-pool


it will wait a sec, import, and then start failing. I can run that command and post an output of the status if that would help
 

Samuel Tai

Never underestimate your own stupidity
Moderator
Joined
Apr 24, 2020
Messages
5,399
What about zpool import -f -F -R /mnt name-of-my-pool? -N means not to mount the pool, which would lead to the errors you see. (You may have intended -n instead, which means to dry run the import but not actually perform it.)
 

Stickeris

Dabbler
Joined
Sep 3, 2022
Messages
18
So about an hour ago I attempted to run this command. Before I did that I tried running zpool import again. I got this

Code:
   pool: Main-Pool
     id: 4362685561878101459
  state: ONLINE
status: One or more devices were being resilvered.
 action: The pool can be imported using its name or numeric identifier.
 config:

        Main-Pool                                 ONLINE
          raidz1-0                                ONLINE
            612e0129-8cbb-4933-a35f-6f3bdcbc0c87  ONLINE
            02037d57-5b22-4a6a-8a37-4196d1bcc066  ONLINE
            0ad52698-6084-4086-adec-c602064f76a1  ONLINE
            846551cd-7fb8-45b4-9884-c522355dac45  ONLINE
        cache
          ee53604f-71de-4fa9-a977-198fa048bed8



When I tried to run the command I get this response
Code:
cannot import 'Main-Pool': one or more devices is currently unavailable


The resilvering seems to be taking quite a while as I have tried this command twice with the same result. I think it may be hanging on resilvering and that may be the source of my issues.
 

Samuel Tai

Never underestimate your own stupidity
Moderator
Joined
Apr 24, 2020
Messages
5,399
Look in /var/log/middlewared.log.* for any entries containing your pool member GUIDs. That may give you an idea of which disks are involved.
 

Arwen

MVP
Joined
May 17, 2014
Messages
3,611
ZFS does not resilver exported pools. What it means is that the pool was being re-silvered at the time of the shutdown / crash. And it will likely restart the resilver if you are able to import the pool.
 

Stickeris

Dabbler
Joined
Sep 3, 2022
Messages
18
The pool mounted as degraded and is still resilvering.

Looks like the error was with 846551cd-7fb8-45b4-9884-c522355dac45 as it is now Faulted with too many errors, that said I dont think the issue is with the drives. I will wait the full 2 days it wants to resilver, but I'm doubtful. Can I remove a faulted drive via detach, without causing error during resilver?

If so I will use Sea Gates Disk Doctor to run a full scan/repair on it.

My problem is now, this is the second time I have had to do this. Each time I go through the process of checking each drive, repairing any issue and then they reload and work fine for 6 months before doing this whole process over again. Could this be a Mobo issue? I really dont trust my motherboard and am wondering if that or the CPU could be at fault. That is why I installed an HBA
 

Samuel Tai

Never underestimate your own stupidity
Moderator
Joined
Apr 24, 2020
Messages
5,399
Please report the output of zpool status -v Main-Pool, as well as details of your enclosure. You may have a bad SATA port, or bad memory.
 

Stickeris

Dabbler
Joined
Sep 3, 2022
Messages
18
My webgui crashed, lucky ssh still works

Here is the output:
Code:
pool: Main-Pool
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
    continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Mon Mar 20 17:17:47 2023
    114G scanned at 27.7M/s, 95.5G issued at 23.2M/s, 6.33T total
    508K resilvered, 1.47% done, 3 days 06:27:21 to go

config:

    NAME                                      STATE     READ WRITE CKSUM

    Main-Pool                                  DEGRADED     0     0     0
      raidz1-0                                DEGRADED   736   120     0
        612e0129-8cbb-4933-a35f-6f3bdcbc0c87  ONLINE       0     0     0  (resilvering)
        02037d57-5b22-4a6a-8a37-4196d1bcc066  ONLINE       7    56     0
        0ad52698-6084-4086-adec-c602064f76a1  ONLINE     469   121     0
        846551cd-7fb8-45b4-9884-c522355dac45  FAULTED  1.18K 1.17K     0  too many errors

    cache
      ee53604f-71de-4fa9-a977-198fa048bed8    ONLINE       0     0     0

errors: List of errors unavailable: pool I/O is currently suspended


My enclosure is a rack-mounted Rosewill RSV-L4000U 4U Server Chassis. The drives are all in their bays. The ssd's that run the boot pool are also in bays, although less secured. And the NVMe cache is sitting on a PCIe adapter.

The SSD's are run through the MoBo directly, and the zPool is run through the HBA. I have had an issue in the past where I felt the motherboard's SATA controller failed, and I switched to the HBA which seemed to fix the issues. Memory-wise, I'm running 64GB, 8x8 configuration.

Also thank you for the help thus far
 

Samuel Tai

Never underestimate your own stupidity
Moderator
Joined
Apr 24, 2020
Messages
5,399
I'm curious how stable your chassis is physically. You may be experiencing sympathetic vibrations or resonance from drive to drive, causing head flutter. This can explain the excessive read and write errors you're seeing.

Looking at the Rosewill site, the drive caddies don't appear to be securely fastened to the drives, but are just friction fit. Try moving the drives around to leave a space between drives to change the drive-to-drive resonance.

Also, from your description, it doesn't look like you have ECC RAM, and this may be also a case of in-RAM corruption spuriously leading ZFS to think reads and writes are corrupt. This is much less likely, unless your motherboard is somehow applying an overclock. It's generally a bad idea to overclock with ZFS.
 

Stickeris

Dabbler
Joined
Sep 3, 2022
Messages
18
No overclock, and not ECC. I will try rearranging the chassie. Would it be a good idea to test each drive with my SeaGate dr and then reconnect them to the pool?
 

Samuel Tai

Never underestimate your own stupidity
Moderator
Joined
Apr 24, 2020
Messages
5,399
I'm skeptical Seagate Disk Doctor understands ZFS well enough to attempt a repair. You could try a diagnostic so long as it doesn't write to the drive.
 

Stickeris

Dabbler
Joined
Sep 3, 2022
Messages
18
Short update. I had a spare HBA lying around I tried that and booted it up. I detached the pool and reset it. I forgot my other HBA is not set to IT mode so I switched back to the original HBA with the drives in a staggered 2 in each bay configuration. I booted it up and the system imported the main-pool no issue. I've been monitoring and so far no issues, no errors. Thank you guys for the help, I will post an update if I experience another system crash and will start looking for a better case.
 
Top