My pool is unhealthy and I don't understand why

BlazeStar · Oct 6, 2023

Hello,

Using TrueNAS-13.0-U5.3

My pool CFO is reported as unhealthy:

When I go in the status, it shows me this:

I recently ran a scrub which came back with zero error.

Also running the status command gives me the following output:

Code:

root@truenas[~]# zpool status -v CFO
  pool: CFO
 state: ONLINE
status: One or more devices has experienced an error resulting in data
    corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
    entire pool from backup.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
  scan: scrub repaired 0B in 11:13:45 with 0 errors on Sun Sep 24 07:42:57 2023
config:

    NAME                                            STATE     READ WRITE CKSUM
    CFO                                             ONLINE       0     0     0
      raidz2-0                                      ONLINE       0     0     0
        gptid/XXXX-cbf2-11ec-b2fb-XXXX  ONLINE       0     0     0
        gptid/XXXX-cbf2-11ec-b2fb-XXXX  ONLINE       0     0     0
        gptid/XXXX-cbf2-11ec-b2fb-XXXX  ONLINE       0     0     0
        gptid/XXXX-cbf2-11ec-b2fb-XXXX  ONLINE       0     0     0

errors: Permanent errors have been detected in the following files:

        CFO/FreePBX-tw80i8:<0x1>

I don't understand what it means.

How is my pool unhealthy and how can I fix it?

winnielinnie · Oct 6, 2023

ZFS metadata error.

Did you try "clearing" the error? Running a subsequent scrub?

Try in this order:

zpool clear CFO
Then run a scrub (overnight?)
Check if the error still exists with zpool status -v CFO
Try to clear the error again if it does

You might have to run the scrub, twice. For the second pass, if time isn't available, I believe canceling the scrub can also yield the same result to clear the error message from the log.

Otherwise, what do SMART logs and selftests report about the drives used for this pool?

EDIT: Did anything happen prior to this? Unsafe shutdown? Power outage? Failed replication? Created a new dataset? Destroyed a dataset?

BlazeStar · Oct 10, 2023

winnielinnie said:
ZFS metadata error.

Did you try "clearing" the error? Running a subsequent scrub?

Try in this order:

zpool clear CFO

Then run a scrub (overnight?)

Check if the error still exists with zpool status -v CFO

Try to clear the error again if it does

You might have to run the scrub, twice. For the second pass, if time isn't available, I believe canceling the scrub can also yield the same result to clear the error message from the log.

So I ran the clear command, and then ran two subsequent scrubs.

The UNHEALTHY notice still shows up.

winnielinnie said:
Otherwise, what do SMART logs and selftests report about the drives used for this pool?

All short and long scan have completed with ZERO error.

winnielinnie said:
EDIT: Did anything happen prior to this? Unsafe shutdown? Power outage? Failed replication? Created a new dataset? Destroyed a dataset?

Not to my knowledge.
The TrueNAS box is on a UPS and always shuts down safely.
There's no replication, I make S3 cloud sync regularly and they seem fine.
I have not messed around with datasets.

The last thing I remember is upgrading to TrueNAS 13 and making the flags upgrade, but the error started showing up after that.

Paul5 · Oct 11, 2023

I don't used raid but when I get such errors as unhealthy or off-line the first thing I do is shut-down and unplug and re plug/seat the connectors about 4 or 5 times on both ends. Then do the zpool clear > zpool status > scrub > zpool status thing as above.

If I also get the corrupted files thing then I just restore a backup after doing the above connectors thing..

Knock on wood, so far I've been lucky and that fixes things.

Jailer · Oct 11, 2023

BlazeStar said:
How is my pool unhealthy and how can I fix it?

With metedata corruption without any obvious indicators of what caused it, I'd be doing some rigorous testing of my hardware.

joeschmuck · Oct 11, 2023

You may need to export the pool and then import the pool, followed by yet another scrub to verify all is good, this has fixed some metadata type errors in the past. I would first do the other steps listed above as they common solutions as well. Thankfully you have a backup should you need it during this problem.

Out of curiosity, what is your hardware setup? Why do I ask? It may shed some light on the situation. Maybe not. But we typically ask for this data to be provided in order to help us help you.

Have you just powered off the unit, waited a few minutes, and powered it back on? You would be surprised what things that fixes. I don't think it will work here but...

In the SMART data for your hard drives, what does the UDMA_CRC_Errors value look like? Zero is perfect. If there is a value, so long as it's not incrementing over time then that is a good thing. If it's incrementing then that is a bad thing and likely a data connector.

Important Announcement for the TrueNAS Community.

My pool is unhealthy and I don't understand why

BlazeStar

Patron

winnielinnie

MVP

BlazeStar

Patron

Paul5

Contributor

Jailer

Not strong, but bad

joeschmuck

Old Man

Similar threads