My pool is unhealthy and I don't understand why

BlazeStar

Patron
Joined
Apr 6, 2014
Messages
383
Hello,

Using TrueNAS-13.0-U5.3

My pool CFO is reported as unhealthy:

1696617761643.png


When I go in the status, it shows me this:

1696617804985.png


I recently ran a scrub which came back with zero error.

Also running the status command gives me the following output:

Code:
root@truenas[~]# zpool status -v CFO
  pool: CFO
 state: ONLINE
status: One or more devices has experienced an error resulting in data
    corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
    entire pool from backup.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
  scan: scrub repaired 0B in 11:13:45 with 0 errors on Sun Sep 24 07:42:57 2023
config:

    NAME                                            STATE     READ WRITE CKSUM
    CFO                                             ONLINE       0     0     0
      raidz2-0                                      ONLINE       0     0     0
        gptid/XXXX-cbf2-11ec-b2fb-XXXX  ONLINE       0     0     0
        gptid/XXXX-cbf2-11ec-b2fb-XXXX  ONLINE       0     0     0
        gptid/XXXX-cbf2-11ec-b2fb-XXXX  ONLINE       0     0     0
        gptid/XXXX-cbf2-11ec-b2fb-XXXX  ONLINE       0     0     0

errors: Permanent errors have been detected in the following files:

        CFO/FreePBX-tw80i8:<0x1>


I don't understand what it means.

How is my pool unhealthy and how can I fix it?
 
Joined
Oct 22, 2019
Messages
3,641
ZFS metadata error.

Did you try "clearing" the error? Running a subsequent scrub?

Try in this order:
  1. zpool clear CFO
  2. Then run a scrub (overnight?)
  3. Check if the error still exists with zpool status -v CFO
  4. Try to clear the error again if it does
You might have to run the scrub, twice. For the second pass, if time isn't available, I believe canceling the scrub can also yield the same result to clear the error message from the log.

Otherwise, what do SMART logs and selftests report about the drives used for this pool?


EDIT: Did anything happen prior to this? Unsafe shutdown? Power outage? Failed replication? Created a new dataset? Destroyed a dataset?
 
Last edited:

BlazeStar

Patron
Joined
Apr 6, 2014
Messages
383
ZFS metadata error.

Did you try "clearing" the error? Running a subsequent scrub?

Try in this order:
  1. zpool clear CFO
  2. Then run a scrub (overnight?)
  3. Check if the error still exists with zpool status -v CFO
  4. Try to clear the error again if it does
You might have to run the scrub, twice. For the second pass, if time isn't available, I believe canceling the scrub can also yield the same result to clear the error message from the log.

So I ran the clear command, and then ran two subsequent scrubs.

The UNHEALTHY notice still shows up.

Otherwise, what do SMART logs and selftests report about the drives used for this pool?

All short and long scan have completed with ZERO error.

EDIT: Did anything happen prior to this? Unsafe shutdown? Power outage? Failed replication? Created a new dataset? Destroyed a dataset?

Not to my knowledge.
The TrueNAS box is on a UPS and always shuts down safely.
There's no replication, I make S3 cloud sync regularly and they seem fine.
I have not messed around with datasets.

The last thing I remember is upgrading to TrueNAS 13 and making the flags upgrade, but the error started showing up after that.
 

Paul5

Contributor
Joined
Jun 17, 2013
Messages
117
I don't used raid but when I get such errors as unhealthy or off-line the first thing I do is shut-down and unplug and re plug/seat the connectors about 4 or 5 times on both ends. Then do the zpool clear > zpool status > scrub > zpool status thing as above.

If I also get the corrupted files thing then I just restore a backup after doing the above connectors thing..

Knock on wood, so far I've been lucky and that fixes things.
 

Jailer

Not strong, but bad
Joined
Sep 12, 2014
Messages
4,977
How is my pool unhealthy and how can I fix it?
With metedata corruption without any obvious indicators of what caused it, I'd be doing some rigorous testing of my hardware.
 

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,994
You may need to export the pool and then import the pool, followed by yet another scrub to verify all is good, this has fixed some metadata type errors in the past. I would first do the other steps listed above as they common solutions as well. Thankfully you have a backup should you need it during this problem.

Out of curiosity, what is your hardware setup? Why do I ask? It may shed some light on the situation. Maybe not. But we typically ask for this data to be provided in order to help us help you.

Have you just powered off the unit, waited a few minutes, and powered it back on? You would be surprised what things that fixes. I don't think it will work here but...

In the SMART data for your hard drives, what does the UDMA_CRC_Errors value look like? Zero is perfect. If there is a value, so long as it's not incrementing over time then that is a good thing. If it's incrementing then that is a bad thing and likely a data connector.
 
Top