Unhealthy Pool Won't Clear After Removing Corrupted File - No errors Now

dizydre21

Dabbler
Joined
Apr 10, 2023
Messages
15
Hello,

The title pretty much says it. I had
"Pool1" come up unhealthy yesterday morning. I had a snapshot being taken and then a task to replicate it to my backup NAS. I thought maybe something weird happened during the snapshot, but a Scrub came up with one movie file (that was new) and said it was corrupt. That was the only error. I used "rm -r" on the directory and have ran two scrubs since and have ran "zpool clear Pool1" several times. There are no more errors, but running "zpool clear" doesn't seem to do anything. Am I doing something wrong or is there something else that I need to do? Short SMART tests pass, but I can run a Long one if needed. I had a long one pass over last weekend though.

Hardware list below.

Asus Z370 Prime-A
I7-8700k (2 cores from Proxmox for Truenas)
32GB 3000MHz RAM - Non ECC (24GB for Truenas)
Two 2x6TB Vdevs - Seagate Ironwolf 5400RPM HDDs ST6000VN001-2BB186
970 Evo Boot Drive for Proxmox (250GB)
Two Sandisk SATA SSDs - One is dedicated disk TrueNAS
LSI-9211-8i - HDDs connected here and passed through to TrueNAS VM - In a PCIEx16 slot running at x8
RTX-2070 Super -Installed in first x16 slot, but running in x8-x8 with the LSI card
 
Last edited:

dizydre21

Dabbler
Joined
Apr 10, 2023
Messages
15
I would like to add that a long SMART test passed yesterday. I still have one error, but I don't know where to go to delete the file. The directory it shows in zpool status looks like a snapshot. I deleted the snapshot that I took that day and deleted the movie file in question. What else needs to be deleted?


Code:

root@truenas[~]# zpool status -v
  pool: Pool1
 state: ONLINE
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
  scan: scrub repaired 0B in 05:33:48 with 1 errors on Wed Apr 26 23:46:07 2023
config:

        NAME                                            STATE     READ WRITE CKSUM
        Pool1                                           ONLINE       0     0     0
          mirror-0                                      ONLINE       0     0     0
            gptid/b2f0fab8-d9a2-11ed-a259-245ebe68032c  ONLINE       0     0     0
            gptid/b2a8bbc6-d9a2-11ed-a259-245ebe68032c  ONLINE       0     0     0
          mirror-1                                      ONLINE       0     0     0
            gptid/b343667c-d9a2-11ed-a259-245ebe68032c  ONLINE       0     0     0
            gptid/b2e5e8cf-d9a2-11ed-a259-245ebe68032c  ONLINE       0     0     0

errors: Permanent errors have been detected in the following files:

        Pool1/Media/Videos@auto-2023-04-26_07-20:/Movies_UHD/The Godfather (1972)/The Godfather (1972) Remux-2160p Proper.mkv

  pool: boot-pool
 state: ONLINE
  scan: scrub repaired 0B in 00:00:04 with 0 errors on Wed Apr 26 03:45:04 2023
config:

        NAME        STATE     READ WRITE CKSUM
        boot-pool   ONLINE       0     0     0
          da4p2     ONLINE       0     0     0



Code:

root@truenas[~]# zpool status -x
  pool: Pool1
 state: ONLINE
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
  scan: scrub repaired 0B in 05:33:48 with 1 errors on Wed Apr 26 23:46:07 2023
config:

        NAME                                            STATE     READ WRITE CKSUM
        Pool1                                           ONLINE       0     0     0
          mirror-0                                      ONLINE       0     0     0
            gptid/b2f0fab8-d9a2-11ed-a259-245ebe68032c  ONLINE       0     0     0
            gptid/b2a8bbc6-d9a2-11ed-a259-245ebe68032c  ONLINE       0     0     0
          mirror-1                                      ONLINE       0     0     0
            gptid/b343667c-d9a2-11ed-a259-245ebe68032c  ONLINE       0     0     0
            gptid/b2e5e8cf-d9a2-11ed-a259-245ebe68032c  ONLINE       0     0     0

errors: 1 data errors, use '-v' for a list

 
Last edited:

dizydre21

Dabbler
Joined
Apr 10, 2023
Messages
15
I was just reading back through the zpool status results and realized that the snapshot it references was for Pool1/Media/Videos@auto...........

I deleted the wrong snapshot. Just deleted the correct one and started another scrub. Will report back
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
Asus Z370 Prime-A
I7-8700k (2 cores from Proxmox for Truenas)

Welcome to the forums.

Sorry to hear you're having problems. Happy to see that you posted a good summary of your hardware and operating environment. I usually have to pull teeth a bit to get details, so I am very happy to start today this way.

I believe that board is a gaming system and the CPU is overclockable. Your system has several strikes against it, including the lack of ECC memory, possible overclocking factors, and also PCIe passthru, which sometimes does not work correctly when you have unusual PCIe setups on non-server motherboards.

Please make certain that you are NOT doing any sort of overclocking, and have reset everything to conservative default values. A bit flip while playing GTA5 may not be fatal, but your memory and data paths are not protected and a bit flip can result in corruption in the pool. Once bad data is written, ZFS does not have any "fsck" or "chkdsk" type tools to restore correctness to the pool.

Additionally, PCIe passthru on non-server boards is sometimes dodgy, and using Proxmox with an HBA is probably a bad idea. It represents a large area of unknowns. Hopefully you burned your system in for at least several weeks of stressful I/O using something like solnet-array-test-v3 and other suggestions in the build guide.


Without the safety belts of ECC and a known good server board, testing the reliability of the system is really left in your hands. ZFS is very sensitive to failures and the stressy nature of operations such as scrubs and resilvers can bring out problems with your system, especially if (for example) maybe you inadvertently left it overclocked.
 

dizydre21

Dabbler
Joined
Apr 10, 2023
Messages
15
Welcome to the forums.

Sorry to hear you're having problems. Happy to see that you posted a good summary of your hardware and operating environment. I usually have to pull teeth a bit to get details, so I am very happy to start today this way.

I believe that board is a gaming system and the CPU is overclockable. Your system has several strikes against it, including the lack of ECC memory, possible overclocking factors, and also PCIe passthru, which sometimes does not work correctly when you have unusual PCIe setups on non-server motherboards.

Please make certain that you are NOT doing any sort of overclocking, and have reset everything to conservative default values. A bit flip while playing GTA5 may not be fatal, but your memory and data paths are not protected and a bit flip can result in corruption in the pool. Once bad data is written, ZFS does not have any "fsck" or "chkdsk" type tools to restore correctness to the pool.

Additionally, PCIe passthru on non-server boards is sometimes dodgy, and using Proxmox with an HBA is probably a bad idea. It represents a large area of unknowns. Hopefully you burned your system in for at least several weeks of stressful I/O using something like solnet-array-test-v3 and other suggestions in the build guide.


Without the safety belts of ECC and a known good server board, testing the reliability of the system is really left in your hands. ZFS is very sensitive to failures and the stressy nature of operations such as scrubs and resilvers can bring out problems with your system, especially if (for example) maybe you inadvertently left it overclocked.
Thanks for the reply.

My hardware was used because I only needed to buy the disks for it. I will likely pick up a server board in the coming months to a year, but wanted to test virtualization out. I had a number of issues getting it working, but I think it is relatively stable at this point. The HBA gave me some trouble too, but it wasn't too terribly bad. My data is also in several places including a TruesNAS running bare metal and a cheap Synology NAS, both with mirrored disks. It's mostly just media, which I would be pissed to lose, but not out of a job or anything.

I deleted the crummy snapshot and my recent scrub passed and cleared the unhealthy status. All is in the green now. I basically misread the error all 767345345 times that I read it until the last lol.
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
I basically misread the error all 767345345 times that I read it until the last lol.

You have a sense of humor about it so that's good, you'll probably do well here. Once stuff gets to a certain level of complexity, us dumb humans sometimes get derailed off a dead siding caused by some misunderstanding or bad understanding. For your future possible server, feel free to refer to the forum guides on hardware selection, there are inexpensive used server options available out there that are great deals if you know where to look.
 

dizydre21

Dabbler
Joined
Apr 10, 2023
Messages
15
You have a sense of humor about it so that's good, you'll probably do well here. Once stuff gets to a certain level of complexity, us dumb humans sometimes get derailed off a dead siding caused by some misunderstanding or bad understanding. For your future possible server, feel free to refer to the forum guides on hardware selection, there are inexpensive used server options available out there that are great deals if you know where to look.
Roger that and will do. Thanks
 
Top