Permanent errors on volume with hex codes

Joined
Apr 30, 2016
Messages
12
I have some permanent errors on one of my volumes, but these are shown as hex codes. For example:
Code:
...
pool: oracle02
state: ONLINE
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: http://illumos.org/msg/ZFS-8000-8A
  scan: resilvered 2.77M in 0 days 00:08:18 with 0 errors on Mon Apr 29 00:00:12 2019
config:

        NAME                                            STATE     READ WRITE CKSUM
        oracle02                                        ONLINE       0     0     0
          raidz2-0                                      ONLINE       0     0     0
            gptid/74ea4d6e-ffe3-11e8-aec2-941882388da4  ONLINE       0     0     0
            gptid/762f22f2-ffe3-11e8-aec2-941882388da4  ONLINE       0     0     0
            gptid/770f7817-ffe3-11e8-aec2-941882388da4  ONLINE       0     0     0
            gptid/77f3f862-ffe3-11e8-aec2-941882388da4  ONLINE       0     0     0
            gptid/78e9f9f4-ffe3-11e8-aec2-941882388da4  ONLINE       0     0     0

errors: Permanent errors have been detected in the following files:

        <0x722>:<0x0>
        <0x78c>:<0x0>
        <0x2b9>:<0x0>
        <0x2b9>:<0x572>
        <0x6c6>:<0x15>
        <0x6c6>:<0x32>
        <0x6cc>:<0x0>
        <0x6cc>:<0xe>
oracle#


Reading back through older forum posts suggests that the only way to resolve this was to restore from a backup and recreate the pool.

The volume appears to be okay and the errors I was seeing at 11.2 (https://www.ixsystems.com/community/threads/losing-zfs-pool-overnight.75994/) are no longer causing me an issue at 11.1. Maybe I just got unlucky on the last reboot because these usually cleared on reboot or when I removed any temporary volumes used during testing.

I found a very useful post here http://unixetc.co.uk/2012/01/22/zfs-corruption-persists-in-unlinked-files/ which discussed the issue. Towards the end of the article, it suggests doing a scrub and then stopping that scrub immediately.

The net result is that I now have a clean volume. Question is, has anyone else been in this position and doubted the reliability? It is almost too easy!

Code:
...
pool: oracle02
state: ONLINE
  scan: scrub canceled on Mon Apr 29 22:29:15 2019
config:

        NAME                                            STATE     READ WRITE CKSUM
        oracle02                                        ONLINE       0     0     0
          raidz2-0                                      ONLINE       0     0     0
            gptid/74ea4d6e-ffe3-11e8-aec2-941882388da4  ONLINE       0     0     0
            gptid/762f22f2-ffe3-11e8-aec2-941882388da4  ONLINE       0     0     0
            gptid/770f7817-ffe3-11e8-aec2-941882388da4  ONLINE       0     0     0
            gptid/77f3f862-ffe3-11e8-aec2-941882388da4  ONLINE       0     0     0
            gptid/78e9f9f4-ffe3-11e8-aec2-941882388da4  ONLINE       0     0     0

errors: No known data errors
oracle#

-paul
 

Apollo

Wizard
Joined
Jun 13, 2013
Messages
1,458
Hi Paul,

I have never seen errors listed has you have. I suspect this could be metadata corruption, or maybe along the line of iocage/jails.
I wouldn't trust running scrub then cancelling it a reliable way to suppress the error.
I would think as soon as you complete your next scrub, the errors will flag again if they do exist on your pool.
Unless you clear the state of the pool, those errors are supposed to remain present. Otherwise what the point of having them in the first place if the next scrub will clear them up.
I believe, from experience ZFS to be reliable.
It is possible a corrupted file present in a dataset has its error flagged during scrub, but I would think destroyng the dataset should cause the error to be erased.
 

Chris Moore

Hall of Famer
Joined
May 2, 2015
Messages
10,080
Question is, has anyone else been in this position and doubted the reliability? It is almost too easy!
No, I have been very lucky and I have not had any data errors despite having to replace every drive in my storage multiple times over the last eight years, since 2011.

I would let the scrub run.
 

Chris Moore

Hall of Famer
Joined
May 2, 2015
Messages
10,080
I suspect this could be metadata corruption,
That is what I suspect.
Unless you clear the state of the pool, those errors are supposed to remain present.
I have seen threads where a scrub corrected errors for someone. It depends on the type and cause of the error I suppose. I do agree that stopping the scrub early is not a good answer.
 

Apollo

Wizard
Joined
Jun 13, 2013
Messages
1,458
have seen threads where a scrub corrected errors for someone. It depends on the type and cause of the error I suppose. I do agree that stopping the scrub early is not a good answer.
I understand running a scrub will trigger resilvering of a failed block, and the error will be corrected if enough redundancy exist, but this error will no go into the corrupted file list.
I doubt an error flagging a block causing file corruption can be fixed on its own, even when scrubbing is ran again, because ZFS would have fixed the issue in the first place.
One scenario I can see where this could potentially work is in the event of a mirror or RAIDZ-x pool where all redundancy disk have been taken off the pool (physical disk removal but the pool still have them defined, will show as degraded) and a error occur on the pool, that error will be unrecoverable at the time. So running scrub on a faulted pool in a degraded state will flag the file as unrecoverable.
If the redundancy disk are inserted back, resilver will proceed ahead, and it is conceivable that file, if unmodified since the last time the disk where removed, resilvering should theoretically be able to reconstruct the file because it would have access to the remaining redundant blocks to fix the file.
Hence the file is no longer corrupted and is no longer required to be part of the flagged corrupted list.
 
Joined
Apr 30, 2016
Messages
12
I suspect this could be metadata corruption, or maybe along the line of iocage/jails.
Yes, I thought that too and there were several files or volume roots (the hex code (0x0).

I do intend to let it run but I think the point was that cancelling it forced the tail end of the scrub. I do intend to let the scrub run but I want to see whether the backup process which caused all the problems in that other thread work overnight first. It takes quite a while to run a full scrub. I have not rebooted yet, will do that shortly before the backups kick in.

To be perfectly honest, I was surprised to see it come back clean given what I had read elsewhere. I just thought that it was pointing out that article and seeing if anyone had used it previously.

-paul
 
Joined
Apr 30, 2016
Messages
12
Came back clean after a reboot too.

-paul
 

Apollo

Wizard
Joined
Jun 13, 2013
Messages
1,458
Hi Paul,

Reading your first post, I noticed your pool "oracle02", which is not to be mistaken as a dataset, has been removed from the system. I think of this issue as either a hardware fault or maybe your pool got disconnected though unknown means (power saving, Freenas 11.2 driver issue on the HBA side...).
It doesn't strike me as one of your disk was at fault but instead the entire pool, hence the low checksum errors.
I suspect there maybe some docker/iocage/SMB issue causing the crash under significant load. HBA driver could also do that.

From the link to the Unix post, it seems to me the user was dealing with a clone of one of its snapshot and it is conceivable he did something that would trigger some errors if the data that was cloned was already corrupted.
But up to this point, this is just hypothesys on my part.
 

Apollo

Wizard
Joined
Jun 13, 2013
Messages
1,458
Came back clean after a reboot too.

-paul
Reboot doesn't change the state of a failed pool.
If you had a failed pool and you attached it to a brand new system, and the error was not cleared, as you did, it will still appear on the new system pointing to the corrupted files.

If error reporting were volatile, I would be extremely worried, as it would let corrupted data slip through unnoticed, which would defeat the purpose of ZFS.
But so you know, I believe running scrub will bring the old error back unless for the exceptions I stated previously.
 

danb35

Hall of Famer
Joined
Aug 16, 2011
Messages
15,504
Unless you clear the state of the pool, those errors are supposed to remain present.
A reboot will also clear pool errors. It won't resolve missing/failed devices, and it obviously won't fix corrupted data on disk, but it will (temporarily) get rid of the kinds of error messages OP is seeing. But if the errors are actually there, the next scrub will show them again.
 

Apollo

Wizard
Joined
Jun 13, 2013
Messages
1,458
A reboot will also clear pool errors. It won't resolve missing/failed devices, and it obviously won't fix corrupted data on disk, but it will (temporarily) get rid of the kinds of error messages OP is seeing. But if the errors are actually there, the next scrub will show them again.
I should have written
"the list of corrupted files are supposed to remain present until the "zpool clear tank" is run"
 

danb35

Hall of Famer
Joined
Aug 16, 2011
Messages
15,504
"the list of corrupted files are supposed to remain present until the "zpool clear tank" is run"
I don't believe this is correct. I know that other sorts of pool errors (read/write/checksum) are reset on reboot, but I don't recall specifically what happens with file/metadata errors.
 

SweetAndLow

Sweet'NASty
Joined
Nov 6, 2013
Messages
6,421
action: Restore the file in question if possible. Otherwise restore the entire pool from backup.
The output literally tells you exactly what to do. Pool is corrupt and needs to be rebuilt.
 

Apollo

Wizard
Joined
Jun 13, 2013
Messages
1,458
The output literally tells you exactly what to do. Pool is corrupt and needs to be rebuilt.
It doesn't say the pool is corrupt. The pool is still in good health otherwise the content would not be accessible is more ways.
I would not destroy the pool in order to restore from backup. If there are snapshots, then the file in question could be searched through "zfs diff ..."I think.
 

SweetAndLow

Sweet'NASty
Joined
Nov 6, 2013
Messages
6,421
It doesn't say the pool is corrupt. The pool is still in good health otherwise the content would not be accessible is more ways.
I would not destroy the pool in order to restore from backup. If there are snapshots, then the file in question could be searched through "zfs diff ..."I think.
Well you're wrong. Good luck recovering the pool.
 

Apollo

Wizard
Joined
Jun 13, 2013
Messages
1,458
Not my pool.
The result of the scrub should tell us were the OP stands.
 
Joined
Apr 30, 2016
Messages
12
I suspect there maybe some docker/iocage/SMB issue causing the crash under significant load. HBA driver could also do that.
That was exactly it. After I upgraded from 11.1 to 11.2 I found that the pool was unavailable mid-way during backups being written to the pool via SMB. That particular pool has a port multiplier involved (I know, I know). Had worked for me for years, but caused problems at 11.2. Now reverted to 11.1 and the backups no longer fail under load. 700GB was written overnight without issue. Still appears clean.

Possibly a saving grace is that almost all the permanent errors were on datasets which no longer exist (oracle02/iocage, oracle02/.system/samba4,oracle02/tmp/media, etc) and several files in /var/db/system/... which also no longer exist as the OS has been reinstalled.

A scrub is now underway.

-paul
 

pro lamer

Guru
Joined
Feb 16, 2018
Messages
626
A scrub is now underway
Let's wait for... Anyway I've seen a case in our forums when a pool had had metadata corruption and later IIRC things seemed to go well but later the pool kept being unstable and finally the pool had to be re-created anyway...

Maybe you at least will be able to extract some files if you need to destroy the pool... Do you have a backup?

Sent from my phone
 
Last edited:
Joined
Apr 30, 2016
Messages
12
The scrub result is positive.

Code:
  pool: oracle02
 state: ONLINE
  scan: scrub repaired 0 in 1 days 14:47:23 with 0 errors on Wed May  1 22:30:09 2019
config:

    NAME                                            STATE     READ WRITE CKSUM
    oracle02                                        ONLINE       0     0     0
      raidz2-0                                      ONLINE       0     0     0
        gptid/74ea4d6e-ffe3-11e8-aec2-941882388da4  ONLINE       0     0     0
        gptid/762f22f2-ffe3-11e8-aec2-941882388da4  ONLINE       0     0     0
        gptid/770f7817-ffe3-11e8-aec2-941882388da4  ONLINE       0     0     0
        gptid/77f3f862-ffe3-11e8-aec2-941882388da4  ONLINE       0     0     0
        gptid/78e9f9f4-ffe3-11e8-aec2-941882388da4  ONLINE       0     0     0

errors: No known data errors


As to backups, the most important I do have on a separate NAS drive and now on the second pool. I would lose some files I would rather not lose if I lost the whole pool.

The intention was to have a daily replication task to backup the main pool. I will have to set these up again now that I have gone back to 11.1.

-paul
 

pro lamer

Guru
Joined
Feb 16, 2018
Messages
626
<0x722>:<0x0> <0x78c>:<0x0> <0x2b9>:<0x0> <0x2b9>:<0x572> <0x6c6>:<0x15> <0x6c6>:<0x32> <0x6cc>:<0x0> <0x6cc>:<0xe>
I wonder why these errors have disappeared?

Sent from my phone
 
Top