Permanent errors on volume with hex codes

paul.warwicker.1 · Apr 29, 2019

I have some permanent errors on one of my volumes, but these are shown as hex codes. For example:

Code:

...
pool: oracle02
state: ONLINE
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: http://illumos.org/msg/ZFS-8000-8A
  scan: resilvered 2.77M in 0 days 00:08:18 with 0 errors on Mon Apr 29 00:00:12 2019
config:

        NAME                                            STATE     READ WRITE CKSUM
        oracle02                                        ONLINE       0     0     0
          raidz2-0                                      ONLINE       0     0     0
            gptid/74ea4d6e-ffe3-11e8-aec2-941882388da4  ONLINE       0     0     0
            gptid/762f22f2-ffe3-11e8-aec2-941882388da4  ONLINE       0     0     0
            gptid/770f7817-ffe3-11e8-aec2-941882388da4  ONLINE       0     0     0
            gptid/77f3f862-ffe3-11e8-aec2-941882388da4  ONLINE       0     0     0
            gptid/78e9f9f4-ffe3-11e8-aec2-941882388da4  ONLINE       0     0     0

errors: Permanent errors have been detected in the following files:

        <0x722>:<0x0>
        <0x78c>:<0x0>
        <0x2b9>:<0x0>
        <0x2b9>:<0x572>
        <0x6c6>:<0x15>
        <0x6c6>:<0x32>
        <0x6cc>:<0x0>
        <0x6cc>:<0xe>
oracle#

Reading back through older forum posts suggests that the only way to resolve this was to restore from a backup and recreate the pool.

The volume appears to be okay and the errors I was seeing at 11.2 (https://www.ixsystems.com/community/threads/losing-zfs-pool-overnight.75994/) are no longer causing me an issue at 11.1. Maybe I just got unlucky on the last reboot because these usually cleared on reboot or when I removed any temporary volumes used during testing.

I found a very useful post here http://unixetc.co.uk/2012/01/22/zfs-corruption-persists-in-unlinked-files/ which discussed the issue. Towards the end of the article, it suggests doing a scrub and then stopping that scrub immediately.

The net result is that I now have a clean volume. Question is, has anyone else been in this position and doubted the reliability? It is almost too easy!

Code:

...
pool: oracle02
state: ONLINE
  scan: scrub canceled on Mon Apr 29 22:29:15 2019
config:

        NAME                                            STATE     READ WRITE CKSUM
        oracle02                                        ONLINE       0     0     0
          raidz2-0                                      ONLINE       0     0     0
            gptid/74ea4d6e-ffe3-11e8-aec2-941882388da4  ONLINE       0     0     0
            gptid/762f22f2-ffe3-11e8-aec2-941882388da4  ONLINE       0     0     0
            gptid/770f7817-ffe3-11e8-aec2-941882388da4  ONLINE       0     0     0
            gptid/77f3f862-ffe3-11e8-aec2-941882388da4  ONLINE       0     0     0
            gptid/78e9f9f4-ffe3-11e8-aec2-941882388da4  ONLINE       0     0     0

errors: No known data errors
oracle#

-paul

Apollo · Apr 29, 2019

Hi Paul,

I have never seen errors listed has you have. I suspect this could be metadata corruption, or maybe along the line of iocage/jails.
I wouldn't trust running scrub then cancelling it a reliable way to suppress the error.
I would think as soon as you complete your next scrub, the errors will flag again if they do exist on your pool.
Unless you clear the state of the pool, those errors are supposed to remain present. Otherwise what the point of having them in the first place if the next scrub will clear them up.
I believe, from experience ZFS to be reliable.
It is possible a corrupted file present in a dataset has its error flagged during scrub, but I would think destroyng the dataset should cause the error to be erased.

Chris Moore · Apr 29, 2019

paul.warwicker.1 said:
Question is, has anyone else been in this position and doubted the reliability? It is almost too easy!

No, I have been very lucky and I have not had any data errors despite having to replace every drive in my storage multiple times over the last eight years, since 2011.

I would let the scrub run.

Chris Moore · Apr 29, 2019

Apollo said:
I suspect this could be metadata corruption,

That is what I suspect.

Apollo said:
Unless you clear the state of the pool, those errors are supposed to remain present.

I have seen threads where a scrub corrected errors for someone. It depends on the type and cause of the error I suppose. I do agree that stopping the scrub early is not a good answer.

Apollo · Apr 29, 2019

Chris Moore said:
have seen threads where a scrub corrected errors for someone. It depends on the type and cause of the error I suppose. I do agree that stopping the scrub early is not a good answer.

I understand running a scrub will trigger resilvering of a failed block, and the error will be corrected if enough redundancy exist, but this error will no go into the corrupted file list.
I doubt an error flagging a block causing file corruption can be fixed on its own, even when scrubbing is ran again, because ZFS would have fixed the issue in the first place.
One scenario I can see where this could potentially work is in the event of a mirror or RAIDZ-x pool where all redundancy disk have been taken off the pool (physical disk removal but the pool still have them defined, will show as degraded) and a error occur on the pool, that error will be unrecoverable at the time. So running scrub on a faulted pool in a degraded state will flag the file as unrecoverable.
If the redundancy disk are inserted back, resilver will proceed ahead, and it is conceivable that file, if unmodified since the last time the disk where removed, resilvering should theoretically be able to reconstruct the file because it would have access to the remaining redundant blocks to fix the file.
Hence the file is no longer corrupted and is no longer required to be part of the flagged corrupted list.

paul.warwicker.1 · Apr 29, 2019

Apollo said:
I suspect this could be metadata corruption, or maybe along the line of iocage/jails.

Yes, I thought that too and there were several files or volume roots (the hex code (0x0).

I do intend to let it run but I think the point was that cancelling it forced the tail end of the scrub. I do intend to let the scrub run but I want to see whether the backup process which caused all the problems in that other thread work overnight first. It takes quite a while to run a full scrub. I have not rebooted yet, will do that shortly before the backups kick in.

To be perfectly honest, I was surprised to see it come back clean given what I had read elsewhere. I just thought that it was pointing out that article and seeing if anyone had used it previously.

-paul

paul.warwicker.1 · Apr 29, 2019

Came back clean after a reboot too.

-paul

Apollo · Apr 29, 2019

Hi Paul,

Reading your first post, I noticed your pool "oracle02", which is not to be mistaken as a dataset, has been removed from the system. I think of this issue as either a hardware fault or maybe your pool got disconnected though unknown means (power saving, Freenas 11.2 driver issue on the HBA side...).
It doesn't strike me as one of your disk was at fault but instead the entire pool, hence the low checksum errors.
I suspect there maybe some docker/iocage/SMB issue causing the crash under significant load. HBA driver could also do that.

From the link to the Unix post, it seems to me the user was dealing with a clone of one of its snapshot and it is conceivable he did something that would trigger some errors if the data that was cloned was already corrupted.
But up to this point, this is just hypothesys on my part.

Apollo · Apr 29, 2019

paul.warwicker.1 said:
Came back clean after a reboot too.

-paul

Reboot doesn't change the state of a failed pool.
If you had a failed pool and you attached it to a brand new system, and the error was not cleared, as you did, it will still appear on the new system pointing to the corrupted files.

If error reporting were volatile, I would be extremely worried, as it would let corrupted data slip through unnoticed, which would defeat the purpose of ZFS.
But so you know, I believe running scrub will bring the old error back unless for the exceptions I stated previously.

danb35 · Apr 29, 2019

Apollo said:
Unless you clear the state of the pool, those errors are supposed to remain present.

A reboot will also clear pool errors. It won't resolve missing/failed devices, and it obviously won't fix corrupted data on disk, but it will (temporarily) get rid of the kinds of error messages OP is seeing. But if the errors are actually there, the next scrub will show them again.

Apollo · Apr 29, 2019

danb35 said:
A reboot will also clear pool errors. It won't resolve missing/failed devices, and it obviously won't fix corrupted data on disk, but it will (temporarily) get rid of the kinds of error messages OP is seeing. But if the errors are actually there, the next scrub will show them again.

I should have written
"the list of corrupted files are supposed to remain present until the "zpool clear tank" is run"

danb35 · Apr 29, 2019

Apollo said:
"the list of corrupted files are supposed to remain present until the "zpool clear tank" is run"

I don't believe this is correct. I know that other sorts of pool errors (read/write/checksum) are reset on reboot, but I don't recall specifically what happens with file/metadata errors.

SweetAndLow · Apr 29, 2019

paul.warwicker.1 said:
action: Restore the file in question if possible. Otherwise restore the entire pool from backup.

The output literally tells you exactly what to do. Pool is corrupt and needs to be rebuilt.

Apollo · Apr 29, 2019

SweetAndLow said:
The output literally tells you exactly what to do. Pool is corrupt and needs to be rebuilt.

It doesn't say the pool is corrupt. The pool is still in good health otherwise the content would not be accessible is more ways.
I would not destroy the pool in order to restore from backup. If there are snapshots, then the file in question could be searched through "zfs diff ..."I think.

SweetAndLow · Apr 29, 2019

Apollo said:
It doesn't say the pool is corrupt. The pool is still in good health otherwise the content would not be accessible is more ways.
I would not destroy the pool in order to restore from backup. If there are snapshots, then the file in question could be searched through "zfs diff ..."I think.

Well you're wrong. Good luck recovering the pool.

Apollo · Apr 29, 2019

Not my pool.
The result of the scrub should tell us were the OP stands.

paul.warwicker.1 · Apr 30, 2019

Apollo said:
I suspect there maybe some docker/iocage/SMB issue causing the crash under significant load. HBA driver could also do that.

That was exactly it. After I upgraded from 11.1 to 11.2 I found that the pool was unavailable mid-way during backups being written to the pool via SMB. That particular pool has a port multiplier involved (I know, I know). Had worked for me for years, but caused problems at 11.2. Now reverted to 11.1 and the backups no longer fail under load. 700GB was written overnight without issue. Still appears clean.

Possibly a saving grace is that almost all the permanent errors were on datasets which no longer exist (oracle02/iocage, oracle02/.system/samba4,oracle02/tmp/media, etc) and several files in /var/db/system/... which also no longer exist as the OS has been reinstalled.

A scrub is now underway.

-paul

pro lamer · Apr 30, 2019

paul.warwicker.1 said:
A scrub is now underway

~~Let's wait for... Anyway~~ I've seen a case in our forums when a pool had had metadata corruption and later IIRC things seemed to go well but later the pool kept being unstable and finally the pool had to be re-created anyway...

Maybe you at least will be able to extract some files if you need to destroy the pool... Do you have a backup?

Sent from my phone

paul.warwicker.1 · May 2, 2019

The scrub result is positive.

Code:

  pool: oracle02
 state: ONLINE
  scan: scrub repaired 0 in 1 days 14:47:23 with 0 errors on Wed May  1 22:30:09 2019
config:

    NAME                                            STATE     READ WRITE CKSUM
    oracle02                                        ONLINE       0     0     0
      raidz2-0                                      ONLINE       0     0     0
        gptid/74ea4d6e-ffe3-11e8-aec2-941882388da4  ONLINE       0     0     0
        gptid/762f22f2-ffe3-11e8-aec2-941882388da4  ONLINE       0     0     0
        gptid/770f7817-ffe3-11e8-aec2-941882388da4  ONLINE       0     0     0
        gptid/77f3f862-ffe3-11e8-aec2-941882388da4  ONLINE       0     0     0
        gptid/78e9f9f4-ffe3-11e8-aec2-941882388da4  ONLINE       0     0     0

errors: No known data errors

As to backups, the most important I do have on a separate NAS drive and now on the second pool. I would lose some files I would rather not lose if I lost the whole pool.

The intention was to have a daily replication task to backup the main pool. I will have to set these up again now that I have gone back to 11.1.

-paul

pro lamer · May 2, 2019

paul.warwicker.1 said:
<0x722>:<0x0> <0x78c>:<0x0> <0x2b9>:<0x0> <0x2b9>:<0x572> <0x6c6>:<0x15> <0x6c6>:<0x32> <0x6cc>:<0x0> <0x6cc>:<0xe>

I wonder why these errors have disappeared?

Sent from my phone

Important Announcement for The TrueNAS Community.

Permanent errors on volume with hex codes

Dabbler

Wizard

Hall of Famer

Hall of Famer

Wizard

Dabbler

Dabbler

Wizard

Wizard

Hall of Famer

Wizard

Hall of Famer

Sweet'NASty

Wizard

Sweet'NASty

Wizard

Dabbler

Guru

Dabbler

Guru

Similar threads

Important Announcement for The TrueNAS Community.