One or more devices has experienced an error resulting in data corruption.

kelchm · Apr 19, 2016

Currently trying to get to the bottom of this issue. I'm using an Asrock C2750D4i and 32GB of ECC RAM. This system has been running for ~2 years without issue.

I had a power failure back on April 13th that lasted several hours -- my NAS did not power off before the UPS died. I actually didn't get around to powering it back on until the several days later on the 17th at which point a scrub started and ultimately CKSUM errors were listed on my pool and raidz2 vdev.

Conceivably, is there any situation related to a power failure that could result in this happening? SMART data for my drives look good (I can post this if it would be helpful.)

Code:

[root@athena] /var/log# zpool status -v
  pool: Storage
state: ONLINE
status: One or more devices has experienced an error resulting in data
    corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
    entire pool from backup.
   see: http://illumos.org/msg/ZFS-8000-8A
  scan: scrub in progress since Tue Apr 19 18:00:35 2016
        973G scanned out of 12.2T at 356M/s, 9h11m to go
        0 repaired, 7.79% done
config:

    NAME                                            STATE     READ WRITE CKSUM
    Storage                                         ONLINE       0     0     2
      raidz2-0                                      ONLINE       0     0     4
        gptid/ad20cddd-437f-11e3-95c5-bc5ff4c941e8  ONLINE       0     0     0
        gptid/ad8e11d8-437f-11e3-95c5-bc5ff4c941e8  ONLINE       0     0     0
        gptid/adfad154-437f-11e3-95c5-bc5ff4c941e8  ONLINE       0     0     0
        gptid/ae652233-437f-11e3-95c5-bc5ff4c941e8  ONLINE       0     0     0
        gptid/aecc5b01-437f-11e3-95c5-bc5ff4c941e8  ONLINE       0     0     0
        gptid/af388815-437f-11e3-95c5-bc5ff4c941e8  ONLINE       0     0     0
    cache
      ada1p1                                        ONLINE       0     0     0

errors: Permanent errors have been detected in the following files:

        /mnt/Storage/Media/Video/Movies/foo1
        /mnt/Storage/Media/Video/Movies/foo2

dlavigne · Apr 20, 2016

Did the scrub finish? If so, does zpool status look any better?

kelchm · Apr 20, 2016

After that second scrub finished:

Code:

[root@athena] ~# zpool status -v
  pool: Storage
state: ONLINE
status: One or more devices has experienced an error resulting in data
    corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
    entire pool from backup.
   see: http://illumos.org/msg/ZFS-8000-8A
  scan: scrub repaired 0 in 8h0m with 1 errors on Wed Apr 20 02:00:41 2016
config:

    NAME                                            STATE     READ WRITE CKSUM
    Storage                                         ONLINE       0     0     1
      raidz2-0                                      ONLINE       0     0     2
        gptid/ad20cddd-437f-11e3-95c5-bc5ff4c941e8  ONLINE       0     0     0
        gptid/ad8e11d8-437f-11e3-95c5-bc5ff4c941e8  ONLINE       0     0     0
        gptid/adfad154-437f-11e3-95c5-bc5ff4c941e8  ONLINE       0     0     0
        gptid/ae652233-437f-11e3-95c5-bc5ff4c941e8  ONLINE       0     0     0
        gptid/aecc5b01-437f-11e3-95c5-bc5ff4c941e8  ONLINE       0     0     0
        gptid/af388815-437f-11e3-95c5-bc5ff4c941e8  ONLINE       0     0     0
    cache
      ada1p1                                        ONLINE       0     0     0

errors: Permanent errors have been detected in the following files:

        /mnt/Storage/Media/Video/Movies/foo3

Since I have backups of all the data I care about, I decided to run another scrub and see what happened...

Code:

zpool status -v
  pool: Storage
state: ONLINE
status: One or more devices has experienced an error resulting in data
    corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
    entire pool from backup.
   see: http://illumos.org/msg/ZFS-8000-8A
  scan: scrub repaired 288K in 7h48m with 1 errors on Wed Apr 20 15:26:00 2016
config:

    NAME                                            STATE     READ WRITE CKSUM
    Storage                                         ONLINE       0     0     1
      raidz2-0                                      ONLINE       0     0     2
        gptid/ad20cddd-437f-11e3-95c5-bc5ff4c941e8  ONLINE       0     0     5
        gptid/ad8e11d8-437f-11e3-95c5-bc5ff4c941e8  ONLINE       0     0     1
        gptid/adfad154-437f-11e3-95c5-bc5ff4c941e8  ONLINE       0     0     0
        gptid/ae652233-437f-11e3-95c5-bc5ff4c941e8  ONLINE       0     0     1
        gptid/aecc5b01-437f-11e3-95c5-bc5ff4c941e8  ONLINE       0     0     0
        gptid/af388815-437f-11e3-95c5-bc5ff4c941e8  ONLINE       0     0     2
    cache
      ada1p1                                        ONLINE       0     0     0

errors: Permanent errors have been detected in the following files:

        /mnt/Storage/Media/Video/TV Shows/foo4

I think this points to a failing controller -- I haven't verified that these four drives are on the same controller, but I would be willing to bet that they are. I know that some people have reported issues with the Marvel controller on this board, so perhaps that is my issue. Very odd that it suddenly appeared after a power failure.

hugovsky · Apr 20, 2016

Have you checked smart status of your drives?

kelchm · Apr 21, 2016

hugovsky said:
Have you checked smart status of your drives?

No issues with the drives.

Is there any way to check which file(s) were affected by the last scrub? It notes that 288k was repaired, but I have no way of knowing what files. Does it even matter? Can any of the data on this vdev be trusted? I have backups for the important data, but they will be time consuming to retrieve and assimilate back into the previous structure (mainly photos).

Bidule0hm · Apr 22, 2016

If it was repaired then it's not corrupted, else it would be repaired :)

kelchm · Apr 22, 2016

Bidule0hm said:
If it was repaired then it's not corrupted, else it would be repaired :)

I don't think this is a safe assumption to make, is it?

If the controller is failing (which it certainly appears to be) it would be entirely possible for the repair process to come to the wrong consensus or even if the right consensus is reached, that bad data get written to multiple drives.

I have the system powered off and I will not power it back up until I have a replacement board. I'm basically trying to establish just how much I can trust this data once I have the system back up and running.

rs225 · Apr 22, 2016

If it scrubs and says there are no errors, then there are no errors.

If it scrubs and says a file is damaged, then that file will give an error when you try to read it.

If a file says damaged, and then goes back to OK, it was never damaged. The errors are happening in your hardware during the scrub read, but they don't exist on the disks. When it 'corrects,' it is probably just overwriting good data with good data.

ZFS is virtually never going to write bad data over good data.(Emphasis on ZFS) In the worst case, it can't get the data to match the checksum, so it declares it bad for that scrub and writes nothing.

kelchm · Apr 22, 2016

rs225 said:
If a file says damaged, and then goes back to OK, it was never damaged. The errors are happening in your hardware during the scrub read, but they don't exist on the disks. When it 'corrects,' it is probably just overwriting good data with good data.

Okay, but to play devil's advocate here -- certainly the situation could arise where two out of three reads happen to have the same bit flipped. In that situation, wouldn't bad data end up being written to the drive?

I know it's unlikely, but it's possible isn't it?

danb35 · Apr 22, 2016

No, because the flipped bit wouldn't match the checksum.

Bidule0hm · Apr 23, 2016

rs225 said:
ZFS is virtually never going to write bad data over good data.

Excepted with bad non-ECC RAM. But here it's not the fault of ZFS anymore anyway.

danb35 · Apr 23, 2016

Bidule0hm said:
Excepted with bad non-ECC RAM.

...and even then, your bad read (due to bad RAM) would have to give a result that matched with an also-bad checksum before ZFS is going to decide that that incorrect data is correct and write it somewhere else.

Important Announcement for the TrueNAS Community.

One or more devices has experienced an error resulting in data corruption.

kelchm

Dabbler

dlavigne

Guest

kelchm

Dabbler

hugovsky

Guru

kelchm

Dabbler

Bidule0hm

Server Electronics Sorcerer

kelchm

Dabbler

rs225

Guru

kelchm

Dabbler

danb35

Hall of Famer

Bidule0hm

Server Electronics Sorcerer

danb35

Hall of Famer

Similar threads