One or more devices has experienced an error resulting in data corruption.

Status
Not open for further replies.

kelchm

Dabbler
Joined
Dec 19, 2013
Messages
12
Currently trying to get to the bottom of this issue. I'm using an Asrock C2750D4i and 32GB of ECC RAM. This system has been running for ~2 years without issue.

I had a power failure back on April 13th that lasted several hours -- my NAS did not power off before the UPS died. I actually didn't get around to powering it back on until the several days later on the 17th at which point a scrub started and ultimately CKSUM errors were listed on my pool and raidz2 vdev.

Conceivably, is there any situation related to a power failure that could result in this happening? SMART data for my drives look good (I can post this if it would be helpful.)

Code:
[root@athena] /var/log# zpool status -v
  pool: Storage
state: ONLINE
status: One or more devices has experienced an error resulting in data
    corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
    entire pool from backup.
   see: http://illumos.org/msg/ZFS-8000-8A
  scan: scrub in progress since Tue Apr 19 18:00:35 2016
        973G scanned out of 12.2T at 356M/s, 9h11m to go
        0 repaired, 7.79% done
config:

    NAME                                            STATE     READ WRITE CKSUM
    Storage                                         ONLINE       0     0     2
      raidz2-0                                      ONLINE       0     0     4
        gptid/ad20cddd-437f-11e3-95c5-bc5ff4c941e8  ONLINE       0     0     0
        gptid/ad8e11d8-437f-11e3-95c5-bc5ff4c941e8  ONLINE       0     0     0
        gptid/adfad154-437f-11e3-95c5-bc5ff4c941e8  ONLINE       0     0     0
        gptid/ae652233-437f-11e3-95c5-bc5ff4c941e8  ONLINE       0     0     0
        gptid/aecc5b01-437f-11e3-95c5-bc5ff4c941e8  ONLINE       0     0     0
        gptid/af388815-437f-11e3-95c5-bc5ff4c941e8  ONLINE       0     0     0
    cache
      ada1p1                                        ONLINE       0     0     0

errors: Permanent errors have been detected in the following files:

        /mnt/Storage/Media/Video/Movies/foo1
        /mnt/Storage/Media/Video/Movies/foo2
 
Last edited:
D

dlavigne

Guest
Did the scrub finish? If so, does zpool status look any better?
 

kelchm

Dabbler
Joined
Dec 19, 2013
Messages
12
After that second scrub finished:

Code:
[root@athena] ~# zpool status -v
  pool: Storage
state: ONLINE
status: One or more devices has experienced an error resulting in data
    corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
    entire pool from backup.
   see: http://illumos.org/msg/ZFS-8000-8A
  scan: scrub repaired 0 in 8h0m with 1 errors on Wed Apr 20 02:00:41 2016
config:

    NAME                                            STATE     READ WRITE CKSUM
    Storage                                         ONLINE       0     0     1
      raidz2-0                                      ONLINE       0     0     2
        gptid/ad20cddd-437f-11e3-95c5-bc5ff4c941e8  ONLINE       0     0     0
        gptid/ad8e11d8-437f-11e3-95c5-bc5ff4c941e8  ONLINE       0     0     0
        gptid/adfad154-437f-11e3-95c5-bc5ff4c941e8  ONLINE       0     0     0
        gptid/ae652233-437f-11e3-95c5-bc5ff4c941e8  ONLINE       0     0     0
        gptid/aecc5b01-437f-11e3-95c5-bc5ff4c941e8  ONLINE       0     0     0
        gptid/af388815-437f-11e3-95c5-bc5ff4c941e8  ONLINE       0     0     0
    cache
      ada1p1                                        ONLINE       0     0     0

errors: Permanent errors have been detected in the following files:

        /mnt/Storage/Media/Video/Movies/foo3


Since I have backups of all the data I care about, I decided to run another scrub and see what happened...

Code:
zpool status -v
  pool: Storage
state: ONLINE
status: One or more devices has experienced an error resulting in data
    corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
    entire pool from backup.
   see: http://illumos.org/msg/ZFS-8000-8A
  scan: scrub repaired 288K in 7h48m with 1 errors on Wed Apr 20 15:26:00 2016
config:

    NAME                                            STATE     READ WRITE CKSUM
    Storage                                         ONLINE       0     0     1
      raidz2-0                                      ONLINE       0     0     2
        gptid/ad20cddd-437f-11e3-95c5-bc5ff4c941e8  ONLINE       0     0     5
        gptid/ad8e11d8-437f-11e3-95c5-bc5ff4c941e8  ONLINE       0     0     1
        gptid/adfad154-437f-11e3-95c5-bc5ff4c941e8  ONLINE       0     0     0
        gptid/ae652233-437f-11e3-95c5-bc5ff4c941e8  ONLINE       0     0     1
        gptid/aecc5b01-437f-11e3-95c5-bc5ff4c941e8  ONLINE       0     0     0
        gptid/af388815-437f-11e3-95c5-bc5ff4c941e8  ONLINE       0     0     2
    cache
      ada1p1                                        ONLINE       0     0     0

errors: Permanent errors have been detected in the following files:

        /mnt/Storage/Media/Video/TV Shows/foo4


I think this points to a failing controller -- I haven't verified that these four drives are on the same controller, but I would be willing to bet that they are. I know that some people have reported issues with the Marvel controller on this board, so perhaps that is my issue. Very odd that it suddenly appeared after a power failure.
 

kelchm

Dabbler
Joined
Dec 19, 2013
Messages
12
Have you checked smart status of your drives?
No issues with the drives.

Is there any way to check which file(s) were affected by the last scrub? It notes that 288k was repaired, but I have no way of knowing what files. Does it even matter? Can any of the data on this vdev be trusted? I have backups for the important data, but they will be time consuming to retrieve and assimilate back into the previous structure (mainly photos).
 

Bidule0hm

Server Electronics Sorcerer
Joined
Aug 5, 2013
Messages
3,710
If it was repaired then it's not corrupted, else it would be repaired :)
 

kelchm

Dabbler
Joined
Dec 19, 2013
Messages
12
If it was repaired then it's not corrupted, else it would be repaired :)
I don't think this is a safe assumption to make, is it?

If the controller is failing (which it certainly appears to be) it would be entirely possible for the repair process to come to the wrong consensus or even if the right consensus is reached, that bad data get written to multiple drives.

I have the system powered off and I will not power it back up until I have a replacement board. I'm basically trying to establish just how much I can trust this data once I have the system back up and running.
 

rs225

Guru
Joined
Jun 28, 2014
Messages
878
If it scrubs and says there are no errors, then there are no errors.

If it scrubs and says a file is damaged, then that file will give an error when you try to read it.

If a file says damaged, and then goes back to OK, it was never damaged. The errors are happening in your hardware during the scrub read, but they don't exist on the disks. When it 'corrects,' it is probably just overwriting good data with good data.

ZFS is virtually never going to write bad data over good data.(Emphasis on ZFS) In the worst case, it can't get the data to match the checksum, so it declares it bad for that scrub and writes nothing.
 

kelchm

Dabbler
Joined
Dec 19, 2013
Messages
12
If a file says damaged, and then goes back to OK, it was never damaged. The errors are happening in your hardware during the scrub read, but they don't exist on the disks. When it 'corrects,' it is probably just overwriting good data with good data.
Okay, but to play devil's advocate here -- certainly the situation could arise where two out of three reads happen to have the same bit flipped. In that situation, wouldn't bad data end up being written to the drive?

I know it's unlikely, but it's possible isn't it?
 

Bidule0hm

Server Electronics Sorcerer
Joined
Aug 5, 2013
Messages
3,710
ZFS is virtually never going to write bad data over good data.

Excepted with bad non-ECC RAM. But here it's not the fault of ZFS anymore anyway.
 

danb35

Hall of Famer
Joined
Aug 16, 2011
Messages
15,504
Excepted with bad non-ECC RAM.
...and even then, your bad read (due to bad RAM) would have to give a result that matched with an also-bad checksum before ZFS is going to decide that that incorrect data is correct and write it somewhere else.
 
Status
Not open for further replies.
Top