Errors while scrubbing a raidz1 vdev (disk failed while replacing a failed disk)

phretor · Jan 1, 2013

Hello.

In my FreeNAS 8.2 box I created a raidz1 vdev with of 4x2TB disks (the vdev has been working for about 2 years). The system has 4GB of RAM (an upgrade to 16GB is planned). In the last year, I replaced 3 out of 4 disks (smart errors). The system is scheduled to scrub the vdev weekly. I never encountered any data error and so far I am quite happy with zfs.

Today, I stumbled upon the first occurrence of data errors. As this is my first experience with zfs, I need some feedback on how to handle them.

Recently, the fourth disk had some bad sectors, so I decided to replace it. While resilvering the vdev (with the new, fourth disk in), several data errors were reported, and both disk 3 and 4 had several thousands of checksum errors.

After a second scrub, the situation is as follows:

Code:

[phretor@opentank ~]$ zpool status
  pool: tank
 state: ONLINE
status: One or more devices has experienced an error resulting in data
	corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
	entire pool from backup.
   see: http://www.sun.com/msg/ZFS-8000-8A
 scrub: none requested
config:

	NAME                                            STATE     READ WRITE CKSUM
	tank                                            ONLINE       0     0     0
	  raidz1                                        ONLINE       0     0     0
	    gpt/disk0                                   ONLINE       0     0     0
	    ada1p2                                      ONLINE       0     0     0
	    ada2p2                                      ONLINE       0     0     7
	    gptid/f70f9292-4f60-11e2-9b34-00270e2f08e1  ONLINE       0     0     0

errors: 494390 data errors, use '-v' for a list

Unfortunately, `zpool status -v` hangs the system (maybe too many files). Meanwhile, I ordered a replacement for the third drive, because I discovered that it was failing too. SMART errors here: http://pastebin.com/XmCBh4E6

So, to conclude, I've got some questions:

are these data errors a sign of actual damages or data losses, or are they recoverable somehow?
while waiting for the replacement disk, is there something that I can do to reduce the risk to loose data completely?
is there a way to enumerate the files affected by data errors (so that I can see if I have a spare copy of these files on some other machines), without hanging the system?

Last, I am aware that WD green disks are not the best choice for a NAS. I learned this at my own risk. I am planning to migrate to a better hardware configuration, but first I need to take care of these issues.

Thanks in advance for any feedback.

ripkurrle · Jan 3, 2013

I'm new to zfs troubleshooting but you might want to try 'zdb -l /dev/ada0p2' on the disks to see if anything comes up, apparently its the zfs debugger

http://docs.oracle.com/cd/E23823_01/html/816-5166/zdb-1m.html

paleoN · Jan 3, 2013

phretor said:
Recently, the fourth disk had some bad sectors, so I decided to replace it. While resilvering the vdev (with the new, fourth disk in), several data errors were reported, and both disk 3 and 4 had several thousands of checksum errors.

If I understand correctly, you have a raidz1 array with two failing disks at the same time.

phretor said:
So, to conclude, I've got some questions:

are these data errors a sign of actual damages or data losses, or are they recoverable somehow?

while waiting for the replacement disk, is there something that I can do to reduce the risk to loose data completely?

is there a way to enumerate the files affected by data errors (so that I can see if I have a spare copy of these files on some other machines), without hanging the system?

Actual data loss. You have a single parity array and consequently are only protected against a single disk failure. Restore from backup.
It appears to be too late. Do not use the pool except to copy as much data off as you can.
You could try and offline ada2p2 if that's what it's stuck waiting on.

phretor said:
The system is scheduled to scrub the vdev weekly.

IMO, this is a bit excessive.

ripkurrle said:
I'm new to zfs troubleshooting but you might want to try 'zdb -l /dev/ada0p2' on the disks to see if anything comes up

Do you even know what that command does? I don't see it being useful here.

Important Announcement for the TrueNAS Community.

Errors while scrubbing a raidz1 vdev (disk failed while replacing a failed disk)

phretor

Cadet

ripkurrle

Dabbler

paleoN

Wizard

Similar threads

Important Announcement for the TrueNAS Community.

Errors while scrubbing a raidz1 vdev (disk failed while replacing a failed disk)

phretor

Cadet

ripkurrle

Dabbler

paleoN

Wizard

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Errors while scrubbing a raidz1 vdev (disk failed while replacing a failed disk)"

Similar threads