Errors while scrubbing a raidz1 vdev (disk failed while replacing a failed disk)

Status
Not open for further replies.

phretor

Cadet
Joined
Dec 26, 2012
Messages
1
Hello.

In my FreeNAS 8.2 box I created a raidz1 vdev with of 4x2TB disks (the vdev has been working for about 2 years). The system has 4GB of RAM (an upgrade to 16GB is planned). In the last year, I replaced 3 out of 4 disks (smart errors). The system is scheduled to scrub the vdev weekly. I never encountered any data error and so far I am quite happy with zfs.

Today, I stumbled upon the first occurrence of data errors. As this is my first experience with zfs, I need some feedback on how to handle them.

Recently, the fourth disk had some bad sectors, so I decided to replace it. While resilvering the vdev (with the new, fourth disk in), several data errors were reported, and both disk 3 and 4 had several thousands of checksum errors.

After a second scrub, the situation is as follows:

Code:
[phretor@opentank ~]$ zpool status
  pool: tank
 state: ONLINE
status: One or more devices has experienced an error resulting in data
	corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
	entire pool from backup.
   see: http://www.sun.com/msg/ZFS-8000-8A
 scrub: none requested
config:

	NAME                                            STATE     READ WRITE CKSUM
	tank                                            ONLINE       0     0     0
	  raidz1                                        ONLINE       0     0     0
	    gpt/disk0                                   ONLINE       0     0     0
	    ada1p2                                      ONLINE       0     0     0
	    ada2p2                                      ONLINE       0     0     7
	    gptid/f70f9292-4f60-11e2-9b34-00270e2f08e1  ONLINE       0     0     0

errors: 494390 data errors, use '-v' for a list


Unfortunately, `zpool status -v` hangs the system (maybe too many files). Meanwhile, I ordered a replacement for the third drive, because I discovered that it was failing too. SMART errors here: http://pastebin.com/XmCBh4E6

So, to conclude, I've got some questions:

  • are these data errors a sign of actual damages or data losses, or are they recoverable somehow?
  • while waiting for the replacement disk, is there something that I can do to reduce the risk to loose data completely?
  • is there a way to enumerate the files affected by data errors (so that I can see if I have a spare copy of these files on some other machines), without hanging the system?

Last, I am aware that WD green disks are not the best choice for a NAS. I learned this at my own risk. I am planning to migrate to a better hardware configuration, but first I need to take care of these issues.

Thanks in advance for any feedback.
 

paleoN

Wizard
Joined
Apr 22, 2012
Messages
1,403
Recently, the fourth disk had some bad sectors, so I decided to replace it. While resilvering the vdev (with the new, fourth disk in), several data errors were reported, and both disk 3 and 4 had several thousands of checksum errors.
If I understand correctly, you have a raidz1 array with two failing disks at the same time.

So, to conclude, I've got some questions:

  • are these data errors a sign of actual damages or data losses, or are they recoverable somehow?
  • while waiting for the replacement disk, is there something that I can do to reduce the risk to loose data completely?
  • is there a way to enumerate the files affected by data errors (so that I can see if I have a spare copy of these files on some other machines), without hanging the system?
  • Actual data loss. You have a single parity array and consequently are only protected against a single disk failure. Restore from backup.

  • It appears to be too late. Do not use the pool except to copy as much data off as you can.

  • You could try and offline ada2p2 if that's what it's stuck waiting on.

The system is scheduled to scrub the vdev weekly.
IMO, this is a bit excessive.



I'm new to zfs troubleshooting but you might want to try 'zdb -l /dev/ada0p2' on the disks to see if anything comes up
Do you even know what that command does? I don't see it being useful here.
 
Status
Not open for further replies.
Top