Ok, I was curious and decided to play around a bit.
First, I made some zero-filled files for emulating the disk drives, then assembled two RAIDZ1 devices out of some of them into a pool called "scratchpool":
Code:
# for device in {00..20};do;dd if=/dev/zero bs=1M count=100 of="./$device.img";done
# zpool create scratchpool \
raidz1 "/root/zfs-sandbox/00.img" "/root/zfs-sandbox/01.img" "/root/zfs-sandbox/02.img" "/root/zfs-sandbox/03.img" "/root/zfs-sandbox/04.img" \
raidz1 "/root/zfs-sandbox/05.img" "/root/zfs-sandbox/06.img" "/root/zfs-sandbox/07.img" "/root/zfs-sandbox/08.img" "/root/zfs-sandbox/09.img"
Then I filled up that pool with files of various sizes and random data.
After that, to start out I corrupted a few bits in the middle of one device and scrubbed the pool to see that scrubbing is working as intended:
Code:
# dd if=/dev/zero bs=1K count=10 seek=51200 conv=notrunc of=04.img
# zpool scrub scratchpool
# zpool status scratchpool
pool: scratchpool
state: ONLINE
status: One or more devices has experienced an unrecoverable error. An
attempt was made to correct the error. Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
using 'zpool clear' or replace the device with 'zpool replace'.
see: http://zfsonlinux.org/msg/ZFS-8000-9P
scan: scrub repaired 32K in 0h0m with 0 errors on Sat Jan 23 13:57:26 2016
config:
NAME STATE READ WRITE CKSUM
scratchpool ONLINE 0 0 0
raidz1-0 ONLINE 0 0 0
/root/zfs-sandbox/11.img ONLINE 0 0 0
/root/zfs-sandbox/10.img ONLINE 0 0 0
/root/zfs-sandbox/12.img ONLINE 0 0 0
/root/zfs-sandbox/13.img ONLINE 0 0 0
/root/zfs-sandbox/04.img ONLINE 0 0 1
raidz1-1 ONLINE 0 0 0
/root/zfs-sandbox/05.img ONLINE 0 0 0
/root/zfs-sandbox/06.img ONLINE 0 0 0
/root/zfs-sandbox/07.img ONLINE 0 0 0
/root/zfs-sandbox/08.img ONLINE 0 0 0
/root/zfs-sandbox/09.img ONLINE 0 0 0
errors: No known data errors
So far so good, I think. I also tested nuking an entire device and replacing it:
Code:
# dd if=/dev/zero bs=1M count=100 of=03.img
# zpool scrub scratchpool
# zpool status scratchpool
pool: scratchpool
state: DEGRADED
status: One or more devices could not be used because the label is missing or
invalid. Sufficient replicas exist for the pool to continue
functioning in a degraded state.
action: Replace the device using 'zpool replace'.
see: http://zfsonlinux.org/msg/ZFS-8000-4J
scan: scrub repaired 0 in 0h0m with 0 errors on Sat Jan 23 13:48:26 2016
config:
NAME STATE READ WRITE CKSUM
scratchpool DEGRADED 0 0 0
raidz1-0 DEGRADED 0 0 0
/root/zfs-sandbox/11.img ONLINE 0 0 0
/root/zfs-sandbox/10.img ONLINE 0 0 0
/root/zfs-sandbox/12.img ONLINE 0 0 0
/root/zfs-sandbox/03.img UNAVAIL 0 0 0 corrupted data
/root/zfs-sandbox/04.img ONLINE 0 0 0
raidz1-1 ONLINE 0 0 0
/root/zfs-sandbox/05.img ONLINE 0 0 0
/root/zfs-sandbox/06.img ONLINE 0 0 0
/root/zfs-sandbox/07.img ONLINE 0 0 0
/root/zfs-sandbox/08.img ONLINE 0 0 0
/root/zfs-sandbox/09.img ONLINE 0 0 0
errors: No known data errors
Code:
# zpool replace scratchpool /root/zfs-sandbox/03.img /root/zfs-sandbox/13.img
# zpool status scratchpool
pool: scratchpool
state: ONLINE
scan: resilvered 80.8M in 0h0m with 0 errors on Sat Jan 23 13:48:59 2016
config:
NAME STATE READ WRITE CKSUM
scratchpool ONLINE 0 0 0
raidz1-0 ONLINE 0 0 0
/root/zfs-sandbox/11.img ONLINE 0 0 0
/root/zfs-sandbox/10.img ONLINE 0 0 0
/root/zfs-sandbox/12.img ONLINE 0 0 0
/root/zfs-sandbox/13.img ONLINE 0 0 0
/root/zfs-sandbox/04.img ONLINE 0 0 0
raidz1-1 ONLINE 0 0 0
/root/zfs-sandbox/05.img ONLINE 0 0 0
/root/zfs-sandbox/06.img ONLINE 0 0 0
/root/zfs-sandbox/07.img ONLINE 0 0 0
/root/zfs-sandbox/08.img ONLINE 0 0 0
/root/zfs-sandbox/09.img ONLINE 0 0 0
errors: No known data errors
Then I nuked one complete device to emulate drive failure, and corrupted a few bits right in the middle of another device, emulating an URE:
Code:
# dd if=/dev/zero bs=1K count=10 seek=51200 conv=notrunc of=07.img
# dd if=/dev/zero bs=1M count=100 of=08.img
# zpool replace scratchpool /root/zfs-sandbox/08.img /root/zfs-sandbox/14.img
And now the pool looks like this:
Code:
# zpool status -v scratchpool
pool: scratchpool
state: ONLINE
status: One or more devices has experienced an error resulting in data
corruption. Applications may be affected.
action: Restore the file in question if possible. Otherwise restore the
entire pool from backup.
see: http://zfsonlinux.org/msg/ZFS-8000-8A
scan: resilvered 81.3M in 0h0m with 1 errors on Sat Jan 23 14:06:29 2016
config:
NAME STATE READ WRITE CKSUM
scratchpool ONLINE 0 0 2
raidz1-0 ONLINE 0 0 0
/root/zfs-sandbox/11.img ONLINE 0 0 0
/root/zfs-sandbox/10.img ONLINE 0 0 0
/root/zfs-sandbox/12.img ONLINE 0 0 0
/root/zfs-sandbox/13.img ONLINE 0 0 0
/root/zfs-sandbox/04.img ONLINE 0 0 0
raidz1-1 ONLINE 0 0 4
/root/zfs-sandbox/05.img ONLINE 0 0 0
/root/zfs-sandbox/06.img ONLINE 0 0 0
/root/zfs-sandbox/07.img ONLINE 0 0 0
replacing-3 UNAVAIL 0 0 0
/root/zfs-sandbox/08.img UNAVAIL 0 0 0 corrupted data
/root/zfs-sandbox/14.img ONLINE 0 0 0
/root/zfs-sandbox/09.img ONLINE 0 0 0
errors: Permanent errors have been detected in the following files:
/root/zfs-sandbox/scratchpool/dir1/88.random
One weird thing: I also manually checksummed each file before corrupting the whole shebang, and the checksum for the file which ZFS says is irrecoverably corrupted is still verifying as good. Not quite sure why, to be honest. Very curious methinks.
Now, as @[member=leadeater] said, if you are diligent with scrubbing your pool, the chances of this happening can be vastly reduced. If I had scrubbed before the drive failure, I could have recovered from the URE. What I'm not 100% sure about is how a conventional RAID would handle this, as I don't really have much experience on that front. I have read that the entire rebuild might fail in a scenario like this, whereas in ZFS, I can still access most of my data and can rely on that not being corrupted. ZFS will mostly rebuild, and give me a list of files which can't be properly recovered, thus enabling me to restore them from a backup, or if I don't have one, at least not rely on those files still being alright. But leadeater might know more about how a conventional RAID would handle such a failure.
Personally, I would say that UREs are probably of a lesser concern as long as you are diligent with scrubbing. I would be more concerned about another drive failing entirely while your pool is rebuilding, especially if it's a big pool where a resilvering operation might take several days.
Side note: If anyone discovers any flaws in my methodology or has suggestions for testing other sorts of failures, feel free to mention that. My experiences are mostly based from using ZFS in a home environment for close to three years, not from a professional setting, so it's conceivable that I might have overlooked something.