Plex suddenly stopped working

djdwosk97 · Jan 22, 2016

pirateghost said:
He was actually asking why you don't have your vdevs in a pool. You're kind of fighting one of the great things about zfs. Pooled storage

Like I said towards the beginning of this thread, I didn't really know what I was doing when I initially set it up (and still don't really).

Is there any real advantage towards doing what you and @SweetAndLow are describing?

pirateghost · Jan 22, 2016

djdwosk97 said:
Like I said towards the beginning of this thread, I didn't really know what I was doing when I initially set it up (and still don't really).

Is there any real advantage towards doing what you and @SweetAndLow are describing?

Pooled storage is the advantage.

When you share out your files, its all centrally located. You don't have to segregate files or try to decide what pool what data goes on.

mattbbpl · Jan 22, 2016

djdwosk97 said:
You mean why aren't the 4tb drives in some kind of array? I didn't like likelihood of a RaidZ1 array failing upon rebuilding and I didn't want to buy a fourth drive for RAIDZ2.

Interesting tradeoff. So in the event of a single drive failure, you've traded the possibility of losing all of your data for the certainty of losing a quarter of your data. I guess that could make sense in some use cases/circumstances. If the data is at all important, I'd recommend ensuring you have backups in such a way that the data on the failed disk can be easily identified, separated, and restored without having to touch the data on the other three disks. I'm assuming you know that you also sacrificed some of the strong suits of ZFS (such as self-correcting scrubs) and have compensated for that in your use cases/practices.

djdwosk97 · Jan 22, 2016

mattbbpl said:
Interesting tradeoff. So in the event of a single drive failure, you've traded the possibility of losing all of your data for the certainty of losing a quarter of your data. I guess that could make sense in some use cases/circumstances. If the data is at all important, I'd recommend ensuring you have backups in such a way that the data on the failed disk can be easily identified, separated, and restored without having to touch the data on the other three disks. I'm assuming you know that you also sacrificed some of the strong suits of ZFS (such as self-correcting scrubs) and have compensated for that in your use cases/practices.

The data isn't super important -- if it was I would've bought a fourth 4tb drive in put them in RAIDZ2.

But I figured it's better to lose 4tb of data (and have a total of 12tb total storage) than 8tb (and only 8tb storage) seeing as RaidZ1 doesn't give me all that much better reliability due to the likelihood of a URE when rebuilding.

Joshua Parker Ruehlig · Jan 22, 2016

djdwosk97 said:
The data isn't super important -- if it was I would've bought a fourth 4tb drive in put them in RAIDZ2.

But I figured it's better to lose 4tb of data (and have a total of 12tb total storage) than 8tb (and only 8tb storage) seeing as RaidZ1 doesn't give me all that much better reliability due to the likelihood of a URE when rebuilding.

a URE during the rebuilding of raidz would result in a single corrupt file. the same URE ALWAYS results in a corrupt file in your current setup.
but with a functioning raidz (or any type of zfs redundancy) wouldn't result in any loss.

djdwosk97 · Jan 22, 2016

Joshua Parker Ruehlig said:
a URE during the rebuilding of raidz would result in a single corrupt file. the same URE ALWAYS results in a corrupt file in your current setup.
but with a functioning raidz (or any type of zfs redundancy) wouldn't result in any loss.

I was under the impression from multiple sources that a URE would result in an array failing to rebuild.

Joshua Parker Ruehlig · Jan 22, 2016

djdwosk97 said:
I was under the impression from multiple sources that a URE would result in an array failing to rebuild.

I think some hardware raid solutions have that scary possibility but not with zfs.

djdwosk97 · Jan 22, 2016

Joshua Parker Ruehlig said:
I think some hardware raid solutions have that scary possibility but not with zfs.

Do you have a source to back that up? I've found very conflicting answers I the Internet.

The feature that makes ZFS less vulnerable is the pool scrubbing which will find UREs and work around it (re-write the data somewhere else so that the next rebuild/resilver won't fail because of it). If a URe happens after the last scrub, but before the rebuild happens, then you are still affected.

Joshua Parker Ruehlig · Jan 22, 2016

djdwosk97 said:
Do you have a source to back that up? I've found very conflicting answers I the Internet.

personal experience*, but have you seen someone complaining of their raidz failing from a single disk+URE?

* I've built a 5 disk faulted raidz before for a system where I was still waiting to buy the last disk.

djdwosk97 · Jan 22, 2016

Joshua Parker Ruehlig said:
personal experience*, but have you seen someone complaining of their raidz failing from a single disk+URE?

* I've built a 5 disk faulted raidz before for a system where I was still waiting to buy the last disk.

No offense but I wouldn't put much faith in personal experience (especially on such a small scale) due to the large number of outside factors.

Joshua Parker Ruehlig · Jan 22, 2016

djdwosk97 said:
No offense but I wouldn't put much faith in personal experience (especially on such a small scale) due to the large number of outside factors.

you can test it out yourself. create a zpool with files/partition vdevs. write some files to it. delete one of vdevs. force write to one of the other vdevs with dd.
run a scrub and it'll complain of the corrupted file(s) but I doubt the whole pool will be faulted from a small write to one of the vdevs.

djdwosk97 · Jan 22, 2016

Joshua Parker Ruehlig said:
you can test it out yourself. create a zpool with files/partition vdevs. write some files to it. delete one of vdevs. force write to one of the other vdevs with dd.
run a scrub and it'll complain of the corrupted file(s) but I doubt the whole pool will be faulted from a small write to one of the vdevs.

But that's not the same as running into a URE during a rebuild...is it?

Joshua Parker Ruehlig · Jan 22, 2016

djdwosk97 said:
But that's not the same as running into a URE during a rebuild...is it?

hmm, it puts zfs in the same situation of not having enough data to reconstruct a single file. but I'm not 100% sure if a URE is worse than bitrot.

djdwosk97 · Jan 22, 2016

Joshua Parker Ruehlig said:
hmm, it puts zfs in the same situation of not having enough data to reconstruct a single file. but I'm not 100% sure if a URE is worse than bitrot.

Alright, well for now it's not really a concern. I already need 12tb of storage (and that's still growing) and I can't really afford to buy more drives.

Joshua Parker Ruehlig · Jan 22, 2016

here's another person saying the same thing as I, not sure his knowledge on the topic is better than mine though
http://forums.nas4free.org/viewtopic.php?f=66&t=8052

djdwosk97 · Jan 23, 2016

Joshua Parker Ruehlig said:
here's another person saying the same thing as I, not sure his knowledge on the topic is better than mine though
http://forums.nas4free.org/viewtopic.php?f=66&t=8052

Some more anecdotal evidence:

Ok, I was curious and decided to play around a bit.

First, I made some zero-filled files for emulating the disk drives, then assembled two RAIDZ1 devices out of some of them into a pool called "scratchpool":

Code:

# for device in {00..20};do;dd if=/dev/zero bs=1M count=100 of="./$device.img";done
# zpool create scratchpool \
     raidz1 "/root/zfs-sandbox/00.img" "/root/zfs-sandbox/01.img" "/root/zfs-sandbox/02.img" "/root/zfs-sandbox/03.img" "/root/zfs-sandbox/04.img" \
     raidz1 "/root/zfs-sandbox/05.img" "/root/zfs-sandbox/06.img" "/root/zfs-sandbox/07.img" "/root/zfs-sandbox/08.img" "/root/zfs-sandbox/09.img"

Then I filled up that pool with files of various sizes and random data.

After that, to start out I corrupted a few bits in the middle of one device and scrubbed the pool to see that scrubbing is working as intended:

Code:

# dd if=/dev/zero bs=1K count=10 seek=51200 conv=notrunc of=04.img
# zpool scrub scratchpool
# zpool status scratchpool

pool: scratchpool
state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://zfsonlinux.org/msg/ZFS-8000-9P
  scan: scrub repaired 32K in 0h0m with 0 errors on Sat Jan 23 13:57:26 2016
config:

        NAME                          STATE     READ WRITE CKSUM
        scratchpool                   ONLINE       0     0     0
          raidz1-0                    ONLINE       0     0     0
            /root/zfs-sandbox/11.img  ONLINE       0     0     0
            /root/zfs-sandbox/10.img  ONLINE       0     0     0
            /root/zfs-sandbox/12.img  ONLINE       0     0     0
            /root/zfs-sandbox/13.img  ONLINE       0     0     0
            /root/zfs-sandbox/04.img  ONLINE       0     0     1
          raidz1-1                    ONLINE       0     0     0
            /root/zfs-sandbox/05.img  ONLINE       0     0     0
            /root/zfs-sandbox/06.img  ONLINE       0     0     0
            /root/zfs-sandbox/07.img  ONLINE       0     0     0
            /root/zfs-sandbox/08.img  ONLINE       0     0     0
            /root/zfs-sandbox/09.img  ONLINE       0     0     0

errors: No known data errors

So far so good, I think. I also tested nuking an entire device and replacing it:

Code:

# dd if=/dev/zero bs=1M count=100 of=03.img
# zpool scrub scratchpool
# zpool status scratchpool
pool: scratchpool
state: DEGRADED
status: One or more devices could not be used because the label is missing or
        invalid.  Sufficient replicas exist for the pool to continue
        functioning in a degraded state.
action: Replace the device using 'zpool replace'.
   see: http://zfsonlinux.org/msg/ZFS-8000-4J
  scan: scrub repaired 0 in 0h0m with 0 errors on Sat Jan 23 13:48:26 2016
config:

        NAME                          STATE     READ WRITE CKSUM
        scratchpool                   DEGRADED     0     0     0
          raidz1-0                    DEGRADED     0     0     0
            /root/zfs-sandbox/11.img  ONLINE       0     0     0
            /root/zfs-sandbox/10.img  ONLINE       0     0     0
            /root/zfs-sandbox/12.img  ONLINE       0     0     0
            /root/zfs-sandbox/03.img  UNAVAIL      0     0     0  corrupted data
            /root/zfs-sandbox/04.img  ONLINE       0     0     0
          raidz1-1                    ONLINE       0     0     0
            /root/zfs-sandbox/05.img  ONLINE       0     0     0
            /root/zfs-sandbox/06.img  ONLINE       0     0     0
            /root/zfs-sandbox/07.img  ONLINE       0     0     0
            /root/zfs-sandbox/08.img  ONLINE       0     0     0
            /root/zfs-sandbox/09.img  ONLINE       0     0     0

errors: No known data errors

Code:

# zpool replace scratchpool /root/zfs-sandbox/03.img /root/zfs-sandbox/13.img
# zpool status scratchpool
  pool: scratchpool
state: ONLINE
  scan: resilvered 80.8M in 0h0m with 0 errors on Sat Jan 23 13:48:59 2016
config:

        NAME                          STATE     READ WRITE CKSUM
        scratchpool                   ONLINE       0     0     0
          raidz1-0                    ONLINE       0     0     0
            /root/zfs-sandbox/11.img  ONLINE       0     0     0
            /root/zfs-sandbox/10.img  ONLINE       0     0     0
            /root/zfs-sandbox/12.img  ONLINE       0     0     0
            /root/zfs-sandbox/13.img  ONLINE       0     0     0
            /root/zfs-sandbox/04.img  ONLINE       0     0     0
          raidz1-1                    ONLINE       0     0     0
            /root/zfs-sandbox/05.img  ONLINE       0     0     0
            /root/zfs-sandbox/06.img  ONLINE       0     0     0
            /root/zfs-sandbox/07.img  ONLINE       0     0     0
            /root/zfs-sandbox/08.img  ONLINE       0     0     0
            /root/zfs-sandbox/09.img  ONLINE       0     0     0

errors: No known data errors

Then I nuked one complete device to emulate drive failure, and corrupted a few bits right in the middle of another device, emulating an URE:

Code:

# dd if=/dev/zero bs=1K count=10 seek=51200 conv=notrunc of=07.img
# dd if=/dev/zero bs=1M count=100 of=08.img
# zpool replace scratchpool /root/zfs-sandbox/08.img /root/zfs-sandbox/14.img

And now the pool looks like this:

Code:

# zpool status -v scratchpool

pool: scratchpool
state: ONLINE
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: http://zfsonlinux.org/msg/ZFS-8000-8A
  scan: resilvered 81.3M in 0h0m with 1 errors on Sat Jan 23 14:06:29 2016
config:

        NAME                            STATE     READ WRITE CKSUM
        scratchpool                     ONLINE       0     0     2
          raidz1-0                      ONLINE       0     0     0
            /root/zfs-sandbox/11.img    ONLINE       0     0     0
            /root/zfs-sandbox/10.img    ONLINE       0     0     0
            /root/zfs-sandbox/12.img    ONLINE       0     0     0
            /root/zfs-sandbox/13.img    ONLINE       0     0     0
            /root/zfs-sandbox/04.img    ONLINE       0     0     0
          raidz1-1                      ONLINE       0     0     4
            /root/zfs-sandbox/05.img    ONLINE       0     0     0
            /root/zfs-sandbox/06.img    ONLINE       0     0     0
            /root/zfs-sandbox/07.img    ONLINE       0     0     0
            replacing-3                 UNAVAIL      0     0     0
              /root/zfs-sandbox/08.img  UNAVAIL      0     0     0  corrupted data
              /root/zfs-sandbox/14.img  ONLINE       0     0     0
            /root/zfs-sandbox/09.img    ONLINE       0     0     0

errors: Permanent errors have been detected in the following files:

        /root/zfs-sandbox/scratchpool/dir1/88.random

One weird thing: I also manually checksummed each file before corrupting the whole shebang, and the checksum for the file which ZFS says is irrecoverably corrupted is still verifying as good. Not quite sure why, to be honest. Very curious methinks.

Now, as @[member=leadeater] said, if you are diligent with scrubbing your pool, the chances of this happening can be vastly reduced. If I had scrubbed before the drive failure, I could have recovered from the URE. What I'm not 100% sure about is how a conventional RAID would handle this, as I don't really have much experience on that front. I have read that the entire rebuild might fail in a scenario like this, whereas in ZFS, I can still access most of my data and can rely on that not being corrupted. ZFS will mostly rebuild, and give me a list of files which can't be properly recovered, thus enabling me to restore them from a backup, or if I don't have one, at least not rely on those files still being alright. But leadeater might know more about how a conventional RAID would handle such a failure.

Personally, I would say that UREs are probably of a lesser concern as long as you are diligent with scrubbing. I would be more concerned about another drive failing entirely while your pool is rebuilding, especially if it's a big pool where a resilvering operation might take several days.

Side note: If anyone discovers any flaws in my methodology or has suggestions for testing other sorts of failures, feel free to mention that. My experiences are mostly based from using ZFS in a home environment for close to three years, not from a professional setting, so it's conceivable that I might have overlooked something.

Based on the test I did above, I would expect the failure to be localized in most cases. The failed pool and its files from my example are still accessible, it's just that one file (dir1/88.random) which ZFS says is corrupt. However, if the corruption was not on a file but instead on the file system's metadata, that might be a different story. But I'm not quite sure how to test that, because I'd need to be able to specifically corrupt metadata, and I don't know a way in which I can specifically target that. If anyone knows, I'd be curious to try it though.

Joshua Parker Ruehlig · Jan 23, 2016

djdwosk97 said:
Some more anecdotal evidence:

Good stuff there. As for trying to corrupt metadata, that would be more difficult then corrupting a file. If I recall ZFS keeps multiple copies of metadata on disk.

djdwosk97 · Jan 23, 2016

Joshua Parker Ruehlig said:
Good stuff there. As for trying to corrupt metadata, that would be more difficult then corrupting a file. If I recall ZFS keeps multiple copies of metadata on disk.

Another follow up post/test to the above -- possibly worth reading if you're curious: http://linustechtips.com/main/topic/532609-is-a-ure-an-issue-in-zfs/?p=7076627

Important Announcement for the TrueNAS Community.

Plex suddenly stopped working

djdwosk97

Patron

pirateghost

Unintelligible Geek

mattbbpl

Patron

djdwosk97

Patron

Joshua Parker Ruehlig

Hall of Famer

djdwosk97

Patron

Joshua Parker Ruehlig

Hall of Famer

djdwosk97

Patron

Joshua Parker Ruehlig

Hall of Famer

djdwosk97

Patron

Joshua Parker Ruehlig

Hall of Famer

djdwosk97

Patron

Joshua Parker Ruehlig

Hall of Famer

djdwosk97

Patron

Joshua Parker Ruehlig

Hall of Famer

djdwosk97

Patron

Joshua Parker Ruehlig

Hall of Famer

djdwosk97

Patron

Similar threads