Weird ZFS on Linux behavior, scrubs on a striped pool

Arwen · Sep 5, 2023

While responding to a post, I included a similar setup I have with a striped ZFS pool.

Many years ago, like around 2015, I built a miniature fanless desktop computer as my media server. The OS is Linux, Gentoo distro with only software I needed installed. It has an 1TB mSATA SSD and a 2TB SATA laptop HDD. It was configured to use about 40GBs of each storage device for a ZFS Mirrored OS pool and the rest ZFS Striped into my media pool.

When I original built it, I did not have automatic ZFS scrubs. Every now and then, probably more than a month or 2, I'd run a manual ZFS scrub. The OS always seemed fine, but I updated it several times a year, which meant many files were read.

As I expected, the scrubs on the media pool found occasional unrecoverable file errors. I had planned on that happening, so I had multiple backups, both on-line and off-line. So, easy enough to restore. Most were in larger video files, which statistically makes sense, as they would have a higher chance of encountering a bad block.

But, the weird thing is, since I enabled automatic twice a month ZFS scrubs around 2018, (both OS pool and media pool), I have not had a single error in years. And this thing has been alive since about 2015, close to 8 years.

I call that weird.

Now to be fair, I did install a USB powered, external fan, (about 3"), blowing air directly across the computer. I doubt that could have made the storage devices more reliable, but maybe...

So, from this perspective, ZFS scrubs seem to magically heal a Striped pool when encountering bad blocks.

Okay, I am not serious about "magic". But, at a guess, reading a block that is starting to fail, (but disk ECC can fix), actually causes either the storage device or ZFS to spare it out before data loss. Thus, ZFS not having to deal with a completely failed block.

Any one else with thoughts on the mater?

sretalla · Sep 5, 2023

I agree it seems that you've been lucky, but there's nothing to say that disks must have errors, so it's not so surprising/miraculous.

I guess forcing regular activity may somehow help the disks to maintain health in one or other way, but as you say, there's no place to automatically get recovery information in a stripe, so it does indeed require all disks to accurately give back all the data they were given all the time for no scrub errors to appear.

It may also have something to do with your knowledge of how to properly maintain temperature of the disks and not spinning them down/up.

Ericloewe · Sep 5, 2023

I'd say that regular errors are not really expected, per se. So some combination of bad luck/good luck is possible.

That said, better cooling could make a difference. Beyond that, regular reads on the SSD would also trigger its own internal error correction, which is probably only semi-transparent to SMART data. So, the more stuff the SSD's internal ECC picks up on and corrects, the less of it filters up to ZFS. Conceptually, anyway.

Arwen · Sep 5, 2023

Hmm, maybe the dozen or so bad blocks I experienced in the first years of that Striped pool, were the weakest blocks. Now I have reached the plateau, and next will be the decline.

I've been searching for a replacement the last 2 years, but nothing small supports ECC memory and enough storage. Their are embedded computers with ECC memory, yet lack enough storage for a media server. The ones with ECC memory support and enough storage tend to be Mini-ITX or larger.

Johnny Fartpants · Sep 6, 2023

Understand this is not the question but would copies=2 have helped here? I've heard of people when only a single disk is involved using this for a little extra protection like desktop, laptops, external drive etc.

Appreciate it would come at a storage space cost.

Ericloewe · Sep 6, 2023

Yes, but it would sort of defeat the purpose of having a stripe with no redundancy. Might as well move to a mirror at that point.

Johnny Fartpants · Sep 6, 2023

Ericloewe said:
Yes, but it would sort of defeat the purpose of having a stripe with no redundancy. Might as well move to a mirror at that point.

True, guess it depends on why a system is setup that way in the first place.

Arwen · Sep 6, 2023

Yes, "copies=2" could have helped. And especially since I have mismatched sizes, (1TB SSD & 2TB HDD). (You can't Mirror a 2TB disk to a 1TB disk.)

But, I need all the storage for media, (and growth). So, no, I would not implement "copies=2" for this use.

Perhaps my next media server will have enough ports that I can Mirror a pair of disks, (HDD or SSD).

Important Announcement for the TrueNAS Community.

Weird ZFS on Linux behavior, scrubs on a striped pool

Arwen

MVP

sretalla

Powered by Neutrality

Ericloewe

Server Wrangler

Arwen

MVP

Johnny Fartpants

Guru

Ericloewe

Server Wrangler

Johnny Fartpants

Guru

Arwen

MVP

Similar threads