RAIDZ1 pool lost two disks at once but data seems to be fine, how? [solved]

jblack · Mar 9, 2023

Edit: Thanks for the help guys, it seems like I was mistaken about the meaning of 'degraded' and it was simply a bit of metadata corruption. I've moved the data to another pool and I'll just destroy and rebuild the corrupted pool once I get a couple more drives for extra redundancy.

First off let me be clear, if it turns out that my data is gone/corrupted, it is entirely my fault for being stupid, I'm not complaining about anything here, just curious.

I have a pool of four disks configured in a RAIDZ1. Overnight two of the disks went into 'degraded' status, causing the pool to go 'degraded' as well (they were bought used and I may have abused them a bit so I'm not surprised they failed tbh). As I understand it, in a RAIDZ1, losing one disk is fine, but losing two or more disks should mean all the data is gone right? I could still see all the datasets I had on the pool, so for shits and giggles I used zfs send and received to move the main dataset over to a different, healthy pool. It completed with no errors or warnings of any kind. I set up an SMB share for the new dataset, connected to it from a windows machine, and my files are just as I left them. I haven't scanned all 600gb of data to make sure nothing is corrupted, but I tried opening a variety of files from different directories and nothing has seemed corrupted or missing. So while I don't know for sure if all my data is there, I know that at least a good portion of it is. How is this possible given that I lost two disks in a RAIDZ1? Or am I totally mistaken and 'degraded' doesn't mean failed or RAIDZ1 isn't what I think?

Just a note, when I configure a new pool to replace this one, I will be going with RAIDZ2 instead of RAIDZ1, as per the overwhelming advice on this forum.

NugentS · Mar 9, 2023

A degraded disk isn't a lost disk, its an ill disk - but still breathing.

WI_Hedgehog · Mar 9, 2023

What it means in very, very general terms is the controller on the hard disk drive found there are issues and logged them in the S.M.A.R.T. system. It could be there is enough wrong the "healthy" flag on the drive to be set to "unhealthy." TrueNAS saw the issues (or flag) and is alerting you the disks have problems.

If you read up on smartctl you'll know how to check the logs and see what the drive controller sees.

If you're going to buy used drives, you might want Z3:

1st trueNAS build – Need pool,vdev,general install advice

Good day all and thanks in advance for any support. I am in the process of building my first NAS with TrueNAS core. The computer is as follows: - I5-7400 in an asus k31cd-k motherboard (only has 1 pci slot – to be used for SAS controller in the future) - 16GB ram (max the MB supports) I...

www.truenas.com

danb35 · Mar 9, 2023

WI_Hedgehog said:
What it means in very, very general terms is the controller on the hard disk drive found there are issues and logged them in the S.M.A.R.T. system

No, it really doesn't. SMART errors and ZFS errors are pretty much independent of each other, and what OP is describing are ZFS errors. There may or may not also be SMART errors.

jblack said:
Or am I totally mistaken and 'degraded' doesn't mean failed

This, pretty much. ZFS is smart enough to know where the errors are; the output of zpool status -v will tell you if there are any data errors, and if so, where.

jblack · Mar 9, 2023

WI_Hedgehog said:
If you're going to buy used drives, you might want Z3:

I may just do that, I am buying used drives (I know it's a bad idea and probably doesn't actually save me any money in the long run, but the data on here is either not that important or is backed up elsewhere as well)

jblack · Mar 9, 2023

danb35 said:
No, it really doesn't. SMART errors and ZFS errors are pretty much independent of each other, and what OP is describing are ZFS errors. There may or may not also be SMART errors.

This, pretty much. ZFS is smart enough to know where the errors are; the output of zpool status -v will tell you if there are any data errors, and if so, where.

zpool status -v gave me this result:

Code:

pool: mainpool
state: DEGRADED
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
  scan: resilvered 19.3M in 00:00:14 with 0 errors on Fri Feb 17 13:04:34 2023
config:

        NAME                                      STATE     READ WRITE CKSUM
        mainpool                                  DEGRADED     0     0     0
          raidz1-0                                DEGRADED     0     0     0
            fd63c05e-7489-403c-bf61-da733420985e  DEGRADED     0     0     0  too many errors
            1c3b9e43-85e6-47e5-8d0f-4fa263796ee8  DEGRADED     0     0     0  too many errors
            6f8be4e6-c839-4d63-b02f-8d28699913c0  ONLINE       0     0     0
            d91f42c0-d5a8-4eb0-aa26-a03c7808a8c1  ONLINE       0     0     0

errors: Permanent errors have been detected in the following files:

        <0xb40>:<0x1>

I have no idea what those files are, but since I've already moved all the data over to another pool, I'm just going to destroy the degraded pool, add a couple more drives and use raidz2/3. I'll also run a few longer tests on all the drives to see if they're physically on their way out or if the corruption was the result of me pulling the power during a write or something.

danb35 · Mar 9, 2023

jblack said:
<0xb40>:<0x1>

That indicates metadata corruption, so destroying and recreating the pool is likely your best bet. But if you've got your data off it, you should be in good shape.

WI_Hedgehog · Mar 9, 2023

jblack said:
I may just do that, I am buying used drives (I know it's a bad idea and probably doesn't actually save me any money in the long run, but the data on here is either not that important or is backed up elsewhere as well)

Depending on the HDD grade (consumer/server), HDD condition (heavily used for crypto-mining or light use as occasional storage), and your intended use, buying used server drives can save money.

If the drives are going to be used hard 24/7 in a business environment you probably want to buy new. Server drives are usually rated at 5 years reliable lifespan, but there are lots of variables.

If it's home use, then you might come out ahead by a long shot with used server drives.

---
I'll have to look into @danb35's statement on ZFS errors. It seems he's correct, and learning is a good thing. Thanks Dan.

danb35 · Mar 9, 2023

Important Announcement for the TrueNAS Community.

RAIDZ1 pool lost two disks at once but data seems to be fine, how? [solved]

jblack

Dabbler

NugentS

MVP

WI_Hedgehog

Guru

1st trueNAS build – Need pool,vdev,general install advice

danb35

Hall of Famer

jblack

Dabbler

jblack

Dabbler

danb35

Hall of Famer

WI_Hedgehog

Guru

danb35

Hall of Famer

Similar threads