RAIDZ1 pool lost two disks at once but data seems to be fine, how? [solved]

jblack

Dabbler
Joined
Feb 2, 2023
Messages
19
Edit: Thanks for the help guys, it seems like I was mistaken about the meaning of 'degraded' and it was simply a bit of metadata corruption. I've moved the data to another pool and I'll just destroy and rebuild the corrupted pool once I get a couple more drives for extra redundancy.

First off let me be clear, if it turns out that my data is gone/corrupted, it is entirely my fault for being stupid, I'm not complaining about anything here, just curious.

I have a pool of four disks configured in a RAIDZ1. Overnight two of the disks went into 'degraded' status, causing the pool to go 'degraded' as well (they were bought used and I may have abused them a bit so I'm not surprised they failed tbh). As I understand it, in a RAIDZ1, losing one disk is fine, but losing two or more disks should mean all the data is gone right? I could still see all the datasets I had on the pool, so for shits and giggles I used zfs send and received to move the main dataset over to a different, healthy pool. It completed with no errors or warnings of any kind. I set up an SMB share for the new dataset, connected to it from a windows machine, and my files are just as I left them. I haven't scanned all 600gb of data to make sure nothing is corrupted, but I tried opening a variety of files from different directories and nothing has seemed corrupted or missing. So while I don't know for sure if all my data is there, I know that at least a good portion of it is. How is this possible given that I lost two disks in a RAIDZ1? Or am I totally mistaken and 'degraded' doesn't mean failed or RAIDZ1 isn't what I think?

Just a note, when I configure a new pool to replace this one, I will be going with RAIDZ2 instead of RAIDZ1, as per the overwhelming advice on this forum.
 
Last edited:

NugentS

MVP
Joined
Apr 16, 2020
Messages
2,947
A degraded disk isn't a lost disk, its an ill disk - but still breathing.
 
Joined
Jun 15, 2022
Messages
674
What it means in very, very general terms is the controller on the hard disk drive found there are issues and logged them in the S.M.A.R.T. system. It could be there is enough wrong the "healthy" flag on the drive to be set to "unhealthy." TrueNAS saw the issues (or flag) and is alerting you the disks have problems.

If you read up on smartctl you'll know how to check the logs and see what the drive controller sees.

If you're going to buy used drives, you might want Z3:
 

danb35

Hall of Famer
Joined
Aug 16, 2011
Messages
15,504
What it means in very, very general terms is the controller on the hard disk drive found there are issues and logged them in the S.M.A.R.T. system
No, it really doesn't. SMART errors and ZFS errors are pretty much independent of each other, and what OP is describing are ZFS errors. There may or may not also be SMART errors.
Or am I totally mistaken and 'degraded' doesn't mean failed
This, pretty much. ZFS is smart enough to know where the errors are; the output of zpool status -v will tell you if there are any data errors, and if so, where.
 

jblack

Dabbler
Joined
Feb 2, 2023
Messages
19
If you're going to buy used drives, you might want Z3:
I may just do that, I am buying used drives (I know it's a bad idea and probably doesn't actually save me any money in the long run, but the data on here is either not that important or is backed up elsewhere as well)
 

jblack

Dabbler
Joined
Feb 2, 2023
Messages
19
No, it really doesn't. SMART errors and ZFS errors are pretty much independent of each other, and what OP is describing are ZFS errors. There may or may not also be SMART errors.

This, pretty much. ZFS is smart enough to know where the errors are; the output of zpool status -v will tell you if there are any data errors, and if so, where.
zpool status -v gave me this result:
Code:
pool: mainpool
state: DEGRADED
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-8A
  scan: resilvered 19.3M in 00:00:14 with 0 errors on Fri Feb 17 13:04:34 2023
config:

        NAME                                      STATE     READ WRITE CKSUM
        mainpool                                  DEGRADED     0     0     0
          raidz1-0                                DEGRADED     0     0     0
            fd63c05e-7489-403c-bf61-da733420985e  DEGRADED     0     0     0  too many errors
            1c3b9e43-85e6-47e5-8d0f-4fa263796ee8  DEGRADED     0     0     0  too many errors
            6f8be4e6-c839-4d63-b02f-8d28699913c0  ONLINE       0     0     0
            d91f42c0-d5a8-4eb0-aa26-a03c7808a8c1  ONLINE       0     0     0

errors: Permanent errors have been detected in the following files:

        <0xb40>:<0x1>

I have no idea what those files are, but since I've already moved all the data over to another pool, I'm just going to destroy the degraded pool, add a couple more drives and use raidz2/3. I'll also run a few longer tests on all the drives to see if they're physically on their way out or if the corruption was the result of me pulling the power during a write or something.
 

danb35

Hall of Famer
Joined
Aug 16, 2011
Messages
15,504
<0xb40>:<0x1>
That indicates metadata corruption, so destroying and recreating the pool is likely your best bet. But if you've got your data off it, you should be in good shape.
 
Joined
Jun 15, 2022
Messages
674
I may just do that, I am buying used drives (I know it's a bad idea and probably doesn't actually save me any money in the long run, but the data on here is either not that important or is backed up elsewhere as well)
Depending on the HDD grade (consumer/server), HDD condition (heavily used for crypto-mining or light use as occasional storage), and your intended use, buying used server drives can save money.

If the drives are going to be used hard 24/7 in a business environment you probably want to buy new. Server drives are usually rated at 5 years reliable lifespan, but there are lots of variables.

If it's home use, then you might come out ahead by a long shot with used server drives.

---
I'll have to look into @danb35's statement on ZFS errors. It seems he's correct, and learning is a good thing. Thanks Dan.
 

danb35

Hall of Famer
Joined
Aug 16, 2011
Messages
15,504
knowing-is-half-the-battle.jpg
 
Top