RAID Z2 - 2 Drives Faulted, Next Steps?

LIGISTX · Dec 21, 2023

Hi everyone,

Not my first time I have had drives go bad... but it is the first time I have had 2 fault at the same time. Thankfully its a Z2 array, and I have a cold spare waiting to go in. But with 2 drives down, I am a little weary of how to approach this.

Would it make sense to do a zpool clear to force it to think everything is ok, and then replace one of the drives with my cold spare? See how the resilver goes, do a scrub, and monitor the situation? I don't want to get myself into a worse situation by jumping to any conclusions prematurly.

I have had a few SMART errors pop up over the past few SMART tests, I guess I didn't dig deep enough into them because I had thought the drives were still working fine (I have had eronious errors in the past that didn't actually result in bad drives or any corruption). Looking at the SMART status of da5 (Drive with 62 faults according to zpool status), I am seeing:

Code:

SMART Attributes Data Structure revision number: 16

Vendor Specific SMART Attributes with Thresholds:

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE

  1 Raw_Read_Error_Rate     0x002f   199   199   051    Pre-fail  Always       -       43

  3 Spin_Up_Time            0x0027   186   161   021    Pre-fail  Always       -       5700

  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       234

  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0

  7 Seek_Error_Rate         0x002e   100   253   000    Old_age   Always       -       0

  9 Power_On_Hours          0x0032   028   028   000    Old_age   Always       -       52765

 10 Spin_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0

 11 Calibration_Retry_Count 0x0032   100   100   000    Old_age   Always       -       0

 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       228

192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       224

193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       131

194 Temperature_Celsius     0x0022   121   105   000    Old_age   Always       -       29

196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0

197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0

198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0

199 UDMA_CRC_Error_Count    0x0032   200   055   000    Old_age   Always       -       608

200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       1



SMART Error Log Version: 1

No Errors Logged

da7 (61 faults):

Code:

SMART Attributes Data Structure revision number: 16

Vendor Specific SMART Attributes with Thresholds:

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE

  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       1

  3 Spin_Up_Time            0x0027   184   159   021    Pre-fail  Always       -       5766

  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       235

  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0

  7 Seek_Error_Rate         0x002e   100   253   000    Old_age   Always       -       0

  9 Power_On_Hours          0x0032   028   028   000    Old_age   Always       -       52758

 10 Spin_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0

 11 Calibration_Retry_Count 0x0032   100   100   000    Old_age   Always       -       0

 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       228

192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       224

193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       107

194 Temperature_Celsius     0x0022   120   105   000    Old_age   Always       -       30

196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0

197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0

198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0

199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0

200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       0



SMART Error Log Version: 1

No Errors Logged

Zpool status:

Code:

pool: pergamum

 state: DEGRADED

status: One or more devices are faulted in response to persistent errors.

    Sufficient replicas exist for the pool to continue functioning in a

    degraded state.

action: Replace the faulted device, or use 'zpool clear' to mark the device

    repaired.

  scan: scrub repaired 1.89M in 06:25:16 with 0 errors on Thu Dec 21 06:25:36 2023

config:



    NAME                                            STATE     READ WRITE CKSUM

    pergamum                                        DEGRADED     0     0     0

      raidz2-0                                      DEGRADED     0     0     0

        gptid/ab0351e8-44ea-11e8-8cad-e0071bffdaee  ONLINE       0     0     0

        gptid/abbfceac-44ea-11e8-8cad-e0071bffdaee  ONLINE       0     0     0

        gptid/e3c7752a-1fc4-11ea-8e70-000c29cab7ac  ONLINE       0     0     0

        gptid/6ebdcf54-ac93-11ec-b2a3-279dd0c48793  ONLINE       0     0     0

        gptid/ae0d7e64-44ea-11e8-8cad-e0071bffdaee  ONLINE       0     0     0

        gptid/aeca106f-44ea-11e8-8cad-e0071bffdaee  ONLINE       0     0     0

        gptid/af89686d-44ea-11e8-8cad-e0071bffdaee  FAULTED     61     0     0  too many errors

        gptid/b04ad4fc-44ea-11e8-8cad-e0071bffdaee  ONLINE       0     0     0

        gptid/b10b6452-44ea-11e8-8cad-e0071bffdaee  FAULTED     62     0     0  too many errors

        gptid/b1d949c1-44ea-11e8-8cad-e0071bffdaee  ONLINE       0     0     0



errors: No known data errors

What should my next steps here be?

To stave off any questison about hardware, it can all be found in my signature, but the controller is a H310 to a SAS expander, and has been in good working order for 7+ years (minus a few drive failures over the years). Its possible a SAS -> SATA cable (or two) is going bad, its happened to me before. But I am thinking this is not that sort of situation.

LIGISTX · Dec 21, 2023

I added in my cold spare to attempt to replace da5, truenas hung... So I went to reboot it, it hung some more. Eventually just shut it off via proxmox webUI, turned off the host, reseated the HBA, replugged cables etc.

During the shutdown, I saw this pop up on proxmox console for truenas:

At boot up after the reseat of the HBA and cables, I got this:

With the above errors being shown on console, I logged into truenas webui to see the pool as being online, ok, and in the middle of a resilver??? 72% done with a resilver to be precise. I was worried the above errors being thrown would be indicative of a bad HBA, so I turned off truenas via proxmox (if I have a bad controller, I really don't want it writting shit data across my entire pool... although if it really was 72% completed and that was the case, I am worried I may be SOL anyways?).

Not entirely usre what is happening here, but truenas is turned off until I can get some better understanding of what may be happening. Or maybe (hopefully) I am missunderstanding the above errors, and its simply reporting that da6 (whiever drive this ended up being on this reboot) is just really, really not happy about things and not actually a controller failure?

Arwen · Dec 21, 2023

I can't offer much advice, except for 2 things:

If you have a spare disk slot, (or can export that single disk pool), a ZFS replace in place is best. Meaning, if the existing drive has not failed completely, it can be used as partial redundancy for the re-silver. (Remember, ZFS is checking anything read for consistency via checksums...)
Did you pass through the HBA to the TrueNAS VM?
It's not clear from your posts. If not, that is a problem...

LIGISTX · Dec 21, 2023

Arwen said:
I can't offer much advice, except for 2 things:

If you have a spare disk slot, (or can export that single disk pool), a ZFS replace in place is best. Meaning, if the existing drive has not failed completely, it can be used as partial redundancy for the re-silver. (Remember, ZFS is checking anything read for consistency via checksums...)

Did you pass through the HBA to the TrueNAS VM?
It's not clear from your posts. If not, that is a problem...

I do have a spare slot, and I have plugged in a cold spare. How would I go about using the current drives (included the suspected failed drvie(s) to do this exactly? I guess i am used to removing the old drive before replacing, I usually have completely failed drives, not ones that are "sort of" working.

Yes, the HBA is passed through to the truenas VM.

LIGISTX · Dec 21, 2023

Well, assuming the Truenas GUI is not non-sensical, I am replacing the worst of the 2 drives with my cold spare now. It looks like it was already resilving the current worst drive (it is again listed as da6), so once that is finished it will start the replacement with the cold spare.

I am not sure why it is trying to resilver the currently bad drive since it was faulted out... I guess when i rebooted it earlier it threw it into a resilver function since the drive did come back online? Eitehr way, the pool is currently listed as online, no data errors across the pool itself, but the worst offending drive is showing 10 read errors currently.

Code:

  pool: pergamum

 state: ONLINE

status: One or more devices is currently being resilvered.  The pool will

    continue to function, possibly in a degraded state.

action: Wait for the resilver to complete.

  scan: resilver in progress since Thu Dec 21 11:48:40 2023

    19.5T scanned at 30.0G/s, 16.6T issued at 25.6G/s, 20.5T total

    32K resilvered, 81.23% done, 00:02:33 to go

config:



    NAME                                              STATE     READ WRITE CKSUM

    pergamum                                          ONLINE       0     0     0

      raidz2-0                                        ONLINE       0     0     0

        gptid/ab0351e8-44ea-11e8-8cad-e0071bffdaee    ONLINE       0     0     0

        gptid/abbfceac-44ea-11e8-8cad-e0071bffdaee    ONLINE       0     0     0

        gptid/e3c7752a-1fc4-11ea-8e70-000c29cab7ac    ONLINE       0     0     0

        gptid/6ebdcf54-ac93-11ec-b2a3-279dd0c48793    ONLINE       0     0     0

        gptid/ae0d7e64-44ea-11e8-8cad-e0071bffdaee    ONLINE       0     0     0

        gptid/aeca106f-44ea-11e8-8cad-e0071bffdaee    ONLINE       0     0     0

        gptid/af89686d-44ea-11e8-8cad-e0071bffdaee    ONLINE       0     0     0

        gptid/b04ad4fc-44ea-11e8-8cad-e0071bffdaee    ONLINE       0     0     0

        replacing-8                                   ONLINE       0     0     0

          gptid/b10b6452-44ea-11e8-8cad-e0071bffdaee  ONLINE      10     0     0  (resilvering)

          gptid/a1020a2d-a04d-11ee-8a53-0002c95458ac  ONLINE       0     0     0  (awaiting resilver)

        gptid/b1d949c1-44ea-11e8-8cad-e0071bffdaee    ONLINE       0     0     0



errors: No known data errors

Thanks for the advice, we shall see how this goes. New drive to replace the potentially second bad drive shows up tomorrow, assuming all is well and nothing is worse, I plan to badblocks is as I normally do prior to deploying a drive, and then repeat this process. If the second suspected bad drive starts to yeet isetlf from the array, I may forgo badblocks as I think I would rather a likely working new drive vs a known bad drive in my array, we will see...

Arwen · Dec 22, 2023

ZFS remembers if a scrub or re-silver was in progress at a reboot or crash, and will resume such on pool import.

Based on what I see, your failing disks were simply called out faulted because they had too many errors in a short time. On reboot, the drive was "onlined", thus, ZFS is attempting to restore it to service. Kinda silly for your case, but just wait it out. Whence that old disk is re-silvered as best it can, the replacement disk's re-silver will start. And when that is done, you can remove the old disk.

So, unless you pool was stuck somehow, you should not have rebooted. Oh, well, its working now.

Their is also one other thing ZFS will do on reboot, resume dataset deletion. Today's asynchronous dataset / zVol deletion is much less impacting, but never the less does take time. It does not show up in a zpool status, you have to use zpool get freeing POOL to see how much is left to free up. In your case, this should not be an issue, just letting you know what else resumes at pool import.

LIGISTX · Dec 22, 2023

Arwen said:
ZFS remembers if a scrub or re-silver was in progress at a reboot or crash, and will resume such on pool import.

Based on what I see, your failing disks were simply called out faulted because they had too many errors in a short time. On reboot, the drive was "onlined", thus, ZFS is attempting to restore it to service. Kinda silly for your case, but just wait it out. Whence that old disk is re-silvered as best it can, the replacement disk's re-silver will start. And when that is done, you can remove the old disk.

So, unless you pool was stuck somehow, you should not have rebooted. Oh, well, its working now.

Their is also one other thing ZFS will do on reboot, resume dataset deletion. Today's asynchronous dataset / zVol deletion is much less impacting, but never the less does take time. It does not show up in a zpool status, you have to use zpool get freeing POOL to see how much is left to free up. In your case, this should not be an issue, just letting you know what else resumes at pool import.

Thanks for the info. All seems to be restored to health. The bad disc did end up falling out of the array again during the resilver, too many errors were seend and ZFS faulted it out. But the cold spare was resilvered, and the pool is once again fully healthy.

I did have 2 faulted drives yesterday, but after the reboot one of them was onlined and has remained online through the resilver process - that said, I have a new drive being delivered today that I plan to swap it out with. The known bad drive I will take in to be recycled, but the second drive that is currently working fine, I am wondering if it in fact was able to reallocate some sectors and is now "happily" going about its life. Once I swap the drive out for a new one, I will bad blocks it a few times and see if the SMART results change. If not... I will hold onto it as a marginally trust worthy cold spare I suppose.

joeschmuck · Dec 22, 2023

LIGISTX said:
Its possible a SAS -> SATA cable (or two) is going bad

No. The SMART self-test is completely internal to the drive. If a SMART Long/Extended test fails, then the drive should be replaced, which is what you are doing.

Wow, over 52,700 hours (over 6 years), that is a long time.

Important Announcement for the TrueNAS Community.

RAID Z2 - 2 Drives Faulted, Next Steps?

LIGISTX

Guru

Attachments

LIGISTX

Guru

Arwen

MVP

LIGISTX

Guru

LIGISTX

Guru

Arwen

MVP

LIGISTX

Guru

joeschmuck

Old Man

Similar threads