Disk of rz2 pool faulted, how to recover

G Brown

Dabbler
Joined
Jan 2, 2014
Messages
31
Thanks in advance for any clarity!

During a scrub of my RZ2 pool disk da4 generated many errors and was faulted. So this is the current state:

Code:
     pool: rz2
state: DEGRADED
status: One or more devices are faulted in response to persistent errors.
    Sufficient replicas exist for the pool to continue functioning in a
    degraded state.
action: Replace the faulted device, or use 'zpool clear' to mark the device
    repaired.
 scan: scrub repaired 0 in 14h13m with 0 errors on Sun Mar 15 14:13:11 2020
config:

    NAME                                                STATE     READ WRITE CKSUM
    rz2                                                 DEGRADED     0     0     0
     raidz2-0                                          DEGRADED     0     0     0
       gptid/17adb8b7-b58b-11e5-b4f6-000c298fbe2c.eli  ONLINE       0     0     0
       gptid/1873efe0-b58b-11e5-b4f6-000c298fbe2c.eli  ONLINE       0     0     0
       gptid/195e2a96-b58b-11e5-b4f6-000c298fbe2c.eli  ONLINE       0     0     0
       gptid/1a3e1bae-b58b-11e5-b4f6-000c298fbe2c.eli  FAULTED     45   134     0  too many errors
       gptid/1b098197-b58b-11e5-b4f6-000c298fbe2c.eli  ONLINE       0     0     0

errors: No known data errors




I completed a nine hour smart control long test and didn't see any errors:

Code:
>>>freenas: root@/mnt/rz2/data/gb$: smartctl -t long /dev/da4

smartctl 6.3 2014-07-26 r3976 [FreeBSD 9.3-RELEASE-p28 amd64] (local build)
Copyright (C) 2002-14, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION ===
Sending command: "Execute SMART Extended self-test routine immediately in off-line mode".
Drive command "Execute SMART Extended self-test routine immediately in off-line mode" successful.
Testing has begun.
Please wait 545 minutes for test to complete.
Test will complete after Mon Mar 16 20:38:10 2020

*****

>>>freenas: root@/mnt/rz2/data/gb$:  smartctl -a /dev/da4
smartctl 6.3 2014-07-26 r3976 [FreeBSD 9.3-RELEASE-p28 amd64] (local build)
 

Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       1
  3 Spin_Up_Time            0x0027   171   170   021    Pre-fail  Always       -       8441
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       44
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   027   027   000    Old_age   Always       -       53682
 10 Spin_Retry_Count        0x0032   100   253   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       44
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       42
193 Load_Cycle_Count        0x0032   199   199   000    Old_age   Always       -       4375
194 Temperature_Celsius     0x0022   122   113   000    Old_age   Always       -       30
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       1

SMART Error Log Version: 1
No Errors Logged




In the system log there were many SCSI errors:

Code:
Mar 15 07:41:08 freenas         (da4:mps0:0:3:0): READ(16). CDB: 88 00 00 00 00 01 79 1f 4c 88 00 00 00 08 00 00 length 4096 SMID 780 terminated ioc 804b scsi 0 state c xfer 0
Mar 15 07:41:08 freenas         (da4:mps0:0:3:0): READ(10). CDB: 28 00 c1 f8 3b e8 00 00 50 00 length 40960 SMID 765 terminated ioc 804b scsi 0 state c xfer 0
Mar 15 07:41:08 freenas         (da4:mps0:0:3:0): READ(10). CDB: 28 00 c1 f8 3b 38 00 00 58 00 length 45056 SMID 556 terminated ioc 804b scsi 0 state c xfer 0
Mar 15 07:41:08 freenas (da4:mps0:0:3:0): READ(10). CDB: 28 00 c1 f8 3b 38 00 00 58 00
Mar 15 07:41:08 freenas (da4:mps0:0:3:0): CAM status: SCSI Status Error
Mar 15 07:41:08 freenas (da4:mps0:0:3:0): SCSI status: Check Condition
Mar 15 07:41:08 freenas (da4:mps0:0:3:0): SCSI sense: NOT READY asc:4,0 (Logical unit not ready, cause not reportable)
Mar 15 07:41:08 freenas (da4:mps0:0:3:0): Retrying command (per sense data)
Mar 15 07:41:08 freenas (da4:mps0:0:3:0): READ(10). CDB: 28 00 c1 f8 3b e8 00 00 50 00
Mar 15 07:41:08 freenas (da4:mps0:0:3:0): CAM status: SCSI Status Error
Mar 15 07:41:08 freenas (da4:mps0:0:3:0): SCSI status: Check Condition


This leads to be to think maybe a cable became unseated or was not in contact. I opened the case and carefully wiggled the cables attached to da4.

So that is where we are now. The question maybe somebody can help me with is what should I do next? I am thinking perhaps a zpool clear command to clear the errors out of the faulted disk. So I assume I would issue the form of the pool clear command that just referenced the disc da4. However as the only errors that are showing for the pool are associated with the faulted disc, would there be any difference in issuing the zpool clear command for the whole pool? Another factor to note is that this RZ2 pool is encrypted. And I have 2 backups of important data.

Another thing I discovered is that the system device da4 seems always associated with the same disc. In this forum it is said that this relationship can change. Perhaps with swapping cables around it does change, but in my system it seems to be the same after reboots.

I did see information on replacing a disc in the manual but not any information for this.

In the forum some have said one must off-line a disc first. If I just clear the errors will it automatically rejoin the pool?

Fortunately the system is very stable and this doesn't happen until now, so I thought I would check in with the forum expertise before I make a mistake.

Thanks.
 

G Brown

Dabbler
Joined
Jan 2, 2014
Messages
31
An update:

After listening to crickets for a while and examining several posts, I decided to reboot. After a minute or so the GUI web interface showed the logon page whereupon I logged in. It showed the pool as locked, so I put in the passphrase and unlocked it. It took about five minutes before the GUI returned and showed the volumes. Logging in, ZPool status showed it was re-silvering the problem disc, and all errors were gone. After 45 minutes the re-silvering was complete, with 1 checksum error.

I do think it may be good to reboot instead of just clearing the error because FreeNAS puts swap partitions on all the volumes. Correct me if I'm wrong but it would seem when a volume appears flaky lower subsystems than the ZFS file system could also be affected, so just clearing ZFS errors may not help the swap partition. Just guessing.

Question: does anybody know what happens to the system when a flaky disc disrupts the swapping system? Is it wise to put swap partitions there?

So after the reboot I did see one checksum error but that may be understandable because the disc dropped out of the re-silvering process. So I did clear the error.

Interesting disc smart monitoring error discussions show that Google says one third of their failed disks showed no errors. Backblaze said their statistics show one quarter of failed disks failed without showing any monitoring errors. The five parameters that they use are:

* SMART 5 – Reallocated_Sector_Count.

* SMART 187 – Reported_Uncorrectable_Errors.

* SMART 188 – Command_Timeout.

* SMART 197 – Current_Pending_Sector_Count.

* SMART 198 – Offline_Uncorrectable.

If one shows an error, 75% chance that disk will fail. https://www.backblaze.com/blog/hard-drive-smart-stats/

So I have a replacement disk; what are the chances it was treated as fragile by Amazon during shipment? Any body know who ships disks marked “fragile”?

Question: Anybody know of a good sata cable company/source?

I will do a scrub this evening and see if things seem ok.
 
Top