Which drive is failing?

ghostlobster · Oct 29, 2012

OK, so my nightly auto-tunable threw a potential error last night, saying that a drive might be failing. When I log into my admin console, I have the flashing yellow alert up top, which shows the following:

Code:

The volume RAID (ZFS) status is UNKNOWN: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected.Determine if the device needs to be replaced, and clear the errors using 'zpool clear' or replace the device with 'zpool replace'.

The output from my tunable report regarding drive health is as follows:

Code:

Checking status of zfs pools:
NAME       SIZE  ALLOC   FREE    CAP  DEDUP  HEALTH  ALTROOT
RAID      3.62T  1.12T  2.50T    30%  1.00x  ONLINE  /mnt
external  1.81T  38.0G  1.78T     2%  1.00x  ONLINE  /mnt

  pool: RAID
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://www.sun.com/msg/ZFS-8000-9P
  scan: scrub repaired 0 in 0h4m with 0 errors on Sat Oct 27 08:35:25 2012
config:

        NAME                                            STATE     READ WRITE CKSUM
        RAID                                            ONLINE       0     0     0
          raidz1-0                                      ONLINE       0     0     0
            gptid/b63615f8-18b8-11e2-a5a9-001617ec2159  ONLINE       0     0     0
            gptid/b6e20947-18b8-11e2-a5a9-001617ec2159  ONLINE       0     0     1
            gptid/b76d6eb2-18b8-11e2-a5a9-001617ec2159  ONLINE       0     0     0
            gptid/b7f68eba-18b8-11e2-a5a9-001617ec2159  ONLINE       0     0     0

I ran zpool clear RAID yesterday and it returned nothing. No error, no report. As you can see, a scrub against the pool took only 4 minutes.
I've installed another 1TB drive in my box to replace the failing one, but I cannot isolate which drive might be having an issue. Volume status shows all drives as online, the drive in ADA2 is listing "1" in the checksum column. Is that the one that needs replacing? Is that "1" the report of the potential error? I've labeled all my drives with their S/N nice and visible so I won't have to yank the entire rack to figure out which one is which, so that'll be easy.
So, should I do a Replace against the drive in ADA2 within the Freenas Interface?
Thanks

ghostlobster · Nov 1, 2012

No responses? I want to fix this issue, but would really like confirmation that this is the right way to go with it...replacing the drive on ADA2. The Checksum number is up to 3 now.
Thanks

cyberjock · Nov 1, 2012

Have you tried running SMART diagnostics on the hard drives? That's what I'd do first.

Yell · Nov 1, 2012

might be a pending block on the device

belong my notes on fixing it, be sure to replace ada3 for you disk and review every command

Code:

## get the block in question
smartctl -a  /dev/ada3 | grep -e ATTRIBUTE_NAME -e Current_Pending_Sector -e "Short"

## Output:
#> ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED #> RAW_VALUE
#> 197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       1
#> Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
#> #1  Short offline       Completed: read failure       90%        17         967814544

# Note the "1" under RAW_VALUE and the  LBA_of_first_error

## offline the disk
zpool offline datapool ada3


## remove RAW write protection
sysctl kern.geom.debugflags=0x10

## zero the block (data on this block IS LOST!)
## this will force the disk to remap the block
dd bs=512 seek=967814544 if=/dev/zero of=/dev/ada3 count=1

## enable RAW write proctection
sysctl kern.geom.debugflags=0x0

## start new smart test
smartctl -t short /dev/ada3

## check SMART Reallocated_Sector_Ct and Current_Pending_Sector
smartctl -a  /dev/ada3 | grep -e ATTRIBUTE_NAME -e Reallocated_Sector_Ct -e Current_Pending_Sector -e "Short"

# Output:
#> ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
#>  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -        0
#> 197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
#> # 1  Short offline       Completed without error       00%        20         -
#> # 2  Short offline       Completed: read failure       90%        20         967814544
#> # 3  Short offline       Completed: read failure       90%        17         967814544

## YEHA got the best result :D

## online the disk
zpool online datapool ada3

## do a scrub again, because we might have stolen ZFS a data/parity block
zpool scrub datapool

Important Announcement for the TrueNAS Community.

Which drive is failing?

ghostlobster

Dabbler

ghostlobster

Dabbler

cyberjock

Inactive Account

Yell

Explorer

Similar threads