What are correct steps for Currently unreadable AND Offline uncorrectable sectors

gmacman · Jul 25, 2016

I am a complete noob so if you can be fairly descriptive in your help I would appreciate it.

I have the following alerts:

Device: /dev/ada1, 2 Currently unreadable (pending) sectors
Device: /dev/ada1, 2 Offline uncorrectable sectors
Device: /dev/ada1, Self-Test Log error count increased from 0 to 1

I have read thru a number of threads including this very good one about the "currently unreadable" errors:

https://forums.freenas.org/index.ph...-1-currently-unreadable-pending-sectors.9824/

And this one on the "offline uncorrectable" errors:

https://forums.freenas.org/index.ph...lesector-offline-uncorrectable-sectors.22131/

As well as this:

http://illumos.org/msg/ZFS-8000-9P

Again, noob here so although all this may be easy for some people to grasp, it is not for me. And I want to make sure I am following the action steps correctly. The data is non critical stuff (just streaming videos, photos, home movies, etc (all data backed up elsewhere)).

I have ordered a new drive because, well, I really should have one lying around just in case but haven't because the system hasn't been up for more than 6 months and...I'm a noob. It will be here in 2 days, but what should I be doing right now. I am running FreeNAS-9.3-STABLE-201506292332 in a 5 disc in a raidz1.

From what I've read I think I still need to determine if this drive is truly going to fail. Is this true? And if so, what are the correct steps to follow.

Here is the daily run output email I got:

Code:


Checking status of zfs pools:
NAME           SIZE  ALLOC   FREE  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
freenas-boot  7.44G   527M  6.92G         -      -     6%  1.00x  ONLINE  -
nasgran       13.6T  8.14T  5.49T         -    25%    59%  1.00x  ONLINE  /mnt

  pool: nasgran
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://illumos.org/msg/ZFS-8000-9P
  scan: scrub repaired 0 in 7h32m with 0 errors on Sun Jul  3 07:32:51 2016
config:

        NAME                                            STATE     READ WRITE CKSUM
        nasgran                                         ONLINE       0     0     0
          raidz2-0                                      ONLINE       0     0     0
            gptid/0fa48cbd-7947-11e5-a665-0cc47a7369be  ONLINE       0     0     0
            gptid/108c7fb2-7947-11e5-a665-0cc47a7369be  ONLINE       0     0    57
            gptid/117ac8a3-7947-11e5-a665-0cc47a7369be  ONLINE       0     0     0
            gptid/125d9027-7947-11e5-a665-0cc47a7369be  ONLINE       0     0     0
            gptid/133c230f-7947-11e5-a665-0cc47a7369be  ONLINE       0     0     0

errors: No known data errors

And here is the output from the smartctl -q noserial -a /dev/ada1 I ran:

Code:

steve@freenas:~ % sudo smartctl -q noserial -a /dev/ada1
Password:
smartctl 6.3 2014-07-26 r3976 [FreeBSD 9.3-RELEASE-p16 amd64] (local build)
Copyright (C) 2002-14, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:  Western Digital Green
Device Model:  WDC WD30EZRX-00SPEB0
Firmware Version: 80.00A80
User Capacity:  3,000,592,982,016 bytes [3.00 TB]
Sector Sizes:  512 bytes logical, 4096 bytes physical
Rotation Rate:  5400 rpm
Device is:  In smartctl database [for details use: -P show]
ATA Version is:  ACS-2 (minor revision not indicated)
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:  Mon Jul 25 09:56:47 2016 PDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x84)   Offline data collection activity
           was suspended by an interrupting command from host.
           Auto Offline Data Collection: Enabled.
Self-test execution status:  ( 121)   The previous self-test completed having
           the read element of the test failed.
Total time to complete Offline
data collection:      (43560) seconds.
Offline data collection
capabilities:         (0x7b) SMART execute Offline immediate.
           Auto Offline data collection on/off support.
           Suspend Offline collection upon new
           command.
           Offline surface scan supported.
           Self-test supported.
           Conveyance Self-test supported.
           Selective Self-test supported.
SMART capabilities:  (0x0003)   Saves SMART data before entering
           power-saving mode.
           Supports SMART auto save timer.
Error logging capability:  (0x01)   Error logging supported.
           General Purpose Logging supported.
Short self-test routine
recommended polling time:     (  2) minutes.
Extended self-test routine
recommended polling time:     ( 437) minutes.
Conveyance self-test routine
recommended polling time:     (  5) minutes.
SCT capabilities:     (0x7035)   SCT Status supported.
           SCT Feature Control supported.
           SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME  FLAG  VALUE WORST THRESH TYPE  UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate  0x002f  200  200  051  Pre-fail  Always  -  2
  3 Spin_Up_Time  0x0027  217  175  021  Pre-fail  Always  -  6141
  4 Start_Stop_Count  0x0032  100  100  000  Old_age  Always  -  24
  5 Reallocated_Sector_Ct  0x0033  200  200  140  Pre-fail  Always  -  0
  7 Seek_Error_Rate  0x002e  200  200  000  Old_age  Always  -  9
  9 Power_On_Hours  0x0032  092  092  000  Old_age  Always  -  6214
 10 Spin_Retry_Count  0x0032  100  253  000  Old_age  Always  -  0
 11 Calibration_Retry_Count 0x0032  100  253  000  Old_age  Always  -  0
 12 Power_Cycle_Count  0x0032  100  100  000  Old_age  Always  -  24
192 Power-Off_Retract_Count 0x0032  200  200  000  Old_age  Always  -  6
193 Load_Cycle_Count  0x0032  094  094  000  Old_age  Always  -  319798
194 Temperature_Celsius  0x0022  118  114  000  Old_age  Always  -  34
196 Reallocated_Event_Count 0x0032  200  200  000  Old_age  Always  -  0
197 Current_Pending_Sector  0x0032  200  200  000  Old_age  Always  -  2
198 Offline_Uncorrectable  0x0030  200  200  000  Old_age  Offline  -  2
199 UDMA_CRC_Error_Count  0x0032  200  200  000  Old_age  Always  -  0
200 Multi_Zone_Error_Rate  0x0008  200  200  000  Old_age  Offline  -  2

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description  Status  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline  Completed: read failure  90%  6209  2747906360

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
  1  0  0  Not_testing
  2  0  0  Not_testing
  3  0  0  Not_testing
  4  0  0  Not_testing
  5  0  0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

So again, what (and please be fairly detailed) is truly the correct steps I need to follow?

m0nkey_ · Jul 25, 2016

gmacman said:
but what should I be doing right now

Praying that your backup is good :) You have backup's, right?

But in all seriousness, you have RAIDZ2, which means you can lose any two drives. If it goes now, your data should still be safe.

When you do get the new drive, follow the drive replacement procedure in the documentation: http://doc.freenas.org/9.10/freenas_storage.html#replacing-a-failed-drive

danb35 · Jul 25, 2016

The drive has failed a SMART self-test, which is a bad sign. If it's under warranty, RMA it and replace it, following the manual's instructions as @m0nkey_ linked above. Make sure you use WDIDLE3.EXE to set the park timer to a sensible value, which you apparently didn't do on this drive. Do that on your other drives, too. Make sure you set up regular SMART tests, which you also don't seem to have done yet, to more proactively monitor the situation.

gmacman · Jul 25, 2016

m0nkey_ said:
Praying that your backup is good :) You have backup's, right?

But in all seriousness, you have RAIDZ2, which means you can lose any two drives. If it goes now, your data should still be safe.

When you do get the new drive, follow the drive replacement procedure in the documentation: http://doc.freenas.org/9.10/freenas_storage.html#replacing-a-failed-drive

Yes, everything is either fully backed up (some stuff duplicate backed up) or stuff I really don't care about because it can be replaced very easily.

;)

But I believe I had set it up with only RAIDz1. But again, a real noob here so maybe I did overcompensate.

My big question is...should I be doing anything right now, like taking the disc offline?

gmacman · Jul 25, 2016

danb35 said:
The drive has failed a SMART self-test, which is a bad sign. If it's under warranty, RMA it and replace it, following the manual's instructions as @m0nkey_ linked above. Make sure you use WDIDLE3.EXE to set the park timer to a sensible value, which you apparently didn't do on this drive. Do that on your other drives, too. Make sure you set up regular SMART tests, which you also don't seem to have done yet, to more proactively monitor the situation.

Thanks. Yes, under warranty. I thought I had set up SMART tests, but clearly I did not, will do that. Again, should I take the drive offline or wait the 2 days until the new one gets here?

m0nkey_ · Jul 25, 2016

gmacman said:
My big question is...should I be doing anything right now, like taking the disc offline?

There is nothing to do until you receive the replacement drive.

danb35 · Jul 25, 2016

gmacman said:
should I be doing anything right now, like taking the disc offline?

No, there's no reason to do that--it would just reduce your redundancy. The drive is still working, just not as well as it should be. If possible, I'd suggest doing an advance RMA, and replacing the disk without offline-ing it first. That way you shouldn't ever need to compromise redundancy.

cyberjock · Jul 25, 2016

The correct step is to replace that drive. It's failing based on multiple smart parameters (1,7,197,198,199,200) and the failing extended test that was performed 5 hours ago.

gmacman · Jul 25, 2016

Many thanks @m0nkey_ and @danb35 and @cyberjock

Stux · Jul 25, 2016

If you pull the drive now then 20%of you array will be damaged (and correctable), where as right now a tiny fraction is.

You have a backup, you have raidz2, ZFS is correcting errors for you, and you know a drive is flaky and have a replacement on the way.

All is good.

When the new drive arrives, refresh your backup, then use the replace option on the failing drive with the new drive.

If you can't add a sixth drive temporarily to do the replace you'll need to offline the old one first.

gmacman · Jul 28, 2016

First, thanks to all for being so polite with this noob and not just saying "this guy is an idiot", when in fact, I was being an idiot.

My new drive arrived and I followed @cyberjock recommendations in his wdidle3 thread and set the new WD Green drive to 300 seconds idle. Then followed the manual to replace the drive, taking old offline, shutdown, physical replace, reboot and replace command and everything went smoothly after the ~8 hr resilver.

I'm now doing an RMA to ship back the bad one, all the drives are only 9 mo old and still have 1 1/4 years of warranty left. So I'm just going to let the remaining 4 drives "go bad" and go thru this exercise four more times.

:(

But I believe this is the right path to take since ALL of the existing drives have well over 300,000 Load Cycle Counts which, correct me if I'm wrong, is the usage number that is way over the lifespan average. So I don't believe it makes sense to do the wdidle3 on them to fix the parking timer, since they are already "old age". Better to let them die (within the warranty period) and have the fun of shipping back 4 more RMAs.

Does this make sense?

cyberjock · Jul 28, 2016

The early models were rated to 250k cycles. Some people have claimed that some of the newer models have been rated to 500k. I haven't been able to confirm an actual number so I can't vouch for whether that's true or not.

In any case, I'd still wdidle the drives. Even if they were rated for 250k cycles, them being rated for that only means the manufacturer has a high degree of confidence that's the lowest value you should see that could result in a fail condition directly (or something like that). Some people have recorded over 1 million cycles and the drive is still in great shape.

gmacman · Jul 28, 2016

cyberjock said:
The early models were rated to 250k cycles. Some people have claimed that some of the newer models have been rated to 500k. I haven't been able to confirm an actual number so I can't vouch for whether that's true or not.

In any case, I'd still wdidle the drives. Even if they were rated for 250k cycles, them being rated for that only means the manufacturer has a high degree of confidence that's the lowest value you should see that could result in a fail condition directly (or something like that). Some people have recorded over 1 million cycles and the drive is still in great shape.

But if I do change the idle won't that, in theory, help the drive to possible last longer? And thus perhaps live past the warranty period. Or is your point that, hey if it could last till 500,000 (or more) and I do do the timer reset, then this drive will go for a few years more and last as long as one would expect. Whereas if not and it's like the one that failed, it will fail regardless before the warranty deadline 15 months from now and I'll do the RMA.

Side question, to do the wdidle on a drive currently in use, what specifically are my steps? Is it: offline disk, remove disk, (should I scrub prior to this), then connect to the pc I use for wdidle, fix timer, physical replace, reboot, turn back online?

cyberjock · Jul 28, 2016

What I'm saying is that even if you leave the wdidle setting alone, you aren't likely to see any of those disks RMA because of the high load cycle count. You are likely only wearing out disks that may fail prematurely after the warranty. So save yourself the heartache and simply turn on the wdidle setting. :P

As for the steps for using wdidle, I'd look at my wdidle thread as it has all the info you'll need.

Important Announcement for the TrueNAS Community.

What are correct steps for Currently unreadable AND Offline uncorrectable sectors

gmacman

Dabbler

m0nkey_

MVP

danb35

Hall of Famer

gmacman

Dabbler

gmacman

Dabbler

m0nkey_

MVP

danb35

Hall of Famer

cyberjock

Inactive Account

gmacman

Dabbler

Stux

MVP

gmacman

Dabbler

cyberjock

Inactive Account

gmacman

Dabbler

cyberjock

Inactive Account

Similar threads

Important Announcement for the TrueNAS Community.

What are correct steps for Currently unreadable AND Offline uncorrectable sectors

Dabbler

MVP

Hall of Famer

Dabbler

Dabbler

MVP

Hall of Famer

Inactive Account

Dabbler

MVP

Dabbler

Inactive Account

Dabbler

Inactive Account

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "What are correct steps for Currently unreadable AND Offline uncorrectable sectors"

Similar threads