Raw Read Errors

Glorious1 · Jan 29, 2015

I started a scrub of one of my volumes and noticed in the console right away, one the drives had a bunch of read errors. The smartctl output (below) shows 78 raw read errors. Last time I looked it was 0. Down below it reports 5 ATA errors.

Does this mean the drive is hosed? Of the four WD Red NASs I bought, I've already returned one. This would be a 50% failure rate in less than 2 months!

Code:

 $ sudo smartctl -a /dev/ada1
smartctl 6.3 2014-07-26 r3976 [FreeBSD 9.3-RELEASE-p8 amd64] (local build)
Copyright (C) 2002-14, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Red (AF)
Device Model:     WDC WD30EFRX-68EUZN0
Serial Number:    WD-WMC4N0F613XU
LU WWN Device Id: 5 0014ee 65a73cfc9
Firmware Version: 82.00A82
User Capacity:    3,000,592,982,016 bytes [3.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5400 rpm
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-2 (minor revision not indicated)
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Thu Jan 29 09:32:13 2015 MST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
                                        was never started.
                                        Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                (38940) seconds.
Offline data collection
capabilities:                    (0x7b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        ( 391) minutes.
Conveyance self-test routine
recommended polling time:        (   5) minutes.
SCT capabilities:              (0x703d) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       78
  3 Spin_Up_Time            0x0027   178   173   021    Pre-fail  Always       -       6100
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       229
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   100   253   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   099   099   000    Old_age   Always       -       1181
10 Spin_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0
12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       66
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       35
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       733
194 Temperature_Celsius     0x0022   115   114   000    Old_age   Always       -       35
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       0

SMART Error Log Version: 1
ATA Error Count: 5
        CR = Command Register [HEX]
        FR = Features Register [HEX]
        SC = Sector Count Register [HEX]
        SN = Sector Number Register [HEX]
        CL = Cylinder Low Register [HEX]
        CH = Cylinder High Register [HEX]
        DH = Device/Head Register [HEX]
        DC = Device Command Register [HEX]
        ER = Error register [HEX]
        ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 5 occurred at disk power-on lifetime: 1180 hours (49 days + 4 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 b8 74 3e 45  Error: UNC at LBA = 0x053e74b8 = 87979192

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 00 28 74 3e 45 08   7d+20:35:57.593  READ DMA
  ec 00 01 00 00 00 40 08   7d+20:35:57.593  IDENTIFY DEVICE
  c8 00 00 28 74 3e 45 08   7d+20:35:50.591  READ DMA
  e5 00 00 00 00 00 40 08   7d+20:35:50.591  CHECK POWER MODE
  c8 00 00 28 74 3e 45 08   7d+20:35:43.592  READ DMA

Error 4 occurred at disk power-on lifetime: 1180 hours (49 days + 4 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 78 74 3e 45  Error: UNC at LBA = 0x053e7478 = 87979128

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 00 28 74 3e 45 08   7d+20:35:50.591  READ DMA
  e5 00 00 00 00 00 40 08   7d+20:35:50.591  CHECK POWER MODE
  c8 00 00 28 74 3e 45 08   7d+20:35:43.592  READ DMA
  c8 00 00 28 74 3e 45 08   7d+20:35:36.591  READ DMA
  c8 00 00 28 74 3e 45 08   7d+20:35:29.592  READ DMA

Error 3 occurred at disk power-on lifetime: 1180 hours (49 days + 4 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 7f 74 3e 45  Error: UNC at LBA = 0x053e747f = 87979135

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 00 28 74 3e 45 08   7d+20:35:43.592  READ DMA
  c8 00 00 28 74 3e 45 08   7d+20:35:36.591  READ DMA
  c8 00 00 28 74 3e 45 08   7d+20:35:29.592  READ DMA

Error 2 occurred at disk power-on lifetime: 1180 hours (49 days + 4 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 df 74 3e 45  Error: UNC at LBA = 0x053e74df = 87979231

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 00 28 74 3e 45 08   7d+20:35:36.591  READ DMA
  c8 00 00 28 74 3e 45 08   7d+20:35:29.592  READ DMA

Error 1 occurred at disk power-on lifetime: 1180 hours (49 days + 4 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 c8 74 3e 45  Error: UNC at LBA = 0x053e74c8 = 87979208

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 00 28 74 3e 45 08   7d+20:35:29.592  READ DMA

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%      1087         -
# 2  Short offline       Completed without error       00%       967         -
# 3  Short offline       Completed without error       00%       663         -
# 4  Extended offline    Completed without error       00%       525         -
# 5  Short offline       Completed without error       00%       254         -
# 6  Short offline       Completed without error       00%       206         -
# 7  Short offline       Completed without error       00%       180         -
# 8  Short offline       Completed without error       00%       156         -
# 9  Short offline       Completed without error       00%       132         -
#10  Extended offline    Completed without error       00%        82         -
#11  Extended offline    Completed without error       00%        24         -
#12  Short offline       Completed without error       00%        17         -
#13  Conveyance offline  Completed without error       00%        17         -

SMART Selective self-test log data structure revision number 1
SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

cyberjock · Jan 29, 2015

The drive isn't hosed, but something is wrong. The values of 196-200 are zero.

So I'd tend to think you have something going on with vibration or something.. maybe the PSU isn't providing clean power and that is messing with the hard drive.

I do noticed that #9 and 12 indicate that you poweroff your box on average every 17 hours of poweron time... we recommend disks stay on 24x7 for longevity.

Fraoch · Jan 29, 2015

It doesn't look like you have automatic SMART testing on, there are irregular periods between tests.

Turn automatic SMART reporting on with frequent extended testing as the drive is somewhat suspect - test at least once a week, maybe more. Make sure e-mail alerts are working.

You won't be able to return the drive as it is because there are no SMART errors or reallocated sectors, but keep a close eye on it through regular tests.

Glorious1 · Jan 29, 2015

I DO have automatic smart testing turned on, but for some reason the smart scheduling in FreeNAS is not working right for me. I'll have all the drives listed in a schedule, and then most will get the test and a few won't. Plus for a while my schedule settings omitted some drives. But I will definitely keep an eye on this one and test it weekly.

As far as the power cycles, assuming that doesn't include spindowns, it must be the sheer number of times I have had to shut down the server or reboot it for installation of something, testing, troubleshooting etc. It is much less frequent since the first month. Otherwise the server stays on 24/7.

I admit I AM flouting all the experts' recommendations with this particular volume and have it set to spin down. Except for smart tests or scrubs, it spins 20 minutes a day for a backup task and then goes back to standby. Maybe I am wavering on that policy . . .

Fraoch · Jan 29, 2015

Glorious1 said:
I DO have automatic smart testing turned on, but for some reason the smart scheduling in FreeNAS is not working right for me. I'll have all the drives listed in a schedule, and then most will get the test and a few won't. Plus for a while my schedule settings omitted some drives. But I will definitely keep an eye on this one and test it weekly.

I've seen it where the drop-down list of disks in the GUI isn't highlighting what I think it should be highlighting, missing a disk or two.

Glorious1 · Jan 29, 2015

Fraoch said:
I've seen it where the drop-down list of disks in the GUI isn't highlighting what I think it should be highlighting, missing a disk or two.

That's precisely what happened to me for a while. Now I have double- and triple-checked that all drives are selected, but it still seems that a few drives don't get tested.

cyberjock · Jan 29, 2015

Glorious1 said:
As far as the power cycles, assuming that doesn't include spindowns, it must be the sheer number of times I have had to shut down the server or reboot it for installation of something, testing, troubleshooting etc. It is much less frequent since the first month. Otherwise the server stays on 24/7.

Ah.. coffee kicked in and I do see your spindowns > power cycles.

So here's where I flog you a little.. your disks are typically only rated for about 2000 spinups/spindowns. You're already at 10% of that.. your value is 229. So for just 2 months of life, you really are burning through their designed lifetime pretty fast. :/ Your drive is showing the exact behavior we are trying to avoid by telling people not to spindown the disks. So you can ignore our advice.. but realize that you are doing exactly what we are telling you not to do, and you are seeing the exact behavior we try to avoid. Just for comparison, my 24 disks that I used to have that are all 3+ years old (some are 4 years old) don't even have 300 yet. I've also had 3 whole failures across all 24 disks since I bought them. Compare and contrast. ;)

Anyway, your read errors basically hint that something is wrong, but there's not enough info to prove what the problem actually is. Could be vibration (that would be my first guess). It could be a crappy power source. I recommend people mount disks fully, with all screws, and to not use chassis that have those rubber grommets that are supposed to quiet the disks. Not sure if you've done all of those things or not, but that's about all the advice I can provide. The disks won't quality for an RMA unless they are actually failing, and there is no indicator that they are actually failing. The only indicator that you have is that the first read attempt is failing, but subsequent read attempts succeed.

Good luck.

Robert Smith · Jan 29, 2015

Note: when you replace a drive it falls out of the SMART schedule; you have to manually go and re-add the drive via the drop-down box.

Glorious1 · Jan 29, 2015

I can't find any official WD info on start/stop cycles ratings or thresholds for the Red. I'm not disagreeing, but 2000 is a shockingly low number. This SMART system is so bizarre because nobody can tell how many raw start/stop counts it takes before the VALUE reaches the THRESHOLD of 0. If people with Reds that have Start/Stop VALUE < 100 could post their start/stop line, that might shed some light on it.

This is the first I've heard that the rubber grommets are not good. I can't change that unless I gut my case and customize the whole rack arrangement, which I don't have the stomach for. But I rather suspect the drive itself, since only this one of 7 drives has shown a problem. Now the scrub is done and a long test is running on that drive.

Good to know about the SMART schedule, need to check that whenever removing or adding drives.

Fraoch · Jan 29, 2015

Glorious1 said:
If people with Reds that have Start/Stop VALUE < 100 could post their start/stop line, that might shed some light on it.

Code:

<snip>
Device Model:  WDC WD20EFRX-68EUZN0

ID# ATTRIBUTE_NAME  FLAG  VALUE WORST THRESH TYPE  UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate  0x002f  200  200  051  Pre-fail  Always  -  0
  3 Spin_Up_Time  0x0027  172  171  021  Pre-fail  Always  -  4400
  4 Start_Stop_Count  0x0032  100  100  000  Old_age  Always  -  24
  5 Reallocated_Sector_Ct  0x0033  200  200  140  Pre-fail  Always  -  0
  7 Seek_Error_Rate  0x002e  200  200  000  Old_age  Always  -  0
  9 Power_On_Hours  0x0032  097  097  000  Old_age  Always  -  2685
 10 Spin_Retry_Count  0x0032  100  253  000  Old_age  Always  -  0
 11 Calibration_Retry_Count 0x0032  100  253  000  Old_age  Always  -  0
 12 Power_Cycle_Count  0x0032  100  100  000  Old_age  Always  -  24
192 Power-Off_Retract_Count 0x0032  200  200  000  Old_age  Always  -  4
193 Load_Cycle_Count  0x0032  200  200  000  Old_age  Always  -  91
194 Temperature_Celsius  0x0022  123  117  000  Old_age  Always  -  24
196 Reallocated_Event_Count 0x0032  200  200  000  Old_age  Always  -  0
197 Current_Pending_Sector  0x0032  200  200  000  Old_age  Always  -  0
198 Offline_Uncorrectable  0x0030  100  253  000  Old_age  Offline  -  0
199 UDMA_CRC_Error_Count  0x0032  200  200  000  Old_age  Always  -  0
200 Multi_Zone_Error_Rate  0x0008  200  200  000  Old_age  Offline  -  0

From one of my 2 TB Reds.

Important Announcement for the TrueNAS Community.

Raw Read Errors

Glorious1

Guru

cyberjock

Inactive Account

Fraoch

Patron

Glorious1

Guru

Fraoch

Patron

Glorious1

Guru

cyberjock

Inactive Account

Robert Smith

Patron

Glorious1

Guru

Fraoch

Patron

Similar threads