SMART - Raw_Read_Error_Rate on WD Red?

Phobos · Sep 8, 2014

Hi!

Short version: Raw_Read_Error_Rate has creeped up to 1 on a new WD Red after running SMART tests and badblocks. RMA the drive?

Long version:
System:

Supermicro X10SLH-F-O
Xeon E3-1231 v3
4 x 8GB Samsung ECC Memory (from Supermicro's compatibility list)
6 x 4TB Western Digital Reds (WD40EFRX)
Seasonic 460W Fanless (SS-460FL2 Active PFC F3)
Antec 300 Illusion
FreeNAS 9.2.1.7

Built the system, original PSU was DOA, ran memtest for 48 hours, memory checked out. Then ran SMART tests. My testing went something like this:

For each disk:
$ smartctl -t short /dev/adaX
$ smartctl -t conveyance /dev/adaX
$ smartctl -t long /dev/adaX

Then:
$ sysctl kern.geom.debugflags=0x10

And for each disk:
$ badblocks -ws /dev/adaX
$ smartctl -t long /dev/adaX
$ smartctl -A /dev/adaX

First time I did this everything went fine. Disks were staying around 30 deg C, they weren't agressively spinning down, no errors reported. SMART values all looked sane. (The runtime for the long tests and badblocks seemed to vary between the drives, but I'm guessing this is due to slightly different rotation speeds?)

I rebooted and repeated the SMART and badblocks tests again. Tests passed (again), but I noticed that Raw_Read_Error_Rate is now 1 on one of the disks. This disk is the "oldest" of the bunch: I bought the drives over a period of ~2 months and was playing with FreeNAS before I had all the drives.

Should I be worried? I'm running the long tests again on that drive, and am thinking of running badblocks again. The system has been on for several weeks straight at this point, but besides testing it has been unused (been really busy... :/). Thoughts?

(I tried searching the forums for Raw_Read_Error_Rate but just found a lot of people posting their smartctl output... sorry if this has been asked many times before...)

Code:

# smartctl -A /dev/ada4
smartctl 6.2 2013-07-26 r3841 [FreeBSD 9.2-RELEASE-p10 amd64] (local build)
Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       1
  3 Spin_Up_Time            0x0027   180   178   021    Pre-fail  Always       -       7983
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       20
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   099   099   000    Old_age   Always       -       829
10 Spin_Retry_Count        0x0032   100   253   000    Old_age   Always       -       0
11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0
12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       20
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       7
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       25
194 Temperature_Celsius     0x0022   123   119   000    Old_age   Always       -       29
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       0

SirMaster · Sep 8, 2014

There is no problem here. The RAW value of that attribute often isn't meaningful as a decimal number. What you typically look at are the value, worst, and thresh. When a disk starts to get read errors the value will become lower and you don't want it to cross the thresh. That's the value the disk manufacturer has determined is a failed disk. Worst keeps track of the worst the read error rate has ever been since the rate of read errors could go both up and down.

All disks are going to get read errors like that and as long as it's a low enough amount of them then the disk's error correction systems keep it working reliably as intended.

cyberjock · Sep 9, 2014

A value of 1 for raw error rates is cause to be cautious but not cause for concern. The fact that badblocks found no errors and the disk has no bad sectors identified by SMART tells me the disk is fine and your drive may have hit one of those UREs or some kind of temporary situation.

Anyway, I wouldn't worry about it. I'd use the disk. But I'd definitely be sure to setup a good SMART schedule as well as a good scrub schedule just as a precaution. Of course, you should be setting those up even if the raw error rate was zero.

Ericloewe · Sep 9, 2014

The raw read error rate is not particularly relevant if it's low.

One of my brand-new WD Reds almost immediately wound up with a Read Error Rate of 2 and a Multi Zone Error Rate of 1.
The latter went back to 0 after a few rounds of badblocks, the former stayed at 2. Badblocks found no errors in 8 passes.

tl;dr - just make sure the drives are monitored (and the pool is scrubbed) so you're warned if things deteriorated.

SirMaster · Sep 9, 2014

Raw read error rate would not indicate a URE if it's the only attribute that's changing.

Raw read error rate are the raw errors before error correction is applied by the disk firmware. A URE would be an error that was not able to be corrected by the ECC within the disk and would show up in a different SMART attribute as well.

cyberjock · Sep 9, 2014

SirMaster said:
Raw read error rate would not indicate a URE if it's the only attribute that's changing.

Raw read error rate are the raw errors before error correction is applied by the disk firmware. A URE would be an error that was not able to be corrected by the ECC within the disk and would show up in a different SMART attribute as well.

That's not true. WD, Seagate and Toshiba have different definitions for what hits the raw read error rate. ;)

That's why I said it "may" have been a URE. I can't remember which is which though. But, you'd be surprised to find that typically one out of every 1000 reads has to be corrected by ECC within the disk. There's tools that allow you to actually monitor the number of ECC corrections since the disk was powered on, and it can very quickly turn into rather high numbers. ;)

SirMaster · Sep 9, 2014

Yeah with the shear size and speed of hard drives read errors are happening all the time but most of the time the disk can correct them itself so the OS never needs to know about them.

From what I understand a URE is when the disk can't fix it by retrying or by ECC and then it's up to the OS ultimately (ZFS) to deal with the error.

My Seagate drives use the raw read error rate differently. They have a very large number in the raw column which I believe specifies just how many of these "soft" raw read errors are occurring per some timeframe, hence "rate". And the current, worst, and thresh is a score of how low or high the current rate is compared to what the manufacture deems acceptable.

Phobos · Sep 9, 2014

Thanks for all the replies! Long test on that drive finished with no errors. Running another round of `badblocks` just to be sure...

cyberjock said:
Anyway, I wouldn't worry about it. I'd use the disk. But I'd definitely be sure to setup a good SMART schedule as well as a good scrub schedule just as a precaution. Of course, you should be setting those up even if the raw error rate was zero.

That's the plan! (Following your recommendations from here.)

SirMaster said:
From what I understand a URE is when the disk can't fix it by retrying or by ECC and then it's up to the OS ultimately (ZFS) to deal with the error.

But that’s the whole point of the `zpool scrub`, right? The scrub checks all the data in the pool, and if it does encounter a URE it can heal itself, correct?

SirMaster said:
My Seagate drives use the raw read error rate differently. They have a very large number in the raw column which I believe specifies just how many of these "soft" raw read errors are occurring per some timeframe, hence "rate". And the current, worst, and thresh is a score of how low or high the current rate is compared to what the manufacture deems acceptable.

Interesting. I will keep an eye on all the drives over the next few weeks to see if the WD Reds are doing something similar. All the disks have the same VALUE and WORST (200), but only this one disk has a RAW_VALUE of 1.

BrooklynMatty · Oct 28, 2014

Ericloewe said:
The raw read error rate is not particularly relevant if it's low.

One of my brand-new WD Reds almost immediately wound up with a Read Error Rate of 2 and a Multi Zone Error Rate of 1.
The latter went back to 0 after a few rounds of badblocks, the former stayed at 2. Badblocks found no errors in 8 passes.

tl;dr - just make sure the drives are monitored (and the pool is scrubbed) so you're warned if things deteriorated.

Eric,

I have the same issue you had, i am rerunning badblocks again hoping it will go back to a lower value. Is this RMA worthy if not?

Code:

=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       6
  3 Spin_Up_Time            0x0027   187   182   021    Pre-fail  Always       -       5616
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       11
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       86
10 Spin_Retry_Count        0x0032   100   253   000    Old_age   Always       -       0
11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0
12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       11
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       7
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       24
194 Temperature_Celsius     0x0022   122   117   000    Old_age   Always       -       28
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       2

cyberjock · Oct 28, 2014

Unless you can run a short (or long) SMART test and it fail it generally isn't considered "worthy" for an RMA. That depends on the manufacturer of your disk though, so you'd need to see their RMA requirements.

BrooklynMatty · Oct 28, 2014

cyberjock said:
Unless you can run a short (or long) SMART test and it fail it generally isn't considered "worthy" for an RMA. That depends on the manufacturer of your disk though, so you'd need to see their RMA requirements.

Cyberjock - i followed your steps and did a short test,long test, badblocks random, short test, long test. No errors reported on badblocks and this was only drive w/smart data that looked concerning. It is a WD 3TB Red Drive btw.

Ericloewe · Oct 28, 2014

Your values are a bit higher. I didn't worry too much about mine, since the error rates are poorly defined and monitoring is in place to warn me if needed.

I'd recommend you stress that drive as much as you can. If it holds up, stop worrying (and optionally start loving the bomb - or in this case, RAIDZ2). If it fails, you can RMA it and the problem is solved.

BrooklynMatty · Oct 28, 2014

Ericloewe said:
Your values are a bit higher. I didn't worry too much about mine, since the error rates are poorly defined and monitoring is in place to warn me if needed.

I'd recommend you stress that drive as much as you can. If it holds up, stop worrying (and optionally start loving the bomb - or in this case, RAIDZ2). If it fails, you can RMA it and the problem is solved.

I'm running another round of badblocks testing as we speak, just in case. I'm just so upset i need to wait another few days to get my beast up and running, cant wait :)

Thanks everyone for advice!

Robert Smith · Oct 28, 2014

For Seagate drives the least significant 32 bits of the Raw_Read_Error_Rate and the Seek_Error_Rate refer to the number of operations, and the most significant 16—to the error count; as such, for Seagate drives, if the raw values in the error rates attributes are below 4294967296 (2^32) then no errors have occurred.

cyberjock · Oct 28, 2014

Robert Smith said:
For Seagate drives the least significant 32 bits of the Raw_Read_Error_Rate and the Seek_Error_Rate refer to the number of operations, and the most significant 16—to the error count; as such, for Seagate drives, if the raw values in the error rates attributes are below 4294967296 (2^32) then no errors have occurred.

Actually I think that's for Seagate and Hitachis.

Important Announcement for The TrueNAS Community.

SMART - Raw_Read_Error_Rate on WD Red?

Phobos

Dabbler

SirMaster

Patron

cyberjock

Inactive Account

Ericloewe

Server Wrangler

SirMaster

Patron

cyberjock

Inactive Account

SirMaster

Patron

Phobos

Dabbler

BrooklynMatty

Dabbler

cyberjock

Inactive Account

BrooklynMatty

Dabbler

Ericloewe

Server Wrangler

BrooklynMatty

Dabbler

Robert Smith

Patron

cyberjock

Inactive Account

Similar threads

Important Announcement for The TrueNAS Community.