ATA Error Count

panicos · Aug 23, 2020

Hello,

My setup: HP microserver N54L with 4 disks WD RED 3TB each.
After a power outage, FREENAS wouldn't boot anymore; i connected the monitor to the server and i saw it was stuck in a loop prompting an error regarding ATA for one of the disks (i haven't noted the error and cannot reproduce it now). Initially i thought it was from the boot pool , but based on that error i physically removed that disk and the system booted up. I scanned that disk with HDD Sentinel, found a bad on it and repaired it. Now i see it is booting normally.
However, i made some SMART tests for that disk and there are ATA errors there, but i don't quite succeed on interpreting its output.
What are those errors about? Does it refer to ATA power or data?
Based on the output of SMART test file, does it mean that my disk is about to die? Do i need to change it?

Code:

 #sudo smartctl -a /dev/ada0
smartctl 7.0 2018-12-30 r4883 [FreeBSD 11.3-RELEASE-p7 amd64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Red
Device Model:     WDC WD30EFRX-68EUZN0
Serial Number:    WD-WCC4NJTT0J7J
LU WWN Device Id: 5 0014ee 20a996dc3
Firmware Version: 80.00A80
User Capacity:    3,000,592,982,016 bytes [3.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5400 rpm
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-2 (minor revision not indicated)
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is:    Sun Aug 23 14:23:47 2020 EEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
                                        was never started.
                                        Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                (41460) seconds.
Offline data collection
capabilities:                    (0x7b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        ( 416) minutes.
Conveyance self-test routine
recommended polling time:        (   5) minutes.
SCT capabilities:              (0x703d) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       1376
  3 Spin_Up_Time            0x0027   179   177   021    Pre-fail  Always       -       6033
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       185
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   100   253   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   031   031   000    Old_age   Always       -       50832
 10 Spin_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   100   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       183
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       161
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       779
194 Temperature_Celsius     0x0022   114   108   000    Old_age   Always       -       36
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       0

SMART Error Log Version: 1
ATA Error Count: 150 (device log contains only the most recent five errors)
        CR = Command Register [HEX]
        FR = Features Register [HEX]
        SC = Sector Count Register [HEX]
        SN = Sector Number Register [HEX]
        CL = Cylinder Low Register [HEX]
        CH = Cylinder High Register [HEX]
        DH = Device/Head Register [HEX]
        DC = Device Command Register [HEX]
        ER = Error register [HEX]
        ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 150 occurred at disk power-on lifetime: 50534 hours (2105 days + 14 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 10 90 02 40 40  Error: UNC 16 sectors at LBA = 0x00400290 = 4194960

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 10 90 02 40 40 08      00:05:11.599  READ DMA
  c8 00 10 90 02 40 40 08      00:05:08.224  READ DMA
  c8 00 10 90 02 40 40 08      00:05:04.850  READ DMA
  c8 00 10 90 02 40 40 08      00:05:01.476  READ DMA
  c8 00 10 90 02 40 40 08      00:04:58.093  READ DMA

Error 149 occurred at disk power-on lifetime: 50534 hours (2105 days + 14 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 10 90 02 40 40  Error: UNC 16 sectors at LBA = 0x00400290 = 4194960

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 10 90 02 40 40 08      00:05:08.224  READ DMA
  c8 00 10 90 02 40 40 08      00:05:04.850  READ DMA
  c8 00 10 90 02 40 40 08      00:05:01.476  READ DMA
  c8 00 10 90 02 40 40 08      00:04:58.093  READ DMA

Error 148 occurred at disk power-on lifetime: 50534 hours (2105 days + 14 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 10 90 02 40 40  Error: UNC 16 sectors at LBA = 0x00400290 = 4194960

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 10 90 02 40 40 08      00:05:04.850  READ DMA
  c8 00 10 90 02 40 40 08      00:05:01.476  READ DMA
  c8 00 10 90 02 40 40 08      00:04:58.093  READ DMA

Error 147 occurred at disk power-on lifetime: 50534 hours (2105 days + 14 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 10 90 02 40 40  Error: UNC 16 sectors at LBA = 0x00400290 = 4194960

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 10 90 02 40 40 08      00:05:01.476  READ DMA
  c8 00 10 90 02 40 40 08      00:04:58.093  READ DMA

Error 146 occurred at disk power-on lifetime: 50534 hours (2105 days + 14 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 10 90 02 40 40  Error: UNC 16 sectors at LBA = 0x00400290 = 4194960

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  c8 00 10 90 02 40 40 08      00:04:58.093  READ DMA

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%     50832         -
# 2  Short offline       Completed without error       00%     50820         -
# 3  Short offline       Completed without error       00%     50796         -
# 4  Extended offline    Completed without error       00%     50795         -
# 5  Extended offline    Aborted by host               90%     50725         -
# 6  Extended offline    Completed: read failure       90%     50725         4194960
# 7  Extended offline    Completed without error       00%     50446         -
# 8  Extended offline    Completed without error       00%     49709         -
# 9  Extended offline    Completed without error       00%     48991         -
#10  Extended offline    Completed without error       00%     48256         -
1 of 1 failed self-tests are outdated by newer successful extended offline self-test # 4

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

Thx.

joeschmuck · Aug 23, 2020

Your hard drive is in good shape. Nothing flags as a problem. You did have a problem a little over 100 hours ago, could have occurred due to the power outage. I see you are doing an SMART Extended/Long test once a month and no SMART Short tests. I would recommend you commence doing a SMART Short test once a day, they take 2 minutes to complete and will provide some confidence. Also I'd recommend changing to a Weekly SMART Extended/Long test to provide full confidence, but that test takes almost 7 hours provided you are not activly using the NAS (good to schedule when you are sleeping). Your WD Red drive has a few hours on it (almost 6 years) so it should be tested a little bit more and you should realistically be expecting a failure, especially if you have a few of these drives in the system of the same age, but I hope they last you a few more years, that would be great to see. Look at my link below for help on SMART results.

panicos · Aug 23, 2020

joeschmuck said:
Your hard drive is in good shape. Nothing flags as a problem.

So these lines are not a problem? i should ignore them, they are just historical?
"ATA Error Count: 150"
"Error 150 occurred at disk power-on lifetime"
"Error: UNC 16 sectors at LBA = 0x00400290 = 4194960"

joeschmuck said:
I see you are doing an SMART Extended/Long test once a month and no SMART Short tests.

where do you see that, how?
I actually have both SMART short and long tests scheduled: the short one is monday , then the long one tuesday, and so on every weekday. I now modified like you said: one long test per week; for the rest of the week i set up one short test everyday.

joeschmuck · Aug 23, 2020

Correct, those errors occurred when it looks like you had a failure during the extended test and never occurred elsewhere. just disregard them.

Simple math, subtracting the hours between the Extended test results.
I don't see that this particular drive was tested as you have indicated except for tests 1, 2, 3. Prior to that it was all Extended Testing only. So it's a good thing that you have updated the testing to more frequently. If you read about what SMART was designed to try to do, that was to hopefully notify the operator of an impending failure within 24 hours. That obviously can't happen for certain hardware failures but data failures generally can be warned of and certain hardware failures. With the age of this drive, it's prudent to test a bit more frequently. Sounds like you are doing that. And you seriously have a good drive.

SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Short offline Completed without error 00% 50832 -
# 2 Short offline Completed without error 00% 50820 -
# 3 Short offline Completed without error 00% 50796 -
# 4 Extended offline Completed without error 00% 50795 -
# 5 Extended offline Aborted by host 90% 50725 -
# 6 Extended offline Completed: read failure 90% 50725 4194960
# 7 Extended offline Completed without error 00% 50446 -
# 8 Extended offline Completed without error 00% 49709 -
# 9 Extended offline Completed without error 00% 48991 -
#10 Extended offline Completed without error 00% 48256 -
1 of 1 failed self-tests are outdated by newer successful extended offline self-test # 4

So, I will tell you that your drive did have one value higher than I expected, ID 1 Raw Read Error Rate is typically a zero vale or very close to zero. It being over 10 made me look harder. Typically the WD Red reports only true read errors so I can't explain it, normally you can ignore this value for many drives because it is not always a hardware related cause. For example, your computer asks for a file and the hard drive grabs it, but then the hard drive grabs the next file and puts it into internal RAM in an effort to speed up read operations, but the computer doesn't want that next file. This generates an error for ID1. As I said, the WD Reds seem to be pretty good about not reporting these read-ahead errors. But you have no other error indicators at all.

Also look at my Hard Drive Troubleshooting Guide link below, it will help you identify what to pay attention to, and of course, if in doubt, Post it and someone will offer you some good advice.

Cheers,
-Joe

panicos · Aug 23, 2020

joeschmuck said:
Correct, those errors occurred when it looks like you had a failure during the extended test and never occurred elsewhere. just disregard them.
So, I will tell you that your drive did have one value higher than I expected, ID 1 Raw Read Error Rate is typically a zero vale or very close to zero. It being over 10 made me look harder. Typically the WD Red reports only true read errors so I can't explain it, normally you can ignore this value for many drives because it is not always a hardware related cause. For example, your computer asks for a file and the hard drive grabs it, but then the hard drive grabs the next file and puts it into internal RAM in an effort to speed up read operations, but the computer doesn't want that next file. This generates an error for ID1. As I said, the WD Reds seem to be pretty good about not reporting these read-ahead errors. But you have no other error indicators at all.

Also look at my Hard Drive Troubleshooting Guide link below, it will help you identify what to pay attention to, and of course, if in doubt, Post it and someone will offer you some good advice.

thank you Joe

Important Announcement for the TrueNAS Community.

ATA Error Count

panicos

Dabbler

joeschmuck

Old Man

panicos

Dabbler

joeschmuck

Old Man

panicos

Dabbler

Similar threads