Not sure how to interpret SMART, please confirm

angelus249 · Dec 19, 2014

Hello together,

I run the latest 9.3.0 with all updates, 4x 4TB in a RAID-Z configuration. The WebGUI storage section gives me a healthy status, however I seem to have a problem with my ada3 disk. Not sure though, how to interpret the reads.

Code:

[root@NAS] ~# smartctl -i -H -A -l error /dev/ada3
smartctl 6.3 2014-07-26 r3976 [FreeBSD 9.3-RELEASE-p5 amd64] (local build)
Copyright (C) 2002-14, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Device Model:     TOSHIBA MD04ACA400
Serial Number:    xxx
LU WWN Device Id: 5 000039 58bd8262c
Firmware Version: FP1A
User Capacity:    4,000,787,030,016 bytes [4.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    7200 rpm
Form Factor:      3.5 inches
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   ATA8-ACS (minor revision not indicated)
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Fri Dec 19 16:10:02 2014 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000b   100   100   050    Pre-fail  Always       -       0
  2 Throughput_Performance  0x0005   100   100   050    Pre-fail  Offline      -       0
  3 Spin_Up_Time            0x0027   100   100   001    Pre-fail  Always       -       7289
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       15
  5 Reallocated_Sector_Ct   0x0033   100   100   050    Pre-fail  Always       -       224
  7 Seek_Error_Rate         0x000b   100   100   050    Pre-fail  Always       -       0
  8 Seek_Time_Performance   0x0005   100   100   050    Pre-fail  Offline      -       0
  9 Power_On_Hours          0x0032   097   097   000    Old_age   Always       -       1392
10 Spin_Retry_Count        0x0033   100   100   030    Pre-fail  Always       -       0
12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       15
191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age   Always       -       67
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       6
193 Load_Cycle_Count        0x0032   100   100   000    Old_age   Always       -       20
194 Temperature_Celsius     0x0022   100   100   000    Old_age   Always       -       36 (Min/Max 17/51)
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       25
197 Current_Pending_Sector  0x0032   100   100   000    Old_age   Always       -       136
198 Offline_Uncorrectable   0x0030   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   253   000    Old_age   Always       -       0
220 Disk_Shift              0x0002   100   100   000    Old_age   Always       -       0
222 Loaded_Hours            0x0032   097   097   000    Old_age   Always       -       1392
223 Load_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
224 Load_Friction           0x0022   100   100   000    Old_age   Always       -       0
226 Load-in_Time            0x0026   100   100   000    Old_age   Always       -       296
240 Head_Flying_Hours       0x0001   100   100   001    Pre-fail  Offline      -       0

SMART Error Log Version: 1
ATA Error Count: 378 (device log contains only the most recent five errors)
    CR = Command Register [HEX]
    FR = Features Register [HEX]
    SC = Sector Count Register [HEX]
    SN = Sector Number Register [HEX]
    CL = Cylinder Low Register [HEX]
    CH = Cylinder High Register [HEX]
    DH = Device/Head Register [HEX]
    DC = Device Command Register [HEX]
    ER = Error register [HEX]
    ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 378 occurred at disk power-on lifetime: 1391 hours (57 days + 23 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 41 18 58 7e ce 40  Error: WP at LBA = 0x00ce7e58 = 13532760

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  61 10 58 e8 a5 99 40 00      00:10:17.236  WRITE FPDMA QUEUED
  61 18 58 b8 79 40 40 00      00:10:17.236  WRITE FPDMA QUEUED
  61 10 58 70 79 40 40 00      00:10:17.235  WRITE FPDMA QUEUED
  61 20 58 88 7e 40 40 00      00:10:17.235  WRITE FPDMA QUEUED
  61 08 58 60 7e 40 40 00      00:10:17.235  WRITE FPDMA QUEUED

Error 377 occurred at disk power-on lifetime: 1391 hours (57 days + 23 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 41 c0 d0 5b cd 40  Error: UNC at LBA = 0x00cd5bd0 = 13458384

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 08 00 18 49 d2 40 00      00:10:12.836  READ FPDMA QUEUED
  60 08 f8 78 a6 0a 40 00      00:10:12.824  READ FPDMA QUEUED
  60 08 f0 70 ae cf 40 00      00:10:12.796  READ FPDMA QUEUED
  60 90 e8 b8 ad 7d 40 00      00:10:12.774  READ FPDMA QUEUED
  60 48 e0 48 8d 7d 40 00      00:10:12.774  READ FPDMA QUEUED

Error 376 occurred at disk power-on lifetime: 1391 hours (57 days + 23 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 41 10 f0 aa cd 40  Error: UNC at LBA = 0x00cdaaf0 = 13478640

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 20 10 e0 aa cd 40 00      00:10:03.618  READ FPDMA QUEUED
  61 08 08 f8 bd c0 40 00      00:10:03.618  WRITE FPDMA QUEUED
  61 08 00 f8 bb c0 40 00      00:10:03.618  WRITE FPDMA QUEUED
  61 08 f8 f8 03 40 40 00      00:10:03.618  WRITE FPDMA QUEUED
  61 08 f0 f8 01 40 40 00      00:10:03.618  WRITE FPDMA QUEUED

Error 375 occurred at disk power-on lifetime: 1391 hours (57 days + 23 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 41 98 48 51 cc 40  Error: UNC at LBA = 0x00cc5148 = 13390152

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 08 a0 50 d2 d1 40 00      00:09:56.409  READ FPDMA QUEUED
  60 08 98 48 51 cc 40 00      00:09:56.409  READ FPDMA QUEUED
  ef 02 00 00 00 00 40 00      00:09:56.409  SET FEATURES [Enable write cache]
  ef aa 00 00 00 00 40 00      00:09:56.408  SET FEATURES [Enable read look-ahead]
  c6 00 10 00 00 00 40 00      00:09:56.408  SET MULTIPLE MODE

Error 374 occurred at disk power-on lifetime: 1391 hours (57 days + 23 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 41 90 48 51 cc 40  Error: UNC at LBA = 0x00cc5148 = 13390152

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 08 98 50 d2 d1 40 00      00:09:50.318  READ FPDMA QUEUED
  60 08 90 48 51 cc 40 00      00:09:50.318  READ FPDMA QUEUED
  60 08 88 50 b6 cd 40 00      00:09:50.318  READ FPDMA QUEUED
  ef 02 00 00 00 00 40 00      00:09:50.317  SET FEATURES [Enable write cache]
  ef aa 00 00 00 00 40 00      00:09:50.317  SET FEATURES [Enable read look-ahead]

I see "SMART overall-health self-assessment test result: PASSED", but compared to the other 3 disks, a few values differ, those respectively:

Code:

  5 Reallocated_Sector_Ct   0x0033   100   100   050    Pre-fail  Always       -       0
191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age   Always       -       11
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   100   100   000    Old_age   Always       -       0

Code:

  5 Reallocated_Sector_Ct   0x0033   100   100   050    Pre-fail  Always       -       224
191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age   Always       -       67
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       25
197 Current_Pending_Sector  0x0032   100   100   000    Old_age   Always       -       136

These are read errors, right? Should I be worried? The disks are fairly new.

Also, one last question. If I remove the drive for RMA, can I use/access my storage in the meantime (without the parity bit/disk so to speak)?

Cheers

danb35 · Dec 19, 2014

Why aren't you running regular SMART tests on your drives?

That said, yes, those look like bad sectors, and I'd probably RMA the drive. Your pool will continue to operate, though without parity. If you're using RAIDZ1, you're playing with fire at that point. Better to advance exchange the drive if possible.

Ericloewe · Dec 19, 2014

Drive's as useful as a paperweight with 136 pending sectors. As danb35 said, with RAIDZ, you're better off turning the system off and keeping it that way until you are ready to resilver with a new drive.

joeschmuck · Dec 19, 2014

If your drive is still under warranty, get it RMA'd and as Ericloewe stated, recommend turning off your NAS until you have the replacement drive in hand. For the RMA, I'd also use the advanced shipmet method so you can have the drive in your hands quicker. Also realize that a 4TB drive will take a long time to resilver so when you start that up just leave your system alone until it's done.

Follow the manual!

danb35 · Dec 19, 2014

joeschmuck said:
Follow the manual!

+1. @angelus249, the manual has click-by-click instructions for replacing a failed disk. Follow them. DO NOT try to add the new disk to your array. If you see red text that says, "You are trying to add a virtual device of type 'stripe'...", you are doing it very wrong, and need to stop lest you irreparably damage your pool.

And then set up regular SMART tests on your disks.

SirMaster · Dec 19, 2014

Have you done a scrub since you saw the pending sectors?

I have seen a scrub (which will force a read of all the sectors) cause the pending sectors to remap successfully and bring the pending count back to 0 before in my setup.

Reallocated sectors and reallocated events are not great, but I have seen enough disks that operate fine for years with a low number of these. Disks have extra sectors specifically for the purpose of remapping bad sectors and the drive is designed to handle a certain number of them. The real problem is when these numbers regularly rise meaning there is some mechanical/electromagnetic issue with the drive and you will soon exhaust the spare sectors. If it's just a bad part of the platter instead and the reallocated sectors and events are a single isolated event then the rest of the disk might still work fine for years.

danb35 · Dec 19, 2014

A scrub does not force a read of all sectors, but a long SMART test should. With a disk that's only two months old, though, these numbers are troublesome, to say the least.

SirMaster · Dec 19, 2014

Well I didn't mean all sectors on the disk. I meant all sectors that actually have data. Presumably the bad sectors are the ones that hold data because pending sectors happen when you try to write data to the disk. The benefit of scrubbing is that ZFS can remap the pending sectors even if they are unreadable because it can use it's checksums and redundancy.

But yes, you should also run a SMART long tests from time to time.

angelus249 · Dec 19, 2014

Thanks gentlemen for all the feedback!

RMA: as expected. Already contacted my supplier to get a replacement drive.
SMART: yes, well, I started setting it up, getting email notification etc., but I failed at one point, then real life happened and I just forgot about it at one point ;) Will set it up for the future.
Follow the manual: sure, I always do. Especially in regards to the core part, my data :)
RAID-Z1: Also as expected, it would work, but it's advisable not to.

Just an update: I've been out and didn't do anything since I posted here and in the meantime my pending sectors decreased from 136 back to 48.

Code:

[root@NAS] ~# smartctl -i -H -A -l error /dev/ada3
smartctl 6.3 2014-07-26 r3976 [FreeBSD 9.3-RELEASE-p5 amd64] (local build)
Copyright (C) 2002-14, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Device Model:     TOSHIBA MD04ACA400
Serial Number:    xxx
LU WWN Device Id: 5 000039 58bd8262c
Firmware Version: FP1A
User Capacity:    4,000,787,030,016 bytes [4.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    7200 rpm
Form Factor:      3.5 inches
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   ATA8-ACS (minor revision not indicated)
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Fri Dec 19 21:39:41 2014 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000b   100   100   050    Pre-fail  Always       -       0
  2 Throughput_Performance  0x0005   100   100   050    Pre-fail  Offline      -       0
  3 Spin_Up_Time            0x0027   100   100   001    Pre-fail  Always       -       7289
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       15
  5 Reallocated_Sector_Ct   0x0033   100   100   050    Pre-fail  Always       -       240
  7 Seek_Error_Rate         0x000b   100   100   050    Pre-fail  Always       -       0
  8 Seek_Time_Performance   0x0005   100   100   050    Pre-fail  Offline      -       0
  9 Power_On_Hours          0x0032   097   097   000    Old_age   Always       -       1398
10 Spin_Retry_Count        0x0033   100   100   030    Pre-fail  Always       -       0
12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       15
191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age   Always       -       67
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       6
193 Load_Cycle_Count        0x0032   100   100   000    Old_age   Always       -       20
194 Temperature_Celsius     0x0022   100   100   000    Old_age   Always       -       37 (Min/Max 17/51)
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       27
197 Current_Pending_Sector  0x0032   100   100   000    Old_age   Always       -       48
198 Offline_Uncorrectable   0x0030   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   253   000    Old_age   Always       -       0
220 Disk_Shift              0x0002   100   100   000    Old_age   Always       -       0
222 Loaded_Hours            0x0032   097   097   000    Old_age   Always       -       1397
223 Load_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
224 Load_Friction           0x0022   100   100   000    Old_age   Always       -       0
226 Load-in_Time            0x0026   100   100   000    Old_age   Always       -       296
240 Head_Flying_Hours       0x0001   100   100   001    Pre-fail  Offline      -       0

SMART Error Log Version: 1
ATA Error Count: 388 (device log contains only the most recent five errors)
        CR = Command Register [HEX]
        FR = Features Register [HEX]
        SC = Sector Count Register [HEX]
        SN = Sector Number Register [HEX]
        CL = Cylinder Low Register [HEX]
        CH = Cylinder High Register [HEX]
        DH = Device/Head Register [HEX]
        DC = Device Command Register [HEX]
        ER = Error register [HEX]
        ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 388 occurred at disk power-on lifetime: 1394 hours (58 days + 2 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 41 08 30 1a cd 40  Error: UNC at LBA = 0x00cd1a30 = 13441584

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 10 10 e0 64 aa 40 00      02:45:42.173  READ FPDMA QUEUED
  60 08 08 30 1a cd 40 00      02:45:42.173  READ FPDMA QUEUED
  ef 02 00 00 00 00 40 00      02:45:42.173  SET FEATURES [Enable write cache]
  ef aa 00 00 00 00 40 00      02:45:42.173  SET FEATURES [Enable read look-ahead]
  c6 00 10 00 00 00 40 00      02:45:42.173  SET MULTIPLE MODE

Error 387 occurred at disk power-on lifetime: 1394 hours (58 days + 2 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 41 00 30 1a cd 40  Error: UNC at LBA = 0x00cd1a30 = 13441584

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 10 08 e0 64 aa 40 00      02:45:37.822  READ FPDMA QUEUED
  60 08 00 30 1a cd 40 00      02:45:37.822  READ FPDMA QUEUED
  ef 02 00 00 00 00 40 00      02:45:37.822  SET FEATURES [Enable write cache]
  ef aa 00 00 00 00 40 00      02:45:37.822  SET FEATURES [Enable read look-ahead]
  c6 00 10 00 00 00 40 00      02:45:37.822  SET MULTIPLE MODE

Error 386 occurred at disk power-on lifetime: 1394 hours (58 days + 2 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 41 f8 30 1a cd 40  Error: UNC at LBA = 0x00cd1a30 = 13441584

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 10 00 e0 64 aa 40 00      02:45:33.631  READ FPDMA QUEUED
  60 08 f8 30 1a cd 40 00      02:45:33.631  READ FPDMA QUEUED
  ef 02 00 00 00 00 40 00      02:45:33.631  SET FEATURES [Enable write cache]
  ef aa 00 00 00 00 40 00      02:45:33.631  SET FEATURES [Enable read look-ahead]
  c6 00 10 00 00 00 40 00      02:45:33.631  SET MULTIPLE MODE

Error 385 occurred at disk power-on lifetime: 1394 hours (58 days + 2 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 41 98 30 1a cd 40  Error: UNC at LBA = 0x00cd1a30 = 13441584

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 08 f8 00 3a ac 40 00      02:45:29.631  READ FPDMA QUEUED
  60 10 f0 e0 64 aa 40 00      02:45:29.631  READ FPDMA QUEUED
  60 08 e8 58 1d 7e 40 00      02:45:29.622  READ FPDMA QUEUED
  60 08 e0 00 1d 7e 40 00      02:45:29.617  READ FPDMA QUEUED
  60 08 d8 88 1c 7e 40 00      02:45:29.614  READ FPDMA QUEUED

Error 384 occurred at disk power-on lifetime: 1394 hours (58 days + 2 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 41 60 30 1a cd 40  Error: WP at LBA = 0x00cd1a30 = 13441584

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  61 08 90 b8 45 12 40 00      02:45:24.450  WRITE FPDMA QUEUED
  61 08 90 28 8d 7d 40 00      02:45:24.430  WRITE FPDMA QUEUED
  61 08 90 98 8d 7d 40 00      02:45:24.430  WRITE FPDMA QUEUED
  61 08 90 20 8d 7d 40 00      02:45:24.430  WRITE FPDMA QUEUED
  60 08 88 e8 f5 7d 40 00      02:45:24.420  READ FPDMA QUEUED

Anyway, I'll replace it. Thanks again for your help!

Meh, I'm really unlucky lately. My tablet is fucked up, one of my desktop PC hdds is screwed, now one of the NAS disks... let's hope it ends here ;)

angelus249 · Dec 19, 2014

Code:

  5 Reallocated_Sector_Ct   0x0033   100   100   050    Pre-fail  Always       -       1064
197 Current_Pending_Sector  0x0032   100   100   000    Old_age   Always       -       4480

LoL, guess it's time to shutdown the NAS :D

Important Announcement for the TrueNAS Community.

Not sure how to interpret SMART, please confirm

angelus249

Dabbler

danb35

Hall of Famer

Ericloewe

Server Wrangler

joeschmuck

Old Man

danb35

Hall of Famer

SirMaster

Patron

danb35

Hall of Famer

SirMaster

Patron

angelus249

Dabbler

angelus249

Dabbler

Similar threads

Important Announcement for the TrueNAS Community.

Not sure how to interpret SMART, please confirm

Dabbler

Hall of Famer

Server Wrangler

Old Man

Hall of Famer

Patron

Hall of Famer

Patron

Dabbler

Dabbler

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Not sure how to interpret SMART, please confirm"

Similar threads