Do I need to change drive?

Charlie86 · Nov 1, 2020

Is this just one bad sector or I need to replace drive?

Someone advice me to rewrite bad sector. But I am not shure how to do it and if this will even help.




root@freenas[~]# zpool status
  pool: freenas-boot
 state: ONLINE
  scan: scrub repaired 0 in 0 days 00:00:05 with 0 errors on Mon Oct 26 03:45:05                                                                                                                                                                                                                                                                                                                                                            2020
config:

        NAME        STATE     READ WRITE CKSUM
        freenas-boot  ONLINE       0     0     0
          da0p2     ONLINE       0     0     0

errors: No known data errors

  pool: pinja
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://illumos.org/msg/ZFS-8000-9P
  scan: resilvered 22.5M in 0 days 00:06:03 with 0 errors on Sat Oct 31 05:29:20                                                                                                                                                                                                                                                                                                                                                            2020
config:

        NAME                                            STATE     READ WRITE CKS                                                                                                                                                                                                                                                                                                                                                           UM
        pinja                                           ONLINE       0     0                                                                                                                                                                                                                                                                                                                                                                0
          raidz1-0                                      ONLINE       0     0                                                                                                                                                                                                                                                                                                                                                                0
            gptid/2fa34fbe-2ce7-11e9-9460-000c297069bb  ONLINE       0     0                                                                                                                                                                                                                                                                                                                                                                0
            gptid/2fa64f55-2ce7-11e9-9460-000c297069bb  ONLINE       0     0                                                                                                                                                                                                                                                                                                                                                                0
            gptid/314d0b14-2ce7-11e9-9460-000c297069bb  ONLINE       0     0                                                                                                                                                                                                                                                                                                                                                                0
            gptid/3159a97a-2ce7-11e9-9460-000c297069bb  ONLINE       0     0                                                                                                                                                                                                                                                                                                                                                                0
            gptid/31945268-2ce7-11e9-9460-000c297069bb  ONLINE       0     0                                                                                                                                                                                                                                                                                                                                                                0
            gptid/3175f853-2ce7-11e9-9460-000c297069bb  ONLINE       0     0                                                                                                                                                                                                                                                                                                                                                                0
            gptid/318f7f9e-2ce7-11e9-9460-000c297069bb  ONLINE     493 5.31K                                                                                                                                                                                                                                                                                                                                                               16
            gptid/31b925b3-2ce7-11e9-9460-000c297069bb  ONLINE       0     0                                                                                                                                                                                                                                                                                                                                                                1
            gptid/31e0836a-2ce7-11e9-9460-000c297069bb  ONLINE       0     0                                                                                                                                                                                                                                                                                                                                                                0

errors: No known data errors


root@freenas[~]# smartctl -a /dev/da1
smartctl 7.0 2018-12-30 r4883 [FreeBSD 11.3-RELEASE-p11 amd64] (local build)
Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Seagate NAS HDD
Device Model:     ST3000VN000-1HJ166
Serial Number:    W6A1VL72
LU WWN Device Id: 5 000c50 09c760bdd
Firmware Version: SC60
User Capacity:    3,000,592,982,016 bytes [3.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5900 rpm
Form Factor:      3.5 inches
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-2, ACS-3 T13/2161-D revision 3b
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Sun Nov  1 01:40:58 2020 PST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART Status command failed: scsi error unsupported field in scsi command
SMART overall-health self-assessment test result: PASSED
Warning: This result is based on an Attribute check.
See vendor-specific Attribute list for marginal Attributes.

General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
                                        was never started.
                                        Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                (  107) seconds.
Offline data collection
capabilities:                    (0x73) SMART execute Offline immediate.
                                        Auto Offline data collection on/off supp                                                                                                                                                                                                                                                                                                                                                           ort.
                                        Suspend Offline collection upon new
                                        command.
                                        No Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   1) minutes.
Extended self-test routine
recommended polling time:        ( 361) minutes.
Conveyance self-test routine
recommended polling time:        (   2) minutes.
SCT capabilities:              (0x10bd) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_                                                                                                                                                                                                                                                                                                                                                           FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   105   084   006    Pre-fail  Always       -                                                                                                                                                                                                                                                                                                                                                                  224197299
  3 Spin_Up_Time            0x0003   093   093   000    Pre-fail  Always       -                                                                                                                                                                                                                                                                                                                                                                  0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -                                                                                                                                                                                                                                                                                                                                                                  37
  5 Reallocated_Sector_Ct   0x0033   092   092   010    Pre-fail  Always       -                                                                                                                                                                                                                                                                                                                                                                  9824
  7 Seek_Error_Rate         0x000f   081   060   030    Pre-fail  Always       -                                                                                                                                                                                                                                                                                                                                                                  125677711
  9 Power_On_Hours          0x0032   064   064   000    Old_age   Always       -                                                                                                                                                                                                                                                                                                                                                                  31797
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -                                                                                                                                                                                                                                                                                                                                                                  0
 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -                                                                                                                                                                                                                                                                                                                                                                  37
184 End-to-End_Error        0x0032   100   100   099    Old_age   Always       -                                                                                                                                                                                                                                                                                                                                                                  0
187 Reported_Uncorrect      0x0032   001   001   000    Old_age   Always       -                                                                                                                                                                                                                                                                                                                                                                  1447
188 Command_Timeout         0x0032   100   096   000    Old_age   Always       -                                                                                                                                                                                                                                                                                                                                                                  42950328333
189 High_Fly_Writes         0x003a   001   001   000    Old_age   Always       -                                                                                                                                                                                                                                                                                                                                                                  396
190 Airflow_Temperature_Cel 0x0022   064   043   045    Old_age   Always   In_th                                                                                                                                                                                                                                                                                                                                                           e_past 36 (0 25 38 33 0)
191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age   Always       -                                                                                                                                                                                                                                                                                                                                                                  0
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -                                                                                                                                                                                                                                                                                                                                                                  36
193 Load_Cycle_Count        0x0032   100   100   000    Old_age   Always       -                                                                                                                                                                                                                                                                                                                                                                  125
194 Temperature_Celsius     0x0022   036   057   000    Old_age   Always       -                                                                                                                                                                                                                                                                                                                                                                  36 (0 17 0 0 0)
197 Current_Pending_Sector  0x0012   001   001   000    Old_age   Always       -                                                                                                                                                                                                                                                                                                                                                                  30544
198 Offline_Uncorrectable   0x0010   001   001   000    Old_age   Offline      -                                                                                                                                                                                                                                                                                                                                                                  30544
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -                                                                                                                                                                                                                                                                                                                                                                  0

SMART Error Log Version: 1
ATA Error Count: 1447 (device log contains only the most recent five errors)
        CR = Command Register [HEX]
        FR = Features Register [HEX]
        SC = Sector Count Register [HEX]
        SN = Sector Number Register [HEX]
        CL = Cylinder Low Register [HEX]
        CH = Cylinder High Register [HEX]
        DH = Device/Head Register [HEX]
        DC = Device Command Register [HEX]
        ER = Error register [HEX]
        ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 1447 occurred at disk power-on lifetime: 31797 hours (1324 days + 21 hours                                                                                                                                                                                                                                                                                                                                                           )
  When the command that caused the error occurred, the device was active or idle                                                                                                                                                                                                                                                                                                                                                           .

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 00 20 ff ff ff 4f 00   4d+13:25:38.924  READ FPDMA QUEUED
  60 00 20 ff ff ff 4f 00   4d+13:25:35.111  READ FPDMA QUEUED
  60 00 20 ff ff ff 4f 00   4d+13:25:35.111  READ FPDMA QUEUED
  60 00 20 ff ff ff 4f 00   4d+13:25:33.008  READ FPDMA QUEUED
  60 00 20 ff ff ff 4f 00   4d+13:25:32.991  READ FPDMA QUEUED

Error 1446 occurred at disk power-on lifetime: 31797 hours (1324 days + 21 hours                                                                                                                                                                                                                                                                                                                                                           )
  When the command that caused the error occurred, the device was active or idle                                                                                                                                                                                                                                                                                                                                                           .

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 00 20 ff ff ff 4f 00   4d+13:01:14.037  READ FPDMA QUEUED
  61 00 08 ff ff ff 4f 00   4d+13:01:12.815  WRITE FPDMA QUEUED
  61 00 08 ff ff ff 4f 00   4d+13:01:12.815  WRITE FPDMA QUEUED
  61 00 08 d8 03 40 40 00   4d+13:01:12.815  WRITE FPDMA QUEUED
  61 00 08 d8 01 40 40 00   4d+13:01:12.815  WRITE FPDMA QUEUED

Error 1445 occurred at disk power-on lifetime: 31797 hours (1324 days + 21 hours                                                                                                                                                                                                                                                                                                                                                           )
  When the command that caused the error occurred, the device was active or idle                                                                                                                                                                                                                                                                                                                                                           .

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 00 20 ff ff ff 4f 00   4d+12:53:14.964  READ FPDMA QUEUED
  60 00 20 ff ff ff 4f 00   4d+12:53:14.963  READ FPDMA QUEUED
  60 00 20 ff ff ff 4f 00   4d+12:53:14.963  READ FPDMA QUEUED
  60 00 20 ff ff ff 4f 00   4d+12:53:14.963  READ FPDMA QUEUED
  61 00 08 ff ff ff 4f 00   4d+12:53:14.780  WRITE FPDMA QUEUED

Error 1444 occurred at disk power-on lifetime: 31786 hours (1324 days + 10 hours                                                                                                                                                                                                                                                                                                                                                           )
  When the command that caused the error occurred, the device was active or idle                                                                                                                                                                                                                                                                                                                                                           .

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 00 e0 ff ff ff 4f 00   4d+01:53:04.010  READ FPDMA QUEUED
  60 00 60 ff ff ff 4f 00   4d+01:53:04.009  READ FPDMA QUEUED
  60 00 20 ff ff ff 4f 00   4d+01:53:04.009  READ FPDMA QUEUED
  60 00 20 ff ff ff 4f 00   4d+01:53:04.009  READ FPDMA QUEUED
  60 00 20 ff ff ff 4f 00   4d+01:53:04.008  READ FPDMA QUEUED

Error 1443 occurred at disk power-on lifetime: 31785 hours (1324 days + 9 hours)
  When the command that caused the error occurred, the device was active or idle                                                                                                                                                                                                                                                                                                                                                           .

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  40 51 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  60 00 20 ff ff ff 4f 00   4d+01:16:58.046  READ FPDMA QUEUED
  61 00 08 ff ff ff 4f 00   4d+01:16:55.626  WRITE FPDMA QUEUED
  61 00 08 ff ff ff 4f 00   4d+01:16:55.626  WRITE FPDMA QUEUED
  61 00 08 48 04 40 40 00   4d+01:16:55.625  WRITE FPDMA QUEUED
  61 00 08 48 02 40 40 00   4d+01:16:55.625  WRITE FPDMA QUEUED

SMART Self-test log structure revision number 1
No self-tests have been logged.  [To run self-tests, use: smartctl -t]

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

Pitfrr · Nov 1, 2020

I would change the drive right away...
As stated by the attribute #198 of the SMART data you have +30k of bad sectors!
And a lot of errors!
The zpool status also indicates some errors in the last three columns so it means there were read, write and checksum errors on the pool. It has been corrected by ZFS but still, very concerning.

And your disk has +31k hours without having seen any SMART test... you really should have those SMART tests planned in...

I would:

Backup the data (or make sure the backups are good)
Replace the drive as soon as possible (*)
Configure SMART tests

(*) An other concern: you're having a RAIDz1 pool with 3TB drives?
RAIDz1 is discouraged with drives greater than 1 or 2TB. So there is a higher risk that during replacement of the drive an other one could fail... I'm not saying it's gonna happen, just that the bigger the drive, the higher the risk...

sretalla · Nov 1, 2020

Pitfrr said:
As stated by the attribute #198 of the SMART data you have +30k of bad sectors!

I don't think that means what you think it means... Seagate disks (for some incomprehensible reason) store their values in a two part structure which gets read by smartctl as one, so you see a really big number when actually it's a small one.

Have a look at this:

Reading SMART Data on Seagate Drives - sdx1.net

Lenovo's Corporate Discount, which currently uses the code NJ*PERKSEPP, offers a significant discount on expensive ThinkPads.

sdx1.net

I agree that the disk probably does need replacement as it has clearly logged bad sectors and those will only continue to grow.

Pitfrr · Nov 1, 2020

Thanks for the link on the SMART data for Seagate!!
I knew Seagate had some strange interpretation for some attributes (like #1, #7 and so) but I didn't know that applied to #198! :-D

joeschmuck · Nov 1, 2020

Oh yes, this drive is failing hard.

sretalla said:
I don't think that means what you think it means... Seagate disks (for some incomprehensible reason) store their values in a two part structure which gets read by smartctl as one, so you see a really big number when actually it's a small one.

Have a look at this:

Reading SMART Data on Seagate Drives - sdx1.net

Lenovo's Corporate Discount, which currently uses the code NJ*PERKSEPP, offers a significant discount on expensive ThinkPads.

sdx1.net

I agree that the disk probably does need replacement as it has clearly logged bad sectors and those will only continue to grow.

Your reference would only apply to ID's 1 and 7 for this drive (unless you have other reference material), but I do like the link and will need to update my troubleshooting guide to include the reference material.

ID's 5, 197, 198, and 199 are still valid counts. ID 199 is zero here but it would be a real value if it were non-zero.

I'm curious why the OP didn't think about changing the drive before having all these error messages, I would have thought the first 5 would be the key. Also as previously mentioned, run SMART Self Tests ! Setup a daily short test and I like a weekly long/extended test on all your hard drives.

@Charlie86 My advice is to backup all your data and replace that failed drive. Then run a SMART short test and long test, replace any failing drives one at a time. And setup a routine SMART test for all your drives.

Good Luck

Charlie86 · Nov 1, 2020

Resilvering with new drive in progress. Finger cross everything will be OK :) Thanks in advice.

BTW: Is there any limitation with SMART tests because I run FreeNAS as VM on ESXi? Disks are in passthrough mode.

joeschmuck · Nov 1, 2020

Charlie86 said:
Resilvering with new drive in progress. Finger cross everything will be OK :) Thanks in advice.

BTW: Is there any limitation with SMART tests because I run FreeNAS as VM on ESXi? Disks are in passthrough mode.

First of all, good luck on the resilvering.

Second, you should have no issues running SMART tests on drives that are in passthrough. You had not issues running smartctl so you should be setting up routing SMART tests. As I said before, I like a daily short test and a weekly long test but others prefer a weekly short test and a monthly long test. Since the short test takes about 2 minutes of time, there is not reason to not do it daily. The long test takes considerably more time (6 hours minimum) and if you have a very active pool, I'd do one drive at a time, one per day when you thing usage will be slow.

Important Announcement for the TrueNAS Community.

Do I need to change drive?

Charlie86

Explorer

Pitfrr

Wizard

sretalla

Powered by Neutrality

Reading SMART Data on Seagate Drives - sdx1.net

Pitfrr

Wizard

joeschmuck

Old Man

Reading SMART Data on Seagate Drives - sdx1.net

Charlie86

Explorer

joeschmuck

Old Man

Similar threads

Important Announcement for the TrueNAS Community.

Do I need to change drive?

Explorer

Wizard

Powered by Neutrality

Wizard

Old Man

Explorer

Old Man

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Do I need to change drive?"

Similar threads