Do I need to change drive?

Charlie86

Explorer
Joined
Sep 28, 2017
Messages
71
Is this just one bad sector or I need to replace drive?

Someone advice me to rewrite bad sector. But I am not shure how to do it and if this will even help.


root@freenas[~]# zpool status pool: freenas-boot state: ONLINE scan: scrub repaired 0 in 0 days 00:00:05 with 0 errors on Mon Oct 26 03:45:05 2020 config: NAME STATE READ WRITE CKSUM freenas-boot ONLINE 0 0 0 da0p2 ONLINE 0 0 0 errors: No known data errors pool: pinja state: ONLINE status: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected. action: Determine if the device needs to be replaced, and clear the errors using 'zpool clear' or replace the device with 'zpool replace'. see: http://illumos.org/msg/ZFS-8000-9P scan: resilvered 22.5M in 0 days 00:06:03 with 0 errors on Sat Oct 31 05:29:20 2020 config: NAME STATE READ WRITE CKS UM pinja ONLINE 0 0 0 raidz1-0 ONLINE 0 0 0 gptid/2fa34fbe-2ce7-11e9-9460-000c297069bb ONLINE 0 0 0 gptid/2fa64f55-2ce7-11e9-9460-000c297069bb ONLINE 0 0 0 gptid/314d0b14-2ce7-11e9-9460-000c297069bb ONLINE 0 0 0 gptid/3159a97a-2ce7-11e9-9460-000c297069bb ONLINE 0 0 0 gptid/31945268-2ce7-11e9-9460-000c297069bb ONLINE 0 0 0 gptid/3175f853-2ce7-11e9-9460-000c297069bb ONLINE 0 0 0 gptid/318f7f9e-2ce7-11e9-9460-000c297069bb ONLINE 493 5.31K 16 gptid/31b925b3-2ce7-11e9-9460-000c297069bb ONLINE 0 0 1 gptid/31e0836a-2ce7-11e9-9460-000c297069bb ONLINE 0 0 0 errors: No known data errors


root@freenas[~]# smartctl -a /dev/da1 smartctl 7.0 2018-12-30 r4883 [FreeBSD 11.3-RELEASE-p11 amd64] (local build) Copyright (C) 2002-18, Bruce Allen, Christian Franke, www.smartmontools.org === START OF INFORMATION SECTION === Model Family: Seagate NAS HDD Device Model: ST3000VN000-1HJ166 Serial Number: W6A1VL72 LU WWN Device Id: 5 000c50 09c760bdd Firmware Version: SC60 User Capacity: 3,000,592,982,016 bytes [3.00 TB] Sector Sizes: 512 bytes logical, 4096 bytes physical Rotation Rate: 5900 rpm Form Factor: 3.5 inches Device is: In smartctl database [for details use: -P show] ATA Version is: ACS-2, ACS-3 T13/2161-D revision 3b SATA Version is: SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s) Local Time is: Sun Nov 1 01:40:58 2020 PST SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART Status command failed: scsi error unsupported field in scsi command SMART overall-health self-assessment test result: PASSED Warning: This result is based on an Attribute check. See vendor-specific Attribute list for marginal Attributes. General SMART Values: Offline data collection status: (0x00) Offline data collection activity was never started. Auto Offline Data Collection: Disabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: ( 107) seconds. Offline data collection capabilities: (0x73) SMART execute Offline immediate. Auto Offline data collection on/off supp ort. Suspend Offline collection upon new command. No Offline surface scan supported. Self-test supported. Conveyance Self-test supported. Selective Self-test supported. SMART capabilities: (0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability: (0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 1) minutes. Extended self-test routine recommended polling time: ( 361) minutes. Conveyance self-test routine recommended polling time: ( 2) minutes. SCT capabilities: (0x10bd) SCT Status supported. SCT Error Recovery Control supported. SCT Feature Control supported. SCT Data Table supported. SMART Attributes Data Structure revision number: 10 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_ FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000f 105 084 006 Pre-fail Always - 224197299 3 Spin_Up_Time 0x0003 093 093 000 Pre-fail Always - 0 4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 37 5 Reallocated_Sector_Ct 0x0033 092 092 010 Pre-fail Always - 9824 7 Seek_Error_Rate 0x000f 081 060 030 Pre-fail Always - 125677711 9 Power_On_Hours 0x0032 064 064 000 Old_age Always - 31797 10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0 12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 37 184 End-to-End_Error 0x0032 100 100 099 Old_age Always - 0 187 Reported_Uncorrect 0x0032 001 001 000 Old_age Always - 1447 188 Command_Timeout 0x0032 100 096 000 Old_age Always - 42950328333 189 High_Fly_Writes 0x003a 001 001 000 Old_age Always - 396 190 Airflow_Temperature_Cel 0x0022 064 043 045 Old_age Always In_th e_past 36 (0 25 38 33 0) 191 G-Sense_Error_Rate 0x0032 100 100 000 Old_age Always - 0 192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 36 193 Load_Cycle_Count 0x0032 100 100 000 Old_age Always - 125 194 Temperature_Celsius 0x0022 036 057 000 Old_age Always - 36 (0 17 0 0 0) 197 Current_Pending_Sector 0x0012 001 001 000 Old_age Always - 30544 198 Offline_Uncorrectable 0x0010 001 001 000 Old_age Offline - 30544 199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0 SMART Error Log Version: 1 ATA Error Count: 1447 (device log contains only the most recent five errors) CR = Command Register [HEX] FR = Features Register [HEX] SC = Sector Count Register [HEX] SN = Sector Number Register [HEX] CL = Cylinder Low Register [HEX] CH = Cylinder High Register [HEX] DH = Device/Head Register [HEX] DC = Device Command Register [HEX] ER = Error register [HEX] ST = Status register [HEX] Powered_Up_Time is measured from power on, and printed as DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes, SS=sec, and sss=millisec. It "wraps" after 49.710 days. Error 1447 occurred at disk power-on lifetime: 31797 hours (1324 days + 21 hours ) When the command that caused the error occurred, the device was active or idle . After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 51 00 ff ff ff 0f Error: UNC at LBA = 0x0fffffff = 268435455 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- 60 00 20 ff ff ff 4f 00 4d+13:25:38.924 READ FPDMA QUEUED 60 00 20 ff ff ff 4f 00 4d+13:25:35.111 READ FPDMA QUEUED 60 00 20 ff ff ff 4f 00 4d+13:25:35.111 READ FPDMA QUEUED 60 00 20 ff ff ff 4f 00 4d+13:25:33.008 READ FPDMA QUEUED 60 00 20 ff ff ff 4f 00 4d+13:25:32.991 READ FPDMA QUEUED Error 1446 occurred at disk power-on lifetime: 31797 hours (1324 days + 21 hours ) When the command that caused the error occurred, the device was active or idle . After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 51 00 ff ff ff 0f Error: UNC at LBA = 0x0fffffff = 268435455 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- 60 00 20 ff ff ff 4f 00 4d+13:01:14.037 READ FPDMA QUEUED 61 00 08 ff ff ff 4f 00 4d+13:01:12.815 WRITE FPDMA QUEUED 61 00 08 ff ff ff 4f 00 4d+13:01:12.815 WRITE FPDMA QUEUED 61 00 08 d8 03 40 40 00 4d+13:01:12.815 WRITE FPDMA QUEUED 61 00 08 d8 01 40 40 00 4d+13:01:12.815 WRITE FPDMA QUEUED Error 1445 occurred at disk power-on lifetime: 31797 hours (1324 days + 21 hours ) When the command that caused the error occurred, the device was active or idle . After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 51 00 ff ff ff 0f Error: UNC at LBA = 0x0fffffff = 268435455 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- 60 00 20 ff ff ff 4f 00 4d+12:53:14.964 READ FPDMA QUEUED 60 00 20 ff ff ff 4f 00 4d+12:53:14.963 READ FPDMA QUEUED 60 00 20 ff ff ff 4f 00 4d+12:53:14.963 READ FPDMA QUEUED 60 00 20 ff ff ff 4f 00 4d+12:53:14.963 READ FPDMA QUEUED 61 00 08 ff ff ff 4f 00 4d+12:53:14.780 WRITE FPDMA QUEUED Error 1444 occurred at disk power-on lifetime: 31786 hours (1324 days + 10 hours ) When the command that caused the error occurred, the device was active or idle . After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 51 00 ff ff ff 0f Error: UNC at LBA = 0x0fffffff = 268435455 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- 60 00 e0 ff ff ff 4f 00 4d+01:53:04.010 READ FPDMA QUEUED 60 00 60 ff ff ff 4f 00 4d+01:53:04.009 READ FPDMA QUEUED 60 00 20 ff ff ff 4f 00 4d+01:53:04.009 READ FPDMA QUEUED 60 00 20 ff ff ff 4f 00 4d+01:53:04.009 READ FPDMA QUEUED 60 00 20 ff ff ff 4f 00 4d+01:53:04.008 READ FPDMA QUEUED Error 1443 occurred at disk power-on lifetime: 31785 hours (1324 days + 9 hours) When the command that caused the error occurred, the device was active or idle . After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 40 51 00 ff ff ff 0f Error: UNC at LBA = 0x0fffffff = 268435455 Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- ---------------- -------------------- 60 00 20 ff ff ff 4f 00 4d+01:16:58.046 READ FPDMA QUEUED 61 00 08 ff ff ff 4f 00 4d+01:16:55.626 WRITE FPDMA QUEUED 61 00 08 ff ff ff 4f 00 4d+01:16:55.626 WRITE FPDMA QUEUED 61 00 08 48 04 40 40 00 4d+01:16:55.625 WRITE FPDMA QUEUED 61 00 08 48 02 40 40 00 4d+01:16:55.625 WRITE FPDMA QUEUED SMART Self-test log structure revision number 1 No self-tests have been logged. [To run self-tests, use: smartctl -t] SMART Selective self-test log data structure revision number 1 SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS 1 0 0 Not_testing 2 0 0 Not_testing 3 0 0 Not_testing 4 0 0 Not_testing 5 0 0 Not_testing Selective self-test flags (0x0): After scanning selected spans, do NOT read-scan remainder of disk. If Selective self-test is pending on power-up, resume after 0 minute delay.



1604223821021.png
 

Pitfrr

Wizard
Joined
Feb 10, 2014
Messages
1,531
I would change the drive right away...
As stated by the attribute #198 of the SMART data you have +30k of bad sectors!
And a lot of errors!
The zpool status also indicates some errors in the last three columns so it means there were read, write and checksum errors on the pool. It has been corrected by ZFS but still, very concerning.

And your disk has +31k hours without having seen any SMART test... you really should have those SMART tests planned in...

I would:
  • Backup the data (or make sure the backups are good)
  • Replace the drive as soon as possible (*)
  • Configure SMART tests

(*) An other concern: you're having a RAIDz1 pool with 3TB drives?
RAIDz1 is discouraged with drives greater than 1 or 2TB. So there is a higher risk that during replacement of the drive an other one could fail... I'm not saying it's gonna happen, just that the bigger the drive, the higher the risk... :smile:
 

sretalla

Powered by Neutrality
Moderator
Joined
Jan 1, 2016
Messages
9,703
As stated by the attribute #198 of the SMART data you have +30k of bad sectors!
I don't think that means what you think it means... Seagate disks (for some incomprehensible reason) store their values in a two part structure which gets read by smartctl as one, so you see a really big number when actually it's a small one.

Have a look at this:

I agree that the disk probably does need replacement as it has clearly logged bad sectors and those will only continue to grow.
 

Pitfrr

Wizard
Joined
Feb 10, 2014
Messages
1,531
Thanks for the link on the SMART data for Seagate!!
I knew Seagate had some strange interpretation for some attributes (like #1, #7 and so) but I didn't know that applied to #198! :-D
 

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,994
Oh yes, this drive is failing hard.

I don't think that means what you think it means... Seagate disks (for some incomprehensible reason) store their values in a two part structure which gets read by smartctl as one, so you see a really big number when actually it's a small one.

Have a look at this:

I agree that the disk probably does need replacement as it has clearly logged bad sectors and those will only continue to grow.
Your reference would only apply to ID's 1 and 7 for this drive (unless you have other reference material), but I do like the link and will need to update my troubleshooting guide to include the reference material.

ID's 5, 197, 198, and 199 are still valid counts. ID 199 is zero here but it would be a real value if it were non-zero.

I'm curious why the OP didn't think about changing the drive before having all these error messages, I would have thought the first 5 would be the key. Also as previously mentioned, run SMART Self Tests ! Setup a daily short test and I like a weekly long/extended test on all your hard drives.

@Charlie86 My advice is to backup all your data and replace that failed drive. Then run a SMART short test and long test, replace any failing drives one at a time. And setup a routine SMART test for all your drives.

Good Luck
 

Charlie86

Explorer
Joined
Sep 28, 2017
Messages
71
Resilvering with new drive in progress. Finger cross everything will be OK :) Thanks in advice.

BTW: Is there any limitation with SMART tests because I run FreeNAS as VM on ESXi? Disks are in passthrough mode.
 

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,994
Resilvering with new drive in progress. Finger cross everything will be OK :) Thanks in advice.

BTW: Is there any limitation with SMART tests because I run FreeNAS as VM on ESXi? Disks are in passthrough mode.
First of all, good luck on the resilvering.

Second, you should have no issues running SMART tests on drives that are in passthrough. You had not issues running smartctl so you should be setting up routing SMART tests. As I said before, I like a daily short test and a weekly long test but others prefer a weekly short test and a monthly long test. Since the short test takes about 2 minutes of time, there is not reason to not do it daily. The long test takes considerably more time (6 hours minimum) and if you have a very active pool, I'd do one drive at a time, one per day when you thing usage will be slow.
 
Top