Zpool DEGRADED - Device Faulted (Read Write Checksum) too many errors

arameen · Jan 24, 2016

Hi
I am having an issue with a harddrive that gets faulted from time to time with lots of read, write and checksum errors making the zpool state degraded.
I know which harddrive is causing the problem, but I am not sure what the problem is.
If the harddrive is dying? doesn't need to be replaced yet? or the problem is a cable or something else ?!

I noticed that restarting the server usually puts the harddrive back online and in a resilvering statues until it is done. Then it continues to function online until some time pass and this happens again. And I restart and so it goes.
I logged it manually and this happens every 2-3 weeks. (This harddrive is not even half a year old)
I have been looking for a troubleshooting guide on the forum but found nothing

I would like to understand how to interpret these messages and how to proceed with this.
I don't need an advice saying "Replace the disk" without explaining what all these messages mean

Actually I already have a disk here ready to replace if needed. But I want to understand this messages so I can handle future issues myself next time and troubleshoot other issues easier :D
I am not even sure I am doing the correct SMART commands

so feel free to comment that too

I tried to longSMARTtest the drive several times with "smartctl -t long /dev/da6".
But it failed, if I do understand the "smartctl -l selftest /dev/da6" output correct:

Code:

=== START OF READ SMART DATA SECTION ===   
SMART Self-test log structure revision number 1   
Num  Test_Description  Status  Remaining  LifeTime(hours)  LBA_of_first_error   
# 1  Extended offline  Completed: read failure  90%  7675  110904936   
# 2  Short offline  Completed without error  00%  7674  -   
# 3  Extended offline  Completed: read failure  90%  7162  323967016   
# 4  Extended offline  Completed: read failure  90%  7150  229286488

And the longer outputresult from "smartctl -a /dev/da6" says:

Code:

=== START OF INFORMATION SECTION ===   
Model Family:  Seagate NAS HDD   
Device Model:  ST4000VN000-1H4168   
Serial Number:  S300ZDEF   
LU WWN Device Id: 5 000c50 0758aa501   
Firmware Version: SC46   
User Capacity:  4,000,787,030,016 bytes [4.00 TB]   
Sector Sizes:  512 bytes logical, 4096 bytes physical   
Rotation Rate:  5900 rpm   
Form Factor:  3.5 inches   
Device is:  In smartctl database [for details use: -P show]   
ATA Version is:  ACS-2, ACS-3 T13/2161-D revision 3b   
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)   
Local Time is:  Sun Jan 24 21:43:39 2016 CET   
SMART support is: Available - device has SMART capability.   
SMART support is: Enabled   
   
=== START OF READ SMART DATA SECTION ===   
SMART overall-health self-assessment test result: PASSED   
   
General SMART Values:   
Offline data collection status:  (0x00) Offline data collection activity   
  was never started.   
  Auto Offline Data Collection: Disabled.   
Self-test execution status:  (  0) The previous self-test routine completed   
  without error or no self-test has ever   
  been run.   
Total time to complete Offline   
data collection:  (  117) seconds.   
Offline data collection   
capabilities:  (0x73) SMART execute Offline immediate.   
  Auto Offline data collection on/off support.   
  Suspend Offline collection upon new   
  command.   
  No Offline surface scan supported.   
  Self-test supported.   
  Conveyance Self-test supported.   
  Selective Self-test supported.   
SMART capabilities:  (0x0003) Saves SMART data before entering   
  power-saving mode.   
  Supports SMART auto save timer.   
Error logging capability:  (0x01) Error logging supported.   
  General Purpose Logging supported.   
Short self-test routine   
recommended polling time:  (  1) minutes.   
Extended self-test routine  
recommended polling time:  ( 533) minutes.   
Conveyance self-test routine   
recommended polling time:  (  2) minutes.   
SCT capabilities:  (0x10bd) SCT Status supported.   
  SCT Error Recovery Control supported.   
  SCT Feature Control supported.   
  SCT Data Table supported.   
   
SMART Attributes Data Structure revision number: 10   
Vendor Specific SMART Attributes with Thresholds:   
ID# ATTRIBUTE_NAME  FLAG  VALUE WORST THRESH TYPE  UPDATED  WHEN_FAILED RAW_VALUE   
  1 Raw_Read_Error_Rate  0x000f  109  093  006  Pre-fail  Always  -  232940963   
  3 Spin_Up_Time  0x0003  092  092  000  Pre-fail  Always  -  0   
  4 Start_Stop_Count  0x0032  100  100  020  Old_age  Always  -  72   
  5 Reallocated_Sector_Ct  0x0033  097  097  010  Pre-fail  Always  -  4464   
  7 Seek_Error_Rate  0x000f  082  060  030  Pre-fail  Always  -  174781515   
  9 Power_On_Hours  0x0032  092  092  000  Old_age  Always  -  7674   
 10 Spin_Retry_Count  0x0013  100  100  097  Pre-fail  Always  -  0   
 12 Power_Cycle_Count  0x0032  100  100  020  Old_age  Always  -  72   
184 End-to-End_Error  0x0032  100  100  099  Old_age  Always  -  0   
187 Reported_Uncorrect  0x0032  001  001  000  Old_age  Always  -  1214   
188 Command_Timeout  0x0032  100  098  000  Old_age  Always  -  62   
189 High_Fly_Writes  0x003a  100  100  000  Old_age  Always  -  0   
190 Airflow_Temperature_Cel 0x0022  071  060  045  Old_age  Always  -  29 (Min/Max 18/32)   
191 G-Sense_Error_Rate  0x0032  100  100  000  Old_age  Always  -  0   
192 Power-Off_Retract_Count 0x0032  100  100  000  Old_age  Always  -  72   
193 Load_Cycle_Count  0x0032  100  100  000  Old_age  Always  -  72   
194 Temperature_Celsius  0x0022  029  040  000  Old_age  Always  -  29 (0 18 0 0 0)   
197 Current_Pending_Sector  0x0012  099  098  000  Old_age  Always  -  296   
198 Offline_Uncorrectable  0x0010  099  098  000  Old_age  Offline  -  296   
199 UDMA_CRC_Error_Count  0x003e  200  200  000  Old_age  Always  -  0   
   
SMART Error Log Version: 1   
ATA Error Count: 1214 (device log contains only the most recent five errors)   
  CR = Command Register [HEX]   
  FR = Features Register [HEX]   
  SC = Sector Count Register [HEX]   
  SN = Sector Number Register [HEX]   
  CL = Cylinder Low Register [HEX]   
  CH = Cylinder High Register [HEX]   
  DH = Device/Head Register [HEX]   
  DC = Device Command Register [HEX]   
  ER = Error register [HEX]   
  ST = Status register [HEX]   
Powered_Up_Time is measured from power on, and printed as   
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,   
SS=sec, and sss=millisec. It "wraps" after 49.710 days.   
   
Error 1214 occurred at disk power-on lifetime: 7646 hours (318 days + 14 hours)   
  When the command that caused the error occurred, the device was active or idle.   
   
  After command completion occurred, registers were:   
  ER ST SC SN CL CH DH   
  -- -- -- -- -- -- --   
  40 51 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455   
   
  Commands leading to the command that caused the error were:   
  CR FR SC SN CL CH DH DC  Powered_Up_Time  Command/Feature_Name   
  -- -- -- -- -- -- -- --  ----------------  --------------------   
  60 00 c0 ff ff ff 4f 00  47d+06:34:49.832  READ FPDMA QUEUED   
  60 00 10 90 02 40 40 00  47d+06:34:49.828  READ FPDMA QUEUED   
  ef 10 02 00 00 00 00 00  47d+06:34:49.755  SET FEATURES [Enable SATA feature]   
  ef 02 00 00 00 00 00 00  47d+06:34:49.755  SET FEATURES [Enable write cache]   
  ef aa 00 00 00 00 00 00  47d+06:34:49.755  SET FEATURES [Enable read look-ahead]   
   
Error 1213 occurred at disk power-on lifetime: 7646 hours (318 days + 14 hours)   
  When the command that caused the error occurred, the device was active or idle.   
   
  After command completion occurred, registers were:   
  ER ST SC SN CL CH DH   
  -- -- -- -- -- -- --   
  40 51 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455   
   
  Commands leading to the command that caused the error were:   
  CR FR SC SN CL CH DH DC  Powered_Up_Time  Command/Feature_Name   
  -- -- -- -- -- -- -- --  ----------------  --------------------   
  60 00 10 90 02 40 40 00  47d+06:34:37.581  READ FPDMA QUEUED   
  60 00 c0 ff ff ff 4f 00  47d+06:34:37.577  READ FPDMA QUEUED   
  60 00 10 ff ff ff 4f 00  47d+06:34:37.577  READ FPDMA QUEUED   
  60 00 10 ff ff ff 4f 00  47d+06:34:37.577  READ FPDMA QUEUED   
  ef 10 02 00 00 00 00 00  47d+06:34:37.352  SET FEATURES [Enable SATA feature]   
   
Error 1212 occurred at disk power-on lifetime: 7646 hours (318 days + 14 hours)   
  When the command that caused the error occurred, the device was active or idle.   
   
  After command completion occurred, registers were:   
  ER ST SC SN CL CH DH   
  -- -- -- -- -- -- --   
  40 51 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455   
   
  Commands leading to the command that caused the error were:   
  CR FR SC SN CL CH DH DC  Powered_Up_Time  Command/Feature_Name   
  -- -- -- -- -- -- -- --  ----------------  --------------------   
  60 00 10 ff ff ff 4f 00  47d+06:34:28.827  READ FPDMA QUEUED   
  60 00 10 ff ff ff 4f 00  47d+06:34:28.827  READ FPDMA QUEUED   
  60 00 10 90 02 40 40 00  47d+06:34:28.827  READ FPDMA QUEUED   
  60 00 c0 ff ff ff 4f 00  47d+06:34:28.827  READ FPDMA QUEUED   
  60 00 80 ff ff ff 4f 00  47d+06:34:28.827  READ FPDMA QUEUED   
Error 1211 occurred at disk power-on lifetime: 7646 hours (318 days + 14 hours)   
  When the command that caused the error occurred, the device was active or idle.   
   
  After command completion occurred, registers were:   
  ER ST SC SN CL CH DH   
  -- -- -- -- -- -- --   
  40 51 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455   
   
  Commands leading to the command that caused the error were:   
  CR FR SC SN CL CH DH DC  Powered_Up_Time  Command/Feature_Name   
  -- -- -- -- -- -- -- --  ----------------  --------------------   
  60 00 20 ff ff ff 4f 00  47d+06:34:25.080  READ FPDMA QUEUED   
  60 00 80 ff ff ff 4f 00  47d+06:34:25.077  READ FPDMA QUEUED   
  61 00 20 ff ff ff 4f 00  47d+06:34:25.077  WRITE FPDMA QUEUED   
  60 00 00 ff ff ff 4f 00  47d+06:34:25.077  READ FPDMA QUEUED   
  ea 00 00 00 00 00 00 00  47d+06:34:25.076  FLUSH CACHE EXT   
   
Error 1210 occurred at disk power-on lifetime: 7646 hours (318 days + 14 hours)   
  When the command that caused the error occurred, the device was active or idle.   
   
  After command completion occurred, registers were:   
  ER ST SC SN CL CH DH   
  -- -- -- -- -- -- --   
  40 51 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455   
   
  Commands leading to the command that caused the error were:   
  CR FR SC SN CL CH DH DC  Powered_Up_Time  Command/Feature_Name   
  -- -- -- -- -- -- -- --  ----------------  --------------------   
  60 00 00 ff ff ff 4f 00  47d+06:34:20.326  READ FPDMA QUEUED   
  60 00 20 ff ff ff 4f 00  47d+06:34:20.326  READ FPDMA QUEUED   
  60 00 20 ff ff ff 4f 00  47d+06:34:20.326  READ FPDMA QUEUED   
  60 00 20 ff ff ff 4f 00  47d+06:34:20.326  READ FPDMA QUEUED   
  60 00 20 ff ff ff 4f 00  47d+06:34:20.326  READ FPDMA QUEUED   
   
SMART Self-test log structure revision number 1   
Num  Test_Description  Status  Remaining  LifeTime(hours)  LBA_of_first_error   
# 1  Short offline  Completed without error  00%  7674  -   
# 2  Extended offline  Completed: read failure  90%  7162  323967016   
# 3  Extended offline  Completed: read failure  90%  7150  229286488   
   
SMART Selective self-test log data structure revision number 1   
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS   
  1  0  0  Not_testing   
  2  0  0  Not_testing   
  3  0  0  Not_testing   
  4  0  0  Not_testing   
  5  0  0  Not_testing
Selective self-test flags (0x0):   
  After scanning selected spans, do NOT read-scan remainder of disk.   
If Selective self-test is pending on power-up, resume after 0 minute delay.

FreeNAS-9.3
Raid z3 pool
ECC RAM

hugovsky · Jan 24, 2016

arameen said:
197 Current_Pending_Sector 0x0012 099 098 000 Old_age Always - 296

From this, your drive is dead.

arameen said:
60 00 c0 ff ff ff 4f 00 47d+06:34:49.832 READ FPDMA QUEUED

This might be from a bad cable. Please post your full hardware.

joeschmuck · Jan 24, 2016

That is one BAD DRIVE! Backup your data NOW and replace the drive. IDs 5, 197, and 198 tell it all. Look at my tagline for Decode Your SMART Data and click on the link. It will explain this stuff to you but the drive is very bad.

EDIT: I noticed the power on hours are very low. Was this an RMA'd drive or Refub unit?

arameen · Jan 24, 2016

SuperMicro MBD-X10SL7-F-O
Intel Core i3-4150 3,5GHz 4MB Socket 1150
4x Samsung 8GB DDR3L ECC 1600MHz 1.35V UDIMM
IBM ServeRAID M1015 (IT Mode)
Multilane SAS-kabel SFF-8087
SEAGATE NAS 4TB (x11) Raidz3 pool
SEAGATE 3TB (x5) Raidz3 pool

No this was a brandnew Seagate NAS drive.
Does the data tell it is older ? (could it be refub unit not labeld as refub ?! )

The SMART link was good. But question is what is good values here ?
for exampel, should 197 Current_Pending_Sector or 198 Offline_Uncorrectable be below 20 to be considerd good enough ? or above 100 to be a bad drive?

How dangerous is it to have this disk in a raidz3 pool? I mean the worst thing that oculd happen is that it dies completly? or is there a risk for data corruption because of one failing disk in a raidz3 pool ?

Robert Trevellyan · Jan 24, 2016

Why wait for it to die completely when you can replace it now under warranty? This is why you have redundancy in your pool, so it can still operate while you source a replacement.

arameen · Jan 24, 2016

I am already replacing it this moment, just curious if one dyieng harddrive can have destroy anything permanently in a raidz3 pool like corrupting the pool or anything else. Its a an answer for me to understand better :)

joeschmuck · Jan 24, 2016

arameen said:
Does the data tell it is older ? (could it be refub unit not labeld as refub ?! )

Nope, just asking because it failed so quickly. RMA the drive.

arameen said:
for exampel, should 197 Current_Pending_Sector or 198 Offline_Uncorrectable be below 20 to be considerd good enough ? or above 100 to be a bad drive?

A value of "0" (ZERO) is good. Values above zero indicate a problem with the heads or platters or both. If I had a value of 5 then I'd run a SMART Long test daily for maybe a month to see if the value increases. If it increases then I'd replace it. If ID 5 starts creaping up then again, it's bad news. Zero is where it should be, anything above that is a valid RMA.

arameen said:
just curious if one dyieng harddrive can have destroy anything permanently in a raidz3 pool

No. That is why you have a RAIDZ3, you can have up to three hard drives fail without risking your data.

arameen · Jan 24, 2016

hugovsky said:
From this, your drive is dead.

This might be from a bad cable. Please post your full hardware.

Regarding the cable, is the "60 00 c0 ff ff ff 4f 00 47d+06:34:49.832 READ FPDMA QUEUED" indicating something wrong with it ?
What more can I do about this ?

arameen · Jan 24, 2016

joeschmuck said:
Nope, just asking because it failed so quickly. RMA the drive.

A value of "0" (ZERO) is good. Values above zero indicate a problem with the heads or platters or both. If I had a value of 5 then I'd run a SMART Long test daily for maybe a month to see if the value increases. If it increases then I'd replace it. If ID 5 starts creaping up then again, it's bad news. Zero is where it should be, anything above that is a valid RMA.

No. That is why you have a RAIDZ3, you can have up to three hard drives fail without risking your data.

I myself was suprised the drive is already faulting, this is a NAS disk and supposed to last much more. So I thought I ask here what all this means and how serious this is before replacing :)
Anyway the replacement drive is already installed and resilvering. And tomorrow I will check the disk in windows with seatools so I have a RMA-reason.

When resilvering or scrubing it says "resilvered ....... with 0 errors"
When is there a risk of some errors in resilvering or scrubing? only in case i have 2 dead drives in my raidz3? or is there other scenarios that could appear that I need to be aware of ?

jde · Jan 24, 2016

The first smart test wasn't performed until 7150 hours. Make sure you setup regular smart tests. See https://forums.freenas.org/index.php?threads/scrub-and-smart-testing-schedules.20108/ for a sane schedule.

DrKK · Jan 24, 2016

jde said:
until 7150 hours

7150 is hardly "new".

That's like a year of service time.

Bulldog · Jan 25, 2016

This is why I stay far away from Seagate drives. The last two I had fail, both died within a day of each other both click of death, not even a year old. WD/HGST is where I've gone.

DrKK · Jan 25, 2016

Bulldog said:
This is why I stay far away from Seagate drives. The last two I had fail, both died within a day of each other both click of death, not even a year old. WD/HGST is where I've gone.

Seagates are usually a few bucks cheaper, is the thing. If money is tight, and you have to make a penny scream, then the 5TB Toshiba for around $140US is something I'd recommend over any Seagate model at any price.

arameen · Jan 26, 2016

ok guys, you got me

I havn't set up SMART tests. The reason is I am still learning freeNAS. But now this is prioritized and I will dig into this nearest days.

Regarding diskbrands. I am a seagate fan :D had very good experience of seagate disks unlike WD.

arameen · Jan 26, 2016

joeschmuck said:
Nope, just asking because it failed so quickly. RMA the drive.

A value of "0" (ZERO) is good. Values above zero indicate a problem with the heads or platters or both. If I had a value of 5 then I'd run a SMART Long test daily for maybe a month to see if the value increases. If it increases then I'd replace it. If ID 5 starts creaping up then again, it's bad news. Zero is where it should be, anything above that is a valid RMA.

No. That is why you have a RAIDZ3, you can have up to three hard drives fail without risking your data.

Well that disk is already outside my freenas. But the strange thing is it showing as ok when using HD Tune ?! my retailer uses it for checking RMA drives. So I had to tell them that the drive didnt pass Seagate Long Generic test. But i expected it will be in a worse condition as its SMART values were described

DrKK · Jan 26, 2016

arameen said:
ok guys, you got me I havn't set up SMART tests. The reason is I am still learning freeNAS. But now this is prioritized and I will dig into this nearest days.

Regarding diskbrands. I am a seagate fan :D had very good experience of seagate disks unlike WD.

That's fine, but:

Remember you're dealing with people in here that specifically make their living knowing about NAS, or who spend the vast majority of their free time working with NAS software and hardware.

***OVERWHELMINGLY*** this community advises against Seagates. We may disagree on which drives we *DO* recommend, but we certainly do not disagree, for the most part, on which drives we do *NOT* recommend. Seagates.

Bulldog · Jan 26, 2016

DrKK said:
That's fine, but:

Remember you're dealing with people in here that specifically make their living knowing about NAS, or who spend the vast majority of their free time working with NAS software and hardware.

***OVERWHELMINGLY*** this community advises against Seagates. We may disagree on which drives we *DO* recommend, but we certainly do not disagree, for the most part, on which drives we do *NOT* recommend. Seagates.

He will learn the hard way =). Maybe 2nd or 3rd time they just fail for no reason with low hours. complete garbage drives.

arameen · Jan 26, 2016

hehehe, seagate seems really not popular here :) Not sure if they suddenly became so bad.
I have been using seagate for more than 10 years and had no issues. Ok, I didnt use those in NAS, but my windowsmachines have been on everyday morning to evening and working with more loads than my NAS machine that is used mostly as safe storage.
Anyway, advices noted :) now some freenas questions:

How bad is it when the zpool show read, write or/and checksum errors for a certain drive ?
After replacing the possible failing drive I am already getting read, write and checksum errors on 2 other drives

I will do long smarttest on them, but until that is done I wonder how serious these read, write or/and checksum errors really are?!
Is it enough to just clear my zpool next time I see them and keep an extra eye on the drive instead of removing it from the pool ? eventually scrub?
Or do I have to prepare replacmentdrives already ?

DrKK · Jan 26, 2016

If you are getting read/write/cksum errors on your pool, then you have a problem.
These are probably problems that ZFS is correcting for you, but they should not be occurring.

For example, I have never had a single read, write, or cksum error, ever, on my pool.

joeschmuck · Jan 27, 2016

Each drive manufacturer has periods of times where their product shines or fails. Seagate use to shine in the past but right now it's not doing so good for longevity these days in general. If someone wants to go with Seagate drives, that is fine and maybe they will have better luck, after all Seagate's luck has to change sometime.

DrKK said:
For example, I have never had a single read, write, or cksum error, ever, on my pool.

I have but it was self-induced. This occurred in the early FreeNAS versions, 8.x I think and I was doing a lot of testing. I forget what I was doing, probably just writing data to a sector on the hard drive to see if it noticed a corrupt file during a scrub.

Important Announcement for the TrueNAS Community.

Zpool DEGRADED - Device Faulted (Read Write Checksum) too many errors

Contributor

Guru

Old Man

Contributor

Pony Wrangler

Contributor

Old Man

Contributor

Contributor

Explorer

FreeNAS Generalissimo

Dabbler

FreeNAS Generalissimo

Contributor

Contributor

FreeNAS Generalissimo

Dabbler

Contributor

FreeNAS Generalissimo

Old Man

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Zpool DEGRADED - Device Faulted (Read Write Checksum) too many errors"

Similar threads