SMART error?

George51

Contributor
Joined
Feb 4, 2014
Messages
126
Got this email (twice at exactly the same time) last night:

This message was generated by the smartd daemon running on:

host name: freenas
DNS domain: local

The following warning/error was logged by the smartd daemon:

Device: /dev/ada1, Self-Test Log error count increased from 0 to 1

Device info:
WDC WD10EFRX-68PJCN0, S/N:WD-WCC4J3542112, WWN:5-0014ee-25f5645f2, FW:01.01A01, 1.00 TB

For details see host's SYSLOG.

You can also use the smartctl utility for further investigation.
No additional messages about this problem will be sent.

Since then, I've run a short test, and a long test is currently running on ada1. Here is the bottom bit of the output of smartctl -a /dev/ada1 ( I don't know how to scroll up within the shell window...)


Code:
 11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0                                          
12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       24                                         
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       2                                          
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       75                                         
194 Temperature_Celsius     0x0022   113   102   000    Old_age   Always       -       30                                         
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0                                          
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0                                          
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0                                          
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0                                          
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       1                                          
                                                                                                                                  
SMART Error Log Version: 1                                                                                                        
No Errors Logged                                                                                                                  
                                                                                                                                  
SMART Self-test log structure revision number 1                                                                                   
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error                                   
# 1  Short offline       Completed without error       00%      7251         -                                                    
# 2  Extended offline    Completed: read failure       40%      7249         1219691232                                           
# 3  Short offline       Completed without error       00%      7228         -                                                    
# 4  Short offline       Completed without error       00%      7204         -                                                    
# 5  Short offline       Completed without error       00%      7180         -                                                    
# 6  Short offline       Completed without error       00%      7156         -                                                    
# 7  Short offline       Completed without error       00%      7132         -                                                    
# 8  Extended offline    Completed without error       00%      7082         -                                                    
# 9  Short offline       Completed without error       00%      7061         -                                                    
#10  Short offline       Completed without error       00%      7037         -                                                    
#11  Short offline       Completed without error       00%      7013         -                                                    
#12  Short offline       Completed without error       00%      6989         -                                                    
#13  Short offline       Completed without error       00%      6965         -                                                    
#14  Extended offline    Completed without error       00%      6914         -                                                    
#15  Short offline       Completed without error       00%      6893         -                                                    
#16  Short offline       Completed without error       00%      6869         -                                                    
#17  Short offline       Completed without error       00%      6845         -                                                    
#18  Short offline       Completed without error       00%      6821         -                                                    
#19  Short offline       Completed without error       00%      6797         -                                                    
#20  Extended offline    Completed without error       00%      6747         -                                                    
#21  Short offline       Completed without error       00%      6725         -                                                    
                                                                                                                                  
SMART Selective self-test log data structure revision number 1                                                                    
SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS                                                                                      
    1        0        0  Not_testing                                                                                              
    2        0        0  Not_testing                                                                                              
    3        0        0  Not_testing                                                                                              
    4        0        0  Not_testing                                                                                              
    5        0        0  Not_testing                                                                                              
Selective self-test flags (0x0):                                                                                                  
  After scanning selected spans, do NOT read-scan remainder of disk.                                                              
If Selective self-test is pending on power-up, resume after 0 minute delay.


Where do I go from here?


Cheers
 

danb35

Hall of Famer
Joined
Aug 16, 2011
Messages
15,455
If you ssh to the machine rather than using the Shell button from the web GUI, you'll usually be able to scroll back and capture everything. Or you could "page" the output, so you could read it all, by doing "smartctl -a /dev/ada1 | more". Or you could send the output to a file on your pool by doing "smartctl -a /dev/ada1 > smartinfo.txt".

But all of those aside, your drive appears to be failing. If the long SMART test you're currently running also shows an error, I'd RMA the drive if under warranty, or buy a new one if not, and replace it.
 

George51

Contributor
Joined
Feb 4, 2014
Messages
126
Code:
smartctl 6.3 2014-07-26 r3976 [FreeBSD 9.3-RELEASE-p10 amd64] (local build)                                                        
Copyright (C) 2002-14, Bruce Allen, Christian Franke, www.smartmontools.org                                                        
                                                                                                                                   
=== START OF INFORMATION SECTION ===                                                                                               
Model Family:     Western Digital Red                                                                                              
Device Model:     WDC WD10EFRX-68PJCN0                                                                                             
Serial Number:    WD-WCC4J3542112                                                                                                  
LU WWN Device Id: 5 0014ee 25f5645f2                                                                                               
Firmware Version: 01.01A01                                                                                                         
User Capacity:    1,000,204,886,016 bytes [1.00 TB]                                                                                
Sector Sizes:     512 bytes logical, 4096 bytes physical                                                                           
Rotation Rate:    5400 rpm                                                                                                         
Device is:        In smartctl database [for details use: -P show]                                                                  
ATA Version is:   ACS-2 (minor revision not indicated)                                                                             
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)                                                                           
Local Time is:    Sat Mar  7 16:19:52 2015 GMT                                                                                     
SMART support is: Available - device has SMART capability.                                                                         
SMART support is: Enabled                                                                                                          
                                                                                                                                   
=== START OF READ SMART DATA SECTION ===                                                                                           
SMART overall-health self-assessment test result: PASSED                                                                           
                                                                                                                                   
General SMART Values:                                                                                                              
Offline data collection status:  (0x00) Offline data collection activity                                                           
                                        was never started.                                                                         
                                        Auto Offline Data Collection: Disabled.                                                    
Self-test execution status:      (   0) The previous self-test routine completed                                                   
                                        without error or no self-test has ever                                                     
                                        been run.                                                                                  
Total time to complete Offline                                                                                                     
data collection:                (13800) seconds.                                                                                   
Offline data collection                                                                                                            
capabilities:                    (0x7b) SMART execute Offline immediate.                                                           
                                        Auto Offline data collection on/off support.                                               
                                        Suspend Offline collection upon new                                                        
                                        command.                                                                                   
                                        Offline surface scan supported.                                                            
                                        Self-test supported.                                                                       
                                        Conveyance Self-test supported.                                                            
                                        Selective Self-test supported.                                                             
SMART capabilities:            (0x0003) Saves SMART data before entering                                                           
                                        power-saving mode.                                                                         
                                        Supports SMART auto save timer.                                                            
Error logging capability:        (0x01) Error logging supported.                                                                   
                                        General Purpose Logging supported.                                                         
Short self-test routine                                                                                                            
recommended polling time:        (   2) minutes.                                                                                   
Extended self-test routine                                                                                                         
recommended polling time:        ( 157) minutes.
Conveyance self-test routine                                                                                                       
recommended polling time:        (   5) minutes.                                                                                   
SCT capabilities:              (0x303d) SCT Status supported.                                                                      
                                        SCT Error Recovery Control supported.                                                      
                                        SCT Feature Control supported.                                                             
                                        SCT Data Table supported.                                                                  
                                                                                                                                   
SMART Attributes Data Structure revision number: 16                                                                                
Vendor Specific SMART Attributes with Thresholds:                                                                                  
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE                                   
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0                                           
  3 Spin_Up_Time            0x0027   133   131   021    Pre-fail  Always       -       4325                                        
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       24                                          
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0                                           
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0                                           
  9 Power_On_Hours          0x0032   091   091   000    Old_age   Always       -       7259                                        
10 Spin_Retry_Count        0x0032   100   253   000    Old_age   Always       -       0                                           
11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0                                           
12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       24                                          
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       2                                           
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       75                                          
194 Temperature_Celsius     0x0022   114   102   000    Old_age   Always       -       29                                          
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0                                           
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0                                           
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0                                           
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0                                           
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       0                                           
                                                                                                                                   
SMART Error Log Version: 1                                                                                                         
No Errors Logged                                                                                                                   
                                                                                                                                   
SMART Self-test log structure revision number 1                                                                                    
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error                                    
# 1  Extended offline    Completed without error       00%      7254         -                                                     
# 2  Short offline       Completed without error       00%      7251         -                                                     
# 3  Extended offline    Completed: read failure       40%      7249         1219691232                                            
# 4  Short offline       Completed without error       00%      7228         -                                                     
# 5  Short offline       Completed without error       00%      7204         -                                                     
# 6  Short offline       Completed without error       00%      7180         -                                                     
# 7  Short offline       Completed without error       00%      7156         -                                                     
# 8  Short offline       Completed without error       00%      7132         -                                                     
# 9  Extended offline    Completed without error       00%      7082         -                                                     
#10  Short offline       Completed without error       00%      7061         -                                                     
#11  Short offline       Completed without error       00%      7037         -                                                     
#12  Short offline       Completed without error       00%      7013         -                                                     
#13  Short offline       Completed without error       00%      6989         -                                                     
#14  Short offline       Completed without error       00%      6965         -                                                     
#15  Extended offline    Completed without error       00%      6914         -                                                     
#16  Short offline       Completed without error       00%      6893         - 
#17  Short offline       Completed without error       00%      6869         -                                                     
#18  Short offline       Completed without error       00%      6845         -                                                     
#19  Short offline       Completed without error       00%      6821         -                                                     
#20  Short offline       Completed without error       00%      6797         -                                                     
#21  Extended offline    Completed without error       00%      6747         -                                                     
1 of 1 failed self-tests are outdated by newer successful extended offline self-test # 1                                           
                                                                                                                                   
SMART Selective self-test log data structure revision number 1                                                                     
SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS                                                                                       
    1        0        0  Not_testing                                                                                               
    2        0        0  Not_testing                                                                                               
    3        0        0  Not_testing                                                                                               
    4        0        0  Not_testing                                                                                               
    5        0        0  Not_testing                                                                                               
Selective self-test flags (0x0):                                                                                                   
  After scanning selected spans, do NOT read-scan remainder of disk.                                                               
If Selective self-test is pending on power-up, resume after 0 minute delay.  


So that's the full smartctl -a /dev/ada1 | more output. The second extended SMART test passed, so where does that leave me? Is the drive failing or did it just have a funny few minutes? Setting up SSH is something I have been meaning to do.. this may convince me to get round to setting it up and making sure it's secure
 

Bidule0hm

Server Electronics Sorcerer
Joined
Aug 5, 2013
Messages
3,710
The drive has (at least) one crappy sector. Just bad enough to trigger a read error but not bad enough to be reallocated. See the description text of the ID #197 on this table ;)

At your place I would keep an eye on this drive.
 

George51

Contributor
Joined
Feb 4, 2014
Messages
126
Okay so to bring this thread back - Since this error, I've had no issues with the drive. Until yesterday morning; I got this email

"This message was generated by the smartd daemon running on:

host name: freenas
DNS domain: local

The following warning/error was logged by the smartd daemon:

Device: /dev/ada1, Self-Test Log error count increased from 0 to 1

Device info:
WDC WD10EFRX-68PJCN0, S/N:WD-WCC4J3542112, WWN:5-0014ee-25f5645f2, FW:01.01A01, 1.00 TB

For details see host's SYSLOG.

You can also use the smartctl utility for further investigation.
No additional messages about this problem will be sent."

So to start with I thought it was a different drive, until I pulled up this old thread again. Realised it was the same one, The results of "smartctl -a /dev/ada1 | more" this time round are as below:


Code:
smartctl 6.3 2014-07-26 r3976 [FreeBSD 9.3-RELEASE-p13 amd64] (local build)                                                        
Copyright (C) 2002-14, Bruce Allen, Christian Franke, www.smartmontools.org                                                        
                                                                                                                                   
=== START OF INFORMATION SECTION ===                                                                                               
Model Family:     Western Digital Red                                                                                              
Device Model:     WDC WD10EFRX-68PJCN0                                                                                             
Serial Number:    WD-WCC4J3542112                                                                                                  
LU WWN Device Id: 5 0014ee 25f5645f2                                                                                               
Firmware Version: 01.01A01                                                                                                         
User Capacity:    1,000,204,886,016 bytes [1.00 TB]                                                                                
Sector Sizes:     512 bytes logical, 4096 bytes physical                                                                           
Rotation Rate:    5400 rpm                                                                                                         
Device is:        In smartctl database [for details use: -P show]                                                                  
ATA Version is:   ACS-2 (minor revision not indicated)                                                                             
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)                                                                           
Local Time is:    Sun Apr 19 08:56:01 2015 BST                                                                                     
SMART support is: Available - device has SMART capability.                                                                         
SMART support is: Enabled                                                                                                          
                                                                                                                                   
=== START OF READ SMART DATA SECTION ===                                                                                           
SMART overall-health self-assessment test result: PASSED                                                                           
                                                                                                                                   
General SMART Values:                                                                                                              
Offline data collection status:  (0x00) Offline data collection activity                                                           
                                        was never started.                                                                         
                                        Auto Offline Data Collection: Disabled.                                                    
Self-test execution status:      (   0) The previous self-test routine completed                                                   
                                        without error or no self-test has ever                                                     
                                        been run.                                                                                  
Total time to complete Offline                                                                                                     
data collection:                (13800) seconds.                                                                                   
Offline data collection                                                                                                            
capabilities:                    (0x7b) SMART execute Offline immediate.                                                           
                                        Auto Offline data collection on/off support.                                               
                                        Suspend Offline collection upon new                                                        
                                        command.                                                                                   
                                        Offline surface scan supported.                                                            
                                        Self-test supported.                                                                       
                                        Conveyance Self-test supported.                                                            
                                        Selective Self-test supported.                                                             
SMART capabilities:            (0x0003) Saves SMART data before entering                                                           
                                        power-saving mode.                                                                         
                                        Supports SMART auto save timer.                                                            
Error logging capability:        (0x01) Error logging supported.                                                                   
                                        General Purpose Logging supported.                                                         
Short self-test routine                                                                                                            
recommended polling time:        (   2) minutes.                                                                                   
Extended self-test routine                                                                                                         
recommended polling time:        ( 157) minutes.
Conveyance self-test routine                                                                                                       
recommended polling time:        (   5) minutes.                                                                                   
SCT capabilities:              (0x303d) SCT Status supported.                                                                      
                                        SCT Error Recovery Control supported.                                                      
                                        SCT Feature Control supported.                                                             
                                        SCT Data Table supported.                                                                  
                                                                                                                                   
SMART Attributes Data Structure revision number: 16                                                                                
Vendor Specific SMART Attributes with Thresholds:                                                                                  
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE                                   
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0                                           
  3 Spin_Up_Time            0x0027   135   131   021    Pre-fail  Always       -       4225                                        
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       40                                          
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0                                           
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0                                           
  9 Power_On_Hours          0x0032   089   089   000    Old_age   Always       -       8280                                        
10 Spin_Retry_Count        0x0032   100   253   000    Old_age   Always       -       0                                           
11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0                                           
12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       40                                          
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       9                                           
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       99                                          
194 Temperature_Celsius     0x0022   116   102   000    Old_age   Always       -       27                                          
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0                                           
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0                                           
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0                                           
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0                                           
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       1                                           
                                                                                                                                   
SMART Error Log Version: 1                                                                                                         
No Errors Logged                                                                                                                   
                                                                                                                                   
SMART Self-test log structure revision number 1                                                                                    
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error                                    
# 1  Extended offline    Completed: read failure       40%      8254         1219691232                                            
# 2  Short offline       Completed without error       00%      8234         -                                                     
# 3  Short offline       Completed without error       00%      8210         -                                                     
# 4  Short offline       Completed without error       00%      8186         -                                                     
# 5  Short offline       Completed without error       00%      8162         -                                                     
# 6  Short offline       Completed without error       00%      8138         -                                                     
# 7  Extended offline    Completed without error       00%      8087         -                                                     
# 8  Short offline       Completed without error       00%      8066         -                                                     
# 9  Short offline       Completed without error       00%      8042         -                                                     
#10  Short offline       Completed without error       00%      8018         -                                                     
#11  Short offline       Completed without error       00%      7994         -                                                     
#12  Short offline       Completed without error       00%      7970         -                                                     
#13  Extended offline    Completed without error       00%      7920         -                                                     
#14  Short offline       Completed without error       00%      7898         -                                                     
#15  Short offline       Completed without error       00%      7874         -                                                     
#16  Short offline       Completed without error       00%      7851         -
#17  Short offline       Completed without error       00%      7827         -                                                     
#18  Short offline       Completed without error       00%      7803         -                                                     
#19  Extended offline    Completed without error       00%      7753         -                                                     
#20  Short offline       Completed without error       00%      7732         -                                                     
#21  Short offline       Completed without error       00%      7708         -                                                     
                                                                                                                                   
SMART Selective self-test log data structure revision number 1                                                                     
SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS                                                                                       
    1        0        0  Not_testing                                                                                               
    2        0        0  Not_testing                                                                                               
    3        0        0  Not_testing                                                                                               
    4        0        0  Not_testing                                                                                               
    5        0        0  Not_testing                                                                                               
Selective self-test flags (0x0):                                                                                                   
  After scanning selected spans, do NOT read-scan remainder of disk.                                                               
If Selective self-test is pending on power-up, resume after 0 minute delay. 


Is this drive getting worse? I am not comfortable reading/interpreting the SMART results. So far this time round, I haven't run anything. By default I do weekly long tests, and daily short tests. As well as monthly scrubs.
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,175
Well, the SMART test failed, so it should qualify for an RMA.
 

danb35

Hall of Famer
Joined
Aug 16, 2011
Messages
15,455
Code:
SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed: read failure       40%      8254         1219691232                      
 

James S

Explorer
Joined
Apr 14, 2014
Messages
91
We need a bit more clarity here. Could an expert please apply their minds?
(1) Agreed it has passed SMART. It says so.
(2) It has (a) read failure(s). A "read railure"
What does this sum to, though, in terms of say Freenas (and ZFS) in relation to say other systems?
Is it a disk that is really about to fail or is it just now performing well enough for ZFS?
 

Ericloewe

Server Wrangler
Moderator
Joined
Feb 15, 2014
Messages
20,175
(1) Agreed it has passed SMART.
No:
Code:
 
SMART Self-test log structure revision number 1                                                                                    
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error                                    
# 1  Extended offline    Completed: read failure

(2) It has (a) read failure(s). A "read railure"
What does this sum to
Unless you enjoy philosophical discussions about the existence of data once it was stored by cannot be read... A bad disk.
 

James S

Explorer
Joined
Apr 14, 2014
Messages
91
Yes - "to be or not to be data" isn't going to help much ;)
The point is how to make sense of:
"SMART overall-health self-assesment test result: PASSED"
but
then a "read failure"
 
Joined
May 10, 2017
Messages
838
SMART overall-health self-assesment test result: PASSED

SMART overall-health is just about the SMART attributes, it will only fail if one or more attributes are failing NOW, it doesn't even consider for example a few pending secters, not a good way to access a disk health, SMART extended test is a much better way of determining if a disk is good or not, sometimes together with analyzing some SMART attributes.
 

James S

Explorer
Joined
Apr 14, 2014
Messages
91
Johnnie Black - thanks.
I'm wrestling with the the fact the 3 and potentially 4 disks in my system have gone bad. I'm posting here the results of all SMARTs and Zpool status. I'll also give the hardware profile at the bottom. The system (this particular pool) has been running without obvious glitches for over 6 months before dying in a fairly epic fashion.
I'd really appreciate your view on the disks and whether this really is a horrible batch of disks or whether something in my system could be underlying this mess.

I've attached the smart long test results for the four disks.

This last disk was in the array but it faulted so I burned in a new disk, replaced it and started what I hoped was going to a smooth resilvering process (before the 2nd and then 3rd disk started to give problems). I put this disk in another linux machine to re-run smart (to make sure something weird isn't happening in my FN system.


The result of zpool status shows:


Code:
root@freenas:~ # zpool status
  pool: NAS2
state: ONLINE
  scan: scrub repaired 0 in 0 days 02:07:10 with 0 errors on Sun Jul 14 02:07:11 2019
config:

        NAME                                            STATE     READ WRITE CKSUM
        NAS2                                            ONLINE       0     0     0
          raidz2-0                                      ONLINE       0     0     0
            gptid/01dd6f7b-ccf5-11e3-8e0e-60a44ce69c93  ONLINE       0     0     0
            gptid/024dced5-ccf5-11e3-8e0e-60a44ce69c93  ONLINE       0     0     0
            gptid/02bcb705-ccf5-11e3-8e0e-60a44ce69c93  ONLINE       0     0     0
            gptid/0329aca0-ccf5-11e3-8e0e-60a44ce69c93  ONLINE       0     0     0

errors: No known data errors

  pool: NAS2_data
state: ONLINE
  scan: scrub repaired 0 in 0 days 00:12:45 with 0 errors on Sun Jul 14 00:12:48 2019
config:

        NAME                                            STATE     READ WRITE CKSUM
        NAS2_data                                       ONLINE       0     0     0
          raidz1-0                                      ONLINE       0     0     0
            gptid/025b0bdd-c146-11e8-9d80-ac1f6b2542fe  ONLINE       0     0     0
            gptid/03993d48-c146-11e8-9d80-ac1f6b2542fe  ONLINE       0     0     0
            gptid/04da7480-c146-11e8-9d80-ac1f6b2542fe  ONLINE       0     0     0
            gptid/063a9ab5-c146-11e8-9d80-ac1f6b2542fe  ONLINE       0     0     0

errors: No known data errors

  pool: VM-Datastore
state: UNAVAIL
status: One or more devices are faulted in response to persistent errors.  There are insufficient replicas for the pool to
        continue functioning.
action: Destroy and re-create the pool from a backup source.  Manually marking the device
        repaired using 'zpool clear' may allow some data to be recovered.
  scan: resilvered 0 in 0 days 00:00:00 with 0 errors on Thu Aug 15 17:42:52 2019
config:

        NAME                                            STATE     READ WRITE CKSUM
        VM-Datastore                                    UNAVAIL    733    68     0
          mirror-0                                      ONLINE       0     0     0
            gptid/2bb4cca3-bb50-11e9-b818-ac1f6b2542fe  ONLINE       0     0     0
            gptid/f9b84327-aecb-11e8-8fb0-ac1f6b2542fe  ONLINE       0     0     0
          mirror-1                                      DEGRADED  1001   190   523
            gptid/fbc1273f-aecb-11e8-8fb0-ac1f6b2542fe  DEGRADED   284   283   566  too many errors
            gptid/fcc066d2-aecb-11e8-8fb0-ac1f6b2542fe  FAULTED      0 1.18K 2.04K  too many errors

errors: No known data errors

  pool: freenas-boot
state: ONLINE
  scan: scrub repaired 0 in 0 days 00:00:22 with 0 errors on Fri Aug  9 03:45:22 2019
config:

        NAME        STATE     READ WRITE CKSUM
        freenas-boot  ONLINE       0     0     0
          ada5p2    ONLINE       0     0     0

errors: No known data errors


For hardware:
Motherboard: Asus kcma-d8 dual socket
CPU: AMD Opteron 4122 (4 core no threading) x2
RAM: 32Gb Kingston ECC
Raid card: Asus Pike 2008 - running the LSI 2008 chipset in jbod mode
Power: FSP Twins (500w) connected to UPS
HDD: 4 x 2TB WD red (Raidz 2) 4 x 2TB WD red (Raidz 1) 4 x 3TB WD red (stripped mirror) From another another post - it seems I've messed up the config of this.
System disks: 2 x 120Gb Intel SSDs
System: FreeNAS 11.1-U7
 

Attachments

  • old da0.txt
    10.2 KB · Views: 418
  • da2.txt
    7.5 KB · Views: 404
  • da1.txt
    7.1 KB · Views: 366
  • da0.txt
    7.1 KB · Views: 373
Joined
May 10, 2017
Messages
838
They are all failing, some attributes to keep an eye on WD drives are, besides the more common pending and reallocated sectors:

Code:
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always    
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline     


Ideally these should be 0, doesn't mean that a disk with a non zero values on either or both is failing, as long as the extended SMART test completes without error the disk is good, but it's never a good sign, and if they keep increasing disk will likely fail sooner rather than later, when the extended SMART test fails the disk is failing, i.e., there are bad sectors.

P.S. it would be nice the see smartctl -x output for all, there's some more info, including logged errors and max lifetime operating temps.
 

James S

Explorer
Joined
Apr 14, 2014
Messages
91
Thanks for the confirmation and taking time to look.
I am trying to get understand this problem. They are not dead as per smart -- yet - but are on the way to the grave. Yet Freenas has marked two of them as faulted. That renders them essential unusable, I guess?
I see the faults seem to cluster around a short period of time (45x hours). I presume if these were all at same time then it would some physical knock to the system. Otherwise does this just look like a bad batch?

Extra infomation is attached.
 

Attachments

  • old da0-x.txt
    17.2 KB · Views: 383
  • da2-x.txt
    21.3 KB · Views: 354
  • da0-x.txt
    16.4 KB · Views: 357
  • da1-x.txt
    21.4 KB · Views: 302
Joined
May 10, 2017
Messages
838
They are not dead as per smart -- yet - but are on the way to the grave.

In fact with cases like these, especially when there aren't pending sectors, it can sometimes happen the bad sectors being intermittent, i.e., if you run a couple of extended tests more it might pass again, and the disk can even be good for some time, but in my experience it's more likely to keep failing or fail again soon.

Historic max temperatures are good, so you either you had bad luck or there was some other external factor at play, like high vibrations, power issues, etc, most likely just bad luck.
 

James S

Explorer
Joined
Apr 14, 2014
Messages
91
Thanks for the input. I've sent the 4 WD drives back under warranty. The 3TiB were replaced with 4TiB. So I don't know if the 3s were known to be troublesome.
 

nick31

Dabbler
Joined
Nov 29, 2017
Messages
12
Funny thing, yesterday I enabled SMART testing for all disks in my system. 2 of them are WDC WD30PURX-64P6ZY0 WD Purple 3TB.
They were never used, just powered on for 1-2 years. So all critical SMART data is at 0.
Nevertheless, today I was greeted with the same errors as OP for both drives:
# 1 Short offline Completed: read failure 40% 11115 9655736
# 1 Short offline Completed: read failure 40% 5150 590976
This coincidence makes me think that there are some bugs with firmware and maybe we shouldn't trust the test results. Why would these drives fail the test without any usage?

edit: after further investigation it seems the sectors are bad indeed. i get read errors with sg_verify. glad i didn't use these drives.
 
Last edited:
Top