Problem with HDD - unreadable (pending) sectors

ltkrogoth

Dabbler
Joined
Jun 3, 2016
Messages
12
Update 10.07.2019

For everyone who do use the search and does have the same problem - here´s a tl.dr
For detailed information about how to do it check the user guide.

1: I got problems reading my files, long lags accessing my drive.
2: Checked FreeNas: It told me about unreadable sectors on one of my drives (ada4).

- You should setup regular smart tests with email alerts to recognize that early on.
3: Ordered a new HDD (same size, same manufacturer)
4: Replaced the broken HDD
5: Resilvered (took 13h for my 3tb WD RED - 70 % full)
6: Running smooth and fine since that.

I am using a ZFS pool with Raid z2.


Hello,

I hope someone can help me with some advice. I did setup my freenas with a ZFS raidz2 pool 4 years ago and it was running smooth since yesterday. I am using 8 WD RED 3tb HDDs. Today I recognized heavy read delays on my server while accessing some files. After logging in I first checked the console and there´s an error:

Code:
Jul  3 21:30:36 freenas smartd[2778]: Device: /dev/ada4, 31 Currently unreadable (pending) sectors
Jul  3 21:30:36 freenas smartd[2778]: Warning via /usr/local/www/freenasUI/tools/smart_alert.py to mymail@hoster.de produced unexpected output (167 bytes) to STDOUT/STDERR:
Jul  3 21:30:36 freenas smartd[2778]: usage: smart_alert.py [-h] [-d DEV]
Jul  3 21:30:36 freenas smartd[2778]: smart_alert.py: error: unrecognized arguments: -s SMART error (CurrentPendingSector) detected on host: freenas markus@angelmahr.de
Jul  3 21:30:36 freenas smartd[2778]: Warning via /usr/local/www/freenasUI/tools/smart_alert.py to mymail@hoster.de: failed (32-bit/8-bit exit status: 512/2)


I also run the smartctl on every drive.

ada1 is a small SSD for freenas and as cache.

ada2
Code:
                                                                                                                                  
SMART Attributes Data Structure revision number: 16                                                                              
Vendor Specific SMART Attributes with Thresholds:                                                                                
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE                                  
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0                                          
  3 Spin_Up_Time            0x0027   181   176   021    Pre-fail  Always       -       5950                                      
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       867                                        
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0                                          
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0                                          
  9 Power_On_Hours          0x0032   085   085   000    Old_age   Always       -       11338                                      
10 Spin_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0                                          
11 Calibration_Retry_Count 0x0032   100   100   000    Old_age   Always       -       0                                          
12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       852                                        
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       88                                        
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       832                                        
194 Temperature_Celsius     0x0022   117   112   000    Old_age   Always       -       33                                        
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0                                          
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0                                          
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0                                          
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0                                          
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       0

SMART Error Log Version: 1                                                                                                        
No Errors Logged



ada3
Code:
                                                                                                                                  
SMART Attributes Data Structure revision number: 16                                                                               
Vendor Specific SMART Attributes with Thresholds:                                                                                
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE                                  
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0                                          
  3 Spin_Up_Time            0x0027   181   177   021    Pre-fail  Always       -       5908                                      
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       789                                        
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0                                          
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0                                          
  9 Power_On_Hours          0x0032   085   085   000    Old_age   Always       -       11206                                      
10 Spin_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0                                          
11 Calibration_Retry_Count 0x0032   100   100   000    Old_age   Always       -       0                                          
12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       784                                        
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       52                                        
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       745                                        
194 Temperature_Celsius     0x0022   116   110   000    Old_age   Always       -       34                                        
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0                                          
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0                                          
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0                                          
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0                                          
200 Multi_Zone_Error_Rate   0x0008   100   253   000    Old_age   Offline      -       0

SMART Error Log Version: 1                                                                                                        
No Errors Logged



ada4
Code:
                                                                                                                                  
SMART Attributes Data Structure revision number: 16                                                                               
Vendor Specific SMART Attributes with Thresholds:                                                                                
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE                                  
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       5868                                      
  3 Spin_Up_Time            0x0027   179   175   021    Pre-fail  Always       -       6041                                      
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       816                                        
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0                                          
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0                                          
  9 Power_On_Hours          0x0032   085   085   000    Old_age   Always       -       11231                                      
10 Spin_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0                                          
11 Calibration_Retry_Count 0x0032   100   100   000    Old_age   Always       -       0                                          
12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       811                                        
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       75                                        
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       772                                        
194 Temperature_Celsius     0x0022   116   113   000    Old_age   Always       -       34                                        
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0                                          
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       31                                        
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0                                          
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0                                          
200 Multi_Zone_Error_Rate   0x0008   100   253   000    Old_age   Offline      -       0                                          
                                                                                                                                 
SMART Error Log Version: 1                                                                                                        
No Errors Logged



ada5
Code:
                                                                                                                                  
SMART Attributes Data Structure revision number: 16                                                                               
Vendor Specific SMART Attributes with Thresholds:                                                                                
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE                                  
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0                                          
  3 Spin_Up_Time            0x0027   174   168   021    Pre-fail  Always       -       6291                                      
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       815                                        
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0                                          
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0                                          
  9 Power_On_Hours          0x0032   085   085   000    Old_age   Always       -       11231                                      
10 Spin_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0                                          
11 Calibration_Retry_Count 0x0032   100   100   000    Old_age   Always       -       0                                          
12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       810                                        
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       74                                        
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       773                                        
194 Temperature_Celsius     0x0022   116   112   000    Old_age   Always       -       34                                        
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0                                          
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0                                          
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0                                          
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0                                          
200 Multi_Zone_Error_Rate   0x0008   100   253   000    Old_age   Offline      -       0                                          
                                                                                                                                 
SMART Error Log Version: 1                                                                                                        
No Errors Logged



ada6
Code:
                                                                                                                                  
SMART Attributes Data Structure revision number: 16                                                                               
Vendor Specific SMART Attributes with Thresholds:                                                                                
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE                                  
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0                                          
  3 Spin_Up_Time            0x0027   184   179   021    Pre-fail  Always       -       5783                                      
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       814                                        
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0                                          
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0                                          
  9 Power_On_Hours          0x0032   085   085   000    Old_age   Always       -       11232                                      
10 Spin_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0                                          
11 Calibration_Retry_Count 0x0032   100   100   000    Old_age   Always       -       0                                          
12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       811                                        
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       73                                        
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       772                                        
194 Temperature_Celsius     0x0022   116   112   000    Old_age   Always       -       34                                        
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0                                          
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0                                          
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0                                          
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0                                          
200 Multi_Zone_Error_Rate   0x0008   100   253   000    Old_age   Offline      -       0                                          
                                                                                                                                 
SMART Error Log Version: 1                                                                                                        
No Errors Logged



ada7
Code:
                                                                                                                                  
SMART Attributes Data Structure revision number: 16                                                                               
Vendor Specific SMART Attributes with Thresholds:                                                                                
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE                                  
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0                                          
  3 Spin_Up_Time            0x0027   176   170   021    Pre-fail  Always       -       6200                                      
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       810                                        
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0                                          
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0                                          
  9 Power_On_Hours          0x0032   085   085   000    Old_age   Always       -       11232                                      
10 Spin_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0                                          
11 Calibration_Retry_Count 0x0032   100   100   000    Old_age   Always       -       0                                          
12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       810                                        
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       69                                        
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       773                                        
194 Temperature_Celsius     0x0022   116   111   000    Old_age   Always       -       34                                        
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0                                          
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0                                          
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0                                          
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0                                          
200 Multi_Zone_Error_Rate   0x0008   100   253   000    Old_age   Offline      -       0                                          
                                                                                                                                 
SMART Error Log Version: 1                                                                                                        
No Errors Logged  



ada8
Code:
                                                                                                                                  
SMART Attributes Data Structure revision number: 16                                                                               
Vendor Specific SMART Attributes with Thresholds:                                                                                
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE                                  
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0                                          
  3 Spin_Up_Time            0x0027   176   170   021    Pre-fail  Always       -       6200                                      
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       810                                        
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0                                          
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0                                          
  9 Power_On_Hours          0x0032   085   085   000    Old_age   Always       -       11232                                      
10 Spin_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0                                          
11 Calibration_Retry_Count 0x0032   100   100   000    Old_age   Always       -       0                                          
12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       810                                        
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       69                                        
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       773                                        
194 Temperature_Celsius     0x0022   116   111   000    Old_age   Always       -       34                                        
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0                                          
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0                                          
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0                                          
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0                                          
200 Multi_Zone_Error_Rate   0x0008   100   253   000    Old_age   Offline      -       0                                          
                                                                                                                                 
SMART Error Log Version: 1                                                                                                        
No Errors Logged  



ada9
Code:
                                                                                                                                  
SMART Attributes Data Structure revision number: 16                                                                               
Vendor Specific SMART Attributes with Thresholds:                                                                                
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE                                  
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0                                          
  3 Spin_Up_Time            0x0027   182   177   021    Pre-fail  Always       -       5866                                      
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       784                                        
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0                                          
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0                                          
  9 Power_On_Hours          0x0032   085   085   000    Old_age   Always       -       11206                                      
10 Spin_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0                                          
11 Calibration_Retry_Count 0x0032   100   100   000    Old_age   Always       -       0                                          
12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       784                                        
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       47                                        
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       745                                        
194 Temperature_Celsius     0x0022   115   110   000    Old_age   Always       -       35                                        
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0                                          
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0                                          
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0                                          
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0                                          
200 Multi_Zone_Error_Rate   0x0008   100   253   000    Old_age   Offline      -       0                                          
                                                                                                                                 
SMART Error Log Version: 1                                                                                                        
No Errors Logged   


screencapture-192-168-3-86-2019-07-03-22_33_14.png

My ZFS status

1562186774677.png

My disc/ volume setup. All healthy here.

raidz2.JPG

I am a little confused about read / write is all 0 here?

Can someone please help me with some advice. Should I replace all pre-fail HDDs or only ada4? If I replace ada4 and I do have some serious read problems accessing my files - will ZFS resilver for good or will some data be lost?

Thanks and regards,
Markus
 

Attachments

  • volume.JPG
    volume.JPG
    198.1 KB · Views: 294
  • 1562186745698.png
    1562186745698.png
    187.4 KB · Views: 300
Last edited:

danb35

Hall of Famer
Joined
Aug 16, 2011
Messages
15,458
What is so difficult to understand about "offline (uncorrectable) sectors"? The very limited SMART data you've posted doesn't indicate if you've ever run any SMART self-tests, or what the results were, but nonetheless the situation is pretty obvious, isn't it?
 

myoung

Explorer
Joined
Mar 14, 2018
Messages
70
What is so difficult to understand about "offline (uncorrectable) sectors"? The very limited SMART data you've posted doesn't indicate if you've ever run any SMART self-tests, or what the results were, but nonetheless the situation is pretty obvious, isn't it?

What is wrong with you?

Can someone please help me with some advice. Should I replace all pre-fail HDDs or only ada4? If I replace ada4 and I do have some serious read problems accessing my files - will ZFS resilver for good or will some data be lost?

The words pre-fail in your SMART attributes don't mean those drives are about to fail.

Attributes are one of two possible
types: Pre-failure or Old age. Pre-failure Attributes are ones
which, if less than or equal to their threshold values, indicate
pending disk failure. Old age, or usage Attributes, are ones
which indicate end-of-product life from old-age or normal aging
and wearout, if the Attribute value is less than or equal to the
threshold. Please note: the fact that an Attribute is of type
'Pre-fail' does not mean that your disk is about to fail! It
only has this meaning if the Attribute´s current Normalized
value is less than or equal to the threshold value.

As @danb35 pointed out you should run smart tests before going further smartctl -t long /dev/adaX. Also schedule recuring tests in the GUI
 

danb35

Hall of Famer
Joined
Aug 16, 2011
Messages
15,458
What is wrong with you?
An inability to fathom how the meaning of "offline (uncorrectable) sectors" could be anything other than blindingly obvious, apparently. Either that, or the expectation that folks will give at least the most minuscule amount of thought to what an error message might mean.

Edit: Probably also some frustration that OP couldn't be bothered to expend the slightest amount of effort in finding the answer for himself, too. After all, there are dozens of threads here on this exact error message.
 
Last edited:

melloa

Wizard
Joined
May 22, 2016
Messages
1,749
will ZFS resilver for good or will some data be lost?

Never had any issues replacing disks. Please read the forum for how to do it and replace one by one as needed. Wait until resilver finishes.
 

ltkrogoth

Dabbler
Joined
Jun 3, 2016
Messages
12
Thanks to all of you for your answers. I know I am not very familiar with freenas or smart tests. That´s the reason I asked here. I didnt expect that I could annoy someone with my question.

Here are some more details.
- I did run the -short smartctl test because the -long would have needed 8 hours per drive. I thought maybe the test would stress the drive and I wanted to reduce the stress to a minimum. Also I didn´t want to wait 8 more days to ask here. On that server are the files of my clients of the last years. It´s very precious to me. I also do have external backups, but they are partial and some weeks old. So I want to rescue these files if possible.
- of course I googled and searched, but I am not familiar with smart tests and I thought maybe some experts could give me more feedback here.

I am going to run the long test next week. But I already ordered one more drive to replace the drive 4.
Thanks for the answers and your time. If you dont feel like answering my question is worth your time just skip it. I am not mad with that, but I am looking forward to get some help if someone is willing to.

Regards,
Markus
 

danb35

Hall of Famer
Joined
Aug 16, 2011
Messages
15,458
I didnt expect that I could annoy someone with my question.
Doing a bit of your own homework really is encouraged--as I mentioned, this exact question has been asked several times before, so you could have had an answer right away by searching the forum, rather than posting and waiting for an answer. But my
What is so difficult to understand about "offline (uncorrectable) sectors"?
was actually a question, not a rant--you (and many others before you) posted a thread saying (roughly) "what does offline (uncorrectable) sectors mean?", when it seems pretty obvious to me what it means. I'm trying to see where the disconnect is.
Also I didn´t want to wait 8 more days to ask here.
There would not be any need to; you can (and ordinarily would) run tests on all the disks simultaneously.
 

melloa

Wizard
Joined
May 22, 2016
Messages
1,749
But I already ordered one more drive to replace the drive 4.

It is best practice to burn in the disks before putting them in production: Hard Drive Burn-in Testing. That would allow you to identify bad disks and RMA under warranty, besides avoid future issues.

You might find other resources searching the forum.

would have needed 8 hours per drive.

You will find out the disk burn in and other tests (memory for instance), are time consuming but it avoid future issues. You can use tmux to be able to run multiple tasks as needed. Take a look on the information @ the site I've linked.

I've also used @Spearfoot scripts as base for my own, so take a llok at his repository as well.
 

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,970
The words pre-fail in your SMART attributes don't mean those drives are about to fail.
While technically true, keep in mind that even with running SMART tests, the SMART tests were designed to hopefully provide an error message within 24 hours prior to a failure. I've got to tell you, short of an electrical/electronic failure a SMART test will warn you of media errors in advance. Your drive ada4 Pending Sectors Count = 31 is a huge warning sign. If it were 1 to 5 sectors then you could watch to see if it grows but 31 is high.

My advice, replace the drive under RMA and as the others have said, replace the drive IAW the User Guide, you would be surprised what is in there. I always recommend a person read the User Guide from cover to cover, twice. Even I forget all the great help in there. And a drive burn-in is a very good idea, it takes a while but it should weed out any problems and infant mortality.

And also, we do expect people to do some research, it's a simple search function (listed in our forum rules) and for this problem I'd have used something like "freenas Currently unreadable (pending) sectors" where "freenas" is the application we are running and "Currently unreadable (pending) sectors" is the error message.

Good Luck.
 

ltkrogoth

Dabbler
Joined
Jun 3, 2016
Messages
12
Thank you everyone for the support. Next time I will use the search before asking. But as you can see, I was unsure about the solution and I was hoping to get someone giving me advice.

The replace disk will arrive today and I will replace and resilver using the user guide.

Just for completing my post, here are the results from the long test. It was aborted after 10% with a reading error.

Code:
=== START OF INFORMATION SECTION ===                                                                                               
Model Family:     Western Digital Red                                                                                               
Device Model:     WDC WD30EFRX-68EUZN0                                                                                             
Serial Number:    WD-WCC4N6HJP7EK                                                                                                   
LU WWN Device Id: 5 0014ee 2b7b08ea6                                                                                               
Firmware Version: 82.00A82                                                                                                         
User Capacity:    3,000,592,982,016 bytes [3.00 TB]                                                                                 
Sector Sizes:     512 bytes logical, 4096 bytes physical                                                                           
Rotation Rate:    5400 rpm                                                                                                         
Device is:        In smartctl database [for details use: -P show]                                                                   
ATA Version is:   ACS-2 (minor revision not indicated)                                                                             
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)                                                                           
Local Time is:    Tue Jul  9 08:04:53 2019 CEST                                                                                     
SMART support is: Available - device has SMART capability.                                                                         
SMART support is: Enabled   

=== START OF READ SMART DATA SECTION ===                                                                                           
SMART overall-health self-assessment test result: PASSED                                                                           
                                                                                                                                    
General SMART Values:                                                                                                               
Offline data collection status:  (0x00) Offline data collection activity                                                           
                                        was never started.                                                                         
                                        Auto Offline Data Collection: Disabled.                                                     
Self-test execution status:      ( 121) The previous self-test completed having                                                     
                                        the read element of the test failed.                                                       
Total time to complete Offline                                                                                                     
data collection:                (39360) seconds.                                                                                   
Offline data collection                                                                                                             
capabilities:                    (0x7b) SMART execute Offline immediate.                                                           
                                        Auto Offline data collection on/off support.                                               
                                        Suspend Offline collection upon new                                                         
                                        command.                                                                                   
                                        Offline surface scan supported.                                                             
                                        Self-test supported.                                                                       
                                        Conveyance Self-test supported.                                                             
                                        Selective Self-test supported.                                                             
SMART capabilities:            (0x0003) Saves SMART data before entering                                                           
                                        power-saving mode.                                                                         
                                        Supports SMART auto save timer.                                                             
Error logging capability:        (0x01) Error logging supported.                                                                   
                                        General Purpose Logging supported.                                                         
Short self-test routine                                                                                                             
recommended polling time:        (   2) minutes.                                                                                   
Extended self-test routine                                                                                                         
recommended polling time:        ( 395) minutes.                                                                                   
Conveyance self-test routine                                         
recommended polling time:        (   5) minutes.                                                                                   
SCT capabilities:              (0x703d) SCT Status supported.                                                                       
                                        SCT Error Recovery Control supported.                                                       
                                        SCT Feature Control supported.                                                             
                                        SCT Data Table supported. 
                                        
SMART Attributes Data Structure revision number: 16                                                                                 
Vendor Specific SMART Attributes with Thresholds:                                                                                   
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE                                   
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       5868                                         
  3 Spin_Up_Time            0x0027   179   175   021    Pre-fail  Always       -       6050                                         
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       817                                         
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0                                           
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0                                           
  9 Power_On_Hours          0x0032   085   085   000    Old_age   Always       -       11247                                       
 10 Spin_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0                                           
 11 Calibration_Retry_Count 0x0032   100   100   000    Old_age   Always       -       0                                           
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       812                                         
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       75                                           
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       773                                         
194 Temperature_Celsius     0x0022   119   113   000    Old_age   Always       -       31                                           
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0                                           
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       31                                           
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0                                           
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0                                           
200 Multi_Zone_Error_Rate   0x0008   100   253   000    Old_age   Offline      -       0                                           
                                                                                                                                    
SMART Error Log Version: 1                                                                                                         
No Errors Logged                                                                                                                   
                                                                                                                                    
SMART Self-test log structure revision number 1                                                                                     
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error                                     
# 1  Extended offline    Completed: read failure       90%     11233         1347439366                                             
# 2  Short offline       Completed without error       00%     11231         -                                                     
# 3  Extended offline    Aborted by host               90%     11231         -                                                     
# 4  Short offline       Completed without error       00%     11230         -                                                     
                                                                                                                                    
SMART Selective self-test log data structure revision number 1                                                                     
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS                                                                                       
    1        0        0  Not_testing                                                                                               
    2        0        0  Not_testing                                                                                               
    3        0        0  Not_testing                                                                                               
    4        0        0  Not_testing                                                                                               
    5        0        0  Not_testing                                                                                               
Selective self-test flags (0x0):                                                                                                   
  After scanning selected spans, do NOT read-scan remainder of disk.                                                               
If Selective self-test is pending on power-up, resume after 0 minute delay.          
 

Jailer

Not strong, but bad
Joined
Sep 12, 2014
Messages
4,974
Something else that needs mentioning that is addressed in the user guide as well as the forum is setting up periodic smart tests and email alerts. You obviously don't have this set up otherwise you would have received an email alert with the first pending sector that popped up on the drive in question. Read the user guide, check out the resource section and search the forum for how to set this up.
 

ltkrogoth

Dabbler
Joined
Jun 3, 2016
Messages
12
Something else that needs mentioning that is addressed in the user guide as well as the forum is setting up periodic smart tests and email alerts. You obviously don't have this set up otherwise you would have received an email alert with the first pending sector that popped up on the drive in question. Read the user guide, check out the resource section and search the forum for how to set this up.

Actual I did setup a periodic smart short and long test. But unfortunately I did a mistake and only setup that for one drive instead for every drive... dumb, i know. Also I noticed that I do get an error from freenas when trying to send me an email. As soon as the resilvering is finish I have to clean up a lot of stuff, repairing and updating freenas.

I now replaced the bad drive, deactivated all scrubs and snapshots and started the resilvering. It was easier than expected. I am very carefull with my data and using not only the zfsz2 but also a regular backup to an external drive. But the chance of loosing data panics me... :/

Update: Resilvering was finished after approx. 13 hours. Everything is running fine and smooth. I updated freenas to 11.2 and setup weekly smart tests with email alerts. Thanks for the help to you!
 
Last edited:

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,970
We all learn from our mistakes but it's easy to not select all the drives in the original GUI, not sure about the new GUI.
 

SweetAndLow

Sweet'NASty
Joined
Nov 6, 2013
Messages
6,421
Actual I did setup a periodic smart short and long test. But unfortunately I did a mistake and only setup that for one drive instead for every drive... dumb, i know. Also I noticed that I do get an error from freenas when trying to send me an email. As soon as the resilvering is finish I have to clean up a lot of stuff, repairing and updating freenas.

I now replaced the bad drive, deactivated all scrubs and snapshots and started the resilvering. It was easier than expected. I am very carefull with my data and using not only the zfsz2 but also a regular backup to an external drive. But the chance of loosing data panics me... :/

Update: Resilvering was finished after approx. 13 hours. Everything is running fine and smooth. I updated freenas to 11.2 and setup weekly smart tests with email alerts. Thanks for the help to you!
Don't deactivate scrubs or snapshots. You will forget to enable them and there is zero reason to turn them off for a resilver.
 

ltkrogoth

Dabbler
Joined
Jun 3, 2016
Messages
12
We all learn from our mistakes but it's easy to not select all the drives in the original GUI, not sure about the new GUI.

In the old GUI it was darn easy as I experienced... the new GUI is more intuitive.

Don't deactivate scrubs or snapshots. You will forget to enable them and there is zero reason to turn them off for a resilver.

I wanted every free capacity for the resilvering. Doesn´t it slow down the resilvering during scrubs and snapshots because there´s heavy read and write load on the drive? Anyway, already activated everything again :)
 

danb35

Hall of Famer
Joined
Aug 16, 2011
Messages
15,458
Snapshots do very little to the pool (they write a very small amount of metadata), and a scrub won't interfere with a resilver.
 

SweetAndLow

Sweet'NASty
Joined
Nov 6, 2013
Messages
6,421
Snapshots do very little to the pool (they write a very small amount of metadata), and a scrub won't interfere with a resilver.
I agree, snapshots are free and a resilver is actually a scrub that also writes data.
 

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,970
and a scrub won't interfere with a resilver.
It has been my experience that while resilvering if a scrub starts, the extra hard drive action will slow down a resilvering. It won't stop it but for example if before the scrub started it would take you 24 hours to resilver, with a scrub now running the resilvering could take an additional 4 hours for example only. Also the scrub will take longer to complete. If this is not correct, I'm all ears. And I'm just answering the question, but it is true, there should be no problems doing both at the same time. IF you have a vdev/pool that is down to a single drive failure before loosing your pool, ensuring a scrub does not occur is key to finish a resilvering faster, this includes any data accessing so I'd take the NAS off the network to inhibit extra hard drive action.
 

SweetAndLow

Sweet'NASty
Joined
Nov 6, 2013
Messages
6,421
It has been my experience that while resilvering if a scrub starts, the extra hard drive action will slow down a resilvering. It won't stop it but for example if before the scrub started it would take you 24 hours to resilver, with a scrub now running the resilvering could take an additional 4 hours for example only. Also the scrub will take longer to complete. If this is not correct, I'm all ears. And I'm just answering the question, but it is true, there should be no problems doing both at the same time. IF you have a vdev/pool that is down to a single drive failure before loosing your pool, ensuring a scrub does not occur is key to finish a resilvering faster, this includes any data accessing so I'd take the NAS off the network to inhibit extra hard drive action.

Code:
root@freenas[/mnt/tank/d1]# zpool scrub tank
cannot scrub tank: currently resilvering


you can not scrub a pool while it is resilvering.
 

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,970
Code:
root@freenas[/mnt/tank/d1]# zpool scrub tank
cannot scrub tank: currently resilvering


you can not scrub a pool while it is resilvering.
OMG, my mind is failing me. Time to stop giving bad advice. I must have been thinking about running a SMART Long test but my mind is still telling me it was a scrub, unless you could run both back in 2011. Eh, maybe my heart attack is affecting me more than I know, I mean it wasn't a stroke but I'm not a doctor, but I will be telling my doctor the next time I see him. If you see me making bad statements, call me on it. I think everyone knows I'd rather not post than give bad advice. Well time to feed the animals, hopefully I'll give the cats the cat food and the dogs the dog food.
 
Top