Pending sector, how to force reallocate (and should I)?

ProblematicLlama · Aug 17, 2016

Hi,
I have a pending sector on one of my disks (the other 4 are completely healthy). I have read that I can force reallocation of a pending sector using DD to attempt to write to the bad sector. I found someone's handy guide here: https://dekoder.wordpress.com/2014/10/08/fixing-freenas-currently-unreadable-pending-sectors-error/

One section that's thrown a small spanner in the works for me, however, is that it says to get the sector size. In their example, the logical and physical sector size is the same - 512. However, on my disk, the sector size reports

Code:

Sector Sizes:     512 bytes logical, 4096 bytes physical

If I'm running dd to write to the bad sector, what should I set the bs to?

Also the command in the guide doesn't seem to actually refer to the sector size, not sure if it's a mistake or if it's legit. Their command shows (note: I'm keeping the example numbers just for consistency with the guide, obviously the LBA is different in my case):

Code:

dd if=/dev/zero of=/dev/ada2 bs=892134344 count=1 seek= conv=noerror,sync

which appears to be using the LBA for the bs but nothing for the seek. Is this legit? The way everything before it is worded implies the command should go something more like this:

Code:

dd if=/dev/zero of=/dev/ada2 bs=512 count=1 seek=892134344 conv=noerror,sync

Below is the output of my SMART report (smartctl -q noserial -a /dev/ada0)

Code:

smartctl 6.5 2016-05-07 r4318 [FreeBSD 10.3-STABLE amd64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Red
Device Model:     WDC WD30EFRX-68EUZN0
Firmware Version: 82.00A82
User Capacity:    3,000,592,982,016 bytes [3.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5400 rpm
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-2 (minor revision not indicated)
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is:    Wed Aug 17 14:35:20 2016 BST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00)    Offline data collection activity
                    was never started.
                    Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0)    The previous self-test routine completed
                    without error or no self-test has ever
                    been run.
Total time to complete Offline
data collection:         (40080) seconds.
Offline data collection
capabilities:             (0x7b) SMART execute Offline immediate.
                    Auto Offline data collection on/off support.
                    Suspend Offline collection upon new
                    command.
                    Offline surface scan supported.
                    Self-test supported.
                    Conveyance Self-test supported.
                    Selective Self-test supported.
SMART capabilities:            (0x0003)    Saves SMART data before entering
                    power-saving mode.
                    Supports SMART auto save timer.
Error logging capability:        (0x01)    Error logging supported.
                    General Purpose Logging supported.
Short self-test routine
recommended polling time:     (   2) minutes.
Extended self-test routine
recommended polling time:     ( 402) minutes.
Conveyance self-test routine
recommended polling time:     (   5) minutes.
SCT capabilities:           (0x703d)    SCT Status supported.
                    SCT Error Recovery Control supported.
                    SCT Feature Control supported.
                    SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       32
  3 Spin_Up_Time            0x0027   179   176   021    Pre-fail  Always       -       6016
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       17
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   084   084   000    Old_age   Always       -       11716
10 Spin_Retry_Count        0x0032   100   253   000    Old_age   Always       -       0
11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0
12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       17
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       4
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       249
194 Temperature_Celsius     0x0022   119   113   000    Old_age   Always       -       31
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       2

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%     11704         -
# 2  Short offline       Completed without error       00%     11680         -
# 3  Short offline       Completed without error       00%     11656         -
# 4  Short offline       Completed without error       00%     11634         -
# 5  Short offline       Completed without error       00%     11608         -
# 6  Short offline       Completed without error       00%     11585         -
# 7  Short offline       Completed without error       00%     11536         -
# 8  Short offline       Completed: read failure       90%     11513         105062840
# 9  Short offline       Completed: read failure       90%     11489         105062840
#10  Short offline       Completed: read failure       90%     11465         105062840
#11  Short offline       Completed: read failure       10%     11441         105062840
#12  Short offline       Completed without error       00%     11417         -
#13  Extended offline    Completed: read failure       90%     11392         105062840
#14  Short offline       Completed: read failure       90%     11369         105062840
#15  Short offline       Completed: read failure       90%     11345         105062840
#16  Short offline       Completed: read failure       10%     11321         105062840
#17  Short offline       Completed: read failure       90%     11297         105062840
#18  Short offline       Completed: read failure       90%     11273         105062841
#19  Short offline       Completed: read failure       90%     11249         105062840
#20  Short offline       Completed without error       00%     11201         -
#21  Extended offline    Completed: read failure       90%     11190         105059376

SMART Selective self-test log data structure revision number 1
SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

So my questions really are:

1) What command should I be using to force reallocate this sector? Should I even be doing this or is it safer to just leave it and let the system attempt to write to the bad sector in it's own time?
2) While it's there, can you spot anything else alarming in my SMART output that indicates I need to replace this drive immediately, or am I relatively safe to keep it in there?

Thanks in advance for your help

Nick2253 · Aug 17, 2016

Unless I'm missing something, Current_Pending_Sector is zero. But as a bigger issue, I would not even bother. The OS will attempt to write to it eventually, and everything will be handled then. Pending sectors are not uncommon, so there's no sense in losing sleep over them.

From a safety perspective, I wouldn't want to use dd in a destructive manner on a ZFS drive. ZFS does a fantastic job of managing your disks, and I wouldn't want to get in the way of that by trying to change the data underneath ZFS.

SweetAndLow · Aug 17, 2016

Why would you want to do this?

ProblematicLlama · Aug 17, 2016

Nick2253 said:
Unless I'm missing something, Current_Pending_Sector is zero. But as a bigger issue, I would not even bother. The OS will attempt to write to it eventually, and everything will be handled then. Pending sectors are not uncommon, so there's no sense in losing sleep over them.

From a safety perspective, I wouldn't want to use dd in a destructive manner on a ZFS drive. ZFS does a fantastic job of managing your disks, and I wouldn't want to get in the way of that by trying to change the data underneath ZFS.

Thanks - I'll leave it alone then. That's interesting, the previous time I'd run the command there was a pending sector and the FreeNAS GUI was showing an alert for it which I hadn't dismissed - I guess it must have already been dealt with in the meantime and I didn't re-check my SMART output that I posted! Oops..

Though now I'm finding it a bit weird that I had a pending sector but the SMART output shows 0 pending sectors and 0 reallocated sectors... Am I missing something?

SweetAndLow said:
Why would you want to do this?

I've just seen that it's been done and figured it gets the reallocation out of the way right away at a time where I'm available to deal with it if anything goes wrong - "fail early" and all that. That being said I guess it's incredibly unlikely that the spare sectors are also bad so really I suppose there isn't much point.

wblock · Aug 17, 2016

To go back to the related question: block size really doesn't matter. With 512-byte blocks, a 512-byte drive writes one block. But so does a drive with 4K-byte blocks--they can't write less than one whole block. So using 512 would work for both.

Agreed with the others, I would not mess with this sort of thing on a drive that is currently in use in any way. If you want to, swap out that drive for a good one, then experiment.

Nick2253 · Aug 17, 2016

ProblematicLlama said:
Though now I'm finding it a bit weird that I had a pending sector but the SMART output shows 0 pending sectors and 0 reallocated sectors... Am I missing something?

A pending sector is not necessarily a bad sector. However, based on whatever algorithm the control used, it raised a flag. In a future attempt to use the sector, if it's still "bad", then it gets reallocated. If it's not "bad", then it gets used.

rs225 · Aug 17, 2016

I would do it, because it seems to be causing your long/short tests to fail.

I use a bs=4k count=1, and divide the LBA by 8 (drop remainder). The drive can't write just 512-bytes, it has to write 4096. So first it has to read the bad sector. To avoid this, you just overwrite the whole 4k sector.

Your logs even show that the read fails on the 'neighbor' LBA sometimes (which is actually the same sector).

If it works, your tests may complete, or at least fail somewhere else. If it never succeeds, consider replacing the drive.

Nick2253 · Aug 17, 2016

His last seven short tests completed without error, so that's not really a reason to do anything.

rs225 · Aug 17, 2016

Nick2253 said:
His last seven short tests completed without error, so that's not really a reason to do anything.

No, but the two failed long tests are, particularly since both long tests stopped with 90% of the drive untested, and within 2MB of each other, indicating a possibly bad area on that drive.

Nick2253 · Aug 17, 2016

Then before randomly poking at the disk with dd, I'd run another long test, and see if it fails. If the long test fails, then we can go from there. But the last long test was about 2 weeks ago, so I'd say that it's not super applicable to what we are doing today.

Stux · Aug 17, 2016

I've had a couple of pending sectors on a drive. It caused alerts and smart failures in the gui until I wrote over the block with dd and then ran a scrub.

Disk has been fine ever since.

ProblematicLlama · Aug 18, 2016

OK so I tried to run another long test and it failed early again on a read, and again the FreeNAS GUI shows:

CRITICAL: Aug. 9, 2016, 10 a.m. - Device: /dev/ada0, 1 Currently unreadable (pending) sectors
CRITICAL: Aug. 18, 2016, 10:32 a.m. - Device: /dev/ada0, new Self-Test Log error at hour timestamp 11735

However again the smartctl -a output shows no pending or reallocated sectors, but it does seem to fail at the same LBA.

Code:

smartctl 6.5 2016-05-07 r4318 [FreeBSD 10.3-STABLE amd64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Red
Device Model:     WDC WD30EFRX-68EUZN0
Firmware Version: 82.00A82
User Capacity:    3,000,592,982,016 bytes [3.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5400 rpm
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-2 (minor revision not indicated)
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is:    Thu Aug 18 12:39:00 2016 BST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00)    Offline data collection activity
                    was never started.
                    Auto Offline Data Collection: Disabled.
Self-test execution status:      ( 121)    The previous self-test completed having
                    the read element of the test failed.
Total time to complete Offline 
data collection:         (40080) seconds.
Offline data collection
capabilities:             (0x7b) SMART execute Offline immediate.
                    Auto Offline data collection on/off support.
                    Suspend Offline collection upon new
                    command.
                    Offline surface scan supported.
                    Self-test supported.
                    Conveyance Self-test supported.
                    Selective Self-test supported.
SMART capabilities:            (0x0003)    Saves SMART data before entering
                    power-saving mode.
                    Supports SMART auto save timer.
Error logging capability:        (0x01)    Error logging supported.
                    General Purpose Logging supported.
Short self-test routine 
recommended polling time:     (   2) minutes.
Extended self-test routine
recommended polling time:     ( 402) minutes.
Conveyance self-test routine
recommended polling time:     (   5) minutes.
SCT capabilities:           (0x703d)    SCT Status supported.
                    SCT Error Recovery Control supported.
                    SCT Feature Control supported.
                    SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       32
  3 Spin_Up_Time            0x0027   179   176   021    Pre-fail  Always       -       6016
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       17
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   084   084   000    Old_age   Always       -       11738
 10 Spin_Retry_Count        0x0032   100   253   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       17
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       4
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       250
194 Temperature_Celsius     0x0022   119   113   000    Old_age   Always       -       31
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       1

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed: read failure       90%     11735         89067040
# 2  Short offline       Completed without error       00%     11704         -
# 3  Short offline       Completed without error       00%     11680         -
# 4  Short offline       Completed without error       00%     11656         -
# 5  Short offline       Completed without error       00%     11634         -
# 6  Short offline       Completed without error       00%     11608         -
# 7  Short offline       Completed without error       00%     11585         -
# 8  Short offline       Completed without error       00%     11536         -
# 9  Short offline       Completed: read failure       90%     11513         105062840
#10  Short offline       Completed: read failure       90%     11489         105062840
#11  Short offline       Completed: read failure       90%     11465         105062840
#12  Short offline       Completed: read failure       10%     11441         105062840
#13  Short offline       Completed without error       00%     11417         -
#14  Extended offline    Completed: read failure       90%     11392         105062840
#15  Short offline       Completed: read failure       90%     11369         105062840
#16  Short offline       Completed: read failure       90%     11345         105062840
#17  Short offline       Completed: read failure       10%     11321         105062840
#18  Short offline       Completed: read failure       90%     11297         105062840
#19  Short offline       Completed: read failure       90%     11273         105062841
#20  Short offline       Completed: read failure       90%     11249         105062840
#21  Short offline       Completed without error       00%     11201         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

I'm wondering if it's worth doing this dd thing to try to force reallocate the sector so that I can run a full smart test, as the fact it doesn't get very far worries me a bit - as someone earlier pointed out, there could be a load of other bad sectors later on in the drive.

Is this a good idea and if so exactly what command would I run? Based on rs225's comment I'm assuming it would be something like:

Code:

dd if=/dev/zero of=/dev/ada0 bs=4096 count=1 seek=13132855 conv=noerror,sync

Where 13132855 is 105062840/8 (although looking at it the 1st Extended Offline test seems to hit a much earlier LBA so not sure whether to try the same for that).

Nick2253 · Aug 18, 2016

In my opinion, a failing SMART test is a huge red flag, and I'd RMA the drive. Most manufacturer's won't balk if you have a failed SMART test, so just to be on the safe side, might as well replace it.

@rs225 might have a different opinion on what steps to take. If there's a way to fix it, then I'd be interested to see the process, though I can't say I'd do the same on any of my personal drives.

rosabox · Aug 18, 2016

First I would suggest to replace the drive.
Then, if it's out of warranty, you can try to "fix" the bad sectors by doing a "secure erase" (which obviously erases the whole disk!), and then try the Extended Offline test again.
I was able to "fix" a couple of drives this way but I never trusted them again and used them only for testing purposes.
Secure Erase: http://cmrr.ucsd.edu/people/Hughes/secure-erase.html

Robert Trevellyan · Aug 18, 2016

If you're determined to force the issue, I suggest removing it from the system and running badblocks destructive on it.

Stux · Aug 21, 2016

The dd thing worked for me in the past. Cleared one sector, drive has been fine since.

ProblematicLlama · Aug 28, 2016

Hi:

I have since bought a replacement drive (so I can use it as a spare in the future) and I've run badblocks on the thing to check it's OK. A weird thing happened though - on the third set of tests, I got a critical alert email containing the following:

Device: /dev/ada0, unable to open device

I then tried to run smartctl -a /dev/ada0 and it comes back with:

/dev/ada0: Unable to detect device type
Please specify device type with the -d option.

Anyway, I rebooted the server and ran smartctl -a again on it, and it seemed to show up fine. Then I decided to start a long test (smartctl -t long /dev/ada0). Part way into the test, again, I ran smartctl -a /dev/ada0 and I got the above output again (please specify device type with -d option). I have since rebooted again, and yet again, smartctl -a works and running smartctl -t long /dev/ada0 seems to be going OK now.

Now obviously the fact that I suddenly got "unable to open device" towards the end of the badblocks test is a very bad thing (and I have no idea why that had happened) but has anyone ever experienced smartctl complaining about device type before, especially after it's already worked before?

I'm guessing right now my best option is to run another test on it with the HDD attached to a different SATA port with a different power and data cable, just to be sure. Luckily I bought the thing from Amazon and they're generally insanely good for returns if I need to.

Stux · Aug 29, 2016

Did you add this to the same system with all the other drives?

Did you just overload your psu?

Try unplugging all the other drives and see if it works fine. If it does, you need to fix your power setup, if it doesn't, it's an RMA, and lucky you tested the drive

ProblematicLlama · Aug 29, 2016

Stux said:
Did you add this to the same system with all the other drives?

Did you just overload your psu?

Try unplugging all the other drives and see if it works fine. If it does, you need to fix your power setup, if it doesn't, it's an RMA, and lucky you tested the drive

That's a good point, hadn't even thought of that. Would there be any indication in logs etc that that's what happened?

The long smart test ran fine yesterday after that reboot so it was clearly a bit of a random issue.. I'd rather not go removing all the other drives from the system to replicate it as it would take days for the issue to show again, if at all - and I couldn't really confirm it either way.

The only thing of note in the SMART test was that the load cycle count is somehow at 30 already which seems a bit high for a new drive that's been on for 4 days and is set to always on in the GUI. < Ignore that, it's on 13 which sounds more reasonable. Still a little high given my year old hard drives are on 100 but still, probably completely unrelated.

I think it's quite implausible that it would be overloading the PSU somehow as hard drives don't draw loads of power and adding one is fairly minor. My system specs are the following:

C2750D4I motherboard (with Intel Atom C2750 SoC CPU)
6x WD Red 3TB
4x8GB RAM
1x Noctua NF-A14 PWM Fan
2x Noctua NF-A9 PWM fans
A 64GB USB stick

I'm running all that on a 360w Seasonic PSU and have never hit problems before with 5 drives, but if you look at all those parts they should never even get close to the 350w even on full load. From what I can tell I could probably add quite a few more hard drives in there (if space wasn't an issue) as looking at power consumption benchmarks I found online, those drives only use 5.4 watts at peak.

Bidule0hm · Aug 29, 2016

ProblematicLlama said:
The only thing of note in the SMART test was that the load cycle count is somehow at 30 already which seems a bit high for a new drive that's been on for 4 days and is set to always on in the GUI. < Ignore that, it's on 13 which sounds more reasonable. Still a little high given my year old hard drives are on 100 but still, probably completely unrelated.

Don't worry about that, it is very low actually.

ProblematicLlama said:
I think it's quite implausible that it would be overloading the PSU somehow as hard drives don't draw loads of power and adding one is fairly minor. My system specs are the following:

C2750D4I motherboard (with Intel Atom C2750 SoC CPU)
6x WD Red 3TB
4x8GB RAM
1x Noctua AF14-PWM Fan
2x Noctua AF9PWM fans
A 64GB USB stick

I'm running all that on a 350w Seasonic PSU and have never hit problems before with 5 drives, but if you look at all those parts they should never even get close to the 350w even on full load. From what I can tell I could probably add quite a few more hard drives in there (if space wasn't an issue) as looking at power consumption benchmarks I found online, those drives only use 5.4 watts at peak.

In fact that's wrong. A drive draw about 30-35 W during spin-up. See https://forums.freenas.org/index.php?threads/proper-power-supply-sizing-guidance.38811/ and https://forums.freenas.org/index.php?threads/how-to-measure-the-drive-spin-up-peak-current.38885/ for more info ;)

Important Announcement for the TrueNAS Community.

Pending sector, how to force reallocate (and should I)?

Dabbler

Wizard

Sweet'NASty

Dabbler

Documentation Engineer

Wizard

Guru

Wizard

Guru

Wizard

MVP

Dabbler

Wizard

Explorer

Pony Wrangler

MVP

Dabbler

MVP

Dabbler

Server Electronics Sorcerer

Similar threads