Hard Drive Troubleshooting Guide (Basic Common Failures)

Status
Not open for further replies.

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,970
Please note: This guide has been moved to the Resources Section of the forum and can be found at this hyperlink.

Future updates will be made there. The original version is archived below.




Hard Drive Troubleshooting Guide

This guide covers the most routine single hard drive failures that are encountered and is not meant to cover every situation, specifically we will check to see if you have a physical drive failure or a communications error. If this guide fails to solve your problem, please open a new thread in the Help forum, list your hardware specs (FreeNAS version, Hardware Configuration), your failure and all indications, and specify that you used this guide and the step which failed to help you if appropriate. If there is an error or improvement you would like to suggest to this procedure, contact one of the forum moderators or the author and your inputs will be evaluated.

How to use this guide:

1. It is assumed you have some knowledge on how to open up a Shell window and perform some minor Linux/FreeBSD commands.

2. All the steps in this guide are non-destructive so you can safely perform these steps without further risk to your data.

3. We cannot take into account all formats of an error message but we used “?” to indicate any value. Additionally if we list an error message format, please keep in mind that as the software changes, the format may change and we will not update the guide every time a minor format of a message occurs.

4. The drive identifier in each command will be “ada0” however the user must enter the identifier for the suspect drive such as “ada4” or maybe “da4”. The failure message should indicate the drive identifier.

5. Once you have identified the failed drive serial number, write it down because drive identifiers “ada0” can change and the serial number is the best way to track and replace your drive if required.

6. You may be referenced to use the FreeNAS User Manual to conduct specific procedures.

7. Appendix A: Examples Error Messages

8. Appendix B: S.M.A.R.T. Data, What’s Important to Me?



Routine Procedures:

These few procedures will be run often so to minimize placing these steps all over the procedure, they will be written here and the user will refer here when directed to run one of them.

Output SMART Status Results

This procedure will display the hard drive data, including error information.

1) Open a shell (can be done via the GUI or SSH using something like Putty).

2) Type “smartctl –a /dev/ada0” where “ada0” is the subject drive. If the output scrolls off the screen then enter “smartctl –a /dev/ada0 | more” and the screen will only fill one page at a time.

3) Note the items asked about in the troubleshooting text.

Perform SMART Long Test

A SMART Long Test conducts a test of the drive electronics and a read of the entire drive surface. This test should be run periodically by setting it up in the FreeNAS GUI for automatic accomplishment. Different users have different opinions on how frequently this should be done, the author prefers once a week for the Long tests and daily for the Short tests.

1) Open a shell (can be done via the GUI or SSH using something like Putty).

2) Type “smartctl –t long /dev/ada0” where “ada0” is the drive identifier. Note how long it will take for the test to complete. You may still use your system however it will slow down the testing.

3) Once the period of time has lapsed for the testing, obtain a SMART Status Result and return to the troubleshooting text.



Troubleshooting Procedure

What type of failure did you received?

1) Error stating:
a. ID 5 Relocated Sector Count, ID 197 Current Pending Sector Count, ID 198 Offline, Uncorrectable Sector Count, or (?da?:ata?:?:?:?): CAM status: ATA Status Error, Pool is Degraded, or if you just don't know where to start.​

b. Timeout errors or any Communication Errors (ID 199 UDMA CRC Error Count).​

2) If “a” then goto step 3, If “b” then goto step 4.


Physical Drive Failure

3) This procedure troubleshoots common physical drive failures.

a. Conduct Output SMART Status Results and record the drive Serial Number, IDs 5, 197, and 198. (Note: For detailed explanation of what each of these IDs represent, visit the S.M.A.R.T. Wiki website)​

b. If any of the IDs are greater than zero (0) then the drive has failed for RMA purposes.​

c. If ALL of the IDs are zero (0), then run a SMART Long Test and after the test has completed, conduct Output SMART Status Results. If ALL of the IDs are still at zero (0), ensure you are troubleshooting the correct drive and if you are, proceed to step 4 because the hard drive does not indicate a hardware failure at this point.​

d. If any of the IDs are 1 to 5 then you may be able to retain the drive however if you’re troubleshooting it, it’s not likely you desire to retain the drive even if it's slowly failing. If you do retain it, it’s highly recommended that you run frequent SMART Long Tests on the drive to ensure the IDs values do not increase. If they increase at all then replace the drive.​

e. If replacing the drive follow the FreeNAS User Guide on how to replace a failed drive. If you have an encrypted drive, ensure you take appropriate precautions per the FreeNAS User Guide.​

f. Exit this guide.​



Drive Communications Failure

4) This procedure troubleshoots common communications errors for a single drive failure.

a. Conduct Output SMART Status Results and record the drive Serial Number, IDs 5, 197, and 198.​

b. Inspect ID’s 5, 197, and 198 and if any value is greater than zero (0), the drive may have an unrelated failure. Goto to Step 3 after finishing this troubleshooting.​

c. Replace the DATA cable between the hard drive (utilizing the serial number to identify the suspect drive) and controller. (Note: The data cable is the most common cause of drive communications errors.)​

d. If the problem is not fixed, Swap the DATA cables between the suspect drive and a nearby drive (at the drive connections). (Note: We are trying to isolate the problem to the hard drive or something else.)​

e. If the problem goes away, it’s likely the DATA cable is still the cause, you may exit this procedure however keep an eye open for future failures. (Note: At times a poor connection may cause this error or a marginal data cable.)​

f. If the problem still exists, run the Output SMART Status Results and verify the drive serial number.​

g. If the drive serial number changed, continue with step h, if the serial number did not change continue with step j. (Note: If the drive serial number changed then the failure could be the DATA cable or drive controller.)​

h. Relocate the DATA cable for the failing drive (remember to use the serial number) to another DATA port on the controller or motherboard that does appear to be working.​

i. If the failure still exists then run the Output SMART Status Results and verify the drive serial number that failed. If it has not changed then the DATA cable is suspect. Goto step k.​

j. If the problem remains with the same drive then the hard drive electronics are suspect and the drive can be considered defective, replace the drive.​

k. If the problem still exists, this is not a common failure and post your failure in the FreeNAS forums.​

l. Exit this guide.​



APPENDIX A


Example Error Messages

Hard Drive Failure Messages


Email Messages:

CRITICAL: Device: /dev/ada3, 1 Currently unreadable (pending) sectors

CRITICAL: Device: /dev/ada1, 817 Currently unreadable (pending) sectors

CRITICAL: Device: /dev/ada1, 2397 Offline uncorrectable sectors


SMART Results Output

(Note: Items in red are failure indications)

smartctl 6.3 2014-07-26 r3976 [FreeBSD 9.3-RELEASE-p28 amd64] (local build)
Copyright (C) 2002-14, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family: Western Digital Red
Device Model: WDC WD20EFRX-68AX9N0
Serial Number: WD-WMC300411000
LU WWN Device Id: 5 0014ee 6ad787ae3
Firmware Version: 80.00A80
User Capacity: 2,000,398,934,016 bytes [2.00 TB]
Sector Sizes: 512 bytes logical, 4096 bytes physical
Device is: In smartctl database [for details use: -P show]
ATA Version is: ACS-2 (minor revision not indicated)
SATA Version is: SATA 3.0, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is: Wed Jan 27 15:41:21 2016 EST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status: (0x00) Offline data collection activity
was never started.
Auto Offline Data Collection: Disabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: (27840) seconds.
Offline data collection
capabilities: (0x7b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 281) minutes.
Conveyance self-test routine
recommended polling time: ( 5) minutes.
SCT capabilities: (0x70bd) SCT Status supported.
SCT Error Recovery Control supported.
SCT Feature Control supported.
SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 31
3 Spin_Up_Time 0x0027 176 174 021 Pre-fail Always - 4175
4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 340
5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 16
7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0
9 Power_On_Hours 0x0032 061 061 000 Old_age Always - 28532
10 Spin_Retry_Count 0x0032 100 100 000 Old_age Always - 0
11 Calibration_Retry_Count 0x0032 100 100 000 Old_age Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 148
192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 61
193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always - 278
194 Temperature_Celsius 0x0022 120 107 000 Old_age Always - 27
196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0
197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 42

198 Offline_Uncorrectable 0x0030 100 253 000 Old_age Offline - 42

199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0
200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Short offline Completed without error 00% 28522 -
# 2 Short offline Completed without error 00% 28498 -
# 3 Short offline Completed without error 00% 28474 -
# 4 Extended offline Completed: read failure 70% 28455 - 543988376
# 5 Short offline Completed without error 00% 28426 -
# 6 Short offline Completed without error 00% 28330 -
# 7 Extended offline Completed without error 00% 28312 -

SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.



Hard Drive Communications Error Messages

(ada1:ahcich1:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 00 90 b2 b9 40 2e 00 00 01 00 00

(ada1:ahcich1:0:0:0): CAM status: Uncorrectable parity/CRC error

(ada1:ahcich1:0:0:0): Retrying command





APPENDIX B

S.M.A.R.T. Data, What’s Important to Me?

When troubleshooting a hard drive failure we utilize the built in SMART diagnostics, part of every hard drive. These results can be used to justify an RMA as well. A SMART test will not isolate communication errors, it will only validate the physical hard drive. If you want to get some good information visit the Wiki for S.M.A.R.T at this link: https://en.wikipedia.org/wiki/S.M.A.R.T.


The important data we look at are as follows:

1) Serial Number
2) ID 5 Relocated Sector Count
3) ID 197 Current Pending Sector Count
4) ID 198 Offline Uncorrectable Sector Count
5) ID 199 UDMA CRC Error Count

Other notable data are:

6) ID 194 Temperature
7) ID 200 MultiZone Error Rate
8) Extended Self-test Time (value in minutes)
9) SMART Self-test logs, specifically the results of the self tests

If ID’s 5, 197, or 198 have any value greater than zero (0) then there has been some defect identified in the media. If ID 194 Temperature is above 40C then you may have a cooling issue and this could shorten the life of your drive. Many manufacturers will not accept an RMA if the temperature of the drive exceeds a certain value (manufacturer specific) as this voids the warranty.

ID 199 is a communications error between the drive electronics and the drive controller. The drive controller is part of your motherboard or an add-on card. Typically this error code results in replacement of the SATA cable to correct the situation. This wouldn't typically be a condition to RMA a drive however it is possible that the hard drive electronics has failed or someone broke the SATA connector on the drive, but that is not the typical failure we see.

ID 200 MultiZone Error Rate can be the cause of a drive failure although a value in this location doesn't always mean it's the fault. It is notable if there are no other failing indications.

The SMART Self-test logs indicate the last time you conducted a SMART test, the type of test, it’s completion status, how far it completed, and the hours of the results (Hours is a value in relation to the ID 9 Power On Hours value.)

It is always a good thing to run a SMART Long test if you doubt the integrity of your drive.

What data is not important? ID 1 is generally data which cannot be extrapolated into anything useful and does not always indicate a problem. You are best to ignore this value.
 
Last edited by a moderator:

joeschmuck

Old Man
Moderator
Joined
May 28, 2011
Messages
10,970
Extra Testing - Bad Blocks

While I personally would prefer to RMA my hard drive or install a new replacment, some people may decide that they want to run some further testing on their hard drives such as Bad Blocks because they have an issue which drove them to this hard drive troubleshooting in the first place. There is a nice thread here which documents quite a bit on how to run Bad Blocks for burn-in testing but here are the instructions for just a single drive. Drive "ada0" will be used in our example here with a Long Read Test failure at LBA 1144448. Also, read this entire section before running the test, there is nothing worse than destroying the wrong drive.

If you had a SMART Short or Long read test failure then we should test that section of the hard drive first because if it keeps failing then the drive is not salvageable. Record the failing LBA and then add at least 100,000 and subtract at least 100,000 to that count. These will become the ending and starting LBAs. You can subtract or add a larger value than 100,000 and it's not a bad idea if you get zero errors during your first run. Once you are all done with the troubled area you can run badblocks on the entire surface and ensure there are no other problem locations.

WARNING: THIS IS A DESTRUCTIVE TEST, VERIFY YOUR DRIVE BY THE SERIAL NUMBER!

1) Open up an SSH window.

First we need to allow you to perform RAW Write operations otherwise you will get an error message stating the operation is not permitted.

2) Type "sysctl kern.geom.debugflags=0x10" and hit Enter.

Now comes the destructive part...
At the end of the command line is the ending LBA and then the starting LBA, in that order. We will run the test 20 times.

3) Type "badblocks -b 4096 -ws -c 64 -p 20 /dev/ada0 1244448 1044448".

So lets say step 3 identifies a few more blocks and fixes them. Next you should adjust the ending and starting LBA number for a larger area such as +/- 200,000. If you don't have any new failures there, then go to step 4 and test the entire drive. If a failure occurs, go back to step 3 for the offending LBA's and repeat.

4) Type "badblocks -b 4096 -ws -c 64 /dev/ada0".

Once you are able to get through the entire badblocks program you can perform step 5 or reboot the machine.

5) Type "sysctl kern.geom.debugflags=0x00" and hit Enter.

Good Luck!
 
Status
Not open for further replies.
Top