Hard Drive Troubleshooting Guide (All Versions of FreeNAS)

UPDATE: 22 September 2018 - Added Drive Data Refreshing
UPDATE: 2 April 2017 - Added support for FreeNAS Corral (FreeNAS 10 and beyond)
UPDATE: 1 November 2020 - Added ID 1 and 7 description for Seagate drives at bottom of Appendix B

This guide covers the most routine single hard drive failures that are encountered and is not meant to cover every situation, specifically we will check to see if you have a physical drive failure or a communications error. If this guide fails to solve your problem, please open a new thread in the Help forum, list your hardware specs (FreeNAS version, Hardware Configuration), your failure and all indications, and specify that you used this guide and the step which failed to help you if appropriate. If there is an error or improvement you would like to suggest to this procedure, contact one of the forum moderators or the author and your inputs will be evaluated.

How to use this guide:

1. It is assumed you have some knowledge on how to open up a Shell window and perform some minor Linux/FreeBSD commands.

2. All the steps in this guide are non-destructive so you can safely perform these steps without further risk to your data.

3. We cannot take into account all formats of an error message but we used “?” to indicate any value. Additionally if we list an error message format, please keep in mind that as the software changes, the format may change and we will not update the guide every time a minor format of a message occurs.

4. The drive identifier in each command will be “ada0” however the user must enter the identifier for the suspect drive such as “ada4” or maybe “da4”. The failure message should indicate the drive identifier.

5. Once you have identified the failed drive serial number, write it down because drive identifiers “ada0” can change and the serial number is the best way to track and replace your drive if required.

6. You may be referenced to use the FreeNAS User Manual to conduct specific procedures.

7. Appendix A: Examples Error Messages

8. Appendix B: S.M.A.R.T. Data, What’s Important to Me?

9. Appendix C: Extra Troubleshooting - Drive Data Refreshing and Bad Blocks

Routine Procedures:

These few procedures will be run often so to minimize placing these steps all over the procedure, they will be written here and the user will refer here when directed to run one of them.

Output SMART Status Results

This procedure will display the hard drive data, including error information.

1) Open a shell (can be done via the GUI or SSH using something like Putty). If using FreeNAS 10 from the GUI Console, type "shell" to enter the shell and type "exit" when completed, for FreeNAS 11.x or greater you may select "Shell" from the lefthand pane.

2) Type smartctl –a /dev/ada0 where “ada0” is the subject drive. If the output scrolls off the screen then enter smartctl –a /dev/ada0 | more and the screen will only fill one page at a time.

3) Note the items asked about in the troubleshooting text.

4) The following output does not mean the hard drive completely passed, this is a terrible summary and all the data must be examined to ensure no errors exist(ed):

Code:

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

Perform SMART Long Test

A SMART Long Test conducts a test of the drive electronics and a read of the entire drive surface. This test should be run periodically by setting it up in the FreeNAS GUI for automatic accomplishment. Different users have different opinions on how frequently this should be done, the author prefers once a week for the Long tests and daily for the Short tests.

1) Open a shell (can be done via the GUI or SSH using something like Putty).

2) Type smartctl –t long /dev/ada0 where “ada0” is the drive identifier. Note how long it will take for the test to complete. You may still use your system however it will slow down the testing.

3) Once the period of time has lapsed for the testing, obtain a SMART Status Result and return to the troubleshooting text.

Troubleshooting Procedure

What type of failure did you received?

1) Error stating:

a. ID 5 Relocated Sector Count, ID 197 Current Pending Sector Count, ID 198 Offline, Uncorrectable Sector Count, or (?da?:ata?:?:?:?): CAM status: ATA Status Error, Pool is Degraded, or if you just don't know where to start.

b. Timeout errors or any Communication Errors (ID 199 UDMA CRC Error Count).

2) If “a” then goto step 3, If “b” then goto step 4.

Physical Drive Failure

3) This procedure troubleshoots common physical drive failures.

a. Conduct Output SMART Status Results and record the drive Serial Number, IDs 5, 197, and 198. (Note: For detailed explanation of what each of these IDs represent, visit the S.M.A.R.T. Wiki website)

b. If any of the IDs are greater than zero (0) then the drive has failed for RMA purposes.

c. If ALL of the IDs are zero (0), then run a SMART Long Test and after the test has completed, conduct Output SMART Status Results. If ALL of the IDs are still at zero (0), ensure you are troubleshooting the correct drive and if you are, proceed to step 4 because the hard drive does not indicate a hardware failure at this point.

d. If any of the IDs are 1 to 5 then you may be able to retain the drive however if you’re troubleshooting it, it’s not likely you desire to retain the drive even if it's slowly failing. If you do retain it, it’s highly recommended that you run frequent SMART Long Tests on the drive to ensure the IDs values do not increase. If they increase at all then replace the drive.

e. If replacing the drive follow the FreeNAS User Guide on how to replace a failed drive. If you have an encrypted drive, ensure you take appropriate precautions per the FreeNAS User Guide.

f. Exit this guide.

Drive Communications Failure

4) This procedure troubleshoots common communications errors for a single drive failure.

a. Conduct Output SMART Status Results and record the drive Serial Number, IDs 5, 197, and 198.

b. Inspect ID’s 5, 197, and 198 and if any value is greater than zero (0), the drive may have an unrelated failure. Goto to Step 3 after finishing this troubleshooting.

c. Replace the DATA cable between the hard drive (utilizing the serial number to identify the suspect drive) and controller. (Note: The data cable is the most common cause of drive communications errors.)

d. If the problem is not fixed, Swap the DATA cables between the suspect drive and a nearby drive (at the drive connections). (Note: We are trying to isolate the problem to the hard drive or something else.)

e. If the problem goes away, it’s likely the DATA cable is still the cause, you may exit this procedure however keep an eye open for future failures. (Note: At times a poor connection may cause this error or a marginal data cable.)

f. If the problem still exists, run the Output SMART Status Results and verify the drive serial number.

g. If the drive serial number changed, continue with step h, if the serial number did not change continue with step j. (Note: If the drive serial number changed then the failure could be the DATA cable or drive controller.)

h. Relocate the DATA cable for the failing drive (remember to use the serial number) to another DATA port on the controller or motherboard that does appear to be working.

i. If the failure still exists then run the Output SMART Status Results and verify the drive serial number that failed. If it has not changed then the DATA cable is suspect. Goto step k.

j. If the problem remains with the same drive then the hard drive electronics are suspect and the drive can be considered defective, replace the drive.

k. If the problem still exists, this is not a common failure and post your failure in the FreeNAS forums.

l. Exit this guide.

APPENDIX A

Example Error Messages

Hard Drive Failure Messages

Email Messages:

CRITICAL: Device: /dev/ada3, 1 Currently unreadable (pending) sectors

CRITICAL: Device: /dev/ada1, 817 Currently unreadable (pending) sectors

CRITICAL: Device: /dev/ada1, 2397 Offline uncorrectable sectors

SMART Results Output

(Note: Items in red are failure indications)

Code:

smartctl 6.3 2014-07-26 r3976 [FreeBSD 9.3-RELEASE-p28 amd64] (local build)
Copyright (C) 2002-14, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:  Western Digital Red
Device Model:  WDC WD20EFRX-68AX9N0
Serial Number:  WD-WMC300411000
LU WWN Device Id: 5 0014ee 6ad787ae3
Firmware Version: 80.00A80
User Capacity:  2,000,398,934,016 bytes [2.00 TB]
Sector Sizes:  512 bytes logical, 4096 bytes physical
Device is:  In smartctl database [for details use: -P show]
ATA Version is:  ACS-2 (minor revision not indicated)
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is:  Wed Jan 27 15:41:21 2016 EST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
  was never started.
  Auto Offline Data Collection: Disabled.
Self-test execution status:  (  0) The previous self-test routine completed
  without error or no self-test has ever
  been run.
Total time to complete Offline
data collection:  (27840) seconds.
Offline data collection
capabilities:  (0x7b) SMART execute Offline immediate.
  Auto Offline data collection on/off support.
  Suspend Offline collection upon new
  command.
  Offline surface scan supported.
  Self-test supported.
  Conveyance Self-test supported.
  Selective Self-test supported.
SMART capabilities:  (0x0003) Saves SMART data before entering
  power-saving mode.
  Supports SMART auto save timer.
Error logging capability:  (0x01) Error logging supported.
  General Purpose Logging supported.
Short self-test routine
recommended polling time:  (  2) minutes.
Extended self-test routine
recommended polling time:  ( 281) minutes.
Conveyance self-test routine
recommended polling time:  (  5) minutes.
SCT capabilities:  (0x70bd) SCT Status supported.
SCT Error Recovery Control supported.
SCT Feature Control supported.
SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME  FLAG  VALUE WORST THRESH TYPE  UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate  0x002f  200  200  051  Pre-fail  Always  -  31
  3 Spin_Up_Time  0x0027  176  174  021  Pre-fail  Always  -  4175
  4 Start_Stop_Count  0x0032  100  100  000  Old_age  Always  -  340
  5 Reallocated_Sector_Ct  0x0033  200  200  140  Pre-fail  Always  -  16
  7 Seek_Error_Rate  0x002e  200  200  000  Old_age  Always  -  0
  9 Power_On_Hours  0x0032  061  061  000  Old_age  Always  -  28532
10 Spin_Retry_Count  0x0032  100  100  000  Old_age  Always  -  0
11 Calibration_Retry_Count 0x0032  100  100  000  Old_age  Always  -  0
12 Power_Cycle_Count  0x0032  100  100  000  Old_age  Always  -  148
192 Power-Off_Retract_Count 0x0032  200  200  000  Old_age  Always  -  61
193 Load_Cycle_Count  0x0032  200  200  000  Old_age  Always  -  278
194 Temperature_Celsius  0x0022  120  107  000  Old_age  Always  -  27
196 Reallocated_Event_Count 0x0032  200  200  000  Old_age  Always  -  0
197 Current_Pending_Sector  0x0032  200  200  000  Old_age  Always  -  42
198 Offline_Uncorrectable  0x0030  100  253  000  Old_age  Offline  -  42
199 UDMA_CRC_Error_Count  0x0032  200  200  000  Old_age  Always  -  0
200 Multi_Zone_Error_Rate  0x0008  200  200  000  Old_age  Offline  -  0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description  Status  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline  Completed without error  00%  28522  -
# 2  Short offline  Completed without error  00%  28498  -
# 3  Short offline  Completed without error  00%  28474  -
# 4  Extended offline  Completed: read failure  70% 28455  -  543988376
# 5  Short offline  Completed without error  00%  28426  -
# 6  Short offline  Completed without error  00%  28330  -
# 7  Extended offline  Completed without error  00%  28312  -

SMART Selective self-test log data structure revision number 1
SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
  1  0  0  Not_testing
  2  0  0  Not_testing
  3  0  0  Not_testing
  4  0  0  Not_testing
  5  0  0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

Hard Drive Communications Error Messages

(ada1:ahcich1:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 00 90 b2 b9 40 2e 00 00 01 00 00

(ada1:ahcich1:0:0:0): CAM status: Uncorrectable parity/CRC error

(ada1:ahcich1:0:0:0): Retrying command

APPENDIX B

S.M.A.R.T. Data, What’s Important to Me?

When troubleshooting a hard drive failure we utilize the built in SMART diagnostics, part of every hard drive. These results can be used to justify an RMA as well. A SMART test will not isolate communication errors, it will only validate the physical hard drive. If you want to get some good information visit the Wiki for S.M.A.R.T at this link: https://en.wikipedia.org/wiki/S.M.A.R.T.

The important data we look at are as follows:

1) Serial Number
2) ID 5 Relocated Sector Count
3) ID 197 Current Pending Sector Count
4) ID 198 Offline Uncorrectable Sector Count
5) ID 199 UDMA CRC Error Count

Other notable data are:

6) ID 194 Temperature
7) ID 200 MultiZone Error Rate
8) Extended Self-test Time (value in minutes)
9) SMART Self-test logs, specifically the results of the self tests

If ID’s 5, 197, or 198 have any value greater than zero (0) then there has been some defect identified in the media. If ID 194 Temperature is above 40C then you may have a cooling issue and this could shorten the life of your drive. Many manufacturers will not accept an RMA if the temperature of the drive exceeds a certain value (manufacturer specific) as this voids the warranty.

ID 199 is a communications error between the drive electronics and the drive controller. The drive controller is part of your motherboard or an add-on card. Typically this error code results in replacement of the SATA cable to correct the situation. This wouldn't typically be a condition to RMA a drive however it is possible that the hard drive electronics has failed or someone broke the SATA connector on the drive, but that is not the typical failure we see.

ID 200 MultiZone Error Rate can be the cause of a drive failure although a value in this location doesn't always mean it's the fault. It is notable if there are no other failing indications.

Wear Level (ID# is manufacture specific) is the indication of the percentage of how many write operations are left. In a typical SSD you have approximately 2000 erase/write cycles per memory block (4k block) however better SSDs are being manufactured that will last longer and over-provisioning creates the illusion of longer life. If your wear level drops to zero then you will not be able to write again to your SSD and it may fail to operate at all.

The SMART Self-test logs indicate the last time you conducted a SMART test, the type of test, it’s completion status, how far it completed, and the hours of the results (Hours is a value in relation to the ID 9 Power On Hours value.)

It is always a good thing to run a SMART Long test if you doubt the integrity of your drive.

What data is not important?
Much of the other data is manufacturer specific so even if they look like they could be accurate data, odds are they are not important since there are other well known values that do maintain meaning. Let me provide you a great example of what I'm talking about, ID 1 - Raw Read Error Rate. This value represents errors in reading data, right? Yes is the answer however how the manufacturer does this is different between manufacturers and it's all due to design. Let me explain (this is likely not accurate but it's an attempt to show that drives handle internal functions differently):

Drive "A" is told to read XX sectors of data and it returns those sectors + the next YY sectors of data in it's internal buffer.
Drive "B" is told to read XX sectors of data and it returns those sectors + the next YY sectors of data in it's internal buffer.
Drive "A" next finds out that the extra YY sectors worth of data isn't required and since it provided exactly what was requested of it, all is good in the world and it completed the operation.
Drive "B" next finds out that the extra YY sectors worth of data isn't required but this drive is programmed differently and it has more data to provide and this creates an internal issue as the drive now thinks it read the wrong data or too much data, something. The end result is the ID 1 value is incremented.

If ID 1 is a low value then it is likely a value you could use as a failing indicator however if it's a high number that is always changing a lot, you can just ignore it.

There are other values you can ignore and some values you should pay attention to however the most important values are listed above.

UPDATE (1 Nov 2020): How to read Seagate drive ID's 1 and 7, but note that even if these values turn out to be greater than zero after the conversion, they are still not important unless you have other key indicators of failure:

Written by sdx1

Reading SMART Data on Seagate Drives - sdx1.net

Lenovo's Corporate Discount, which currently uses the code NJ*PERKSEPP, offers a significant discount on expensive ThinkPads.

sdx1.net

Seagate hard drives often report extremely high read and seek error rates in SMART data. After pronouncing many such drives defective, I did some research on the subject and found that Seagate drives use non-standard formatting for read and seek error rates, and every piece of software I've used reports these values incorrectly. While the easiest advice on this subject is to ignore these values in the absence of other drive issues, there is a way to read them.

The Process

The vast majority of programs that read SMART data rely on smartctl (Windows download), which we'll use to read these values.

Assuming that our disk is located at /dev/hda (which is typical for smartctl on Windows and older Linuxes), we can get the disk status using smartctl -A /dev/hda. The raw read error rate will be reported as something like 130917967 and the seek something like 13219996990. This is clearly wrong, because the drive hasn't failed or reported any SMART errors, and the normalized value still shows this attribute to be within acceptable margins.

This is because the read and seek error rates are actually recorded as 48-bit hexadecimal values, where the first 16 bits (4 hexadecmal digits) represent the total number of read or seek errors and the last 32 bits (8 hex digits) represent the total number of reads or seeks attempted. We can get these values with the command smartctl -v 1,hex48 -v 7,hex48 -A /dev/hda.

Using this command, our read and seek error values become 0x000007cda64f and 0x000313f9253e respectively. The first four digits following the x (emphasized) are the total number of errors. As you can see, read and seek errors are only 0 and 3 respectively for this particular drive. And the total number of reads are 130917967 and 335095102 respectively, obtained by taking the last eight digits and converting from hexadecimal to decimal numbers.

APPENDIX C

Extra Testing - Drive Data Refreshing and Bad Blocks

Drive Data Refreshing
If you are having a few Pending Sector Errors during a SMART Test then you could try to simply refresh your data to your hard drive. I can't tell you that this will make your hard drive all better but it should not hurt, provided you enter the commands properly. What this does is read all the hard drive sectors and write them back in the same locations and this is important to know as this does not read just the data you have stored, it reads the entire hard drive surface area and writes it back. This means that it is a long process (many hours) and the time it takes to complete is dependent on the size of the hard drive and how fast the hard drive is at reading and writing and any other operations the drive is doing.

1) First we need to allow you to perform RAW Write operations otherwise you will get an error message stating the operation is not permitted. Type sysctl kern.geom.debugflags=0x10 and hit Enter.

NOTE: This will take a while and completely depends on the hard drive size and speed, not the data stored.

2) Next we need to use the "dd" command to read the hard drive data and then write it back thus refreshing the data. Use dd if=/dev/ada0 of=/dev/ada0 bs=1m

3) Once you are done you need to either reboot FreeNAS or type sysctl kern.geom.debugflags=0x00 and hit Enter to restore the write protection of the mounted drive. My personal preference is to reboot FreeNAS as they you know that it is returned to normal operations.

BAD Blocks
While I personally would prefer to RMA my hard drive or install a new replacement, some people may decide that they want to run some further testing on their hard drives such as Bad Blocks because they have an issue which drove them to this hard drive troubleshooting in the first place. There is a nice thread here which documents quite a bit on how to run Bad Blocks for burn-in testing but here are the instructions for just a single drive. I would also recommend that the drive in which you are testing is not part of an active pool/vdev, remove it first. Drive "ada0" will be used in our example here with a Long Read Test failure at LBA 1144448. Also, read this entire section before running the test, there is nothing worse than destroying the wrong drive.

Because Bad Blocks takes generally several days to run (likely a full week) on a hard drive, I have broken down the troubleshooting into what I feel are reasonable steps in order to test the drive as quickly as possible. If you had a SMART Short or Long read test failure then we will test that section of the hard drive first because if it keeps failing then the drive is not salvageable. If you did not have a SMART read failure and just want to test the entire drive, well I've written the procedure to allow for that situation as well.

Record the failing LBA and then add at least 100,000 and subtract at least 100,000 to that count. These will become the ending and starting LBAs. You can subtract or add a larger value than 100,000 and it's not a bad idea if you get zero errors during your first run. Once you are all done with the troubled area you can run badblocks on the entire surface and ensure there are no other problem locations.

There is one assumption and that is that you are running this testing on your FreeNAS system. You can place your hard drive on any other computer and boot up FreeNAS on a USB Flash drive or use Ubuntu Live CD or some other piece of software, the instructions are basically the same.

After all of this testing and fixing of your hard drive, if you have another failure several months down the road, you can rest assured that drive is having physical component failure and you can toss the drive into the recycle bin, after you take it apart and get those high strength magnets out and stick them to your refrigerator. They are painful to get back off!

WARNING: THIS IS A DESTRUCTIVE TEST, VERIFY YOUR DRIVE BY THE SERIAL NUMBER!

Setup:
Because this is destructive you should take precautions to prevent accidental damage to your good hard drives containing data. Power off your system and physically disconnect all your good hard drives from the system, leaving the suspect drive connected. Use your serial number to ensure this. Now you can power on FreeNAS and your system should boot up. If desired you can boot to an Ubuntu Live CD/ISO if you like, open a terminal window, and then go to step 3 below. If you are using Ubuntu Live, I will assume you have a clue what you are doing and do not need step by step instructions so below is just a guide for those people.

1) Open up an SSH window.

2) Note added by @wblock 2018-01-22: this section recommended enabling the kern.geom.debugflags sysctl. Many people still think it has something to do with allowing raw writes. It does not. Instead, it disables a safety system that is intended to prevent writes to disks that are in use (say, by having a mounted filesystem). From man 4 geom:

0x10 (allow foot shooting)
Allow writing to Rank 1 providers. This would, for example,
allow the super-user to overwrite the MBR on the root disk or
write random sectors elsewhere to a mounted disk. The
implications are obvious.

To summarize, this option should generally not be needed. It only makes it possible to harm data. Any disk you are going to overwrite with data should not be mounted or have anything you wish to keep. In fact, best practice is to not be erasing or stress-testing drives on a system that has actual data on it. Since those disks will not have mounted filesystems, this sysctl will not affect being able to write to them. In fact, it will only make it possible to blow away things that are in use.

2) First we need to allow you to perform RAW Write operations otherwise you will get an error message stating the operation is not permitted. Type sysctl kern.geom.debugflags=0x10 and hit Enter.

Now comes the destructive part...
3) If you did not have a SMART read failure or just want to run badblocks and walk away, goto step 8.

4) At the end of the command line is the ending LBA and then the starting LBA, in that order. We will run the test 10 times. Type badblocks -b 4096 -wsv -c 64 -p 10 /dev/ada0 1244448 1044448.
Note: The "-p 10" identifies how many times to run the test after a pass and you can increase or decrease this. I chose a value of 10 for no real reason other than I want to be very sure the surface holds up.

5) So lets say step 4 identifies a few more blocks and fixes (I use that term loosely) them. Next you should adjust the ending and starting LBA number for a larger area such as +/- 200,000. If you don't have any failures there, then go to step 6, otherwise you go back to step 4 until you have no errors.

6) Now we need to run another SMART Long test to see if there are any more offending sectors which can be quickly picked up. Type smartctl -t long /dev/ada0

7) If the SMART Long test fails the Extended Read test again, using the new LBA jump back to step 4, otherwise continue to step 8.

8) Now that we can get through an entire SMART Long Read test we are ready to run Bad Blocks on the entire hard drive surface. This testing will take considerable time to run, likely several days. Type badblocks -b 4096 -wsv -c 64 /dev/ada0.

Once you are able to get through the entire badblocks program you can perform step 9 or reboot the machine, I prefer to reboot.

9) If you are not running Ubuntu Live, Type sysctl kern.geom.debugflags=0x00 and hit Enter.

Good Luck!

Resource icon by Evan-Amos @ Wikimedia Commons

Reactions: Davvo, WI_Hedgehog, flashdrive and 16 others

joeschmuck

I agree, it could use some updating and I'm more than happy letting someone else provide some good solid facts to update it.

Solonet-Array-Test should test those large hard drives. Here is the resource link.
https://www.truenas.com/community/resources/solnet-array-test.1/

Thank you for the kind words. I still want to update the guide to cover other situations but my hardware is still limited. If you can any constructive comments, please PM me and I'll take them into consideration.

I would like to update/expand coverage to include using camcontrol commands but my hardware configuration limits me on what I can do and I only try to publish items which I can test myself. I don't like giving out second hand advice.

Important Announcement for the TrueNAS Community.

Hard Drive Troubleshooting Guide (All Versions of FreeNAS)

Reading SMART Data on Seagate Drives - sdx1.net

More resources from joeschmuck

Share this resource

Latest updates

Added how to read Seagate ID 1 and ID 7 values

Hard Drive Troubleshooting Guide (All Versions of FreeNAS) - 22 September 2018

Latest reviews