HDD stops working

biigniick · Jun 6, 2013

Hi all,
I have a freeNAS server running at my house now for about a year now. Ever since my update to the new version and zfs format, I'm having trouble with intermittency with HDD connection or something. It only happens when under a heavy load for 30 mins or an hour. The SMART light turns yellow on the web interface and says the drive stopped responding. If I click shutdown, the server will never fully shut down. Restart will not fix the problem. Hard reset and a hour rest seems to help.

I'd like some help troubleshooting or thinking of why this is happening.

Could this be a thermal issue? Or something else?

Thanks,
- nick

cyberjock · Jun 6, 2013

That sounds like a disk is failing.

You can try viewing the output of your disk's SMART data with the command smartctl -a /dev/adaXX substituting your drives for the adaXX.

Additionally you can do a zpool status to see if ZFS has found any corruption.

You can also run long SMART tests on the drives by doing the command smartctl -t long /dev/adaXX and then reviewing the error logs on the drive after the test finishes(usually 2-5 hours, the output of the command will tell you how many minutes the test should take). Do not run the SMART long test at the same time as high disk activity or a scrub. The error log can be viewed by the command smartctl -a /dev/adaXX.

If you don't have a backup I'd also recommend you make a backup of the most important data on your drive.

This could be a thermal issue if the hard drives are overheating. Since you asked if it could be heat related I assume that you have reason to think that heat might be an issue. If you run the first command I provided above it will tell you the temperature in degrees C. For example, mine says my disk is 30C...

Code:

194 Temperature_Celsius     0x0022   122   103   000    Old_age   Always       -       30

Ideally hard drives should be kept below 40C, and should always be below 45C. I've seen Seagate hard drives that get above 45C start having errors with reading and writing that go away when the hard drive is cooler. I could run a long test and if the drive was over 45C it would give tons of errors. Cool the drive below 45C and the drive works just fine.

It would also be helpful if you have any more questions/comments to post your system hardware, FreeNAS version, and configuration along with any applicable story related to how this issue started or comes up.

biigniick · Jun 6, 2013

ok, thanks for the reply. pardon my noobness. . . here is the first command

Code:

[root@freenas] ~# smartctl -a /dev/ada0
smartctl 5.43 2012-06-30 r3573 [FreeBSD 8.3-RELEASE-p6 amd64] (local build)
Copyright (C) 2002-12 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF INFORMATION SECTION ===
Model Family:     Seagate Barracuda LP
Device Model:     ST31500541AS
Serial Number:    6XW0JC3G
LU WWN Device Id: 5 000c50 01b76bdb9
Firmware Version: CC94
User Capacity:    1,500,301,910,016 bytes [1.50 TB]
Sector Size:      512 bytes logical/physical
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   8
ATA Standard is:  ATA-8-ACS revision 4
Local Time is:    Thu Jun  6 12:14:54 2013 CDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
See vendor-specific Attribute list for marginal Attributes.

General SMART Values:
Offline data collection status:  (0x82)	Offline data collection activity
					was completed without error.
					Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0)	The previous self-test routine completed
					without error or no self-test has ever 
					been run.
Total time to complete Offline 
data collection: 		(  684) seconds.
Offline data collection
capabilities: 			 (0x7b) SMART execute Offline immediate.
					Auto Offline data collection on/off support.
					Suspend Offline collection upon new
					command.
					Offline surface scan supported.
					Self-test supported.
					Conveyance Self-test supported.
					Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
					power-saving mode.
					Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
					General Purpose Logging supported.
Short self-test routine 
recommended polling time: 	 (   1) minutes.
Extended self-test routine
recommended polling time: 	 ( 390) minutes.
Conveyance self-test routine
recommended polling time: 	 (   2) minutes.
SCT capabilities: 	       (0x103f)	SCT Status supported.
					SCT Error Recovery Control supported.
					SCT Feature Control supported.
					SCT Data Table supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   116   099   006    Pre-fail  Always       -       101971942
  3 Spin_Up_Time            0x0003   100   100   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   098   098   020    Old_age   Always       -       2818
  5 Reallocated_Sector_Ct   0x0033   100   100   036    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   064   060   030    Pre-fail  Always       -       3100557
  9 Power_On_Hours          0x0032   091   091   000    Old_age   Always       -       8720
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       63
183 Runtime_Bad_Block       0x0032   100   100   000    Old_age   Always       -       0
184 End-to-End_Error        0x0032   100   100   099    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
188 Command_Timeout         0x0032   100   090   000    Old_age   Always       -       446683414633
189 High_Fly_Writes         0x003a   099   099   000    Old_age   Always       -       1
190 Airflow_Temperature_Cel 0x0022   062   041   045    Old_age   Always   In_the_past 38 (0 11 38 36 0)
194 Temperature_Celsius     0x0022   038   059   000    Old_age   Always       -       38 (0 19 0 0 0)
195 Hardware_ECC_Recovered  0x001a   052   032   000    Old_age   Always       -       101971942
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
240 Head_Flying_Hours       0x0000   100   253   000    Old_age   Offline      -       220980362355274
241 Total_LBAs_Written      0x0000   100   253   000    Old_age   Offline      -       910494070
242 Total_LBAs_Read         0x0000   100   253   000    Old_age   Offline      -       109251125

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%      8582         -
# 2  Extended offline    Completed without error       00%      8249         -
# 3  Extended offline    Completed without error       00%      7897         -
# 4  Extended offline    Completed without error       00%      7573         -
# 5  Extended offline    Completed without error       00%      7165         -
# 6  Extended offline    Completed without error       00%      6938         -
# 7  Extended offline    Completed without error       00%      6602         -
# 8  Extended offline    Completed without error       00%      6266         -
# 9  Extended offline    Completed without error       00%      5859         -
#10  Extended offline    Completed without error       00%      5522         -
#11  Extended offline    Completed without error       00%      5114         -
#12  Extended offline    Completed without error       00%      4394         -
#13  Extended offline    Completed without error       00%      4057         -
#14  Extended offline    Completed without error       00%      3699         -
#15  Extended offline    Completed without error       00%      3421         -
#16  Extended offline    Completed without error       00%      3037         -
#17  Extended offline    Completed without error       00%      2701         -
#18  Extended offline    Completed without error       00%      2293         -
#19  Extended offline    Completed without error       00%      1957         -
#20  Extended offline    Completed without error       00%      1549         -
#21  Extended offline    Completed without error       00%      1213         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

that seems to have no errors. . . and running at 38C

here is the zpool status

Code:

[root@freenas] ~# zpool status
  pool: FreeNAS
 state: ONLINE
  scan: scrub repaired 0 in 0h58m with 0 errors on Sun May 19 00:58:15 2013
config:

	NAME                                          STATE     READ WRITE CKSUM
	FreeNAS                                       ONLINE       0     0     0
	  gptid/46d897cf-a20a-11e1-a67c-0013d40fc6f0  ONLINE       0     0     0

errors: No known data errors

also looks good. i'll run the long check today. it says it will be done tonight.

Code:

[root@freenas] ~# smartctl -t long /dev/ada0
smartctl 5.43 2012-06-30 r3573 [FreeBSD 8.3-RELEASE-p6 amd64] (local build)
Copyright (C) 2002-12 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION ===
Sending command: "Execute SMART Extended self-test routine immediately in off-line mode".
Drive command "Execute SMART Extended self-test routine immediately in off-line mode" successful.
Testing has begun.
Please wait 390 minutes for test to complete.
Test will complete after Thu Jun  6 18:49:04 2013

Use smartctl -X to abort test.

thanks,
- nick

cyberjock · Jun 6, 2013

SMART data looks good. I'm expecting the long test will be fine.

paleoN · Jun 6, 2013

Code:

188 Command_Timeout         0x0032   100   090   000    Old_age   Always       -       446683414633

cyberjock · Jun 6, 2013

It's a seagate. By virtue of the brand I ignore all numbers that are more than 9 digits. They have a weird system for those large values by the bits and their location. From my experience if a command timeout because of a hardware or firmware issue you'll have an extended log entry with some error bits and such.

I do know that the entire Seagate line of hard drives suffers from some kind of command reset(or similar) making them poorly suited for hardware RAIDs. They do work fine in a ZFS array though.

paleoN · Jun 7, 2013

cyberjock said:
It's a seagate. By virtue of the brand I ignore all numbers that are more than 9 digits. They have a weird system for those large values by the bits and their location.

It may very well be a quirk of that firmware/model combination. It appeared to operate normally in earlier firmware. A value of 105 would also be enough to cause problems.

cyberjock said:
From my experience if a command timeout because of a hardware or firmware issue you'll have an extended log entry with some error bits and such.

Sometimes.

biigniick said:
Ever since my update to the new version and zfs format, I'm having trouble with intermittency with HDD connection or something. It only happens when under a heavy load for 30 mins or an hour.

Where is this drive and how is it connected?

biigniick · Jun 8, 2013

i had the failure happen again today. here is the message.

Code:

SMART error (FailedOpenDevice) detected on host: freenas.local
The following warning/error was logged by the smartd daemon:

Device: /dev/ada0, unable to open device

For details see host's SYSLOG.

You can also use the smartctl utility for further investigation. 
No additional email messages about this problem will be sent.

the server was under an extended heavy load. still not sure why this is happening. . .

thanks for your help,
- nick

cyberjock · Jun 8, 2013

Here's my last experience with Seagate drives(3 years ago):

I had a 12 drive RAID6 and drives would randomly drop out of the array during periods of heavy loading. There was a 20+ page forum post discussing the issue and the solution is to leave Seagates behind. I was a little more than pissed because the 12 drives I had were pretty much brand new but I couldn't return them. The array was so unreliable I was scared to even stream my dvd collection off of it. The server sat off for over a month until I had another $2500 to spend on more hard drives and I went with the WD Greens(what I have now). Never had any of those same issues even with the exact same hardware. Those drives then sat in a box and were later sold for about $30 a piece to a friend(I paid something like $150 a piece and used them for less than 120 days).

My advice based on my experience is to bail on that drive and get something else. I used and recommended Seagates between 2000 and 2008. After that rather expensive mistake I'll never trust them again. I'm all WD and very happy with the drives I have. Lucky for me I'm all Intel SSD now except for my server so I don't even have to worry about failures on my desktops/laptop.

I'm really not sure what happened to Seagate but when a hard drive manufacturer has a damn wikipedia page devoted to listing their issues that's a bad sign! I have 2 friends that have still bought them despite my experience and advice. Both have had to RMA more than 50% of their drives in the first year.

bernardc · Jun 13, 2013

I have about 20 of the Seagate Barracuda 7200.14 ST3000DM001 3TB and have not yet had a serious failure. (Non-serious failures include one that keeps reporting 8 bad blocks, which amounts to 16 kB of data, on the SMART test.) The 12 in the raidz2 pool scrub without error every week and SMART test without a peep every hour.

I did have a near-death experience with cheap external power connectors and-or SAS fanout cables that I used in place of a hot-swap backplane. Transplanting everything into a Supermicro chassis (which has a backplane) saved the pool; thank you protosd for guiding the way on that. So check your external affairs.

Apollo · Jun 13, 2013

Hi Biigniick,

What drive controller to you have?
I have a Highpoint 4320 controller with 6 1.5TB drives in RAID 6 and currently the case is wide open. Under a hot day and heavy drive load, I would loose the entire array and the system would become unresponsive (no loss of data, just have to power down and power up again). Adding airflow to the hpt4320 has solved the issue. The same is true of the Seagate drives on that machine. With reduced or improper airflow, anyone drive running in the mid to high 40 degrees C, would start exhibiting dropout behavior with occurance of bad sectors.

I have newly setup FreeNAS on a AMD Athlon 2 processor with 2 GB RAM and ZFS on 6 1.5TB drive in RAIDZ2 (I know, I am shy of a few KB...) and I would experience network connection loss while doing bidirectional access to the NAS. I am currently exploring to upgrade to a better system but until then it will have to do. In this instance, freenas would not be accessible over network (loss of connection with rsync, SAMBA...). Freenas would still work locally and doing zpool status proves all the drive to be OK. Doing a reboot of Freenas is enough to bring the network back on line.

I suspect you maybe experiencing any of the 3 issues.

Regards.

cyberjock · Jun 14, 2013

Apollo-

Welcome to the forums. I'm going to point out a few things but if you want to discuss them further you should start a new thread.

1. Hard drives should be kept below 40C at all times for reliability and longevity reasons. This has been mentioned many times in the forum and there is some good info on the topic to back up the 40C recommendation.
2. Using a RAID6(hardware RAID) with ZFS is not a good choice. The concerns have been mentioned in the FreeNAS manual and in the forums dozens of times. Quite a few people have lost their data because they made the poor choice of mixing hardware RAID with ZFS.
3. 2GB of RAM is far less than you should be using for ZFS. See section 1.4.2 for the recommendations(Hint: you should have 6GB of RAM minimum before you even create a ZFS pool). Having less than 6GB with ZFS can cause performance issues, stability issues, and has cost quite a few people all of their data because ZFS becomes unstable. Not surprisingly, you are having issues.

Apollo · Jun 15, 2013

Cyberjock,

I wanted to be more generic than specific and give Biigniick some options to help him identify the cause of the failures.

1-My hpt4320 RAID6 array is a second NTFS array on my Windows 7 machine, and is the one to drop off if airflow to the controller is not enough.
2- Freenas RAIDZ2 is on a Highpoint 2320 set as JBOD's, I have read and understood the reason behind it before setting it up.
3- Freenas computer is not adequate, I know, but works as a temporary solution (considering the possible pitfalls) until I decide on the proper hardware. Currently Xeon 12xx V3 is my choice but as some people have stated, Freenas may not be ready for it nor the new motherboard chipsets and peripherals yet.

Important Announcement for the TrueNAS Community.

HDD stops working

biigniick

Cadet

cyberjock

Inactive Account

biigniick

Cadet

cyberjock

Inactive Account

paleoN

Wizard

cyberjock

Inactive Account

paleoN

Wizard

biigniick

Cadet

cyberjock

Inactive Account

bernardc

Dabbler

Apollo

Wizard

cyberjock

Inactive Account

Apollo

Wizard

Similar threads

Important Announcement for the TrueNAS Community.

HDD stops working

Cadet

Inactive Account

Cadet

Inactive Account

Wizard

Inactive Account

Wizard

Cadet

Inactive Account

Dabbler

Wizard

Inactive Account

Wizard

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "HDD stops working"

Similar threads