FreeNAS Dropping Harddrives?

Geoffrey Lee · Oct 20, 2012

Hi,

I've been using FreeNAS for about a year and a half, with no problems up until now.

Right now NAS is simply dropping hard drives randomly. I thought it was a bad hard drive that was getting dropped from the array, because it was consistently the same one each time. So I went out, got a new hard drive to replace it, and after I got it plugged in and tested it, I noticed that a different drive had been dropped in the array (Hadn't actually replaced it yet). So now I'm not sure if I've got a bad drive or not. One thing I noticed that was kinda funny is that if a drive hasn't been dropped, and I smart test it using smartctl, the test comes back clean, but also says that "Read SMART Log Directory failed". Is that a cause for worry?

The only other thing I can think of is that I switched the motherboard and CPU in the NAS about a while ago, but I cannot remember if the disk dropping started before or after that =/

Right now I've got 3x Seagate Barracuda 1.5TB's in RAID-Z, and my replacement disk is of the same model.

Using an ASUS M4A79XTD-EVO, Phenom X2 555, 8GB of RAM.

Anyways, where should I start looking next? Do you think that the switched motherboard could be an issue? Or should I still be looking at the HDD's?

cyberjock · Oct 20, 2012

From personal experience, Seagate had some serious firmware issues with their hard disks. I built a small hardware RAID with Seagates in 2009 and within 90 days I had 3 drives that would randomly drop out. Over the next 3 months over 50% of the drives had been dropped at some point randomly. Thank goodness I didn't lose 3 drives at the same time, but I did have 2 at several times. Sometimes a drive would drop when the system was idle and no other computers were on in the house, sometimes when very busy. There's a very long thread on the Seagate forums talking about the problem and Seagate's only answer is that the drives are not designed for any type of RAID except "Desktop RAID". They later clarified that "Desktop RAID" means 2 drives in a RAID-1 on Windows.

There was some uproar because there is no clear definition for "Desktop RAID" for the industry and that is misleading.

I'm not saying your problem is the same as mine, but what you describe is exactly what I was experiencing. The only solution was to get rid of the Seagates. I had tried a different RAID controller, but the issue persisted. I switched the array disks out for WD as each one dropped from the array(only took 6 weeks to see them all drop out) and now they are all WD and I have never had the problem since.

Prior to this all of my non-enterprise RAIDs have been Seagates since 2003 with never a single problem. I will never buy Seagate drives for any type of RAID ever again until I have personally seen a system with Seagate drives without problems. I realize this means that Seagate is pretty much no-go forever since I don't expect them to mail me $2k worth of free hard drives to win me back. But considering I had a box of hard drives I couldn't use, trust, or sell that is what it will take. I know my response is a little extreme, but since I had spent $1900 on hard drives to have them randomly stop working forcing me to spend another $1700 on hard drives within 6 months was completely unacceptable and something I will not allow to be repeated. I don't have that kind of money to be throwing around for hard drives a company won't stand behind. I still have all of those hard drives in a box. I still refuse to use them because I feel I can't trust them despite them passing every test I could run on them.

The REALLY crappy thing was that the RAID controller was listed as verified compatible with the hard drives and the exact firmware versions. The hard drives listed the RAID controller as compatible with the exact firmware versions. Neither company wanted to accept responsibility. But Highpoint(my RAID manufacturer) agreed to let me change out the controller for free because of my situation. Of course, the issue did not go away.

Fool me once, shame on me.

Geoffrey Lee · Oct 20, 2012

Didn't know Seagate was having RAID issues. So they're pretty much no good for RAID-Z? That's unfortunate, everything was working fine until now =/

I'm not sure if want to swap out all of my drives for WD though. The weird thing is that they'll be re-detected if I restart my NAS, and everything will be fine. I might try using different SATA ports on the motherboard, maybe it might just be one of them that's funny? I haven't monitored my NAS too closely though, because I just use it for media storage and back-ups, so I don't know if the drive dropping happens at a specific system event, or if its just totally random. I have found that they'll sometimes drop when I'm running SMART tests on them though. I'll keep messing with it and maybe figure out exactly what's going on.

Thanks for the input!

cyberjock · Oct 20, 2012

I'd say do a scrub after they are dropped. I'm betting the scrub will find errors.

My drive wouldn't be redetected until i rebooted either. Of course, once they were detected the RAID controller would immediately start doing a parity check of the entire array, and sometimes I'd lose a second drive before the parity check completed. It got really scary when I lost 2 drives and a parity check was running to recover one of them. Doing a recovery of TBs of data isn't my idea of fun.

paleoN · Oct 21, 2012

Geoffrey Lee said:
That's unfortunate, everything was working fine until now =/

Given that you recently switched the motherboard and CPU this might suggest it had something to do with that change.

Geoffrey Lee said:
I might try using different SATA ports on the motherboard, maybe it might just be one of them that's funny?

It would be more than one as multiple drives are dropping.

Geoffrey Lee said:
I have found that they'll sometimes drop when I'm running SMART tests on them though.

This is odd. What command are you using to run the tests? Also, the SMART output for each drive:

Code:

smartctl -q noserial -a /dev/adaX

Despite noobsauce80's valid bitterness regarding Seagate, if you truly didn't have problems before I don't think it's likely the drives themselves are at fault.

Geoffrey Lee · Oct 22, 2012

Thanks for the reply.

I've been using smartctl to test the drives, and I'll post the results when I get back. I ran a scrub last night, and haven't checked the results of that either so I'll post it up as well.

And as much as I want to point to the new motherboard causing the problem, I can't remember if I had this issue before or after I swapped hardware. I'm usually pretty careful with this kind of stuff, but I'm pretty sure I had just overlooked the error. I'm just glad the data isn't super critical to me, so if something goes south it isn't the end of the world (Crosses fingers)

Geoffrey Lee · Oct 22, 2012

Okay, so here's the zpools status and the smartctl results from each drive

Code:

[root@freenas] ~# zpool status
  pool: data
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://www.sun.com/msg/ZFS-8000-9P
 scrub: scrub completed after 4h7m with 0 errors on Sun Oct 21 12:14:34 2012
config:

        NAME          STATE     READ WRITE CKSUM
        data          ONLINE       0     0     0
          raidz1      ONLINE       0     0     0
            gpt/ada0  ONLINE       0     0     0
            gpt/ada1  ONLINE      24 41.0K     0
            gpt/ada2  ONLINE       0     0     0

errors: No known data errors

Code:

[root@freenas] ~# smartctl -a /dev/ada0
smartctl 5.42 2011-10-20 r3458 [FreeBSD 8.2-RELEASE-p9 amd64] (local build)
Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF INFORMATION SECTION ===
Model Family:     Seagate Barracuda Green (Adv. Format)
Device Model:     ST1500DL003-9VT16L
Serial Number:    5YD4NWEL
LU WWN Device Id: 5 000c50 03946dfd2
Firmware Version: CC4A
User Capacity:    1,500,301,910,016 bytes [1.50 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   8
ATA Standard is:  ATA-8-ACS revision 4
Local Time is:    Mon Oct 22 15:16:46 2012 PDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
                                        was completed without error.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                (  633) seconds.
Offline data collection
capabilities:                    (0x7b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   1) minutes.
Extended self-test routine
recommended polling time:        ( 255) minutes.
Conveyance self-test routine
recommended polling time:        (   2) minutes.
SCT capabilities:              (0x30b7) SCT Status supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   117   099   006    Pre-fail  Always       -       2234720
  3 Spin_Up_Time            0x0003   093   092   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       180
  5 Reallocated_Sector_Ct   0x0033   100   100   036    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   070   060   030    Pre-fail  Always       -       10697098
  9 Power_On_Hours          0x0032   095   095   000    Old_age   Always       -       4717
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       179
183 Runtime_Bad_Block       0x0032   100   100   000    Old_age   Always       -       0
184 End-to-End_Error        0x0032   100   100   099    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
188 Command_Timeout         0x0032   100   099   000    Old_age   Always       -       4295032837
189 High_Fly_Writes         0x003a   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0022   061   046   045    Old_age   Always       -       39 (Min/Max 36/43)
191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age   Always       -       0
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       113
193 Load_Cycle_Count        0x0032   100   100   000    Old_age   Always       -       180
194 Temperature_Celsius     0x0022   039   054   000    Old_age   Always       -       39 (0 17 0 0 0)
195 Hardware_ECC_Recovered  0x001a   037   009   000    Old_age   Always       -       2234720
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
240 Head_Flying_Hours       0x0000   100   253   000    Old_age   Offline      -       130390912143986
241 Total_LBAs_Written      0x0000   100   253   000    Old_age   Offline      -       3648190973
242 Total_LBAs_Read         0x0000   100   253   000    Old_age   Offline      -       3200778660

(pass0:ahcich2:0:0:0): SMART. ACB: b0 d5 00 4f c2 40 00 00 00 00 01 00
(pass0:ahcich2:0:0:0): CAM status: Command timeout
Read SMART Log Directory failed.

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%      4670         -
# 2  Short offline       Completed without error       00%      4670         -
# 3  Short offline       Completed without error       00%      4670         -
# 4  Short offline       Completed without error       00%      4670         -

(pass0:ahcich2:0:0:0): SMART. ACB: b0 d5 09 4f c2 40 00 00 00 00 01 00
(pass0:ahcich2:0:0:0): CAM status: Command timeout
Error SMART Read Selective Self-Test Log failed: Unknown error: 0
Smartctl: SMART Selective Self Test Log Read Failed

Code:

[root@freenas] ~# smartctl -a /dev/ada1
smartctl 5.42 2011-10-20 r3458 [FreeBSD 8.2-RELEASE-p9 amd64] (local build)
Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF INFORMATION SECTION ===
Model Family:     Seagate Barracuda Green (Adv. Format)
Device Model:     ST1500DL003-9VT16L
Serial Number:    5YD4J6BA
LU WWN Device Id: 5 000c50 0393f2ebc
Firmware Version: CC4A
User Capacity:    1,500,301,910,016 bytes [1.50 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   8
ATA Standard is:  ATA-8-ACS revision 4
Local Time is:    Mon Oct 22 15:21:34 2012 PDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
                                        was completed without error.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                (  643) seconds.
Offline data collection
capabilities:                    (0x7b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   1) minutes.
Extended self-test routine
recommended polling time:        ( 255) minutes.
Conveyance self-test routine
recommended polling time:        (   2) minutes.
SCT capabilities:              (0x30b7) SCT Status supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   119   099   006    Pre-fail  Always       -       216316744
  3 Spin_Up_Time            0x0003   093   092   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       180
  5 Reallocated_Sector_Ct   0x0033   100   100   036    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   069   060   030    Pre-fail  Always       -       10358631
  9 Power_On_Hours          0x0032   095   095   000    Old_age   Always       -       4718
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       179
183 Runtime_Bad_Block       0x0032   100   100   000    Old_age   Always       -       0
184 End-to-End_Error        0x0032   100   100   099    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
188 Command_Timeout         0x0032   100   098   000    Old_age   Always       -       7
189 High_Fly_Writes         0x003a   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0022   063   050   045    Old_age   Always       -       37 (Min/Max 33/39)
191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age   Always       -       0
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       112
193 Load_Cycle_Count        0x0032   100   100   000    Old_age   Always       -       180
194 Temperature_Celsius     0x0022   037   050   000    Old_age   Always       -       37 (0 18 0 0 0)
195 Hardware_ECC_Recovered  0x001a   032   006   000    Old_age   Always       -       216316744
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
240 Head_Flying_Hours       0x0000   100   253   000    Old_age   Offline      -       23420456669810
241 Total_LBAs_Written      0x0000   100   253   000    Old_age   Offline      -       63121710
242 Total_LBAs_Read         0x0000   100   253   000    Old_age   Offline      -       2783892279

(pass1:ahcich3:0:0:0): SMART. ACB: b0 d5 00 4f c2 40 00 00 00 00 01 00
(pass1:ahcich3:0:0:0): CAM status: Command timeout
Read SMART Log Directory failed.

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%      4718         -
# 2  Short offline       Interrupted (host reset)      40%      4670         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

Code:

[root@freenas] ~# smartctl -a /dev/ada2
smartctl 5.42 2011-10-20 r3458 [FreeBSD 8.2-RELEASE-p9 amd64] (local build)
Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF INFORMATION SECTION ===
Model Family:     Seagate Barracuda Green (Adv. Format)
Device Model:     ST1500DL003-9VT16L
Serial Number:    5YD4GLJS
LU WWN Device Id: 5 000c50 03916f884
Firmware Version: CC4A
User Capacity:    1,500,301,910,016 bytes [1.50 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   8
ATA Standard is:  ATA-8-ACS revision 4
Local Time is:    Mon Oct 22 15:20:34 2012 PDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
                                        was completed without error.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                (  633) seconds.
Offline data collection
capabilities:                    (0x7b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   1) minutes.
Extended self-test routine
recommended polling time:        ( 255) minutes.
Conveyance self-test routine
recommended polling time:        (   2) minutes.
SCT capabilities:              (0x30b7) SCT Status supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   111   099   006    Pre-fail  Always       -       33318472
  3 Spin_Up_Time            0x0003   093   092   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       173
  5 Reallocated_Sector_Ct   0x0033   100   100   036    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   070   060   030    Pre-fail  Always       -       10609995
  9 Power_On_Hours          0x0032   095   095   000    Old_age   Always       -       4716
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       172
183 Runtime_Bad_Block       0x0032   100   100   000    Old_age   Always       -       0
184 End-to-End_Error        0x0032   100   100   099    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
188 Command_Timeout         0x0032   100   098   000    Old_age   Always       -       11
189 High_Fly_Writes         0x003a   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0022   065   050   045    Old_age   Always       -       35 (Min/Max 32/38)
191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age   Always       -       0
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       110
193 Load_Cycle_Count        0x0032   100   100   000    Old_age   Always       -       173
194 Temperature_Celsius     0x0022   035   050   000    Old_age   Always       -       35 (0 17 0 0 0)
195 Hardware_ECC_Recovered  0x001a   025   006   000    Old_age   Always       -       33318472
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
240 Head_Flying_Hours       0x0000   100   253   000    Old_age   Offline      -       245947007242864
241 Total_LBAs_Written      0x0000   100   253   000    Old_age   Offline      -       3632879875
242 Total_LBAs_Read         0x0000   100   253   000    Old_age   Offline      -       987876413

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%      4669         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

For reference, Its /dev/ada0 that usually drops out on me. If you guys have any insight it would be appreciated!

paleoN · Oct 23, 2012

Geoffrey Lee said:
And as much as I want to point to the new motherboard causing the problem, I can't remember if I had this issue before or after I swapped hardware. I'm usually pretty careful with this kind of stuff, but I'm pretty sure I had just overlooked the error.

That's seems a completely random decision to me, but OK?

What were the smartctl commands you were using that was causing them to drop when running the SMART tests? What else was the array doing at that time?

Geoffrey Lee · Oct 23, 2012

This is sort of strange. I posted a full dump of the SMART tests last night, but the post hasn't shown up yet. I'll try again later.

I was using

Code:

smartctl -t short /dev/ada0
smartctl -a /dev/ada0

And after I checked the SMART drive output, I would get an error that the SMART test log cannot be read/found, and the drive would drop out a short time later.

This happened a few times, but I haven't been able to repeat the drive dropping reliably.

cyberjock · Oct 23, 2012

I couldn't get my drives to drop reliably either. I was really hoping to find a test that would cause a drive to drop from the array in my case because then it made troubleshooting possible. Up until I started abandoning my Seagate drives I wasn't convinced that my (then beloved) Seagate would bend me over like it was. I also wasn't sure if replacing the disks with a different brand would fix the problem. I wasn't too hip on the idea of spending $1.5k+ just to find out if it were the drives. Instead I could never trigger it which was only more frustrating.

I was really looking for solid irrefutable proof that it was the drives before I started spending big money again, but according to the Seagate thread there was no way to prove it. :(

Please keep at it though. Just because my situation is very similar to yours does not mean we are the same. Especially since my fix would involve spending more money. ;)

Geoffrey Lee · Nov 1, 2012

Hrm. So I watched my NAS pretty carefully over the last week or so, and I haven't seen a drive drop again yet, and it seems like their more reliable after a scrub (does this even make sense?)

I'll keep checking it periodically, but it looks like my data isn't in any serious jeopardy (fingers crossed) Hopefully these seagate drives will be good for another while.

Thanks for all the input!

jlpellet · Nov 6, 2012

I've started noting a similar problem. About every 30 minutes, I get the following logged:

Nov 6 10:07:05 freenas smartd[3403]: Device: /dev/ada4, 162 Currently unreadable (pending) sectors
Nov 6 10:07:05 freenas smartd[3403]: Device: /dev/ada4, 162 Offline uncorrectable sectors
Nov 6 10:07:05 freenas smartd[3403]: Device: /dev/ada5, 36 Currently unreadable (pending) sectors
Nov 6 10:07:05 freenas smartd[3403]: Device: /dev/ada5, 36 Offline uncorrectable sectors

Scrub shows no errors and smartctl -t short also shows no error for ada4 but read error on ada5.

I think I'm going to replace the drive on ada5 and use Seagate's tool to test on another drive. All of the drives are Seagate 1.5TB LP's.

Ideas welcome.

bollar · Nov 11, 2012

Just to add another datapoint, I have had two old drives drop offline since I installed the system last week. Both were 2TB Seagate Barracuda LP / ST32000542AS and had been in a Netgear ReadyNAS for years before this assignment. They are both approaching end of warranty -- the one that failed today has a warranty expiration of March 2013. Given they're so old, I wouldn't have thought twice about it, if I hadn't seen this thread.

Here's what the log shows:

Code:

Nov 11 05:39:25 freenas kernel: mps0: mpssas_scsiio_timeout checking sc 0xffffff80005b6000 cm 0xffffff80005ffb18
Nov 11 05:39:25 freenas kernel: (da4:mps0:0:4:0): SYNCHRONIZE CACHE(10). CDB: 35 0 0 0 0 0 0 0 0 0 length 0 SMID 683 command timeout cm 0xffffff80005ffb18 ccb 0xffffff001ee40800
Nov 11 05:39:25 freenas kernel: mps0: mpssas_alloc_tm freezing simq
Nov 11 05:39:25 freenas kernel: mps0: timedout cm 0xffffff80005ffb18 allocated tm 0xffffff80005c9148
Nov 11 05:39:25 freenas kernel: mps0: mpssas_scsiio_timeout checking sc 0xffffff80005b6000 cm 0xffffff8000608be0
Nov 11 05:39:25 freenas kernel: (da4:mps0:0:4:0): WRITE(10). CDB: 2a 0 54 da f2 58 0 0 2b 0 length 22016 SMID 796 command timeout cm 0xffffff8000608be0 ccb 0xffffff001ee5e800
Nov 11 05:39:25 freenas kernel: mps0: queued timedout cm 0xffffff8000608be0 for processing by tm 0xffffff80005c9148
Nov 11 05:39:29 freenas kernel: (da4:mps0:0:4:0): WRITE(10). CDB: 2a 0 54 da f2 58 0 0 2b 0 length 22016 SMID 796 completed timedout cm 0xffffff8000608be0 ccb 0xffffff001ee5e800 during recovery ioc 804b scsi 0 state c xfer (da4:mps0:0:4:0): WRITE(10). CDB: 2a 0 54 da f2 58 0 0 2b 0 length 22016 SMID 796 terminated ioc 804b scsi 0 state c xfer 0
Nov 11 05:39:29 freenas kernel: (da4:mps0:0:4:0): SYNCHRONIZE CACHE(10). CDB: 35 0 0 0 0 0 0 0 0 0 length 0 SMID 683 completed timedout cm 0xffffff80005ffb18 ccb 0xffffff001ee40800 during recovery ioc 8048 scsi 0 state c xf(noperiph:mps0:0:4:0): SMID 1 abort TaskMID 683 status 0x4a code 0x0 count 2
Nov 11 05:39:29 freenas kernel: (noperiph:mps0:0:4:0): SMID 1 finished recovery after aborting TaskMID 683
Nov 11 05:39:29 freenas kernel: mps0: mpssas_free_tm releasing simq
Nov 11 05:39:49 freenas afpd[4230]: transmit: Request to dbd daemon (db_dir /mnt/bollar/backups/timemachine) timed out.
Nov 11 05:40:06 freenas kernel: (da4:mps0:0:4:0): TEST UNIT READY. CDB: 0 0 0 0 0 0 length 0 SMID 686 terminated ioc 804b scsi 0 state 0 xfer 0
Nov 11 05:40:08 freenas kernel: mps0: mpssas_alloc_tm freezing simq
Nov 11 05:40:11 freenas cnid_dbd[4221]: read: Connection reset by peer
Nov 11 05:40:11 freenas kernel: mps0: mpssas_remove_complete on handle 0x000e, IOCStatus= 0x0
Nov 11 05:40:11 freenas kernel: mps0: mpssas_free_tm releasing simq
Nov 11 05:40:11 freenas kernel: (da4:mps0:0:4:0): lost device - 1 outstanding, 1 refs

BTW, I regularly send drives back to Seagate when they fail and lately, I have been getting the Seagate Barracuda Green (Adv. Format) / ST2000DL003-9VT166 in return.

jlpellet · Nov 11, 2012

Turns out both drives were giving the problem during SMART testing, which defaults to 30 minutes intervals. I replaced each in turn, resillvering between replacements, and seem to have had no data loss. Both of the drives failed Seagate's diagnostic & are in the process of warranty repair.

cyberjock · Nov 11, 2012

Woohoo! No data loss! Cheer for redundancy! That's the kind of posts I like to see!

Ken Almond · Aug 8, 2014

I just had (what sounds like) the same problem. I struggled with for over a year with Intel GLT33M (if I remember) motherboard because it was left over. Recently (2 mons ago) I gave up and installed a GigaGAEP45T-UD3LRv1.1 motherboard. The zpool was read automatically on boot (even though different motherboard) - very cool - all was rock solid. Till today - I simply added a bigger fan. This involved moving the power connectors and SATA cables here and there... but that's it.

Then, on reboot, I started getting 'detached'. I rebooted at least 10 times, wiggled things, plugged/unplugged, changed STATA cable to different port on motherboard (I'm using 5 ports of 6) it seemed to 'randomly' detach 1 (or more in couple of tries) 3 of my 5 disks. The disks ARE all Seagate 3TB 'standard' (cheap).

Note: Even when one of drives was 'REMOVED' - the smart tests worked and showed all was well at SMART level.

So, I changed ACHI in BIOS to IDE and it came up OK (for now). Running zpool scrub....

It is a little disconcerting to now know and 'be afraid' (this all takes unexpected time - 2 hrs so far). Changing disks = $600 at least (5 x $110).

I wish there was some way to diagnose, report, or a process to more deterministically figure this kind of stuff out. I understand... and don't know HOW to do it... but I can still wish. I was really hoping a new (modern) motherboard would fix things.

Bottom line - I don't think drives were bad in this case.

cyberjock · Aug 8, 2014

Well, I really don't know what to say.

But, as you are using desktop boards, I can't say I'm surprised. When choosing to use desktop boards with FreeNAS you're kind of in a 'no mans land' and the behavior can range from non-functional to partially functional with quirks to fully functional. It's totally hit and miss. Since you won't find many people that will use the same board you use it's usually something where you just have to hope and pray that you aren't, someday, going to be hit with some quirk that eats your pool for a midnight snack.

I love Gigabyte motherboards and all of my desktops are made with them. But under no uncertain terms would I ever recommend Gigabyte as a FreeNAS server motherboard.

You are right in that the drives probably aren't bad. But, considering the motherboard alone isn't what we'd recommend I have to wonder if you've made other decisions that might not be recommended that may also be impacting your system in a way that it is not behaving itself properly.

Ken Almond · Aug 9, 2014

Right after the post above, after 10 minutes, I had a disk detach. Multiple reboot attempts - each time was more than 1 detach!! So I shut it down to replace power supply.

I remembered that I read that bad power-supply can cause 'random disk errors'. I had 3-4yr old powersupply, I added a heavier fan to bring disk temps down from 38 C to 30 C + we had a power event in middle of night. I have systems on APC but maybe bigger fan?

In any case, I replace the power supply and now things look solid again. I don't think I had much damage as there was no activity (no read/write) going on during this whole episode but I noticed that 3 of 5 disks showed (resilvering) simultaneously for about 10mins as it worked thru my 9TB. Then it cleared, 0 errors, and I'm doing a scrub.

I don't understand the (reslivering) in () parens as apposed to "gptid/xxxx ONLINE 0 0 0 x.x resilvered".

>But, as you are using desktop boards, I can't say I'm surprised.
Yes, I agree. And cheap Seagate disks. But looking at alternatives for VMware NFS are 'really expensive', like $2K to $10K seems like just to get started. I hadn't really appreciated Free part of FreeNAS until you really start looking at alternatives. Its also a great education to work with FreeNAS. I also think other systems tend to gloss over the data validation aspect - at least I have trouble finding definative statements on this.

Ken Almond · Sep 15, 2014

Wanted to do a follow up to my Aug 9/9th postings on this thread.

To recap, on Aug 8th, I added a larger FAN for better cooling - and then I had trouble. Who would have thought that a bigger fan could be just enough to stress or overload the power supply (it was 600watt - e.g. enough power in theory - BUT 3-4 years old).

After replacing the power supply on Aug 9 - all is now 100% rock solid. Several GBs of writes and several scrubs and all is working perfectly.

Important Announcement for the TrueNAS Community.

FreeNAS Dropping Harddrives?

Cadet

Inactive Account

Cadet

Inactive Account

Wizard

Cadet

Cadet

Wizard

Cadet

Inactive Account

Cadet

Patron

Patron

Patron

Inactive Account

Dabbler

Inactive Account

Dabbler

Dabbler

Similar threads