Replace Bad Drive - System keeps rebooting...

Status
Not open for further replies.

jsylvia007

Explorer
Joined
Oct 4, 2011
Messages
84
Howdy all. I had a drive ready to fail about a month ago with unrecoverable sectors, and immediately initiated an RMA of the drive with WD.

The drives I use in my NAS are:
Western Digital Caviar Green (AF, SATA 6Gb/s)
Device Model: WDC WD30EZRX

The first drive was simple. I offlined the disk, powered down the NAS, performed the swap, replaced the disk with the new one via the GUI, and all was well after about 14 hours of resilvering.

I've been keeping close eye on it, as I'm pretty sure that once one drive goes, more to follow. Sure enough, over the weekend, I noticed another drive starting to report SMART errors. I RMA'd that drive, and performed the replacement tonight... Then the issues started.

The resilvering process was going, and then I noticed all of a sudden, ada0 (the drive I just replaced) simply disappeared, oddly enough, resilvering was still going... I decided I should reboot and see if the drive would come back, thinking that maybe WD sent me a bad drive as an RMA... The drive came back, and resilvering continued after the reboot.

Subsequently the system has now dropped the drive once more and rebooted automatically on it's own twice now... I've done NOTHING but replace the drive. No changes whatsoever. I've looked through the logs and can't seem to pinpoint what the issue is. I'm thinking that I will allow this resilvering to finish, and then issue another RMA on this drive as well.

The weird thing, is that the SMART short test passed without issue. I'm performing a long test now while the resilvering continues...

Thoughts?
 

Yatti420

Wizard
Joined
Aug 12, 2012
Messages
1,437
Did you burn in the drive / pre-test it before use? Short smart/long smart/badblocks?
 

jsylvia007

Explorer
Joined
Oct 4, 2011
Messages
84
Honestly, no. I don't have a system to run those types of tasks in, unless they are something that I can do in the freenas box itself...

This is only the second time i've ever had to replace a hard drive in FreeNAS, and I've run about 6 complete builds without any issues so far...

Any recommendations on how I could pre-test it within FreeNAS? Maybe Offline it and run some commands from the terminal?
 

Yatti420

Wizard
Joined
Aug 12, 2012
Messages
1,437
With the crazy high hard drive capacities we're all at now I don't understand why you wouldn't do a little bit of testing when you get the drives.. There are manyyyyy tools out there to run tests / diagnostics etc on drives before even booting FreeNAS..

You did a memtest atleast right? I'm not experience with your hardware (Intel G41 Chipset) Non-ECC ram can be a pool eater.. Greens have been WDIDLEd to disable / change timer? Look at the smart values maybe something is off.. Loose cable etc? Not a fan of the greens to be honest.. I mirrored mine when my RMA came in.. Made sure WDIDLE had taken effect and set them up to sleep.. I rarely touch them these days.. I don't plan on buying more WD drives for the time being..
 

jsylvia007

Explorer
Joined
Oct 4, 2011
Messages
84
Yea, I did a memory check and verified all cables, and the SMART values all look OK. I need to enable WDIDLE, I've only just ready about that tonight. The array has been running for almost 2 years now, and I haven't had a single problem until this week, and WDIDLE wasn't anything mentioned back then. I need a backup before messing with the WDIDLE stuff though.

It's been up for almost 4 hours now, and everything seems to be going ok. I'll check back in with this thread if something changes...
 

Yatti420

Wizard
Joined
Aug 12, 2012
Messages
1,437
Run some smart tests on all drives.. Short and long.. Conveyance etc.. Maybe even badblocks on the new drive.. What is the load cycle count on the older greens? "smartctl -a -q noserial /dev/ada0" then 1 etc for all drive smart info.. Have you scheduled auto smart tests etc and scrubs on the drives?

That drive controller is onboard right? Not an add-on card etc?
 

jsylvia007

Explorer
Joined
Oct 4, 2011
Messages
84
Run some smart tests on all drives.. Short and long.. Conveyance etc.. Maybe even badblocks on the new drive.. What is the load cycle count on the older greens? "smartctl -a -q noserial /dev/ada0" then 1 etc for all drive smart info.. Have you scheduled auto smart tests etc and scrubs on the drives?

That drive controller is onboard right? Not an add-on card etc?


I have run smart tests on all drives, both short and long. Short tests are scheduled once a week, long tests are schedules every 2 months... None of the drives are having any issues.

I wasn't familiar with the conveyance test, I will perform those too, any recommendation on how often to schedule the conveyance test?

I can't post the smartctl output right now because I'm at work, but I know the LCC is around 230000 for the two older drives... I've read that's not good.

Scrubs are also scheduled, but I can't remember how often they run, I want to say every 15 days.

The card is an add-in PCI card.
 

Yatti420

Wizard
Joined
Aug 12, 2012
Messages
1,437
Just to confirm what do you mean by disappearing? Literally dropping from the system?

You had no issues what so ever with that PCI card? How many drives are you running.. You can't use motherboard ports? Maybe another memtest in order.. Random reboots are not good.. I would probably pull all the data til you can get a handle on what is going on..

The only other thing I had issues with was when I switched to the SuperMicro board and issues with Watchdog triggered resets..
 

jsylvia007

Explorer
Joined
Oct 4, 2011
Messages
84
Just to confirm what do you mean by disappearing? Literally dropping from the system?

You had no issues what so ever with that PCI card? How many drives are you running.. You can't use motherboard ports? Maybe another memtest in order.. Random reboots are not good.. I would probably pull all the data til you can get a handle on what is going on..

The only other thing I had issues with was when I switched to the SuperMicro board and issues with Watchdog triggered resets..

Thanks for all the help. I was able to VPN home and get some data... First, the smartctl outputs:

OUTPUT FOR ADA0 - Drive that keeps dropping out
Code:
=== START OF INFORMATION SECTION ===
Model Family:    Western Digital Caviar Green (AF, SATA 6Gb/s)
Device Model:    WDC WD30EZRX-00D8PB0
Firmware Version: 80.00A80
User Capacity:    3,000,592,982,016 bytes [3.00 TB]
Sector Sizes:    512 bytes logical, 4096 bytes physical
Rotation Rate:    5400 rpm
Device is:        In smartctl database [for details use: -P show]
ATA Version is:  ACS-2 (minor revision not indicated)
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is:    Thu Mar 20 10:27:37 2014 EDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
 
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
 
General SMART Values:
Offline data collection status:  (0x80)    Offline data collection activity
                    was never started.
                    Auto Offline Data Collection: Enabled.
Self-test execution status:      ( 241)    Self-test routine in progress...
                    10% of test remaining.
Total time to complete Offline
data collection:        (40560) seconds.
Offline data collection
capabilities:              (0x7b) SMART execute Offline immediate.
                    Auto Offline data collection on/off support.
                    Suspend Offline collection upon new
                    command.
                    Offline surface scan supported.
                    Self-test supported.
                    Conveyance Self-test supported.
                    Selective Self-test supported.
SMART capabilities:            (0x0003)    Saves SMART data before entering
                    power-saving mode.
                    Supports SMART auto save timer.
Error logging capability:        (0x01)    Error logging supported.
                    General Purpose Logging supported.
Short self-test routine
recommended polling time:      (  2) minutes.
Extended self-test routine
recommended polling time:      ( 407) minutes.
Conveyance self-test routine
recommended polling time:      (  5) minutes.
SCT capabilities:            (0x7035)    SCT Status supported.
                    SCT Feature Control supported.
                    SCT Data Table supported.
 
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG    VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate    0x002f  100  253  051    Pre-fail  Always      -      0
  3 Spin_Up_Time            0x0027  165  165  021    Pre-fail  Always      -      6750
  4 Start_Stop_Count        0x0032  100  100  000    Old_age  Always      -      8
  5 Reallocated_Sector_Ct  0x0033  200  200  140    Pre-fail  Always      -      0
  7 Seek_Error_Rate        0x002e  200  200  000    Old_age  Always      -      0
  9 Power_On_Hours          0x0032  100  100  000    Old_age  Always      -      14
10 Spin_Retry_Count        0x0032  100  253  000    Old_age  Always      -      0
11 Calibration_Retry_Count 0x0032  100  253  000    Old_age  Always      -      0
12 Power_Cycle_Count      0x0032  100  100  000    Old_age  Always      -      8
192 Power-Off_Retract_Count 0x0032  200  200  000    Old_age  Always      -      7
193 Load_Cycle_Count        0x0032  200  200  000    Old_age  Always      -      14
194 Temperature_Celsius    0x0022  110  105  000    Old_age  Always      -      40
196 Reallocated_Event_Count 0x0032  200  200  000    Old_age  Always      -      0
197 Current_Pending_Sector  0x0032  200  200  000    Old_age  Always      -      0
198 Offline_Uncorrectable  0x0030  100  253  000    Old_age  Offline      -      0
199 UDMA_CRC_Error_Count    0x0032  200  200  000    Old_age  Always      -      0
200 Multi_Zone_Error_Rate  0x0008  100  253  000    Old_age  Offline      -      0
 
SMART Error Log Version: 1
No Errors Logged
 
SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Interrupted (host reset)      80%        2        -
# 2  Short offline      Completed without error      00%        0        -
 
SMART Selective self-test log data structure revision number 1
SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.


OUTPUT FOR ADA1 - One of the original drives
Code:
=== START OF INFORMATION SECTION ===
Model Family:    Western Digital Caviar Green (AF, SATA 6Gb/s)
Device Model:    WDC WD30EZRX-00MMMB0
Firmware Version: 80.00A80
User Capacity:    3,000,592,982,016 bytes [3.00 TB]
Sector Sizes:    512 bytes logical, 4096 bytes physical
Device is:        In smartctl database [for details use: -P show]
ATA Version is:  ATA8-ACS (minor revision not indicated)
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is:    Thu Mar 20 10:28:01 2014 EDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
 
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
 
General SMART Values:
Offline data collection status:  (0x82)    Offline data collection activity
                    was completed without error.
                    Auto Offline Data Collection: Enabled.
Self-test execution status:      (  0)    The previous self-test routine completed
                    without error or no self-test has ever
                    been run.
Total time to complete Offline
data collection:        (52800) seconds.
Offline data collection
capabilities:              (0x7b) SMART execute Offline immediate.
                    Auto Offline data collection on/off support.
                    Suspend Offline collection upon new
                    command.
                    Offline surface scan supported.
                    Self-test supported.
                    Conveyance Self-test supported.
                    Selective Self-test supported.
SMART capabilities:            (0x0003)    Saves SMART data before entering
                    power-saving mode.
                    Supports SMART auto save timer.
Error logging capability:        (0x01)    Error logging supported.
                    General Purpose Logging supported.
Short self-test routine
recommended polling time:      (  2) minutes.
Extended self-test routine
recommended polling time:      ( 507) minutes.
Conveyance self-test routine
recommended polling time:      (  5) minutes.
SCT capabilities:            (0x3035)    SCT Status supported.
                    SCT Feature Control supported.
                    SCT Data Table supported.
 
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG    VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate    0x002f  200  200  051    Pre-fail  Always      -      0
  3 Spin_Up_Time            0x0027  157  147  021    Pre-fail  Always      -      9133
  4 Start_Stop_Count        0x0032  100  100  000    Old_age  Always      -      39
  5 Reallocated_Sector_Ct  0x0033  200  200  140    Pre-fail  Always      -      0
  7 Seek_Error_Rate        0x002e  200  200  000    Old_age  Always      -      0
  9 Power_On_Hours          0x0032  071  071  000    Old_age  Always      -      21481
10 Spin_Retry_Count        0x0032  100  253  000    Old_age  Always      -      0
11 Calibration_Retry_Count 0x0032  100  253  000    Old_age  Always      -      0
12 Power_Cycle_Count      0x0032  100  100  000    Old_age  Always      -      35
192 Power-Off_Retract_Count 0x0032  200  200  000    Old_age  Always      -      17
193 Load_Cycle_Count        0x0032  117  117  000    Old_age  Always      -      249845
194 Temperature_Celsius    0x0022  112  095  000    Old_age  Always      -      40
196 Reallocated_Event_Count 0x0032  200  200  000    Old_age  Always      -      0
197 Current_Pending_Sector  0x0032  200  200  000    Old_age  Always      -      0
198 Offline_Uncorrectable  0x0030  200  200  000    Old_age  Offline      -      0
199 UDMA_CRC_Error_Count    0x0032  200  200  000    Old_age  Always      -      0
200 Multi_Zone_Error_Rate  0x0008  200  200  000    Old_age  Offline      -      0
 
SMART Error Log Version: 1
No Errors Logged
 
SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline      Completed without error      00%    21476        -
# 2  Short offline      Completed without error      00%    21448        -
# 3  Short offline      Completed without error      00%    21424        -
# 4  Short offline      Completed without error      00%    21400        -
# 5  Short offline      Completed without error      00%    21376        -
# 6  Short offline      Completed without error      00%    21352        -
# 7  Short offline      Completed without error      00%    21328        -
# 8  Short offline      Completed without error      00%    21305        -
# 9  Short offline      Completed without error      00%    21281        -
#10  Short offline      Completed without error      00%    21257        -
#11  Short offline      Completed without error      00%    21233        -
#12  Short offline      Completed without error      00%    21210        -
#13  Short offline      Completed without error      00%    21186        -
#14  Short offline      Completed without error      00%    21162        -
#15  Short offline      Completed without error      00%    21142        -
#16  Short offline      Completed without error      00%    21114        -
#17  Short offline      Completed without error      00%    21090        -
#18  Short offline      Completed without error      00%    21067        -
#19  Short offline      Completed without error      00%    21044        -
#20  Short offline      Completed without error      00%    21019        -
#21  Short offline      Completed without error      00%    20995        -
 
SMART Selective self-test log data structure revision number 1
SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.


OUTPUT FOR ADA2 - One of the original drives
Code:
=== START OF INFORMATION SECTION ===
Model Family:    Western Digital Caviar Green (AF, SATA 6Gb/s)
Device Model:    WDC WD30EZRX-00MMMB0
Firmware Version: 80.00A80
User Capacity:    3,000,592,982,016 bytes [3.00 TB]
Sector Sizes:    512 bytes logical, 4096 bytes physical
Device is:        In smartctl database [for details use: -P show]
ATA Version is:  ATA8-ACS (minor revision not indicated)
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is:    Thu Mar 20 10:28:05 2014 EDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
 
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
 
General SMART Values:
Offline data collection status:  (0x82)    Offline data collection activity
                    was completed without error.
                    Auto Offline Data Collection: Enabled.
Self-test execution status:      (  0)    The previous self-test routine completed
                    without error or no self-test has ever
                    been run.
Total time to complete Offline
data collection:        (51000) seconds.
Offline data collection
capabilities:              (0x7b) SMART execute Offline immediate.
                    Auto Offline data collection on/off support.
                    Suspend Offline collection upon new
                    command.
                    Offline surface scan supported.
                    Self-test supported.
                    Conveyance Self-test supported.
                    Selective Self-test supported.
SMART capabilities:            (0x0003)    Saves SMART data before entering
                    power-saving mode.
                    Supports SMART auto save timer.
Error logging capability:        (0x01)    Error logging supported.
                    General Purpose Logging supported.
Short self-test routine
recommended polling time:      (  2) minutes.
Extended self-test routine
recommended polling time:      ( 490) minutes.
Conveyance self-test routine
recommended polling time:      (  5) minutes.
SCT capabilities:            (0x3035)    SCT Status supported.
                    SCT Feature Control supported.
                    SCT Data Table supported.
 
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG    VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate    0x002f  200  200  051    Pre-fail  Always      -      0
  3 Spin_Up_Time            0x0027  158  147  021    Pre-fail  Always      -      9058
  4 Start_Stop_Count        0x0032  100  100  000    Old_age  Always      -      39
  5 Reallocated_Sector_Ct  0x0033  200  200  140    Pre-fail  Always      -      0
  7 Seek_Error_Rate        0x002e  200  200  000    Old_age  Always      -      0
  9 Power_On_Hours          0x0032  071  071  000    Old_age  Always      -      21481
10 Spin_Retry_Count        0x0032  100  253  000    Old_age  Always      -      0
11 Calibration_Retry_Count 0x0032  100  253  000    Old_age  Always      -      0
12 Power_Cycle_Count      0x0032  100  100  000    Old_age  Always      -      35
192 Power-Off_Retract_Count 0x0032  200  200  000    Old_age  Always      -      17
193 Load_Cycle_Count        0x0032  117  117  000    Old_age  Always      -      250149
194 Temperature_Celsius    0x0022  111  096  000    Old_age  Always      -      41
196 Reallocated_Event_Count 0x0032  200  200  000    Old_age  Always      -      0
197 Current_Pending_Sector  0x0032  200  200  000    Old_age  Always      -      0
198 Offline_Uncorrectable  0x0030  200  200  000    Old_age  Offline      -      0
199 UDMA_CRC_Error_Count    0x0032  200  200  000    Old_age  Always      -      0
200 Multi_Zone_Error_Rate  0x0008  200  200  000    Old_age  Offline      -      0
 
SMART Error Log Version: 1
No Errors Logged
 
SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline      Completed without error      00%    21480        -
# 2  Short offline      Completed without error      00%    21448        -
# 3  Short offline      Completed without error      00%    21424        -
# 4  Short offline      Completed without error      00%    21400        -
# 5  Short offline      Completed without error      00%    21376        -
# 6  Short offline      Completed without error      00%    21352        -
# 7  Short offline      Completed without error      00%    21328        -
# 8  Short offline      Completed without error      00%    21304        -
# 9  Short offline      Completed without error      00%    21280        -
#10  Short offline      Completed without error      00%    21256        -
#11  Short offline      Completed without error      00%    21232        -
#12  Short offline      Completed without error      00%    21209        -
#13  Short offline      Completed without error      00%    21185        -
#14  Short offline      Completed without error      00%    21161        -
#15  Short offline      Completed without error      00%    21145        -
#16  Short offline      Completed without error      00%    21114        -
#17  Short offline      Completed without error      00%    21090        -
#18  Short offline      Completed without error      00%    21066        -
#19  Short offline      Completed without error      00%    21050        -
#20  Short offline      Completed without error      00%    21018        -
#21  Short offline      Completed without error      00%    20994        -
 
SMART Selective self-test log data structure revision number 1
SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.



OUTPUT FOR ADA3 - Replaced about 2 weeks ago without issue.

Code:
=== START OF INFORMATION SECTION ===
Model Family:    Western Digital Caviar Green (AF, SATA 6Gb/s)
Device Model:    WDC WD30EZRX-00MMMB0
Firmware Version: 80.00A80
User Capacity:    3,000,592,982,016 bytes [3.00 TB]
Sector Sizes:    512 bytes logical, 4096 bytes physical
Device is:        In smartctl database [for details use: -P show]
ATA Version is:  ATA8-ACS (minor revision not indicated)
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is:    Thu Mar 20 10:28:09 2014 EDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
 
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
 
General SMART Values:
Offline data collection status:  (0x82)    Offline data collection activity
                    was completed without error.
                    Auto Offline Data Collection: Enabled.
Self-test execution status:      (  0)    The previous self-test routine completed
                    without error or no self-test has ever
                    been run.
Total time to complete Offline
data collection:        (50700) seconds.
Offline data collection
capabilities:              (0x7b) SMART execute Offline immediate.
                    Auto Offline data collection on/off support.
                    Suspend Offline collection upon new
                    command.
                    Offline surface scan supported.
                    Self-test supported.
                    Conveyance Self-test supported.
                    Selective Self-test supported.
SMART capabilities:            (0x0003)    Saves SMART data before entering
                    power-saving mode.
                    Supports SMART auto save timer.
Error logging capability:        (0x01)    Error logging supported.
                    General Purpose Logging supported.
Short self-test routine
recommended polling time:      (  2) minutes.
Extended self-test routine
recommended polling time:      ( 487) minutes.
Conveyance self-test routine
recommended polling time:      (  5) minutes.
SCT capabilities:            (0x3035)    SCT Status supported.
                    SCT Feature Control supported.
                    SCT Data Table supported.
 
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG    VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate    0x002f  200  200  051    Pre-fail  Always      -      0
  3 Spin_Up_Time            0x0027  100  253  021    Pre-fail  Always      -      0
  4 Start_Stop_Count        0x0032  100  100  000    Old_age  Always      -      6
  5 Reallocated_Sector_Ct  0x0033  200  200  140    Pre-fail  Always      -      0
  7 Seek_Error_Rate        0x002e  100  253  000    Old_age  Always      -      0
  9 Power_On_Hours          0x0032  100  100  000    Old_age  Always      -      347
10 Spin_Retry_Count        0x0032  100  253  000    Old_age  Always      -      0
11 Calibration_Retry_Count 0x0032  100  253  000    Old_age  Always      -      0
12 Power_Cycle_Count      0x0032  100  100  000    Old_age  Always      -      6
192 Power-Off_Retract_Count 0x0032  200  200  000    Old_age  Always      -      3
193 Load_Cycle_Count        0x0032  194  194  000    Old_age  Always      -      19119
194 Temperature_Celsius    0x0022  113  110  000    Old_age  Always      -      39
196 Reallocated_Event_Count 0x0032  200  200  000    Old_age  Always      -      0
197 Current_Pending_Sector  0x0032  200  200  000    Old_age  Always      -      0
198 Offline_Uncorrectable  0x0030  200  200  000    Old_age  Offline      -      0
199 UDMA_CRC_Error_Count    0x0032  200  200  000    Old_age  Always      -      0
200 Multi_Zone_Error_Rate  0x0008  200  200  000    Old_age  Offline      -      0
 
SMART Error Log Version: 1
No Errors Logged
 
SMART Self-test log structure revision number 1
No self-tests have been logged.  [To run self-tests, use: smartctl -t]
 
 
SMART Selective self-test log data structure revision number 1
SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.


Yes, the first two times it literally disappeared from the system, and there were errors about ATA_IDENTITY:

Code:
Mar 19 20:12:41 beowulf-freenas kernel: ata6: SIGNATURE: ffffffff
Mar 19 20:12:41 beowulf-freenas kernel: (ada0:ata6:0:0:0): WRITE_DMA48. ACB: 35 00 48 69 2d 40 27 00 00 00 30 00
Mar 19 20:12:41 beowulf-freenas kernel: (ada0:ata6:0:0:0): CAM status: Command timeout
Mar 19 20:12:41 beowulf-freenas kernel: (ada0:ata6:0:0:0): Error 5, Periph was invalidated
Mar 19 20:12:41 beowulf-freenas kernel: ata6: timeout waiting to issue command
Mar 19 20:12:41 beowulf-freenas kernel: ata6: error issuing ATA_IDENTIFY command


Interestingly enough, the resilvering completed about an hour ago, but the old drive was still listed. I detached the old drive, and resilvering started again... Is that normal?

I have yet to perform a scrub since replacing this second drive. I did one when I replaced the first drive without issue.

I have the /var/log/messages files, and can attach those if you'd like.

After the first two times when the drive just dropped out with the above errors, I rebooted after each time, and it came back. After about resilvering for about an hour last night, the system cold-booted by itself, but when it came back up, everything seemed OK, and the resilvering completed this morning (about an hour ago).
 

Yatti420

Wizard
Joined
Aug 12, 2012
Messages
1,437
Resilvering started on the new drive? I'm confused you mean the old drive was listed in GUI? Did you replace the drive according to documentation? The greens will all need wdidle by the looks of it..
 

jsylvia007

Explorer
Joined
Oct 4, 2011
Messages
84
Resilvering started on the new drive? I'm confused you mean the old drive was listed in GUI? Did you replace the drive according to documentation? The greens will all need wdidle by the looks of it..


Yea, I followed the instructions in the documentation, and it says under 6.3.12 Replacing a Failed Drive, "4. If the replaced disk continues to be listed after resilvering is complete, click its entry and use the “Detach” button to remove the disk from the list."

So, the resilvering finished and the old disk was still there in the GUI. I followed the procedure and detached the old failed disk, and immediately the pool started resilvering...

AND... About 35 min ago the drive dropped out again, and the system is now down a drive.

I'm thinking that I'm going to take the replacement drive out, and put back in the old drive, because at least that one works reliably (but has bad sectors), then RMA the replacement drive, and see if the 3rd drive fixes the issue before returning anything. It's got to be this drive, it's literally the ONLY thing that changed.
 

Yatti420

Wizard
Joined
Aug 12, 2012
Messages
1,437
Make sure you have a backup.. The RMA drive may be faulty aswell.. If you haven't been having drops before it's probably another bad drive?
 

jsylvia007

Explorer
Joined
Oct 4, 2011
Messages
84
Make sure you have a backup.. The RMA drive may be faulty aswell.. If you haven't been having drops before it's probably another bad drive?


Yea, They're obviously both bad, I'm hoping that the 3rd time is the charm for this... We'll see.

The problem is, this *IS* my backup server. It's the only thing I have with enough space to store everything. That's why I chose Z1 for the single drive fault tolerance (RAID5). Looks like once this thing settles itself out I will be purchasing an external drive to backup this array to, purchasing a few more hard drives and creating a RAIDZ2 from scratch... That will give me 2 drive tolerance.
 

Yatti420

Wizard
Joined
Aug 12, 2012
Messages
1,437
You should have multiple copies of your data..
 

jsylvia007

Explorer
Joined
Oct 4, 2011
Messages
84

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
I need to enable WDIDLE, I've only just ready about that tonight. The array has been running for almost 2 years now, and I haven't had a single problem until this week, and WDIDLE wasn't anything mentioned back then. I need a backup before messing with the WDIDLE stuff though.

1. WDIDLE existed before I bought my drives in 2010.. so its definitely been around. I remember doing lots of homework when I was picking my drives in 2010, and EVERYONE that was a server afficionado said that wdidle was virtually a requirement if you wanted to use Greens in a server.

2. Yes, technically a backup is a good idea, but wdidle is very safe. I've never heard of it damaging someone's data or the drive.

Controller: Promise SATA300 TX4

That's not exactly the worlds most recommended controller for FreeNAS. So you may be seeing some latent problem rearing its ugly head as part of the problem.


The problem is, this *IS* my backup server. It's the only thing I have with enough space to store everything. That's why I chose Z1 for the single drive fault tolerance (RAID5). Looks like once this thing settles itself out I will be purchasing an external drive to backup this array to, purchasing a few more hard drives and creating a RAIDZ2 from scratch... That will give me 2 drive tolerance.

Yeah, that's highly recommended.

One other thing is that all your SMART data shows your disks at 39C+. Per the google white paper hard drives should be kept below 40C at all times. 40C+ and failure rates skyrocket. And if I'm not mistaken all of your disks except 1 were too hot. I'm not sure if those temps were taken during a scrub or not, but if you weren't running a scrub then you're actually worse off because the drives will often heat up 2-5C when doing a scrub, so you are literally cooking all of your disks every time you do a scrub.
 

jsylvia007

Explorer
Joined
Oct 4, 2011
Messages
84
@cyberjock - Thanks so much for your insight! I read so many post around this forum and you really have a handle on this system.

1. I will plan on running WDIDLE tonight on the 3 functioning drives using the bootable ISO.
2. I'm looking into getting an IBM Raid Controller like you mention, but again... These things all cost $$.
3. Looks like I'll also be purchasing an external drive that I can backup to. My plan is then to switch to RAIDZ2 and rebuild my array. At that point, I can just backup the most important stuff on the external, and hope that dual redundancy is good enough for my setup.
4. I will look at the placement of the server and possibly adding an additional case fan to try and bring the temperature down.
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
Regarding #3.. you are using non-ECC RAM.. that's a recipe for sudden disaster. So yeah.. proceed at your own peril.
 

jsylvia007

Explorer
Joined
Oct 4, 2011
Messages
84
Regarding #3.. you are using non-ECC RAM.. that's a recipe for sudden disaster. So yeah.. proceed at your own peril.

I just really don't want to have to spend $1000 on a supermicro system... Maybe I will...
 

jsylvia007

Explorer
Joined
Oct 4, 2011
Messages
84
Hey @cyberjock...

I can pickup one of these: http://www.supermicro.com/products/motherboard/Xeon1333/5000P/X7DB3.cfm with two of these: http://ark.intel.com/products/33081/Intel-Xeon-Processor-E5430-12M-Cache-2_66-GHz-1333-MHz-FSB on the cheap...

Would you say as long as I added memory and the IBM RAID Card I'd be in better shape? I understand it's old hardware, but, total cost is 1/3 the price of buying all new, and honestly, I'm not a heavy hitter when it comes to use. Mostly it sits there idle, and serves up a video or a backup file a few times a week.
 
Status
Not open for further replies.
Top