Drives keep resilvering

Xander May · Apr 29, 2017

Just ordered in 3 Seagate 2tb hard drives and have dropped them into a Raid-Z config. Been trying to copy from my external HDD over to them to repopulate my media collection on the server, but every time I do, it seems like there's some critical error. The first time I tried plugging the external hard drive into the server and importing the disk. That cause the drives to go into re-silver mode then the server to crash. Now I'm going over the network, and after 5gb they re-silvered and now say they're 'okay'. Are these disks going to fail in short order or is this a normal thing that happens that I don't know about?

Also, I know this is hardware but if I have two problems might as well go for the two birds one stone approach, with my network which is wired for 1Gbit/s connection, I only get occasional spikes into the gigabit range, with the usual sustained speed to freeNAS being in the 200-300 Mbit/s range

Dice · Apr 29, 2017

Hello,
Your story is slightly confusing.
1. Have you established a fresh raidz1 consisting of 3 drives?
2. External hdd connected to FreeNAS is ...a can of worms.
3.

Xander May said:
Now I'm going over the network, and after 5gb they re-silvered and now say they're 'okay'. Are these disks going to fail in short order or is this a normal thing that happens that I don't know about?

Try to cover the gaps in the story and provide a coherent version. Drives just don't resilver out of the blue.

gpsguy · Apr 29, 2017

Please provide detailed hardware information, as well as FreeNAS version. See the forum rules - https://forums.freenas.org/index.php?threads/updated-forum-rules-4-11-17.45124/

Are you connecting via wifi? When you see the slow speeds, what are you doing? Copying 10,000 tiny files? Please provide more information.

Xander May said:
Also, I know this is hardware but if I have two problems might as well go for the two birds one stone approach, with my network which is wired for 1Gbit/s connection, I only get occasional spikes into the gigabit range, with the usual sustained speed to freeNAS being in the 200-300 Mbit/s range

Xander May · Apr 29, 2017

Dice said:
Hello,
Your story is slightly confusing.
1. Have you established a fresh raidz1 consisting of 3 drives?
2. External hdd connected to FreeNAS is ...a can of worms.
3.

Try to cover the gaps in the story and provide a coherent version. Drives just don't resilver out of the blue.

Alright. If I provided you with an unedited video of my life during that hour you would see:
Me starting up the maching freeNAS is on
Me navigating to xxx.xxx.x.5 on the network and logging into freeNAS
Me enabling SMB then creating a Raid-Z1 with the three new drives
Me creating the share
Me using Unstoppable Copier to begin the file copy to the share
5 minutes later me refreshing the freeNAS page to see a CRITICAL ERROR red light saying my drives were now resilvering

Xander May · Apr 29, 2017

gpsguy said:
Please provide detailed hardware information, as well as FreeNAS version. See the forum rules - https://forums.freenas.org/index.php?threads/updated-forum-rules-4-11-17.45124/

Are you connecting via wifi? When you see the slow speeds, what are you doing? Copying 10,000 tiny files? Please provide more information.

MOBO: Asus M5A97 R2
Platform AMD Phenom(tm) II X4 965 Processor
Memory 8080MB Kingston
HDD: 3x2tb Seagate Barracuda Drives
Everything is wired as I said in the original post, and I said I was repopulating a media collection so a bunch of 200mb to 1gb files.

Xander May · Apr 29, 2017

Now I've got this error:

However, files are still copying to it fine and the pool status still says healthy...

gpsguy · Apr 29, 2017

Please do a zpool status -v and post the results in code tags.

Xander May · Apr 29, 2017

Code:

  pool: Big1																														
state: ONLINE																													 
status: One or more devices has experienced an unrecoverable error.  An															 
		attempt was made to correct the error.  Applications are unaffected.														
action: Determine if the device needs to be replaced, and clear the errors														 
		using 'zpool clear' or replace the device with 'zpool replace'.															 
   see: http://illumos.org/msg/ZFS-8000-9P																						 
  scan: resilvered 13.5M in 0h7m with 0 errors on Sat Apr 29 12:18:28 2017														 
config:																															 
																																	
		NAME											STATE	 READ WRITE CKSUM												 
		Big1											ONLINE	   0	 0	 0												 
		  raidz1-0									  ONLINE	   0	 0	 0												 
			gptid/086611dd-2c84-11e7-ac9d-60a44c300e94  ONLINE	   0	 0	 0												 
			gptid/08f9edbb-2c84-11e7-ac9d-60a44c300e94  ONLINE	   0	 0	 0												 
			gptid/0a853489-2c84-11e7-ac9d-60a44c300e94  ONLINE	   0	 0   149												 
																																	
errors: No known data errors

Dice · Apr 29, 2017

Please do a smartctl -a /dev/ada6p2

Xander May · Apr 29, 2017

Code:

/$ smartctl -a /dev/ada6p2
smartctl 6.5 2016-05-07 r4318 [FreeBSD 10.3-STABLE amd64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

/dev/ada6p2: Unable to detect device type
Please specify device type with the -d option.

Use smartctl -h to get a usage summary

Guessing you meant:

Code:

/$ smartctl -a /dev/ada6
smartctl 6.5 2016-05-07 r4318 [FreeBSD 10.3-STABLE amd64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Device Model:  ST2000DM006-2DM164
Serial Number:  Z504H3AF
LU WWN Device Id: 5 000c50 0a26c6b70
Firmware Version: CC26
User Capacity:  2,000,398,934,016 bytes [2.00 TB]
Sector Sizes:  512 bytes logical, 4096 bytes physical
Rotation Rate:  7200 rpm
Form Factor:  3.5 inches
Device is:  Not in smartctl database [for details use: -P showall]
ATA Version is:  ACS-2, ACS-3 T13/2161-D revision 3b
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:  Sat Apr 29 13:12:24 2017 PDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82)   Offline data collection activity
		   was completed without error.
		   Auto Offline Data Collection: Enabled.
Self-test execution status:  (  0)   The previous self-test routine completed
		   without error or no self-test has ever
		   been run.
Total time to complete Offline
data collection:	  (  97) seconds.
Offline data collection
capabilities:		 (0x7b) SMART execute Offline immediate.
		   Auto Offline data collection on/off support.
		   Suspend Offline collection upon new
		   command.
		   Offline surface scan supported.
		   Self-test supported.
		   Conveyance Self-test supported.
		   Selective Self-test supported.
SMART capabilities:  (0x0003)   Saves SMART data before entering
		   power-saving mode.
		   Supports SMART auto save timer.
Error logging capability:  (0x01)   Error logging supported.
		   General Purpose Logging supported.
Short self-test routine
recommended polling time:	 (  1) minutes.
Extended self-test routine
recommended polling time:	 ( 235) minutes.
Conveyance self-test routine
recommended polling time:	 (  2) minutes.
SCT capabilities:	 (0x1085)   SCT Status supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME  FLAG  VALUE WORST THRESH TYPE  UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate  0x000f  104  100  006  Pre-fail  Always  -  6494784
  3 Spin_Up_Time  0x0003  096  095  000  Pre-fail  Always  -  0
  4 Start_Stop_Count  0x0032  100  100  020  Old_age  Always  -  10
  5 Reallocated_Sector_Ct  0x0033  100  100  010  Pre-fail  Always  -  0
  7 Seek_Error_Rate  0x000f  100  253  030  Pre-fail  Always  -  71062
  9 Power_On_Hours  0x0032  100  100  000  Old_age  Always  -  139
 10 Spin_Retry_Count  0x0013  100  100  097  Pre-fail  Always  -  0
 12 Power_Cycle_Count  0x0032  100  100  020  Old_age  Always  -  9
183 Runtime_Bad_Block  0x0032  100  100  000  Old_age  Always  -  0
184 End-to-End_Error  0x0032  100  100  099  Old_age  Always  -  0
187 Reported_Uncorrect  0x0032  100  100  000  Old_age  Always  -  0
188 Command_Timeout  0x0032  065  001  000  Old_age  Always  -  18225
189 High_Fly_Writes  0x003a  100  100  000  Old_age  Always  -  0
190 Airflow_Temperature_Cel 0x0022  073  070  045  Old_age  Always  -  27 (Min/Max 26/27)
191 G-Sense_Error_Rate  0x0032  100  100  000  Old_age  Always  -  0
192 Power-Off_Retract_Count 0x0032  100  100  000  Old_age  Always  -  5
193 Load_Cycle_Count  0x0032  100  100  000  Old_age  Always  -  17
194 Temperature_Celsius  0x0022  027  040  000  Old_age  Always  -  27 (0 16 0 0 0)
197 Current_Pending_Sector  0x0012  100  100  000  Old_age  Always  -  0
198 Offline_Uncorrectable  0x0010  100  100  000  Old_age  Offline  -  0
199 UDMA_CRC_Error_Count  0x003e  200  196  000  Old_age  Always  -  33200
240 Head_Flying_Hours  0x0000  100  253  000  Old_age  Offline  -  139 (124 137 0)
241 Total_LBAs_Written  0x0000  100  253  000  Old_age  Offline  -  202858357
242 Total_LBAs_Read  0x0000  100  253  000  Old_age  Offline  -  1286895

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
No self-tests have been logged.  [To run self-tests, use: smartctl -t]

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
  1  0  0  Not_testing
  2  0  0  Not_testing
  3  0  0  Not_testing
  4  0  0  Not_testing
  5  0  0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

Dice · Apr 29, 2017

Xander May said:
Guessing you meant:

Good catch.

Seagate drives are a bit 'unorthodox' to diagnose due to their proprietary SMART values.
However, what stands out are:
ID1
ID7
ID188
ID199
Which on any typical other drive should not generate high values on a healthy drive.

Since there are no SMART tests logged (part of setting up your FreeNAS - tips in this thread.

For now, I suggest you run a short, and then a long smart test.
smartctl -t short /dev/ada6
Wait 2 mins and then:
smartctl -t long /dev/ada6
This probably takes a couple of hours.
Once done, the smartctl -a /dev/ada6 will tell you if it passed the test.

From what I gather atm, is you've potentially recieved a faulty drive.
Since the seagate numbers are held in secrecy to other tools than their proprietary, the drives need to pass SMART tests at the very least.

A recommended procedure to test hardware prior to committing them to a pool can be found here:
https://forums.freenas.org/index.php?threads/how-to-hard-drive-burn-in-testing.21451/

Xander May · Apr 29, 2017

Dice said:
Good catch.

Seagate drives are a bit 'unorthodox' to diagnose due to their proprietary SMART values.
However, what stands out are:
ID1
ID7
ID188
ID199
Which on any typical other drive should not generate high values on a healthy drive.

Since there are no SMART tests logged (part of setting up your FreeNAS - tips in this thread.

For now, I suggest you run a short, and then a long smart test.
smartctl -t short /dev/ada6
Wait 2 mins and then:
smartctl -t long /dev/ada6
This probably takes a couple of hours.
Once done, the smartctl -a /dev/ada6 will tell you if it passed the test.

From what I gather atm, is you've potentially recieved a faulty drive.
Since the seagate numbers are held in secrecy to other tools than their proprietary, the drives need to pass SMART tests at the very least.

A recommended procedure to test hardware prior to committing them to a pool can be found here:
https://forums.freenas.org/index.php?threads/how-to-hard-drive-burn-in-testing.21451/

Alrighty, I will test the drives now. The reason there are no smart tests is because I've actually had only about 2 hours with the drives before that issue. Before that they were just powered on doing nothing.

gpsguy · Apr 29, 2017

BTW, your motherboard has an onboard Realtek 8111F NIC. It's known to be crappy under FreeNAS. Replace it with an Intel Pro/1000 CT (about $30 USD) for much better performance.

Also, 8GB is the bare minimum of RAM needed for FreeNAS. If you plan to add plugins like Plex, etc. - you'll need to add additional RAM.

Xander May · Apr 29, 2017

gpsguy said:
BTW, your motherboard has an onboard Realtek 8111F NIC. It's known to be crappy under FreeNAS. Replace it with an Intel Pro/1000 CT (about $30 USD) for much better performance.

Also, 8GB is the bare minimum of RAM needed for FreeNAS. If you plan to add plugins like Plex, etc. - you'll need to add additional RAM.

Did not know about the Realtek issue, but will for sure be getting more ram. Poor college student and all, just wanted to cobble together a server to download to while I'm at work for the summer.

Xander May · Apr 29, 2017

Dice said:
Good catch.

Seagate drives are a bit 'unorthodox' to diagnose due to their proprietary SMART values.
However, what stands out are:
ID1
ID7
ID188
ID199
Which on any typical other drive should not generate high values on a healthy drive.

Since there are no SMART tests logged (part of setting up your FreeNAS - tips in this thread.

For now, I suggest you run a short, and then a long smart test.
smartctl -t short /dev/ada6
Wait 2 mins and then:
smartctl -t long /dev/ada6
This probably takes a couple of hours.
Once done, the smartctl -a /dev/ada6 will tell you if it passed the test.

From what I gather atm, is you've potentially recieved a faulty drive.
Since the seagate numbers are held in secrecy to other tools than their proprietary, the drives need to pass SMART tests at the very least.

A recommended procedure to test hardware prior to committing them to a pool can be found here:
https://forums.freenas.org/index.php?threads/how-to-hard-drive-burn-in-testing.21451/

These seem to be from the short tests. Ran on two drives for fun....well shit...

Waited until after the time it told me the long test would take. Here is the new smartctl:

Code:

 smartctl -a /dev/ada6
smartctl 6.5 2016-05-07 r4318 [FreeBSD 10.3-STABLE amd64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Device Model:  ST2000DM006-2DM164
Serial Number:  Z504H3AF
LU WWN Device Id: 5 000c50 0a26c6b70
Firmware Version: CC26
User Capacity:  2,000,398,934,016 bytes [2.00 TB]
Sector Sizes:  512 bytes logical, 4096 bytes physical
Rotation Rate:  7200 rpm
Form Factor:  3.5 inches
Device is:  Not in smartctl database [for details use: -P showall]
ATA Version is:  ACS-2, ACS-3 T13/2161-D revision 3b
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:  Sat Apr 29 17:44:16 2017 PDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
  was completed without error.
  Auto Offline Data Collection: Enabled.
Self-test execution status:  (  41) The self-test routine was interrupted
  by the host with a hard or soft reset.
Total time to complete Offline
data collection:  (  97) seconds.
Offline data collection
capabilities:  (0x7b) SMART execute Offline immediate.
  Auto Offline data collection on/off supp  ort.
  Suspend Offline collection upon new
  command.
  Offline surface scan supported.
  Self-test supported.
  Conveyance Self-test supported.
  Selective Self-test supported.
SMART capabilities:  (0x0003) Saves SMART data before entering
  power-saving mode.
  Supports SMART auto save timer.
Error logging capability:  (0x01) Error logging supported.
  General Purpose Logging supported.
Short self-test routine
recommended polling time:  (  1) minutes.
Extended self-test routine
recommended polling time:  ( 235) minutes.
Conveyance self-test routine
recommended polling time:  (  2) minutes.
SCT capabilities:  (0x1085) SCT Status supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME  FLAG  VALUE WORST THRESH TYPE  UPDATED  WHEN_  FAILED RAW_VALUE
  1 Raw_Read_Error_Rate  0x000f  105  100  006  Pre-fail  Always  -  7711280
  3 Spin_Up_Time  0x0003  096  095  000  Pre-fail  Always  -  0
  4 Start_Stop_Count  0x0032  100  100  020  Old_age  Always  -  11
  5 Reallocated_Sector_Ct  0x0033  100  100  010  Pre-fail  Always  -  0
  7 Seek_Error_Rate  0x000f  100  253  030  Pre-fail  Always  -  84021
  9 Power_On_Hours  0x0032  100  100  000  Old_age  Always  -  143
 10 Spin_Retry_Count  0x0013  100  100  097  Pre-fail  Always  -  0
 12 Power_Cycle_Count  0x0032  100  100  020  Old_age  Always  -  10
183 Runtime_Bad_Block  0x0032  100  100  000  Old_age  Always  -  0
184 End-to-End_Error  0x0032  100  100  099  Old_age  Always  -  0
187 Reported_Uncorrect  0x0032  100  100  000  Old_age  Always  -  0
188 Command_Timeout  0x0032  098  001  000  Old_age  Always  -  8590085161
189 High_Fly_Writes  0x003a  100  100  000  Old_age  Always  -  0
190 Airflow_Temperature_Cel 0x0022  074  070  045  Old_age  Always  -  26 (Min/Max 25/27)
191 G-Sense_Error_Rate  0x0032  100  100  000  Old_age  Always  -  0
192 Power-Off_Retract_Count 0x0032  100  100  000  Old_age  Always  -  5
193 Load_Cycle_Count  0x0032  100  100  000  Old_age  Always  -  19
194 Temperature_Celsius  0x0022  026  040  000  Old_age  Always  -  26 (0 16 0 0 0)
197 Current_Pending_Sector  0x0012  100  100  000  Old_age  Always  -  0
198 Offline_Uncorrectable  0x0010  100  100  000  Old_age  Offline  -  0
199 UDMA_CRC_Error_Count  0x003e  200  196  000  Old_age  Always  -  36124
240 Head_Flying_Hours  0x0000  100  253  000  Old_age  Offline  -  143 (1 141 0)
241 Total_LBAs_Written  0x0000  100  253  000  Old_age  Offline  -  262013717
242 Total_LBAs_Read  0x0000  100  253  000  Old_age  Offline  -  1440312

SMART Error Log Version: 1
ATA Error Count: 2
  CR = Command Register [HEX]
  FR = Features Register [HEX]
  SC = Sector Count Register [HEX]
  SN = Sector Number Register [HEX]
  CL = Cylinder Low Register [HEX]
  CH = Cylinder High Register [HEX]
  DH = Device/Head Register [HEX]
  DC = Device Command Register [HEX]
  ER = Error register [HEX]
  ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 2 occurred at disk power-on lifetime: 139 hours (5 days + 19 hours)
  When the command that caused the error occurred, the device was in an unknown  state.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  04 51 00 00 00 00 00  Error: ABRT

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC  Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  00 00 00 00 00 00 00 ff  00:12:06.665  NOP [Abort queued commands]
  b0 d4 00 82 4f c2 40 00  00:11:46.607  SMART EXECUTE OFF-LINE IMMEDIATE
  b0 d0 01 00 4f c2 40 00  00:11:46.546  SMART READ DATA
  ec 00 01 00 00 00 40 00  00:11:46.542  IDENTIFY DEVICE
  ef 02 00 00 00 00 40 00  00:09:25.564  SET FEATURES [Enable write cache]

Error 1 occurred at disk power-on lifetime: 139 hours (5 days + 19 hours)
  When the command that caused the error occurred, the device was in an unknown  state.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  04 51 00 00 00 00 00  Error: ABRT

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC  Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --  ----------------  --------------------
  00 00 00 00 00 00 00 ff  00:09:25.410  NOP [Abort queued commands]
  b0 d4 00 81 4f c2 40 00  00:09:05.368  SMART EXECUTE OFF-LINE IMMEDIATE
  b0 d0 01 00 4f c2 40 00  00:09:05.304  SMART READ DATA
  ec 00 01 00 00 00 40 00  00:09:05.300  IDENTIFY DEVICE
  b0 d4 00 7f 4f c2 40 00  00:07:17.564  SMART EXECUTE OFF-LINE IMMEDIATE

SMART Self-test log structure revision number 1
Num  Test_Description  Status  Remaining  LifeTime(hours)  LBA  _of_first_error
# 1  Extended captive  Interrupted (host reset)  90%  139  -
# 2  Short captive  Interrupted (host reset)  70%  139  -
# 3  Extended offline  Aborted by host  90%  139  -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
  1  0  0  Not_testing
  2  0  0  Not_testing
  3  0  0  Not_testing
  4  0  0  Not_testing
  5  0  0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

Robert Trevellyan · Apr 29, 2017

It looks like you have flaky hardware. Not clear if it's the drive(s) or something else.

Xander May · Apr 29, 2017

Robert Trevellyan said:
It looks like you have flaky hardware. Not clear if it's the drive(s) or something else.

To be honest I'm willing to bet its the motherboard at this point. I recently switched from one motherboard to another. IMMEDIATELY after that, my HBA stopped working properly. Now these new harddrives too. Fan....Tastic

Robert Trevellyan · Apr 29, 2017

Seems reasonable. The SMART output doesn't show any obvious issues.

Dice · Apr 29, 2017

But this however:

Code:

SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA _of_first_error
# 1 Extended captive Interrupted (host reset) 90% 139 -
# 2 Short captive Interrupted (host reset) 70% 139 -
# 3 Extended offline Aborted by host 90% 139 -

...The SMART tests don't finish properly. It is a definite worry to me. If it is user-induced - retry. If not.... something is fishy.

Robert Trevellyan · Apr 30, 2017

Yes, but that's consistent with some other flaky piece of hardware causing the drive to disconnect and reset intermittently, which could also lead to the ZFS checksum errors.

EDIT: for clarity - it could well be the drive itself, but didn't you see trouble on more than one drive?

Important Announcement for the TrueNAS Community.

Drives keep resilvering

Dabbler

Wizard

Active Member

Dabbler

Dabbler

Dabbler

Active Member

Dabbler

Wizard

Dabbler

Wizard

Dabbler

Active Member

Dabbler

Dabbler

Pony Wrangler

Dabbler

Pony Wrangler

Wizard

Pony Wrangler

Similar threads