Two drives failing at the same time?

mmolinari · Mar 3, 2017

Hi,
I've been running this cheap home server for a more than 6 years:
- Intel i5-650
- ZOTAC Motherboard H55ITX-A-E
- 8GB non-ECC RAM

I recently added a SAS 9207-8I controller.

The server has a pool with two RAID-Z2 vdevs. It has been running fine until a couple of weeks ago, when this happened:

Code:

Feb 18 18:45:25 PigZilla (ada1:ata2:0:1:0): CAM status: Command timeout
Feb 18 18:45:25 PigZilla (ada1:ata2:0:1:0): Retrying command
Feb 18 18:53:13 PigZilla (ada1:ata2:0:1:0): FLUSHCACHE48. ACB: ea 00 00 00 00 40 00 00 00 00 00 00
Feb 18 18:53:13 PigZilla (ada1:ata2:0:1:0): CAM status: Command timeout
Feb 18 18:53:13 PigZilla (ada1:ata2:0:1:0): Retrying command
Feb 18 18:53:13 PigZilla ada0 at ata2 bus 0 scbus2 target 0 lun 0
Feb 18 18:53:13 PigZilla ada0: <ST4000DM000-1F2168 CC52> s/n W300C91L detached
Feb 18 18:53:13 PigZilla ada1 at ata2 bus 0 scbus2 target 1 lun 0
Feb 18 18:53:13 PigZilla ada1: <ST4000DX000-1C5160 CC42> s/n Z1Z0084Z detached
Feb 18 18:53:14 PigZilla swap_pager: I/O error - pagein failed; blkno 62,size 8192, error 6
Feb 18 18:53:14 PigZilla GEOM_ELI: vm_fault: pager read error, pid 1200 (devd)
Feb 18 18:53:14 PigZilla Device ada0p1.eli destroyed.
Feb 18 18:53:14 PigZilla GEOM_ELI: Detached ada0p1.eli on last close.
Feb 18 18:53:14 PigZilla GEOM_ELI: Device ada1p1.eli destroyed.
Feb 18 18:53:14 PigZilla GEOM_ELI: Detached ada1p1.eli on last close.
Feb 18 18:53:14 PigZilla swap_pager: I/O error - pagein failed; blkno 524321,size 16384, error 6
Feb 18 18:53:14 PigZilla vm_fault: pager read error, pid 1200 (devd)
Feb 18 18:53:14 PigZilla kernel: Failed to write core file for process devd (error 14)
Feb 18 18:53:14 PigZilla kernel: Failed to write core file for process devd (error 14)
Feb 18 18:53:14 PigZilla kernel: pid 1200 (devd), uid 0: exited on signal 11
Feb 18 18:53:14 PigZilla zfsd: CaseFile::Serialize: Unable to open /etc/zfs/cases/pool_14071799796770780259_vdev_8531010438066722682.case.
Feb 18 18:53:14 PigZilla zfsd: CaseFile::Serialize: Unable to open /etc/zfs/cases/pool_14071799796770780259_vdev_15017151504146195539.case.
Feb 18 18:53:14 PigZilla (ada0:ata2:0:0:0): Periph destroyed
Feb 18 18:53:14 PigZilla (ada1:ata2:0:1:0): Periph destroyed

S.M.A.R.T. was (and still is) fine for the drives, I've attached the output in the footer. I run scrubs every month, and never has a single issue.

When I rebooted the server, the affected drives were not visible to the server, so I tried to connect them to the new controller (which had a couple of spare parts), which detected the drives, and I got this:

Code:

pool: tank
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
	attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
	using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://illumos.org/msg/ZFS-8000-9P
  scan: resilvered 12.9M in 0h0m with 0 errors on Mon Feb 20 19:55:24 2017
config:
	NAME											STATE	 READ WRITE CKSUM
	tank											ONLINE	   0	 0	 0
	  raidz2-0									  ONLINE	   0	 0	 0
		gptid/c4205d4d-14e3-11e5-930a-00012e2ccfc4  ONLINE	   0	 0	25
		gptid/c4a2f333-14e3-11e5-930a-00012e2ccfc4  ONLINE	   0	 0	23
		gptid/c50380dc-14e3-11e5-930a-00012e2ccfc4  ONLINE	   0	 0	 0
		gptid/c5720042-14e3-11e5-930a-00012e2ccfc4  ONLINE	   0	 0	 0
		gptid/64867e5c-9c51-11e6-8898-00012e2ccfc4  ONLINE	   0	 0	 0
		gptid/c64f25ac-14e3-11e5-930a-00012e2ccfc4  ONLINE	   0	 0	 0
	  raidz2-1									  ONLINE	   0	 0	 0
		gptid/0264cad6-d0fd-11e6-a095-00012e2ccfc4  ONLINE	   0	 0	 0
		gptid/7ec466a4-d0fd-11e6-a095-00012e2ccfc4  ONLINE	   0	 0	 0
		gptid/7fba0667-d0fd-11e6-a095-00012e2ccfc4  ONLINE	   0	 0	 0
		gptid/d29093db-f391-11e6-8dca-00012e2ccfc4  ONLINE	   0	 0	 0
		gptid/8150e980-d0fd-11e6-a095-00012e2ccfc4  ONLINE	   0	 0	 0
		gptid/822ae3a5-d0fd-11e6-a095-00012e2ccfc4  ONLINE	   0	 0	 0
errors: No known data errors

The drives with the checksum errors are the affected drives.

I read the linked article and I decided to clear the errors, but after a few minutes the number of CKSUM errors went up again:

Code:

	gptid/c4205d4d-14e3-11e5-930a-00012e2ccfc4  ONLINE	   0	 0	20
	gptid/c4a2f333-14e3-11e5-930a-00012e2ccfc4  ONLINE	   0	 0	20

After a while, as expected, ZFS decided to detach the drives.

Note how the number is the same for both drives: in each subsequent test I’ve done the number of checksum errors *always* increased at the same rate, at the same time for both drives.

Then I’ve ran 4 days of memtest on the server and it was fine.

Today I connected all drives to a new PC, also connecting the power of the affected drives to the new PSU, and I still get the same results, even if apparently the number of checksum errors was increasing more slowly.

Whenever I see the checksum errors, I also see this kind of errors in the logs:

Code:

zfsd: CaseFile::Serialize: Unable to open /etc/zfs/cases/pool_14071799796770780259_vdev_15017151504146195539.case.

and

Code:

daemon.log:Mar  3 11:10:47 PigZilla zfsd: ZFS: Notify  class=ereport.fs.zfs.checksum ena=2331296708105873409 parent_guid=2206385843928437561 parent_type=raidz pool=tank pool_context=0 pool_failmode=continue pool_guid=14071799796770780259 subsystem=ZFS timestamp=1488532247 type=ereport.fs.zfs.checksum vdev_guid=15017151504146195539 vdev_path=/dev/gptid/c4a2f333-14e3-11e5-930a-00012e2ccfc4 vdev_type=disk zio_err=0 zio_objset=193 zio_offset=1892544516096 zio_size=4096
daemon.log:Mar  3 11:10:47 PigZilla zfsd: ZFS: Notify  class=ereport.fs.zfs.checksum ena=2331296708105873409 parent_guid=2206385843928437561 parent_type=raidz pool=tank pool_context=0 pool_failmode=continue pool_guid=14071799796770780259 subsystem=ZFS timestamp=1488532247 type=ereport.fs.zfs.checksum vdev_guid=8531010438066722682 vdev_path=/dev/gptid/c4205d4d-14e3-11e5-930a-00012e2ccfc4 vdev_type=disk zio_err=0 zio_objset=193 zio_offset=1892544516096 zio_size=4096

The last errors always in pair.

I think I’ve taken out of the equation:
- CPU / motherboard / memory
- controller (I’ve connected the affected drives to both the MB and the controller ports)
- SATA cables
- PSU and power cables

So my question is: are both drives failing at the same time, am I another example of why you should use ECC RAM, or am I missing something else? I’d be ok if it turns out the pool is corrupted, I have backups of what matters, but I’d like to know how to proceed.

Thanks a lot!
Marco

S.M.A.R.T for the affected drives:

Code:

[root@PigZilla] /var/log# smartctl -a /dev/da6
smartctl 6.5 2016-05-07 r4318 [FreeBSD 10.3-STABLE amd64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:	 Seagate Barracuda XT
Device Model:	 ST4000DX000-1C5160
Serial Number:	Z1Z0084Z
LU WWN Device Id: 5 000c50 035a0bd05
Firmware Version: CC42
User Capacity:	4,000,787,030,016 bytes [4.00 TB]
Sector Size:	  512 bytes logical/physical
Rotation Rate:	7200 rpm
Device is:		In smartctl database [for details use: -P show]
ATA Version is:   ATA8-ACS T13/1699-D revision 4
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:	Fri Mar  3 12:20:18 2017 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
See vendor-specific Attribute list for marginal Attributes.

General SMART Values:
Offline data collection status:  (0x82)	Offline data collection activity
					was completed without error.
					Auto Offline Data Collection: Enabled.
Self-test execution status:	  (   0)	The previous self-test routine completed
					without error or no self-test has ever 
					been run.

Total time to complete Offline 
data collection:		 (  584) seconds.

Offline data collection
capabilities:			 (0x7b) SMART execute Offline immediate.
					Auto Offline data collection on/off support.
					Suspend Offline collection upon new
					command.
					Offline surface scan supported.
					Self-test supported.
					Conveyance Self-test supported.
					Selective Self-test supported.
SMART capabilities:			(0x0003)	Saves SMART data before entering
					power-saving mode.
					Supports SMART auto save timer.
Error logging capability:		(0x01)	Error logging supported.
					General Purpose Logging supported.
Short self-test routine 
recommended polling time:	 (   1) minutes.
Extended self-test routine
recommended polling time:	 ( 516) minutes.
Conveyance self-test routine
recommended polling time:	 (   2) minutes.
SCT capabilities:		   (0x1085)	SCT Status supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME		  FLAG	 VALUE WORST THRESH TYPE	  UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate	 0x000f   117   094   006	Pre-fail  Always	   -	   165030344
  3 Spin_Up_Time			0x0003   089   089   000	Pre-fail  Always	   -	   0
  4 Start_Stop_Count		0x0032   100   100   020	Old_age   Always	   -	   139
  5 Reallocated_Sector_Ct   0x0033   100   100   036	Pre-fail  Always	   -	   0
  7 Seek_Error_Rate		 0x000f   084   060   030	Pre-fail  Always	   -	   284547522
  9 Power_On_Hours		  0x0032   081   081   000	Old_age   Always	   -	   16959
 10 Spin_Retry_Count		0x0013   100   100   097	Pre-fail  Always	   -	   0
 12 Power_Cycle_Count	   0x0032   100   100   020	Old_age   Always	   -	   137
183 Runtime_Bad_Block	   0x0032   100   100   000	Old_age   Always	   -	   0
184 End-to-End_Error		0x0032   100   100   099	Old_age   Always	   -	   0
187 Reported_Uncorrect	  0x0032   087   087   000	Old_age   Always	   -	   13
188 Command_Timeout		 0x0032   100   099   000	Old_age   Always	   -	   42950328335
189 High_Fly_Writes		 0x003a   100   100   000	Old_age   Always	   -	   0
190 Airflow_Temperature_Cel 0x0022   069   006   045	Old_age   Always   In_the_past 31 (13 48 31 27 0)
191 G-Sense_Error_Rate	  0x0032   100   100   000	Old_age   Always	   -	   0
192 Power-Off_Retract_Count 0x0032   100   100   000	Old_age   Always	   -	   96
193 Load_Cycle_Count		0x0032   097   097   000	Old_age   Always	   -	   7230
194 Temperature_Celsius	 0x0022   031   094   000	Old_age   Always	   -	   31 (0 12 0 0 0)
197 Current_Pending_Sector  0x0012   100   100   000	Old_age   Always	   -	   0
198 Offline_Uncorrectable   0x0010   100   100   000	Old_age   Offline	  -	   0
199 UDMA_CRC_Error_Count	0x003e   200   200   000	Old_age   Always	   -	   0
240 Head_Flying_Hours	   0x0000   100   253   000	Old_age   Offline	  -	   16836 (126 169 0)
241 Total_LBAs_Written	  0x0000   100   253   000	Old_age   Offline	  -	   496698347
242 Total_LBAs_Read		 0x0000   100   253   000	Old_age   Offline	  -	   3420019828

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description	Status				  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline	Completed without error	   00%	 16936		 -
# 2  Short offline	   Completed without error	   00%	 16925		 -
# 3  Short offline	   Completed without error	   00%	 16901		 -
# 4  Short offline	   Completed without error	   00%	 16877		 -
# 5  Short offline	   Completed without error	   00%	 16853		 -
# 6  Short offline	   Completed without error	   00%	 16829		 -
# 7  Short offline	   Completed without error	   00%	 16805		 -
# 8  Short offline	   Completed without error	   00%	 16781		 -
# 9  Extended offline	Completed without error	   00%	 16768		 -
#10  Short offline	   Completed without error	   00%	 16757		 -
#11  Short offline	   Completed without error	   00%	 16733		 -
#12  Short offline	   Completed without error	   00%	 16709		 -
#13  Short offline	   Completed without error	   00%	 16685		 -
#14  Short offline	   Completed without error	   00%	 16661		 -
#15  Short offline	   Completed without error	   00%	 16637		 -
#16  Short offline	   Completed without error	   00%	 16613		 -
#17  Extended offline	Completed without error	   00%	 16600		 -
#18  Short offline	   Completed without error	   00%	 16589		 -
#19  Short offline	   Completed without error	   00%	 16565		 -
#20  Short offline	   Completed without error	   00%	 16541		 -
#21  Short offline	   Completed without error	   00%	 16517		 -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
	1		0		0  Not_testing
	2		0		0  Not_testing
	3		0		0  Not_testing
	4		0		0  Not_testing
	5		0		0  Not_testing

Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.

If Selective self-test is pending on power-up, resume after 0 minute delay.

[root@PigZilla] /var/log# smartctl -a /dev/da7
smartctl 6.5 2016-05-07 r4318 [FreeBSD 10.3-STABLE amd64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:	 Seagate Desktop HDD.15
Device Model:	 ST4000DM000-1F2168
Serial Number:	W300C91L
LU WWN Device Id: 5 000c50 06977b8b5
Firmware Version: CC52
User Capacity:	4,000,787,030,016 bytes [4.00 TB]
Sector Sizes:	 512 bytes logical, 4096 bytes physical
Rotation Rate:	5900 rpm
Form Factor:	  3.5 inches
Device is:		In smartctl database [for details use: -P show]
ATA Version is:   ATA8-ACS T13/1699-D revision 4
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:	Fri Mar  3 12:20:50 2017 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82)	Offline data collection activity
					was completed without error.
					Auto Offline Data Collection: Enabled.
Self-test execution status:	  (   0)	The previous self-test routine completed
					without error or no self-test has ever 
					been run.
Total time to complete Offline 
data collection:		 (  612) seconds.
Offline data collection
capabilities:			 (0x7b) SMART execute Offline immediate.
					Auto Offline data collection on/off support.
					Suspend Offline collection upon new
					command.
					Offline surface scan supported.
					Self-test supported.
					Conveyance Self-test supported.
					Selective Self-test supported.
SMART capabilities:			(0x0003)	Saves SMART data before entering
					power-saving mode.
					Supports SMART auto save timer.
Error logging capability:		(0x01)	Error logging supported.
					General Purpose Logging supported.
Short self-test routine 
recommended polling time:	 (   1) minutes.
Extended self-test routine
recommended polling time:	 ( 538) minutes.
Conveyance self-test routine
recommended polling time:	 (   2) minutes.
SCT capabilities:		   (0x1085)	SCT Status supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME		  FLAG	 VALUE WORST THRESH TYPE	  UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate	 0x000f   112   099   006	Pre-fail  Always	   -	   45917704
  3 Spin_Up_Time			0x0003   092   091   000	Pre-fail  Always	   -	   0
  4 Start_Stop_Count		0x0032   100   100   020	Old_age   Always	   -	   73
  5 Reallocated_Sector_Ct   0x0033   100   100   010	Pre-fail  Always	   -	   0
  7 Seek_Error_Rate		 0x000f   083   060   030	Pre-fail  Always	   -	   202195012
  9 Power_On_Hours		  0x0032   083   083   000	Old_age   Always	   -	   15214
 10 Spin_Retry_Count		0x0013   100   100   097	Pre-fail  Always	   -	   0
 12 Power_Cycle_Count	   0x0032   100   100   020	Old_age   Always	   -	   73
183 Runtime_Bad_Block	   0x0032   100   100   000	Old_age   Always	   -	   0
184 End-to-End_Error		0x0032   100   100   099	Old_age   Always	   -	   0
187 Reported_Uncorrect	  0x0032   100   100   000	Old_age   Always	   -	   0
188 Command_Timeout		 0x0032   100   100   000	Old_age   Always	   -	   0 0 0
189 High_Fly_Writes		 0x003a   096   096   000	Old_age   Always	   -	   4
190 Airflow_Temperature_Cel 0x0022   067   060   045	Old_age   Always	   -	   33 (Min/Max 30/33)
191 G-Sense_Error_Rate	  0x0032   100   100   000	Old_age   Always	   -	   0
192 Power-Off_Retract_Count 0x0032   100   100   000	Old_age   Always	   -	   55
193 Load_Cycle_Count		0x0032   097   097   000	Old_age   Always	   -	   6508
194 Temperature_Celsius	 0x0022   033   040   000	Old_age   Always	   -	   33 (0 15 0 0 0)
197 Current_Pending_Sector  0x0012   100   100   000	Old_age   Always	   -	   0
198 Offline_Uncorrectable   0x0010   100   100   000	Old_age   Offline	  -	   0
199 UDMA_CRC_Error_Count	0x003e   200   200   000	Old_age   Always	   -	   0
240 Head_Flying_Hours	   0x0000   100   253   000	Old_age   Offline	  -	   15058h+44m+26.796s
241 Total_LBAs_Written	  0x0000   100   253   000	Old_age   Offline	  -	   31633084667
242 Total_LBAs_Read		 0x0000   100   253   000	Old_age   Offline	  -	   109535584255

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description	Status				  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline	Completed without error	   00%	 15191		 -
# 2  Short offline	   Completed without error	   00%	 15180		 -
# 3  Short offline	   Completed without error	   00%	 15156		 -
# 4  Short offline	   Completed without error	   00%	 15132		 -
# 5  Short offline	   Completed without error	   00%	 15109		 -
# 6  Short offline	   Completed without error	   00%	 15085		 -
# 7  Short offline	   Completed without error	   00%	 15061		 -
# 8  Short offline	   Completed without error	   00%	 15037		 -
# 9  Extended offline	Completed without error	   00%	 15024		 -
#10  Short offline	   Completed without error	   00%	 15013		 -
#11  Short offline	   Completed without error	   00%	 14989		 -
#12  Short offline	   Completed without error	   00%	 14965		 -
#13  Short offline	   Completed without error	   00%	 14941		 -
#14  Short offline	   Completed without error	   00%	 14917		 -
#15  Short offline	   Completed without error	   00%	 14893		 -
#16  Short offline	   Completed without error	   00%	 14869		 -
#17  Extended offline	Completed without error	   00%	 14856		 -
#18  Short offline	   Completed without error	   00%	 14845		 -
#19  Short offline	   Completed without error	   00%	 14821		 -
#20  Short offline	   Completed without error	   00%	 14797		 -
#21  Short offline	   Completed without error	   00%	 14773		 -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
	1		0		0  Not_testing
	2		0		0  Not_testing
	3		0		0  Not_testing
	4		0		0  Not_testing
	5		0		0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.

If Selective self-test is pending on power-up, resume after 0 minute delay.

joeschmuck · Mar 3, 2017

I see nothing wrong with your hard drives, SMART looks good for them.

Did you run a scrub since the first failure? If you haven't run one since the last problem, you should. Once the scrub is done, ensure it completed without any errors, or maybe there are some corrupt files that need to be manually deleted/restored.

You could have a bad controller but since you moved it to a new computer, did you also move the controller with it?

EDIT: Is FreeNAS running on a bare metal machine or VM?

mmolinari · Mar 4, 2017

Hi,
thank you for the reply.

Did you run a scrub since the first failure? If you haven't run one since the last problem, you should. Once the scrub is done, ensure it completed without any errors, or maybe there are some corrupt files that need to be manually deleted/restored.

I didn't run a scrub, so I've run it now, and here is the result:

Code:

 pool: tank
 state: DEGRADED
status: One or more devices has experienced an unrecoverable error.  An
	attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
	using 'zpool clear' or replace the device with 'zpool replace'.
  see: http://illumos.org/msg/ZFS-8000-9P
  scan: scrub repaired 4.80M in 20h6m with 0 errors on Sun Mar  5 04:20:24 2017
config:

	NAME											STATE	 READ WRITE CKSUM
	tank											DEGRADED	 0	 0	 0
	  raidz2-0									  DEGRADED	 0	 0	 0
		gptid/c4205d4d-14e3-11e5-930a-00012e2ccfc4  DEGRADED	 0	 0   391  too many errors
		gptid/c4a2f333-14e3-11e5-930a-00012e2ccfc4  DEGRADED	 0	 0   389  too many errors
		gptid/c50380dc-14e3-11e5-930a-00012e2ccfc4  ONLINE	   0	 0	 0
		gptid/c5720042-14e3-11e5-930a-00012e2ccfc4  ONLINE	   0	 0	 0
		gptid/64867e5c-9c51-11e6-8898-00012e2ccfc4  ONLINE	   0	 0	 0
		gptid/c64f25ac-14e3-11e5-930a-00012e2ccfc4  ONLINE	   0	 0	 0
	  raidz2-1									  ONLINE	   0	 0	 0
		gptid/0264cad6-d0fd-11e6-a095-00012e2ccfc4  ONLINE	   0	 0	 0
		gptid/7ec466a4-d0fd-11e6-a095-00012e2ccfc4  ONLINE	   0	 0	 0
		gptid/7fba0667-d0fd-11e6-a095-00012e2ccfc4  ONLINE	   0	 0	 0
		gptid/d29093db-f391-11e6-8dca-00012e2ccfc4  ONLINE	   0	 0	 0
		gptid/8150e980-d0fd-11e6-a095-00012e2ccfc4  ONLINE	   0	 0	 0
		gptid/822ae3a5-d0fd-11e6-a095-00012e2ccfc4  ONLINE	   0	 0	 0

errors: No known data errors

I've then cleared the errors, and FreeNAS has now started a planned monthly scrub: should I let it run?

You could have a bad controller but since you moved it to a new computer, did you also move the controller with it?

When the first failure happened, the affected drives were connected to the motherboard of the old server, which I'm not using anymore.

Now that I've connected everything to the new PC, 4 drives are connected to the motherboard, and 8 drives are connected to the 9207-8I, including the affected drives.

So yes I moved the controller, but the drives were not using it when the first failure happened.

Since you mention a bad controller, is it possible that the old server motherboard is failing, caused the first failure which messed up those drives, but now ZFS has repaired everything? In other words, I believe I was misinterpreting the checksum errors that were popping up: I thought they were new, but maybe they were just old errors that ZFS was finding while using the drives?

Is FreeNAS running on a bare metal machine or VM?

Bare metal.

Thanks a lot!
Marco

joeschmuck · Mar 5, 2017

What are the results of the second scrub?

This could have been caused by many things, hardware failure is one of them. I would be speculating what caused it.

If the scrub reported more errors then I'd suggest you run another SMART Long Test on both drives and report them. Have you looked at the Hard Drive Troubleshooting Guide (link in my signature) ?

mmolinari · Mar 6, 2017

What are the results of the second scrub?

Here are the results:

Code:

 pool: tank
 state: ONLINE

status: One or more devices has experienced an unrecoverable error.  An
	attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
	using 'zpool clear' or replace the device with 'zpool replace'.
  see: http://illumos.org/msg/ZFS-8000-9P
  scan: scrub repaired 32K in 20h6m with 0 errors on Mon Mar  6 04:23:31 2017
config:

	NAME											STATE	 READ WRITE CKSUM
	tank											ONLINE	   0	 0	 0
	  raidz2-0									  ONLINE	   0	 0	 0
		gptid/c4205d4d-14e3-11e5-930a-00012e2ccfc4  ONLINE	   0	 0	 0
		gptid/c4a2f333-14e3-11e5-930a-00012e2ccfc4  ONLINE	   0	 0	 0
		gptid/c50380dc-14e3-11e5-930a-00012e2ccfc4  ONLINE	   0	 0	 0
		gptid/c5720042-14e3-11e5-930a-00012e2ccfc4  ONLINE	   0	 0	 1
		gptid/64867e5c-9c51-11e6-8898-00012e2ccfc4  ONLINE	   0	 0	 0
		gptid/c64f25ac-14e3-11e5-930a-00012e2ccfc4  ONLINE	   0	 0	 0
	  raidz2-1									  ONLINE	   0	 0	 0
		gptid/0264cad6-d0fd-11e6-a095-00012e2ccfc4  ONLINE	   0	 0	 0
		gptid/7ec466a4-d0fd-11e6-a095-00012e2ccfc4  ONLINE	   0	 0	 0
		gptid/7fba0667-d0fd-11e6-a095-00012e2ccfc4  ONLINE	   0	 0	 0
		gptid/d29093db-f391-11e6-8dca-00012e2ccfc4  ONLINE	   0	 0	 0
		gptid/8150e980-d0fd-11e6-a095-00012e2ccfc4  ONLINE	   0	 0	 0
		gptid/822ae3a5-d0fd-11e6-a095-00012e2ccfc4  ONLINE	   0	 0	 0

errors: No known data errors

So scrub found another checksum errors, but it's on a different disk. While in 6 years I've obviously replaced disks, I've run scrubs once a month and they were always clean. The disks of the first event (c4205d4d and c4a2f333) are fine. The disk with newly reported checksum error (c5720042) is connected to the 9207-8I, like c4205d4d, while c4a2f333 is connected to the new motherboard now. All three disks were connected to the old motherboard when the first event happened.

If the scrub reported more errors then I'd suggest you run another SMART Long Test on both drives and report them.

I've run Long Tests on all drives, and they were clean, including the drive with the new checksum error. I've attached the SMART results below.

Next I'll use your Hard Drive Troubleshooting Guide, and follow up.

Thanks!
Marco

SMART results:

Code:

[root@PigZilla] ~# smartctl -a /dev/ada1 
smartctl 6.5 2016-05-07 r4318 [FreeBSD 10.3-STABLE amd64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:	 Seagate Barracuda XT
Device Model:	 ST4000DX000-1C5160
Serial Number:	Z1Z0084Z
LU WWN Device Id: 5 000c50 035a0bd05
Firmware Version: CC42
User Capacity:	4,000,787,030,016 bytes [4.00 TB]
Sector Size:	  512 bytes logical/physical
Rotation Rate:	7200 rpm
Device is:		In smartctl database [for details use: -P show]
ATA Version is:   ATA8-ACS T13/1699-D revision 4
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:	Mon Mar  6 15:28:52 2017 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
See vendor-specific Attribute list for marginal Attributes.

General SMART Values:
Offline data collection status:  (0x82)	Offline data collection activity
					was completed without error.
					Auto Offline Data Collection: Enabled.
Self-test execution status:	  (   0)	The previous self-test routine completed
					without error or no self-test has ever 
					been run.
Total time to complete Offline 
data collection:		 (  584) seconds.
Offline data collection
capabilities:			(0x7b) SMART execute Offline immediate.
					Auto Offline data collection on/off support.
					Suspend Offline collection upon new
					command.
					Offline surface scan supported.
					Self-test supported.
					Conveyance Self-test supported.
					Selective Self-test supported.
SMART capabilities:			(0x0003)	Saves SMART data before entering
					power-saving mode.
					Supports SMART auto save timer.
Error logging capability:		(0x01)	Error logging supported.
					General Purpose Logging supported.
Short self-test routine 
recommended polling time:	(   1) minutes.
Extended self-test routine
recommended polling time:	( 516) minutes.
Conveyance self-test routine
recommended polling time:	(   2) minutes.
SCT capabilities:		  (0x1085)	SCT Status supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME		  FLAG	 VALUE WORST THRESH TYPE	  UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate	 0x000f   109   094   006	Pre-fail  Always	   -	   24514629
  3 Spin_Up_Time			0x0003   089   089   000	Pre-fail  Always	   -	   0
  4 Start_Stop_Count		0x0032   100   100   020	Old_age   Always	   -	   141
  5 Reallocated_Sector_Ct   0x0033   100   100   036	Pre-fail  Always	   -	   0
  7 Seek_Error_Rate		 0x000f   084   060   030	Pre-fail  Always	   -	   291598868
  9 Power_On_Hours		  0x0032   081   081   000	Old_age   Always	   -	   17018
10 Spin_Retry_Count		0x0013   100   100   097	Pre-fail  Always	   -	   0
12 Power_Cycle_Count	   0x0032   100   100   020	Old_age   Always	   -	   139
183 Runtime_Bad_Block	   0x0032   100   100   000	Old_age   Always	   -	   0
184 End-to-End_Error		0x0032   100   100   099	Old_age   Always	   -	   0
187 Reported_Uncorrect	  0x0032   087   087   000	Old_age   Always	   -	   13
188 Command_Timeout		 0x0032   100   099   000	Old_age   Always	   -	   42950328335
189 High_Fly_Writes		 0x003a   100   100   000	Old_age   Always	   -	   0
190 Airflow_Temperature_Cel 0x0022   071   006   045	Old_age   Always   In_the_past 29 (13 48 36 17 0)
191 G-Sense_Error_Rate	  0x0032   100   100   000	Old_age   Always	   -	   0
192 Power-Off_Retract_Count 0x0032   100   100   000	Old_age   Always	   -	   97
193 Load_Cycle_Count		0x0032   097   097   000	Old_age   Always	   -	   7238
194 Temperature_Celsius	 0x0022   029   094   000	Old_age   Always	   -	   29 (0 12 0 0 0)
197 Current_Pending_Sector  0x0012   100   100   000	Old_age   Always	   -	   0
198 Offline_Uncorrectable   0x0010   100   100   000	Old_age   Offline	  -	   0
199 UDMA_CRC_Error_Count	0x003e   200   200   000	Old_age   Always	   -	   0
240 Head_Flying_Hours	   0x0000   100   253   000	Old_age   Offline	  -	   16896 (112 174 0)
241 Total_LBAs_Written	  0x0000   100   253   000	Old_age   Offline	  -	   1117758192
242 Total_LBAs_Read		 0x0000   100   253   000	Old_age   Offline	  -	   3262294286

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description	Status				  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline	Completed without error	   00%	 17017		 -
# 2  Short offline	   Completed without error	   00%	 17007		 -
# 3  Extended offline	Completed without error	   00%	 16936		 -
# 4  Short offline	   Completed without error	   00%	 16925		 -
# 5  Short offline	   Completed without error	   00%	 16901		 -
# 6  Short offline	   Completed without error	   00%	 16877		 -
# 7  Short offline	   Completed without error	   00%	 16853		 -
# 8  Short offline	   Completed without error	   00%	 16829		 -
# 9  Short offline	   Completed without error	   00%	 16805		 -
#10  Short offline	   Completed without error	   00%	 16781		 -
#11  Extended offline	Completed without error	   00%	 16768		 -
#12  Short offline	   Completed without error	   00%	 16757		 -
#13  Short offline	   Completed without error	   00%	 16733		 -
#14  Short offline	   Completed without error	   00%	 16709		 -
#15  Short offline	   Completed without error	   00%	 16685		 -
#16  Short offline	   Completed without error	   00%	 16661		 -
#17  Short offline	   Completed without error	   00%	 16637		 -
#18  Short offline	   Completed without error	   00%	 16613		 -
#19  Extended offline	Completed without error	   00%	 16600		 -
#20  Short offline	   Completed without error	   00%	 16589		 -
#21  Short offline	   Completed without error	   00%	 16565		 -

SMART Selective self-test log data structure revision number 1
SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
	1		0		0  Not_testing
	2		0		0  Not_testing
	3		0		0  Not_testing
	4		0		0  Not_testing
	5		0		0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

[root@PigZilla] ~# smartctl -a /dev/da3
smartctl 6.5 2016-05-07 r4318 [FreeBSD 10.3-STABLE amd64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:	 Seagate Desktop HDD.15
Device Model:	 ST4000DM000-1F2168
Serial Number:	W300C91L
LU WWN Device Id: 5 000c50 06977b8b5
Firmware Version: CC52
User Capacity:	4,000,787,030,016 bytes [4.00 TB]
Sector Sizes:	 512 bytes logical, 4096 bytes physical
Rotation Rate:	5900 rpm
Form Factor:	  3.5 inches
Device is:		In smartctl database [for details use: -P show]
ATA Version is:   ATA8-ACS T13/1699-D revision 4
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:	Mon Mar  6 15:29:03 2017 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82)	Offline data collection activity
					was completed without error.
					Auto Offline Data Collection: Enabled.
Self-test execution status:	  (   0)	The previous self-test routine completed
					without error or no self-test has ever 
					been run.
Total time to complete Offline 
data collection:		 (  612) seconds.
Offline data collection
capabilities:			(0x7b) SMART execute Offline immediate.
					Auto Offline data collection on/off support.
					Suspend Offline collection upon new
					command.
					Offline surface scan supported.
					Self-test supported.
					Conveyance Self-test supported.
					Selective Self-test supported.
SMART capabilities:			(0x0003)	Saves SMART data before entering
					power-saving mode.
					Supports SMART auto save timer.
Error logging capability:		(0x01)	Error logging supported.
					General Purpose Logging supported.
Short self-test routine 
recommended polling time:	(   1) minutes.
Extended self-test routine
recommended polling time:	( 538) minutes.
Conveyance self-test routine
recommended polling time:	(   2) minutes.
SCT capabilities:		  (0x1085)	SCT Status supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME		  FLAG	 VALUE WORST THRESH TYPE	  UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate	 0x000f   116   099   006	Pre-fail  Always	   -	   114027928
  3 Spin_Up_Time			0x0003   091   091   000	Pre-fail  Always	   -	   0
  4 Start_Stop_Count		0x0032   100   100   020	Old_age   Always	   -	   75
  5 Reallocated_Sector_Ct   0x0033   100   100   010	Pre-fail  Always	   -	   0
  7 Seek_Error_Rate		 0x000f   083   060   030	Pre-fail  Always	   -	   209423797
  9 Power_On_Hours		  0x0032   083   083   000	Old_age   Always	   -	   15274
10 Spin_Retry_Count		0x0013   100   100   097	Pre-fail  Always	   -	   0
12 Power_Cycle_Count	   0x0032   100   100   020	Old_age   Always	   -	   75
183 Runtime_Bad_Block	   0x0032   100   100   000	Old_age   Always	   -	   0
184 End-to-End_Error		0x0032   100   100   099	Old_age   Always	   -	   0
187 Reported_Uncorrect	  0x0032   100   100   000	Old_age   Always	   -	   0
188 Command_Timeout		 0x0032   100   100   000	Old_age   Always	   -	   0 0 0
189 High_Fly_Writes		 0x003a   096   096   000	Old_age   Always	   -	   4
190 Airflow_Temperature_Cel 0x0022   068   060   045	Old_age   Always	   -	   32 (Min/Max 21/36)
191 G-Sense_Error_Rate	  0x0032   100   100   000	Old_age   Always	   -	   0
192 Power-Off_Retract_Count 0x0032   100   100   000	Old_age   Always	   -	   56
193 Load_Cycle_Count		0x0032   097   097   000	Old_age   Always	   -	   6511
194 Temperature_Celsius	 0x0022   032   040   000	Old_age   Always	   -	   32 (0 15 0 0 0)
197 Current_Pending_Sector  0x0012   100   100   000	Old_age   Always	   -	   0
198 Offline_Uncorrectable   0x0010   100   100   000	Old_age   Offline	  -	   0
199 UDMA_CRC_Error_Count	0x003e   200   200   000	Old_age   Always	   -	   0
240 Head_Flying_Hours	   0x0000   100   253   000	Old_age   Offline	  -	   15118h+37m+29.016s
241 Total_LBAs_Written	  0x0000   100   253   000	Old_age   Offline	  -	   31673525307
242 Total_LBAs_Read		 0x0000   100   253   000	Old_age   Offline	  -	   122554540929

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description	Status				  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline	Completed without error	   00%	 15273		 -
# 2  Short offline	   Completed without error	   00%	 15263		 -
# 3  Extended offline	Completed without error	   00%	 15191		 -
# 4  Short offline	   Completed without error	   00%	 15180		 -
# 5  Short offline	   Completed without error	   00%	 15156		 -
# 6  Short offline	   Completed without error	   00%	 15132		 -
# 7  Short offline	   Completed without error	   00%	 15109		 -
# 8  Short offline	   Completed without error	   00%	 15085		 -
# 9  Short offline	   Completed without error	   00%	 15061		 -
#10  Short offline	   Completed without error	   00%	 15037		 -
#11  Extended offline	Completed without error	   00%	 15024		 -
#12  Short offline	   Completed without error	   00%	 15013		 -
#13  Short offline	   Completed without error	   00%	 14989		 -
#14  Short offline	   Completed without error	   00%	 14965		 -
#15  Short offline	   Completed without error	   00%	 14941		 -
#16  Short offline	   Completed without error	   00%	 14917		 -
#17  Short offline	   Completed without error	   00%	 14893		 -
#18  Short offline	   Completed without error	   00%	 14869		 -
#19  Extended offline	Completed without error	   00%	 14856		 -
#20  Short offline	   Completed without error	   00%	 14845		 -
#21  Short offline	   Completed without error	   00%	 14821		 -

SMART Selective self-test log data structure revision number 1
SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
	1		0		0  Not_testing
	2		0		0  Not_testing
	3		0		0  Not_testing
	4		0		0  Not_testing
	5		0		0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

[root@PigZilla] ~# smartctl -a /dev/da7
smartctl 6.5 2016-05-07 r4318 [FreeBSD 10.3-STABLE amd64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:	 Seagate Barracuda XT
Device Model:	 ST4000DX000-1C5160
Serial Number:	Z1Z01AG8
LU WWN Device Id: 5 000c50 035fc669b
Firmware Version: CC42
User Capacity:	4,000,787,030,016 bytes [4.00 TB]
Sector Size:	  512 bytes logical/physical
Rotation Rate:	7200 rpm
Device is:		In smartctl database [for details use: -P show]
ATA Version is:   ATA8-ACS T13/1699-D revision 4
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:	Mon Mar  6 15:29:05 2017 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
See vendor-specific Attribute list for marginal Attributes.

General SMART Values:
Offline data collection status:  (0x82)	Offline data collection activity
					was completed without error.
					Auto Offline Data Collection: Enabled.
Self-test execution status:	  (   0)	The previous self-test routine completed
					without error or no self-test has ever 
					been run.
Total time to complete Offline 
data collection:		 (  592) seconds.
Offline data collection
capabilities:			(0x7b) SMART execute Offline immediate.
					Auto Offline data collection on/off support.
					Suspend Offline collection upon new
					command.
					Offline surface scan supported.
					Self-test supported.
					Conveyance Self-test supported.
					Selective Self-test supported.
SMART capabilities:			(0x0003)	Saves SMART data before entering
					power-saving mode.
					Supports SMART auto save timer.
Error logging capability:		(0x01)	Error logging supported.
					General Purpose Logging supported.
Short self-test routine 
recommended polling time:	(   1) minutes.
Extended self-test routine
recommended polling time:	( 517) minutes.
Conveyance self-test routine
recommended polling time:	(   2) minutes.
SCT capabilities:		  (0x1085)	SCT Status supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME		  FLAG	 VALUE WORST THRESH TYPE	  UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate	 0x000f   113   099   006	Pre-fail  Always	   -	   51730197
  3 Spin_Up_Time			0x0003   089   089   000	Pre-fail  Always	   -	   0
  4 Start_Stop_Count		0x0032   100   100   020	Old_age   Always	   -	   153
  5 Reallocated_Sector_Ct   0x0033   100   100   036	Pre-fail  Always	   -	   0
  7 Seek_Error_Rate		 0x000f   084   060   030	Pre-fail  Always	   -	   296197010
  9 Power_On_Hours		  0x0032   081   081   000	Old_age   Always	   -	   17093
10 Spin_Retry_Count		0x0013   100   100   097	Pre-fail  Always	   -	   0
12 Power_Cycle_Count	   0x0032   100   100   020	Old_age   Always	   -	   142
183 Runtime_Bad_Block	   0x0032   100   100   000	Old_age   Always	   -	   0
184 End-to-End_Error		0x0032   100   100   099	Old_age   Always	   -	   0
187 Reported_Uncorrect	  0x0032   098   098   000	Old_age   Always	   -	   2
188 Command_Timeout		 0x0032   100   091   000	Old_age   Always	   -	   317832429644
189 High_Fly_Writes		 0x003a   100   100   000	Old_age   Always	   -	   0
190 Airflow_Temperature_Cel 0x0022   071   014   045	Old_age   Always   In_the_past 29 (1 240 36 18 0)
191 G-Sense_Error_Rate	  0x0032   100   100   000	Old_age   Always	   -	   0
192 Power-Off_Retract_Count 0x0032   100   100   000	Old_age   Always	   -	   96
193 Load_Cycle_Count		0x0032   097   097   000	Old_age   Always	   -	   7899
194 Temperature_Celsius	 0x0022   029   086   000	Old_age   Always	   -	   29 (0 13 0 0 0)
197 Current_Pending_Sector  0x0012   100   100   000	Old_age   Always	   -	   0
198 Offline_Uncorrectable   0x0010   100   100   000	Old_age   Offline	  -	   0
199 UDMA_CRC_Error_Count	0x003e   200   200   000	Old_age   Always	   -	   0
240 Head_Flying_Hours	   0x0000   100   253   000	Old_age   Offline	  -	   16928 (10 140 0)
241 Total_LBAs_Written	  0x0000   100   253   000	Old_age   Offline	  -	   1644627993
242 Total_LBAs_Read		 0x0000   100   253   000	Old_age   Offline	  -	   2435339339

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description	Status				  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline	Completed without error	   00%	 17085		 -
# 2  Short offline	   Completed without error	   00%	 17034		 -
# 3  Extended offline	Completed without error	   00%	 17010		 -
# 4  Short offline	   Completed without error	   00%	 16999		 -
# 5  Short offline	   Completed without error	   00%	 16975		 -
# 6  Short offline	   Completed without error	   00%	 16951		 -
# 7  Short offline	   Completed without error	   00%	 16927		 -
# 8  Short offline	   Completed without error	   00%	 16903		 -
# 9  Short offline	   Completed without error	   00%	 16879		 -
#10  Short offline	   Completed without error	   00%	 16855		 -
#11  Extended offline	Completed without error	   00%	 16842		 -
#12  Short offline	   Completed without error	   00%	 16831		 -
#13  Short offline	   Completed without error	   00%	 16807		 -
#14  Short offline	   Completed without error	   00%	 16783		 -
#15  Short offline	   Completed without error	   00%	 16759		 -
#16  Short offline	   Completed without error	   00%	 16735		 -
#17  Short offline	   Completed without error	   00%	 16711		 -
#18  Short offline	   Completed without error	   00%	 16687		 -
#19  Extended offline	Completed without error	   00%	 16674		 -
#20  Short offline	   Completed without error	   00%	 16663		 -
#21  Short offline	   Completed without error	   00%	 16639		 -

SMART Selective self-test log data structure revision number 1
SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
	1		0		0  Not_testing
	2		0		0  Not_testing
	3		0		0  Not_testing
	4		0		0  Not_testing
	5		0		0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

Robert Trevellyan · Mar 6, 2017

ada1 and da7 have both overheated in the past, so their reliability may be compromised.

joeschmuck · Mar 8, 2017

Re you still running Non ECC RAM? And is it the same RAM?

I'd go ahead and clear the scrub errors again and then watch it. I don't believe this to be a hard drive issue. This could be a power supply or RAM issue. It's a good thing that you have a resilient pool.

mmolinari · Mar 13, 2017

Robert Trevellyan said:
ada1 and da7 have both overheated in the past, so their reliability may be compromised.

Thanks for noticing this. I'm aware some of these drives were a little mistreated by the previous owner, but I got them for free and can't really complain. After reading your message I researched how to interpret this:

Code:

ID# ATTRIBUTE_NAME		 FLAG	 VALUE WORST THRESH TYPE	 UPDATED WHEN_FAILED RAW_VALUE
190 Airflow_Temperature_Cel 0x0022 071 006 045	Old_age Always In_the_past 29 (13 48 36 17 0)

and it turns out the 006 in the worst column is the max temperature the drive ever saw. It's read as (100 - value) C, which means the drive recorded a temperature of 94C/201F !!

The other drive which failed at the same time is from another batch though, and should be fine, so I believe this problem in particular was caused by something different.

joeschmuck said:
Re you still running Non ECC RAM? And is it the same RAM?

I'd go ahead and clear the scrub errors again and then watch it. I don't believe this to be a hard drive issue. This could be a power supply or RAM issue. It's a good thing that you have a resilient pool.

The scrubs were run on a new PC, it's not the same RAM (but it's still non ECC). Besides the drives, I only re-used the 9207-8I and the SATA cables, but I mixed them. So far the pool has been running fine when connected to the new PC, even if I ran it for just a few days.

I ran a few more tests on the old server, with a spare hard drive. So far I've run:
- 4 days of Memtest86+
- 1 day of CPU Burn-In
- 3 runs of read/write badblocks, with a 300GB drive, using the SATA ports the drives were connected to when the first event happened

No errors were reported. I think I’ll reconnect the drives to the old server, and see what happens. I’m also looking for a new power supply.

Thanks!
Marco

Robert Trevellyan · Mar 13, 2017

deleted

Robert Trevellyan · Mar 13, 2017

mmolinari said:
4 days of Memtest86+

It doesn't matter how many days of memtest you run without error. The issue is that a random bit flip can occur at any time, and ECC RAM will either correct it or halt the system. With non-ECC RAM, the error simply goes undetected. Combine that with the fact that data can live for an indefinite period in ARC, and you're asking for trouble.

Important Announcement for the TrueNAS Community.

Two drives failing at the same time?

mmolinari

Cadet

joeschmuck

Old Man

mmolinari

Cadet

joeschmuck

Old Man

mmolinari

Cadet

Robert Trevellyan

Pony Wrangler

joeschmuck

Old Man

mmolinari

Cadet

Robert Trevellyan

Pony Wrangler

Robert Trevellyan

Pony Wrangler

Similar threads

Important Announcement for the TrueNAS Community.

Two drives failing at the same time?

Cadet

Old Man

Cadet

Old Man

Cadet

Pony Wrangler

Old Man

Cadet

Pony Wrangler

Pony Wrangler

Important Announcement for the TrueNAS Community.

Related topics on forums.truenas.com for thread: "Two drives failing at the same time?"

Similar threads