Need help with pool errors

Status
Not open for further replies.

Ryan Howard

Cadet
Joined
Dec 22, 2015
Messages
6
I've been having a weird problem with my pool becoming degraded, and I'm trying to get to the bottom of it. It was running fine for many months. Luckily I didn't have anything important on the disks. At night, I always get some errors emailed to me. The pool first goes offline, then becomes degraded.

My freenas server consists of a Supermicro X8DTN+ wth 64gb of ECC ram, 12 2tb WD black drives, a M1015 that handles 8 of the drives, and the remaining 4 drives plugged into the mobo sata ports. 2 vdevs with 6 drives each, in RaidZ2.

The disks that show up as removed are always the ones plugged into the mobo ports.
Here is a screenshot of the output of zpool status -v, and glabel status. http://i.imgur.com/bTWOWgo.png
The SMART output of the offending disks looks fine, it passes all tests. I need to get this back up and running. Right now I'm thinking about buying a new backplane with a sas expander so I can plug all the drives in the M1015. Any thoughts?
 

Robert Trevellyan

Pony Wrangler
Joined
May 16, 2014
Messages
3,778
Right now I'm thinking about buying a new backplane with a sas expander so I can plug all the drives in the M1015. Any thoughts?
If you swap drives between the M1015 and the mobo ports, do the errors move with the drives or stay with the mobo ports? This will tell you if moving all the drives to the M1015 is likely to be beneficial.
 

Robert Trevellyan

Pony Wrangler
Joined
May 16, 2014
Messages
3,778
Also, have you eliminated any possibility of a power problem?

Best thing would be to post a lot more information to give people a better chance of helping.
 

SweetAndLow

Sweet'NASty
Joined
Nov 6, 2013
Messages
6,421
are you doing something funny with partitioning your drives? Why does it seem like you are using the same disk twice ada1p1 and ada1p2?

Verify that your disks are connected properly, provide hardware information and freenas version.
 

Ryan Howard

Cadet
Joined
Dec 22, 2015
Messages
6
Also, have you eliminated any possibility of a power problem?

Best thing would be to post a lot more information to give people a better chance of helping.
The server is running two 800w psus. All the molex power connections on the backplane are populated. The server is plugged into an APC Smart-UPS 2200. The server did not experience a sudden loss of power, and was always shutdown correctly.
I do not believe it is a power problem.

are you doing something funny with partitioning your drives? Why does it seem like you are using the same disk twice ada1p1 and ada1p2?
I wasn't doing any partitioning. I do see what you're talking about though. I just setup a volume in the gui with 2 RaidZ2 vdevs containing 6 drives each, and that's what I got. How can I fix it?

I apologize for not listing enough hardware info, I'm new and thought I had provided enough. I'll try to fill in the gaps.

I suspected there was something wrong with the mobo sata ports so I added another hba to the system today. Its an IBM N2115 that I flashed to a LSI SAS 9207 with p20 it firmware per this suggestion - IBM N2115 HBA. I'm also running an M1015 flashed to it mode also on p20 firmware. My backplane (Supermicro BPN-SAS-826A) does not have a sas expander, so two SFF-8087 cables go to the m1015, and one is in the N2115. All the drives are seated properly in the backplane.
I'm running FreeNAS-9.10.2-U1 (86c7ef5)
Supermicro X8DTN+ motherboard
64gb PC3-10600R ECC ram
2x Intel Xeon L5630 cpus
12x WD2003FYYS Hdds
Supermicro BPN-SAS-826A backplane

Please Let me know I I need to provide any more info. Keep in mind, this server was running fine for 6 months.

After adding the new hba and testing with some data, I'm getting new errors. Disks seem to be staying online, but I'm getting new read/write errors. Here's the errors from the email I got.

Code:
Device: /dev/da11 [SAT], Read SMART Self-Test Log Failed
The volume Storage (ZFS) state is DEGRADED: One or more devices are faulted in response to persistent errors. Sufficient replicas exist for the pool to continue functioning in a degraded state.
Device: /dev/da11 [SAT], not capable of SMART self-check
Device: /dev/da11 [SAT], failed to read SMART Attribute Data
Device: /dev/da8 [SAT], not capable of SMART self-check
Device: /dev/da8 [SAT], Read SMART Error Log Failed
Device: /dev/da8 [SAT], Read SMART Self-Test Log Failed
Device: /dev/da11 [SAT], Read SMART Error Log Failed
Device: /dev/da8 [SAT], failed to read SMART Attribute Data


Here's the new zpool status -v and glabel status output. -
9NstW5I.png


da0-7 are attached to the m1015.
da8-11 are attached to the N2115.

Is this a smart pass through problem with the new hba?
Also not sure why it seems like the disks have partitions.
 

Robert Trevellyan

Pony Wrangler
Joined
May 16, 2014
Messages
3,778
The SMART output of the offending disks looks fine
Can we see that?
Also not sure why it seems like the disks have partitions.
That's just the FreeNAS way. By default it makes a swap partition on each disk, and the main ZFS data partition ... and, uh, another partition, I guess.
Device: /dev/da11 [SAT], not capable of SMART self-check
Device: /dev/da8 [SAT], not capable of SMART self-check
This is undesirable.

My gut feeling is that da11 is failing, but there may be other issues too. Maybe a backplane or a cable problem?
 

Ryan Howard

Cadet
Joined
Dec 22, 2015
Messages
6
Here's some smart data of the disks with errors. I will run an extended test tonight.

Code:
 smartctl -a /dev/da8
smartctl 6.5 2016-05-07 r4318 [FreeBSD 10.3-STABLE amd64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:	 Western Digital RE4
Device Model:	 WDC WD2003FYYS-02W0B0
Serial Number:	WD-WMAY01346454
LU WWN Device Id: 5 0014ee 0ad4a3d4e
Firmware Version: 01.01D01
User Capacity:	2,000,398,934,016 bytes [2.00 TB]
Sector Size:	  512 bytes logical/physical
Rotation Rate:	7200 rpm
Device is:		In smartctl database [for details use: -P show]
ATA Version is:   ATA8-ACS (minor revision not indicated)
SATA Version is:  SATA 2.6, 3.0 Gb/s
Local Time is:	Tue Jan 24 17:13:47 2017 PST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x84) Offline data collection activity
										was suspended by an interrupting command from host.
										Auto Offline Data Collection: Enabled.
Self-test execution status:	  (   0) The previous self-test routine completed
										without error or no self-test has ever
										been run.
Total time to complete Offline
data collection:				(30180) seconds.
Offline data collection
capabilities:					(0x7b) SMART execute Offline immediate.
										Auto Offline data collection on/off support.
										Suspend Offline collection upon new
										command.
										Offline surface scan supported.
										Self-test supported.
										Conveyance Self-test supported.
										Selective Self-test supported.
SMART capabilities:			(0x0003) Saves SMART data before entering
										power-saving mode.
										Supports SMART auto save timer.
Error logging capability:		(0x01) Error logging supported.
										General Purpose Logging supported.
Short self-test routine
recommended polling time:		(   2) minutes.
Extended self-test routine
recommended polling time:		( 307) minutes.
Conveyance self-test routine
recommended polling time:		(   5) minutes.
SCT capabilities:			  (0x303f) SCT Status supported.
										SCT Error Recovery Control supported.
										SCT Feature Control supported.
										SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME		  FLAG	 VALUE WORST THRESH TYPE	  UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate	 0x002f   200   200   051	Pre-fail  Always	   -	   21
  3 Spin_Up_Time			0x0027   253   253   021	Pre-fail  Always	   -	   6816
  4 Start_Stop_Count		0x0032   100   100   000	Old_age   Always	   -	   231
  5 Reallocated_Sector_Ct   0x0033   200   200   140	Pre-fail  Always	   -	   0
  7 Seek_Error_Rate		 0x002e   200   200   000	Old_age   Always	   -	   0
  9 Power_On_Hours		  0x0032   062   062   000	Old_age   Always	   -	   28057
 10 Spin_Retry_Count		0x0032   100   100   000	Old_age   Always	   -	   0
 11 Calibration_Retry_Count 0x0032   100   100   000	Old_age   Always	   -	   0
 12 Power_Cycle_Count	   0x0032   100   100   000	Old_age   Always	   -	   230
192 Power-Off_Retract_Count 0x0032   200   200   000	Old_age   Always	   -	   201
193 Load_Cycle_Count		0x0032   200   200   000	Old_age   Always	   -	   29
194 Temperature_Celsius	 0x0022   123   110   000	Old_age   Always	   -	   29
196 Reallocated_Event_Count 0x0032   200   200   000	Old_age   Always	   -	   0
197 Current_Pending_Sector  0x0032   200   200   000	Old_age   Always	   -	   0
198 Offline_Uncorrectable   0x0030   200   200   000	Old_age   Offline	  -	   0
199 UDMA_CRC_Error_Count	0x0032   200   200   000	Old_age   Always	   -	   0
200 Multi_Zone_Error_Rate   0x0008   200   200   000	Old_age   Offline	  -	   0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
No self-tests have been logged.  [To run self-tests, use: smartctl -t]

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
	1		0		0  Not_testing
	2		0		0  Not_testing
	3		0		0  Not_testing
	4		0		0  Not_testing
	5		0		0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.




Code:
smartctl -a /dev/da9
smartctl 6.5 2016-05-07 r4318 [FreeBSD 10.3-STABLE amd64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:	 Western Digital RE4-GP
Device Model:	 WDC WD2003FYPS-27Y2B0
Serial Number:	WD-WCAVY5793783
LU WWN Device Id: 5 0014ee 2afb9ac4f
Firmware Version: 04.05G11
User Capacity:	2,000,398,934,016 bytes [2.00 TB]
Sector Size:	  512 bytes logical/physical
Rotation Rate:	5400 rpm
Device is:		In smartctl database [for details use: -P show]
ATA Version is:   ATA8-ACS (minor revision not indicated)
SATA Version is:  SATA 2.6, 3.0 Gb/s
Local Time is:	Tue Jan 24 17:09:23 2017 PST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
										was completed without error.
										Auto Offline Data Collection: Enabled.
Self-test execution status:	  (   0) The previous self-test routine completed
										without error or no self-test has ever
										been run.
Total time to complete Offline
data collection:				(42180) seconds.
Offline data collection
capabilities:					(0x7b) SMART execute Offline immediate.
										Auto Offline data collection on/off support.
										Suspend Offline collection upon new
										command.
										Offline surface scan supported.
										Self-test supported.
										Conveyance Self-test supported.
										Selective Self-test supported.
SMART capabilities:			(0x0003) Saves SMART data before entering
										power-saving mode.
										Supports SMART auto save timer.
Error logging capability:		(0x01) Error logging supported.
										General Purpose Logging supported.
Short self-test routine
recommended polling time:		(   2) minutes.
Extended self-test routine
recommended polling time:		( 480) minutes.
Conveyance self-test routine
recommended polling time:		(   5) minutes.
SCT capabilities:			  (0x303d) SCT Status supported.
										SCT Error Recovery Control supported.
										SCT Feature Control supported.
										SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME		  FLAG	 VALUE WORST THRESH TYPE	  UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate	 0x002f   200   200   051	Pre-fail  Always	   -	   3
  3 Spin_Up_Time			0x0027   253   237   021	Pre-fail  Always	   -	   7033
  4 Start_Stop_Count		0x0032   100   100   000	Old_age   Always	   -	   302
  5 Reallocated_Sector_Ct   0x0033   200   200   140	Pre-fail  Always	   -	   0
  7 Seek_Error_Rate		 0x002e   100   253   000	Old_age   Always	   -	   0
  9 Power_On_Hours		  0x0032   072   072   000	Old_age   Always	   -	   20925
10 Spin_Retry_Count		0x0032   100   100   000	Old_age   Always	   -	   0
11 Calibration_Retry_Count 0x0032   100   100   000	Old_age   Always	   -	   0
12 Power_Cycle_Count	   0x0032   100   100   000	Old_age   Always	   -	   300
192 Power-Off_Retract_Count 0x0032   200   200   000	Old_age   Always	   -	   265
193 Load_Cycle_Count		0x0032   200   200   000	Old_age   Always	   -	   36
194 Temperature_Celsius	 0x0022   125   105   000	Old_age   Always	   -	   27
196 Reallocated_Event_Count 0x0032   200   200   000	Old_age   Always	   -	   0
197 Current_Pending_Sector  0x0032   200   200   000	Old_age   Always	   -	   0
198 Offline_Uncorrectable   0x0030   200   200   000	Old_age   Offline	  -	   0
199 UDMA_CRC_Error_Count	0x0032   200   200   000	Old_age   Always	   -	   0
200 Multi_Zone_Error_Rate   0x0008   200   200   000	Old_age   Offline	  -	   1

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description	Status				  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline	Aborted by host			   90%	 19663		 -
# 2  Extended offline	Completed without error	   00%	 19641		 -
# 3  Short offline	   Completed without error	   00%	 13444		 -
# 4  Short offline	   Completed without error	   00%	  3066		 -
# 5  Short offline	   Completed without error	   00%	  3066		 -
# 6  Short offline	   Completed without error	   00%	  3065		 -
# 7  Short offline	   Completed without error	   00%	  3051		 -
# 8  Short offline	   Completed without error	   00%		 0		 -

SMART Selective self-test log data structure revision number 1
SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
	1		0		0  Not_testing
	2		0		0  Not_testing
	3		0		0  Not_testing
	4		0		0  Not_testing
	5		0		0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.


Code:
smartctl -a /dev/da10
smartctl 6.5 2016-05-07 r4318 [FreeBSD 10.3-STABLE amd64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:	 Western Digital RE4
Device Model:	 WDC WD2003FYYS-02W0B0
Serial Number:	WD-WMAY01433291
LU WWN Device Id: 5 0014ee 0029fded0
Firmware Version: 01.01D01
User Capacity:	2,000,398,934,016 bytes [2.00 TB]
Sector Size:	  512 bytes logical/physical
Rotation Rate:	7200 rpm
Device is:		In smartctl database [for details use: -P show]
ATA Version is:   ATA8-ACS (minor revision not indicated)
SATA Version is:  SATA 2.6, 3.0 Gb/s
Local Time is:	Tue Jan 24 17:10:08 2017 PST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x84) Offline data collection activity
										was suspended by an interrupting command from host.
										Auto Offline Data Collection: Enabled.
Self-test execution status:	  (   0) The previous self-test routine completed
										without error or no self-test has ever
										been run.
Total time to complete Offline
data collection:				(29100) seconds.
Offline data collection
capabilities:					(0x7b) SMART execute Offline immediate.
										Auto Offline data collection on/off support.
										Suspend Offline collection upon new
										command.
										Offline surface scan supported.
										Self-test supported.
										Conveyance Self-test supported.
										Selective Self-test supported.
SMART capabilities:			(0x0003) Saves SMART data before entering
										power-saving mode.
										Supports SMART auto save timer.
Error logging capability:		(0x01) Error logging supported.
										General Purpose Logging supported.
Short self-test routine
recommended polling time:		(   2) minutes.
Extended self-test routine
recommended polling time:		( 296) minutes.
Conveyance self-test routine
recommended polling time:		(   5) minutes.
SCT capabilities:			  (0x303f) SCT Status supported.
										SCT Error Recovery Control supported.
										SCT Feature Control supported.
										SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME		  FLAG	 VALUE WORST THRESH TYPE	  UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate	 0x002f   200   200   051	Pre-fail  Always	   -	   5
  3 Spin_Up_Time			0x0027   253   253   021	Pre-fail  Always	   -	   7500
  4 Start_Stop_Count		0x0032   100   100   000	Old_age   Always	   -	   191
  5 Reallocated_Sector_Ct   0x0033   200   200   140	Pre-fail  Always	   -	   0
  7 Seek_Error_Rate		 0x002e   200   200   000	Old_age   Always	   -	   0
  9 Power_On_Hours		  0x0032   059   059   000	Old_age   Always	   -	   30091
10 Spin_Retry_Count		0x0032   100   100   000	Old_age   Always	   -	   0
11 Calibration_Retry_Count 0x0032   100   100   000	Old_age   Always	   -	   0
12 Power_Cycle_Count	   0x0032   100   100   000	Old_age   Always	   -	   190
192 Power-Off_Retract_Count 0x0032   200   200   000	Old_age   Always	   -	   161
193 Load_Cycle_Count		0x0032   200   200   000	Old_age   Always	   -	   29
194 Temperature_Celsius	 0x0022   121   106   000	Old_age   Always	   -	   31
196 Reallocated_Event_Count 0x0032   200   200   000	Old_age   Always	   -	   0
197 Current_Pending_Sector  0x0032   200   200   000	Old_age   Always	   -	   0
198 Offline_Uncorrectable   0x0030   200   200   000	Old_age   Offline	  -	   0
199 UDMA_CRC_Error_Count	0x0032   200   200   000	Old_age   Always	   -	   0
200 Multi_Zone_Error_Rate   0x0008   200   200   000	Old_age   Offline	  -	   0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description	Status				  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline	Interrupted (host reset)	  90%	 28639		 -
# 2  Extended offline	Completed without error	   00%	 23134		 -
# 3  Short offline	   Completed without error	   00%	 22761		 -
# 4  Short offline	   Completed without error	   00%	 22760		 -
# 5  Short offline	   Completed without error	   00%	 22759		 -
# 6  Short offline	   Completed without error	   00%	 22758		 -
# 7  Short offline	   Completed without error	   00%	 22757		 -
# 8  Short offline	   Completed without error	   00%	 22756		 -
# 9  Short offline	   Completed without error	   00%	 22755		 -
#10  Short offline	   Completed without error	   00%	 22754		 -
#11  Short offline	   Completed without error	   00%	 22753		 -
#12  Short offline	   Completed without error	   00%	 22752		 -
#13  Short offline	   Completed without error	   00%	 22751		 -
#14  Short offline	   Completed without error	   00%	 22750		 -
#15  Short offline	   Completed without error	   00%	 22749		 -
#16  Short offline	   Completed without error	   00%	 22748		 -
#17  Short offline	   Completed without error	   00%	 22747		 -
#18  Short offline	   Completed without error	   00%	 22746		 -
#19  Short offline	   Completed without error	   00%	 22745		 -
#20  Short offline	   Completed without error	   00%	 22744		 -
#21  Short offline	   Completed without error	   00%	 22743		 -

SMART Selective self-test log data structure revision number 1
SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
	1		0		0  Not_testing
	2		0		0  Not_testing
	3		0		0  Not_testing
	4		0		0  Not_testing
	5		0		0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.




Code:
smartctl -a /dev/da11
smartctl 6.5 2016-05-07 r4318 [FreeBSD 10.3-STABLE amd64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:	 Western Digital RE4
Device Model:	 WDC WD2003FYYS-02W0B0
Serial Number:	WD-WMAY01401004
LU WWN Device Id: 5 0014ee 057f4f3ce
Firmware Version: 01.01D01
User Capacity:	2,000,398,934,016 bytes [2.00 TB]
Sector Size:	  512 bytes logical/physical
Rotation Rate:	7200 rpm
Device is:		In smartctl database [for details use: -P show]
ATA Version is:   ATA8-ACS (minor revision not indicated)
SATA Version is:  SATA 2.6, 3.0 Gb/s
Local Time is:	Tue Jan 24 17:11:10 2017 PST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x84) Offline data collection activity
										was suspended by an interrupting command from host.
										Auto Offline Data Collection: Enabled.
Self-test execution status:	  (   0) The previous self-test routine completed
										without error or no self-test has ever
										been run.
Total time to complete Offline
data collection:				(28560) seconds.
Offline data collection
capabilities:					(0x7b) SMART execute Offline immediate.
										Auto Offline data collection on/off support.
										Suspend Offline collection upon new
										command.
										Offline surface scan supported.
										Self-test supported.
										Conveyance Self-test supported.
										Selective Self-test supported.
SMART capabilities:			(0x0003) Saves SMART data before entering
										power-saving mode.
										Supports SMART auto save timer.
Error logging capability:		(0x01) Error logging supported.
										General Purpose Logging supported.
Short self-test routine
recommended polling time:		(   2) minutes.
Extended self-test routine
recommended polling time:		( 291) minutes.
Conveyance self-test routine
recommended polling time:		(   5) minutes.
SCT capabilities:			  (0x303f) SCT Status supported.
										SCT Error Recovery Control supported.
										SCT Feature Control supported.
										SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME		  FLAG	 VALUE WORST THRESH TYPE	  UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate	 0x002f   200   200   051	Pre-fail  Always	   -	   3
  3 Spin_Up_Time			0x0027   253   253   021	Pre-fail  Always	   -	   6133
  4 Start_Stop_Count		0x0032   100   100   000	Old_age   Always	   -	   578
  5 Reallocated_Sector_Ct   0x0033   200   200   140	Pre-fail  Always	   -	   0
  7 Seek_Error_Rate		 0x002e   200   200   000	Old_age   Always	   -	   0
  9 Power_On_Hours		  0x0032   062   062   000	Old_age   Always	   -	   27942
10 Spin_Retry_Count		0x0032   100   100   000	Old_age   Always	   -	   0
11 Calibration_Retry_Count 0x0032   100   100   000	Old_age   Always	   -	   0
12 Power_Cycle_Count	   0x0032   100   100   000	Old_age   Always	   -	   577
192 Power-Off_Retract_Count 0x0032   200   200   000	Old_age   Always	   -	   548
193 Load_Cycle_Count		0x0032   200   200   000	Old_age   Always	   -	   29
194 Temperature_Celsius	 0x0022   123   108   000	Old_age   Always	   -	   29
196 Reallocated_Event_Count 0x0032   200   200   000	Old_age   Always	   -	   0
197 Current_Pending_Sector  0x0032   200   200   000	Old_age   Always	   -	   0
198 Offline_Uncorrectable   0x0030   200   200   000	Old_age   Offline	  -	   0
199 UDMA_CRC_Error_Count	0x0032   200   200   000	Old_age   Always	   -	   0
200 Multi_Zone_Error_Rate   0x0008   200   200   000	Old_age   Offline	  -	   0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description	Status				  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline	   Completed without error	   00%	 26687		 -

SMART Selective self-test log data structure revision number 1
SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
	1		0		0  Not_testing
	2		0		0  Not_testing
	3		0		0  Not_testing
	4		0		0  Not_testing
	5		0		0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

 

Robert Trevellyan

Pony Wrangler
Joined
May 16, 2014
Messages
3,778
To me, there's nothing obviously wrong with the drives based on the SMART data. You do need to set up a proper SMART testing schedule, but that won't solve your current problems. I guess I'd be looking elsewhere, assuming the current test come up clear.
 

Ryan Howard

Cadet
Joined
Dec 22, 2015
Messages
6
I had a regular schedule, but since I've been having this problem, the box spent more time off than powered on while diagnosing it.
 

Ryan Howard

Cadet
Joined
Dec 22, 2015
Messages
6
Here's the extended tests. Does anyone see anything wrong?

Code:
smartctl -a /dev/da8
smartctl 6.5 2016-05-07 r4318 [FreeBSD 10.3-STABLE amd64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:	 Western Digital RE4-GP
Device Model:	 WDC WD2003FYPS-27Y2B0
Serial Number:	WD-WCAVY5793783
LU WWN Device Id: 5 0014ee 2afb9ac4f
Firmware Version: 04.05G11
User Capacity:	2,000,398,934,016 bytes [2.00 TB]
Sector Size:	  512 bytes logical/physical
Rotation Rate:	5400 rpm
Device is:		In smartctl database [for details use: -P show]
ATA Version is:   ATA8-ACS (minor revision not indicated)
SATA Version is:  SATA 2.6, 3.0 Gb/s
Local Time is:	Sun Jan 29 17:27:47 2017 PST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x84) Offline data collection activity
										was suspended by an interrupting command from host.
										Auto Offline Data Collection: Enabled.
Self-test execution status:	  (   0) The previous self-test routine completed
										without error or no self-test has ever
										been run.
Total time to complete Offline
data collection:				(42180) seconds.
Offline data collection
capabilities:					(0x7b) SMART execute Offline immediate.
										Auto Offline data collection on/off support.
										Suspend Offline collection upon new
										command.
										Offline surface scan supported.
										Self-test supported.
										Conveyance Self-test supported.
										Selective Self-test supported.
SMART capabilities:			(0x0003) Saves SMART data before entering
										power-saving mode.
										Supports SMART auto save timer.
Error logging capability:		(0x01) Error logging supported.
										General Purpose Logging supported.
Short self-test routine
recommended polling time:		(   2) minutes.
Extended self-test routine
recommended polling time:		( 480) minutes.
Conveyance self-test routine
recommended polling time:		(   5) minutes.
SCT capabilities:			  (0x303d) SCT Status supported.
										SCT Error Recovery Control supported.
										SCT Feature Control supported.
										SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME		  FLAG	 VALUE WORST THRESH TYPE	  UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate	 0x002f   200   200   051	Pre-fail  Always	   -	   3
  3 Spin_Up_Time			0x0027   253   237   021	Pre-fail  Always	   -	   4675
  4 Start_Stop_Count		0x0032   100   100   000	Old_age   Always	   -	   309
  5 Reallocated_Sector_Ct   0x0033   200   200   140	Pre-fail  Always	   -	   0
  7 Seek_Error_Rate		 0x002e   100   253   000	Old_age   Always	   -	   0
  9 Power_On_Hours		  0x0032   072   072   000	Old_age   Always	   -	   21043
 10 Spin_Retry_Count		0x0032   100   100   000	Old_age   Always	   -	   0
 11 Calibration_Retry_Count 0x0032   100   100   000	Old_age   Always	   -	   0
 12 Power_Cycle_Count	   0x0032   100   100   000	Old_age   Always	   -	   307
192 Power-Off_Retract_Count 0x0032   200   200   000	Old_age   Always	   -	   272
193 Load_Cycle_Count		0x0032   200   200   000	Old_age   Always	   -	   36
194 Temperature_Celsius	 0x0022   124   105   000	Old_age   Always	   -	   28
196 Reallocated_Event_Count 0x0032   200   200   000	Old_age   Always	   -	   0
197 Current_Pending_Sector  0x0032   200   200   000	Old_age   Always	   -	   0
198 Offline_Uncorrectable   0x0030   200   200   000	Old_age   Offline	  -	   0
199 UDMA_CRC_Error_Count	0x0032   200   200   000	Old_age   Always	   -	   0
200 Multi_Zone_Error_Rate   0x0008   200   200   000	Old_age   Offline	  -	   1

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description	Status				  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline	Completed without error	   00%	 20932		 -
# 2  Extended offline	Aborted by host			   90%	 19663		 -
# 3  Extended offline	Completed without error	   00%	 19641		 -
# 4  Short offline	   Completed without error	   00%	 13444		 -
# 5  Short offline	   Completed without error	   00%	  3066		 -
# 6  Short offline	   Completed without error	   00%	  3066		 -
# 7  Short offline	   Completed without error	   00%	  3065		 -
# 8  Short offline	   Completed without error	   00%	  3051		 -
# 9  Short offline	   Completed without error	   00%		 0		 -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
	1		0		0  Not_testing
	2		0		0  Not_testing
	3		0		0  Not_testing
	4		0		0  Not_testing
	5		0		0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.



Code:
smartctl -a /dev/da9
smartctl 6.5 2016-05-07 r4318 [FreeBSD 10.3-STABLE amd64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:	 Western Digital RE4
Device Model:	 WDC WD2003FYYS-02W0B0
Serial Number:	WD-WMAY01346454
LU WWN Device Id: 5 0014ee 0ad4a3d4e
Firmware Version: 01.01D01
User Capacity:	2,000,398,934,016 bytes [2.00 TB]
Sector Size:	  512 bytes logical/physical
Rotation Rate:	7200 rpm
Device is:		In smartctl database [for details use: -P show]
ATA Version is:   ATA8-ACS (minor revision not indicated)
SATA Version is:  SATA 2.6, 3.0 Gb/s
Local Time is:	Sun Jan 29 17:28:31 2017 PST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x84) Offline data collection activity
										was suspended by an interrupting command from host.
										Auto Offline Data Collection: Enabled.
Self-test execution status:	  (   0) The previous self-test routine completed
										without error or no self-test has ever
										been run.
Total time to complete Offline
data collection:				(30180) seconds.
Offline data collection
capabilities:					(0x7b) SMART execute Offline immediate.
										Auto Offline data collection on/off support.
										Suspend Offline collection upon new
										command.
										Offline surface scan supported.
										Self-test supported.
										Conveyance Self-test supported.
										Selective Self-test supported.
SMART capabilities:			(0x0003) Saves SMART data before entering
										power-saving mode.
										Supports SMART auto save timer.
Error logging capability:		(0x01) Error logging supported.
										General Purpose Logging supported.
Short self-test routine
recommended polling time:		(   2) minutes.
Extended self-test routine
recommended polling time:		( 307) minutes.
Conveyance self-test routine
recommended polling time:		(   5) minutes.
SCT capabilities:			  (0x303f) SCT Status supported.
										SCT Error Recovery Control supported.
										SCT Feature Control supported.
										SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME		  FLAG	 VALUE WORST THRESH TYPE	  UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate	 0x002f   200   200   051	Pre-fail  Always	   -	   21
  3 Spin_Up_Time			0x0027   253   253   021	Pre-fail  Always	   -	   4033
  4 Start_Stop_Count		0x0032   100   100   000	Old_age   Always	   -	   240
  5 Reallocated_Sector_Ct   0x0033   200   200   140	Pre-fail  Always	   -	   0
  7 Seek_Error_Rate		 0x002e   200   200   000	Old_age   Always	   -	   0
  9 Power_On_Hours		  0x0032   062   062   000	Old_age   Always	   -	   28176
 10 Spin_Retry_Count		0x0032   100   100   000	Old_age   Always	   -	   0
 11 Calibration_Retry_Count 0x0032   100   100   000	Old_age   Always	   -	   0
 12 Power_Cycle_Count	   0x0032   100   100   000	Old_age   Always	   -	   239
192 Power-Off_Retract_Count 0x0032   200   200   000	Old_age   Always	   -	   210
193 Load_Cycle_Count		0x0032   200   200   000	Old_age   Always	   -	   29
194 Temperature_Celsius	 0x0022   122   110   000	Old_age   Always	   -	   30
196 Reallocated_Event_Count 0x0032   200   200   000	Old_age   Always	   -	   0
197 Current_Pending_Sector  0x0032   200   200   000	Old_age   Always	   -	   0
198 Offline_Uncorrectable   0x0030   200   200   000	Old_age   Offline	  -	   0
199 UDMA_CRC_Error_Count	0x0032   200   200   000	Old_age   Always	   -	   0
200 Multi_Zone_Error_Rate   0x0008   200   200   000	Old_age   Offline	  -	   0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description	Status				  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline	Completed without error	   00%	 28062		 -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
	1		0		0  Not_testing
	2		0		0  Not_testing
	3		0		0  Not_testing
	4		0		0  Not_testing
	5		0		0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.



Code:
smartctl -a /dev/da10
smartctl 6.5 2016-05-07 r4318 [FreeBSD 10.3-STABLE amd64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:	 Western Digital RE4
Device Model:	 WDC WD2003FYYS-02W0B0
Serial Number:	WD-WMAY01433291
LU WWN Device Id: 5 0014ee 0029fded0
Firmware Version: 01.01D01
User Capacity:	2,000,398,934,016 bytes [2.00 TB]
Sector Size:	  512 bytes logical/physical
Rotation Rate:	7200 rpm
Device is:		In smartctl database [for details use: -P show]
ATA Version is:   ATA8-ACS (minor revision not indicated)
SATA Version is:  SATA 2.6, 3.0 Gb/s
Local Time is:	Sun Jan 29 17:29:05 2017 PST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x84) Offline data collection activity
										was suspended by an interrupting command from host.
										Auto Offline Data Collection: Enabled.
Self-test execution status:	  (   0) The previous self-test routine completed
										without error or no self-test has ever
										been run.
Total time to complete Offline
data collection:				(29100) seconds.
Offline data collection
capabilities:					(0x7b) SMART execute Offline immediate.
										Auto Offline data collection on/off support.
										Suspend Offline collection upon new
										command.
										Offline surface scan supported.
										Self-test supported.
										Conveyance Self-test supported.
										Selective Self-test supported.
SMART capabilities:			(0x0003) Saves SMART data before entering
										power-saving mode.
										Supports SMART auto save timer.
Error logging capability:		(0x01) Error logging supported.
										General Purpose Logging supported.
Short self-test routine
recommended polling time:		(   2) minutes.
Extended self-test routine
recommended polling time:		( 296) minutes.
Conveyance self-test routine
recommended polling time:		(   5) minutes.
SCT capabilities:			  (0x303f) SCT Status supported.
										SCT Error Recovery Control supported.
										SCT Feature Control supported.
										SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME		  FLAG	 VALUE WORST THRESH TYPE	  UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate	 0x002f   200   200   051	Pre-fail  Always	   -	   5
  3 Spin_Up_Time			0x0027   253   253   021	Pre-fail  Always	   -	   4883
  4 Start_Stop_Count		0x0032   100   100   000	Old_age   Always	   -	   200
  5 Reallocated_Sector_Ct   0x0033   200   200   140	Pre-fail  Always	   -	   0
  7 Seek_Error_Rate		 0x002e   200   200   000	Old_age   Always	   -	   0
  9 Power_On_Hours		  0x0032   059   059   000	Old_age   Always	   -	   30210
 10 Spin_Retry_Count		0x0032   100   100   000	Old_age   Always	   -	   0
 11 Calibration_Retry_Count 0x0032   100   100   000	Old_age   Always	   -	   0
 12 Power_Cycle_Count	   0x0032   100   100   000	Old_age   Always	   -	   199
192 Power-Off_Retract_Count 0x0032   200   200   000	Old_age   Always	   -	   170
193 Load_Cycle_Count		0x0032   200   200   000	Old_age   Always	   -	   29
194 Temperature_Celsius	 0x0022   120   106   000	Old_age   Always	   -	   32
196 Reallocated_Event_Count 0x0032   200   200   000	Old_age   Always	   -	   0
197 Current_Pending_Sector  0x0032   200   200   000	Old_age   Always	   -	   0
198 Offline_Uncorrectable   0x0030   200   200   000	Old_age   Offline	  -	   0
199 UDMA_CRC_Error_Count	0x0032   200   200   000	Old_age   Always	   -	   0
200 Multi_Zone_Error_Rate   0x0008   200   200   000	Old_age   Offline	  -	   0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description	Status				  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline	Completed without error	   00%	 30096		 -
# 2  Extended offline	Interrupted (host reset)	  90%	 28639		 -
# 3  Extended offline	Completed without error	   00%	 23134		 -
# 4  Short offline	   Completed without error	   00%	 22761		 -
# 5  Short offline	   Completed without error	   00%	 22760		 -
# 6  Short offline	   Completed without error	   00%	 22759		 -
# 7  Short offline	   Completed without error	   00%	 22758		 -
# 8  Short offline	   Completed without error	   00%	 22757		 -
# 9  Short offline	   Completed without error	   00%	 22756		 -
#10  Short offline	   Completed without error	   00%	 22755		 -
#11  Short offline	   Completed without error	   00%	 22754		 -
#12  Short offline	   Completed without error	   00%	 22753		 -
#13  Short offline	   Completed without error	   00%	 22752		 -
#14  Short offline	   Completed without error	   00%	 22751		 -
#15  Short offline	   Completed without error	   00%	 22750		 -
#16  Short offline	   Completed without error	   00%	 22749		 -
#17  Short offline	   Completed without error	   00%	 22748		 -
#18  Short offline	   Completed without error	   00%	 22747		 -
#19  Short offline	   Completed without error	   00%	 22746		 -
#20  Short offline	   Completed without error	   00%	 22745		 -
#21  Short offline	   Completed without error	   00%	 22744		 -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
	1		0		0  Not_testing
	2		0		0  Not_testing
	3		0		0  Not_testing
	4		0		0  Not_testing
	5		0		0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.



Code:
smartctl -a /dev/da11
smartctl 6.5 2016-05-07 r4318 [FreeBSD 10.3-STABLE amd64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:	 Western Digital RE4
Device Model:	 WDC WD2003FYYS-02W0B0
Serial Number:	WD-WMAY01401004
LU WWN Device Id: 5 0014ee 057f4f3ce
Firmware Version: 01.01D01
User Capacity:	2,000,398,934,016 bytes [2.00 TB]
Sector Size:	  512 bytes logical/physical
Rotation Rate:	7200 rpm
Device is:		In smartctl database [for details use: -P show]
ATA Version is:   ATA8-ACS (minor revision not indicated)
SATA Version is:  SATA 2.6, 3.0 Gb/s
Local Time is:	Sun Jan 29 17:29:35 2017 PST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x84) Offline data collection activity
										was suspended by an interrupting command from host.
										Auto Offline Data Collection: Enabled.
Self-test execution status:	  (   0) The previous self-test routine completed
										without error or no self-test has ever
										been run.
Total time to complete Offline
data collection:				(28560) seconds.
Offline data collection
capabilities:					(0x7b) SMART execute Offline immediate.
										Auto Offline data collection on/off support.
										Suspend Offline collection upon new
										command.
										Offline surface scan supported.
										Self-test supported.
										Conveyance Self-test supported.
										Selective Self-test supported.
SMART capabilities:			(0x0003) Saves SMART data before entering
										power-saving mode.
										Supports SMART auto save timer.
Error logging capability:		(0x01) Error logging supported.
										General Purpose Logging supported.
Short self-test routine
recommended polling time:		(   2) minutes.
Extended self-test routine
recommended polling time:		( 291) minutes.
Conveyance self-test routine
recommended polling time:		(   5) minutes.
SCT capabilities:			  (0x303f) SCT Status supported.
										SCT Error Recovery Control supported.
										SCT Feature Control supported.
										SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME		  FLAG	 VALUE WORST THRESH TYPE	  UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate	 0x002f   200   200   051	Pre-fail  Always	   -	   3
  3 Spin_Up_Time			0x0027   253   253   021	Pre-fail  Always	   -	   3616
  4 Start_Stop_Count		0x0032   100   100   000	Old_age   Always	   -	   601
  5 Reallocated_Sector_Ct   0x0033   200   200   140	Pre-fail  Always	   -	   0
  7 Seek_Error_Rate		 0x002e   200   200   000	Old_age   Always	   -	   0
  9 Power_On_Hours		  0x0032   062   062   000	Old_age   Always	   -	   28060
 10 Spin_Retry_Count		0x0032   100   100   000	Old_age   Always	   -	   0
 11 Calibration_Retry_Count 0x0032   100   100   000	Old_age   Always	   -	   0
 12 Power_Cycle_Count	   0x0032   100   100   000	Old_age   Always	   -	   600
192 Power-Off_Retract_Count 0x0032   200   200   000	Old_age   Always	   -	   571
193 Load_Cycle_Count		0x0032   200   200   000	Old_age   Always	   -	   29
194 Temperature_Celsius	 0x0022   122   108   000	Old_age   Always	   -	   30
196 Reallocated_Event_Count 0x0032   200   200   000	Old_age   Always	   -	   0
197 Current_Pending_Sector  0x0032   200   200   000	Old_age   Always	   -	   0
198 Offline_Uncorrectable   0x0030   200   200   000	Old_age   Offline	  -	   0
199 UDMA_CRC_Error_Count	0x0032   200   200   000	Old_age   Always	   -	   0
200 Multi_Zone_Error_Rate   0x0008   200   200   000	Old_age   Offline	  -	   1

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description	Status				  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline	Completed without error	   00%	 27946		 -
# 2  Short offline	   Completed without error	   00%	 26687		 -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
	1		0		0  Not_testing
	2		0		0  Not_testing
	3		0		0  Not_testing
	4		0		0  Not_testing
	5		0		0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

 

Robert Trevellyan

Pony Wrangler
Joined
May 16, 2014
Messages
3,778
Here's the extended tests.
That's not how SMART tests work. You ran extended tests, and they passed. That's all they do, either pass or fail. The smartctl command outputs the contents of the SMART attributes and the logs, which the drive monitors continuously on its own. As before, there's no smoking gun here that I can see.
 
Status
Not open for further replies.
Top