Poor performance for the hardware I have

Status
Not open for further replies.

bhilgenkamp

Dabbler
Joined
Oct 1, 2012
Messages
16
I have a single share on my FreeNAS setup shared over AFP. If one person is writing to it it's not too bad, but any more than that and everyone trying to access it has very slow transfers to/from it. Even just getting a directory listing is painfully slow. The CPU, memory, etc. usage seems fine, it just seems like the disk access is really slow. I don't even know where to begin troubleshooting it. Can anyone help me at least get an idea what's wrong?

Specs:
6 core 3.1 GHz CPU
32 GB RAM
LSI 9211 Controller
12x 2TB drive (SATAII) single ZFS volume set as RAID-Z2
 

bollar

Patron
Joined
Oct 28, 2012
Messages
411
That's a large array to have in a single Vdev and my guess is that you're being limited by the slowest disk's max throughput. You could dramatically improve this situation by striping your array across two RAIDZ2 Vdevs -- but you'll have to start over to do this and you'll lose 4T of storage (though you would have four parity disks).
 

bhilgenkamp

Dabbler
Joined
Oct 1, 2012
Messages
16
I've got the array fairly full and unfortunately I don't have anywhere to put the data. I noticed one of the hard drives has an activity light that is fairly solid while the others are not as busy. Is it possible there is a problem with that single drive, and if so is there a way to test just that drive for any errors?
 

c4rp3d13m

Dabbler
Joined
Nov 18, 2012
Messages
12
i have 12x1TB hd in RAIDZ3 setup with at least 6 datasets and man it is slow. i thought if i am making multiple datasets why not split the 12 hard drives to 6-hd set, 5-hd set + 1 spare. it should help speed up my access, making my RAIDZ3 to RAIDZ+1 spare.
 

bollar

Patron
Joined
Oct 28, 2012
Messages
411
It is possible, though IMO, that's not the cause. You can run a long SMART test on the drive and you can look at the current status and iostat to see if that drive has issues:

smartctl -x /dev/daXX (where XX is the drive you wish to test)
smartctl -t long /dev/daXX (this test could take a long time (maybe 8 hours depending on the drive)
smartctl -x /dev/daXX
zpool status
zpool iostat -v


Or you could just replace the drive and let it resilver.

Also, you should be aware that once the array starts to fill -- some say 80% full, performance will start to suffer.
 

bhilgenkamp

Dabbler
Joined
Oct 1, 2012
Messages
16
Would I be better off converting to UFS at some point? Or is 12 drives just too much no matter what I do?
 

bollar

Patron
Joined
Oct 28, 2012
Messages
411
i have 12x1TB hd in RAIDZ3 setup with at least 6 datasets and man it is slow. i thought if i am making multiple datasets why not split the 12 hard drives to 6-hd set, 5-hd set + 1 spare. it should help speed up my making RAIDZ3 to RAIDZ+1 spare.

2 Vdevs should speed your access. The Vdevs want to be the same size, so that leaves you at 6-6 or 5-5. With RAIDZ3, that doesn't leave you with much storage space and you might considering mirroring and striping the drives instead, which would be very fast.
 

bollar

Patron
Joined
Oct 28, 2012
Messages
411
Would I be better off converting to UFS at some point? Or is 12 drives just too much no matter what I do?

In any parity RAID array, IO is pretty much limited to the speed of the slowest HD, so I would expect a change to UFS to be negligible. The ways to speed up ZFS arrays are through striping multiple Vdevs (think RAID 50 or 60), or through using mirroring and striping (think RAID 10). RAID 10 is extremely fast and doesn't have the same IO issues, but it takes more drives to get the same amount of space. In round numbers, your 12 drives would yield something less than 4TB of storage in RAID 10 and something less than 8TB in RAID 60. My guess is that you have something under 10TB now.

If you're space constrained, that's not going to work unless you move to larger drives.
 

bollar

Patron
Joined
Oct 28, 2012
Messages
411
Can I continue to use the array during the SMART tests, or will I need to unmount it first?

You can continue to use it -- if the array is busy, the test will just take longer.
 

bhilgenkamp

Dabbler
Joined
Oct 1, 2012
Messages
16
I did a quick test and here were the results. I see errors, but just don't really know how bad they are:


Code:
smartctl 5.43 2012-06-30 r3573 [FreeBSD 8.3-RELEASE-p4 amd64] (local build)
Copyright (C) 2002-12 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Caviar Green (Adv. Format)
Device Model:     WDC WD20EARS-00MVWB0
Serial Number:    WD-WMAZA0704087
LU WWN Device Id: 5 0014ee 60066ea6c
Firmware Version: 51.0AB51
User Capacity:    2,000,398,934,016 bytes [2.00 TB]
Sector Size:      512 bytes logical/physical
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   8
ATA Standard is:  Exact ATA specification draft version not indicated
Local Time is:    Tue Nov 20 16:28:40 2012 CST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
AAM feature is:   Disabled
APM feature is:   Unavailable
Rd look-ahead is: Enabled
Write cache is:   Enabled
ATA Security is:  Disabled, NOT FROZEN [SEC1]

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x85)	Offline data collection activity
					was aborted by an interrupting command from host.
					Auto Offline Data Collection: Enabled.
Self-test execution status:      ( 117)	The previous self-test completed having
					the read element of the test failed.
Total time to complete Offline 
data collection: 		(38100) seconds.
Offline data collection
capabilities: 			 (0x7b) SMART execute Offline immediate.
					Auto Offline data collection on/off support.
					Suspend Offline collection upon new
					command.
					Offline surface scan supported.
					Self-test supported.
					Conveyance Self-test supported.
					Selective Self-test supported.
SMART capabilities:            (0x0003)	Saves SMART data before entering
					power-saving mode.
					Supports SMART auto save timer.
Error logging capability:        (0x01)	Error logging supported.
					General Purpose Logging supported.
Short self-test routine 
recommended polling time: 	 (   2) minutes.
Extended self-test routine
recommended polling time: 	 ( 367) minutes.
Conveyance self-test routine
recommended polling time: 	 (   5) minutes.
SCT capabilities: 	       (0x3035)	SCT Status supported.
					SCT Feature Control supported.
					SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAGS    VALUE WORST THRESH FAIL RAW_VALUE
  1 Raw_Read_Error_Rate     POSR-K   187   187   051    -    5387
  3 Spin_Up_Time            POS--K   174   167   021    -    6300
  4 Start_Stop_Count        -O--CK   092   092   000    -    8146
  5 Reallocated_Sector_Ct   PO--CK   190   190   140    -    203
  7 Seek_Error_Rate         -OSR-K   200   200   000    -    0
  9 Power_On_Hours          -O--CK   079   079   000    -    15563
 10 Spin_Retry_Count        -O--CK   100   100   000    -    0
 11 Calibration_Retry_Count -O--CK   100   100   000    -    0
 12 Power_Cycle_Count       -O--CK   100   100   000    -    126
192 Power-Off_Retract_Count -O--CK   200   200   000    -    48
193 Load_Cycle_Count        -O--CK   091   091   000    -    328145
194 Temperature_Celsius     -O---K   115   110   000    -    35
196 Reallocated_Event_Count -O--CK   083   083   000    -    117
197 Current_Pending_Sector  -O--CK   199   199   000    -    511
198 Offline_Uncorrectable   ----CK   200   200   000    -    195
199 UDMA_CRC_Error_Count    -O--CK   200   200   000    -    0
200 Multi_Zone_Error_Rate   ---R--   193   192   000    -    2112
                            ||||||_ K auto-keep
                            |||||__ C event count
                            ||||___ R error rate
                            |||____ S speed/performance
                            ||_____ O updated online
                            |______ P prefailure warning

General Purpose Log Directory Version 1
SMART           Log Directory Version 1 [multi-sector log support]
GP/S  Log at address 0x00 has    1 sectors [Log Directory]
SMART Log at address 0x01 has    1 sectors [Summary SMART error log]
SMART Log at address 0x02 has    5 sectors [Comprehensive SMART error log]
GP    Log at address 0x03 has    6 sectors [Ext. Comprehensive SMART error log]
SMART Log at address 0x06 has    1 sectors [SMART self-test log]
GP    Log at address 0x07 has    1 sectors [Extended self-test log]
SMART Log at address 0x09 has    1 sectors [Selective self-test log]
GP    Log at address 0x10 has    1 sectors [NCQ Command Error log]
GP    Log at address 0x11 has    1 sectors [SATA Phy Event Counters]
GP/S  Log at address 0x80 has   16 sectors [Host vendor specific log]
GP/S  Log at address 0x81 has   16 sectors [Host vendor specific log]
GP/S  Log at address 0x82 has   16 sectors [Host vendor specific log]
GP/S  Log at address 0x83 has   16 sectors [Host vendor specific log]
GP/S  Log at address 0x84 has   16 sectors [Host vendor specific log]
GP/S  Log at address 0x85 has   16 sectors [Host vendor specific log]
GP/S  Log at address 0x86 has   16 sectors [Host vendor specific log]
GP/S  Log at address 0x87 has   16 sectors [Host vendor specific log]
GP/S  Log at address 0x88 has   16 sectors [Host vendor specific log]
GP/S  Log at address 0x89 has   16 sectors [Host vendor specific log]
GP/S  Log at address 0x8a has   16 sectors [Host vendor specific log]
GP/S  Log at address 0x8b has   16 sectors [Host vendor specific log]
GP/S  Log at address 0x8c has   16 sectors [Host vendor specific log]
GP/S  Log at address 0x8d has   16 sectors [Host vendor specific log]
GP/S  Log at address 0x8e has   16 sectors [Host vendor specific log]
GP/S  Log at address 0x8f has   16 sectors [Host vendor specific log]
GP/S  Log at address 0x90 has   16 sectors [Host vendor specific log]
GP/S  Log at address 0x91 has   16 sectors [Host vendor specific log]
GP/S  Log at address 0x92 has   16 sectors [Host vendor specific log]
GP/S  Log at address 0x93 has   16 sectors [Host vendor specific log]
GP/S  Log at address 0x94 has   16 sectors [Host vendor specific log]
GP/S  Log at address 0x95 has   16 sectors [Host vendor specific log]
GP/S  Log at address 0x96 has   16 sectors [Host vendor specific log]
GP/S  Log at address 0x97 has   16 sectors [Host vendor specific log]
GP/S  Log at address 0x98 has   16 sectors [Host vendor specific log]
GP/S  Log at address 0x99 has   16 sectors [Host vendor specific log]
GP/S  Log at address 0x9a has   16 sectors [Host vendor specific log]
GP/S  Log at address 0x9b has   16 sectors [Host vendor specific log]
GP/S  Log at address 0x9c has   16 sectors [Host vendor specific log]
GP/S  Log at address 0x9d has   16 sectors [Host vendor specific log]
GP/S  Log at address 0x9e has   16 sectors [Host vendor specific log]
GP/S  Log at address 0x9f has   16 sectors [Host vendor specific log]
GP/S  Log at address 0xa0 has   16 sectors [Device vendor specific log]
GP/S  Log at address 0xa1 has   16 sectors [Device vendor specific log]
GP/S  Log at address 0xa2 has   16 sectors [Device vendor specific log]
GP/S  Log at address 0xa3 has   16 sectors [Device vendor specific log]
GP/S  Log at address 0xa4 has   16 sectors [Device vendor specific log]
GP/S  Log at address 0xa5 has   16 sectors [Device vendor specific log]
GP/S  Log at address 0xa6 has   16 sectors [Device vendor specific log]
GP/S  Log at address 0xa7 has   16 sectors [Device vendor specific log]
GP/S  Log at address 0xa8 has    1 sectors [Device vendor specific log]
GP/S  Log at address 0xa9 has    1 sectors [Device vendor specific log]
GP/S  Log at address 0xaa has    1 sectors [Device vendor specific log]
GP/S  Log at address 0xab has    1 sectors [Device vendor specific log]
GP/S  Log at address 0xac has    1 sectors [Device vendor specific log]
GP/S  Log at address 0xad has    1 sectors [Device vendor specific log]
GP/S  Log at address 0xae has    1 sectors [Device vendor specific log]
GP/S  Log at address 0xaf has    1 sectors [Device vendor specific log]
GP/S  Log at address 0xb0 has    1 sectors [Device vendor specific log]
GP/S  Log at address 0xb1 has    1 sectors [Device vendor specific log]
GP/S  Log at address 0xb2 has    1 sectors [Device vendor specific log]
GP/S  Log at address 0xb3 has    1 sectors [Device vendor specific log]
GP/S  Log at address 0xb4 has    1 sectors [Device vendor specific log]
GP/S  Log at address 0xb5 has    1 sectors [Device vendor specific log]
GP/S  Log at address 0xb6 has    1 sectors [Device vendor specific log]
GP/S  Log at address 0xb7 has    1 sectors [Device vendor specific log]
GP/S  Log at address 0xc0 has    1 sectors [Device vendor specific log]
GP    Log at address 0xc1 has   93 sectors [Device vendor specific log]
GP/S  Log at address 0xe0 has    1 sectors [SCT Command/Status]
GP/S  Log at address 0xe1 has    1 sectors [SCT Data Transfer]

SMART Extended Comprehensive Error Log Version: 1 (6 sectors)
Device Error Count: 2233 (device log contains only the most recent 24 errors)
	CR     = Command Register
	FEATR  = Features Register
	COUNT  = Count (was: Sector Count) Register
	LBA_48 = Upper bytes of LBA High/Mid/Low Registers ]  ATA-8
	LH     = LBA High (was: Cylinder High) Register    ]   LBA
	LM     = LBA Mid (was: Cylinder Low) Register      ] Register
	LL     = LBA Low (was: Sector Number) Register     ]
	DV     = Device (was: Device/Head) Register
	DC     = Device Control Register
	ER     = Error register
	ST     = Status register
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 2233 [0] occurred at disk power-on lifetime: 15412 hours (642 days + 4 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER -- ST COUNT  LBA_48  LH LM LL DV DC
  -- -- -- == -- == == == -- -- -- -- --
  40 -- 51 00 00 00 00 6f 78 ba e0 40 00  Error: UNC at LBA = 0x6f78bae0 = 1870183136

  Commands leading to the command that caused the error were:
  CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
  -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
  60 00 19 00 20 00 00 6f 78 bc f8 40 00  8d+14:18:05.854  READ FPDMA QUEUED
  60 00 81 00 18 00 00 6f 78 bc 77 40 00  8d+14:18:05.854  READ FPDMA QUEUED
  60 00 1a 00 10 00 00 6f 78 bc 5d 40 00  8d+14:18:05.854  READ FPDMA QUEUED
  60 00 b4 00 08 00 00 6f 78 bb a9 40 00  8d+14:18:05.854  READ FPDMA QUEUED
  60 00 e8 00 00 00 00 6f 78 ba c1 40 00  8d+14:18:05.854  READ FPDMA QUEUED

Error 2232 [23] occurred at disk power-on lifetime: 15412 hours (642 days + 4 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER -- ST COUNT  LBA_48  LH LM LL DV DC
  -- -- -- == -- == == == -- -- -- -- --
  40 -- 51 00 00 00 00 6f 78 ba e0 40 00  Error: UNC at LBA = 0x6f78bae0 = 1870183136

  Commands leading to the command that caused the error were:
  CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
  -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
  60 00 19 00 20 00 00 6f 78 bc f8 40 00  8d+14:18:01.658  READ FPDMA QUEUED
  60 00 81 00 18 00 00 6f 78 bc 77 40 00  8d+14:18:01.658  READ FPDMA QUEUED
  60 00 1a 00 10 00 00 6f 78 bc 5d 40 00  8d+14:18:01.658  READ FPDMA QUEUED
  60 00 b4 00 08 00 00 6f 78 bb a9 40 00  8d+14:18:01.658  READ FPDMA QUEUED
  60 00 e8 00 00 00 00 6f 78 ba c1 40 00  8d+14:18:01.658  READ FPDMA QUEUED

Error 2231 [22] occurred at disk power-on lifetime: 15412 hours (642 days + 4 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER -- ST COUNT  LBA_48  LH LM LL DV DC
  -- -- -- == -- == == == -- -- -- -- --
  40 -- 51 00 00 00 00 6f 78 ba e0 40 00  Error: UNC at LBA = 0x6f78bae0 = 1870183136

  Commands leading to the command that caused the error were:
  CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
  -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
  60 00 19 00 20 00 00 6f 78 bc f8 40 00  8d+14:17:57.662  READ FPDMA QUEUED
  60 00 81 00 18 00 00 6f 78 bc 77 40 00  8d+14:17:57.662  READ FPDMA QUEUED
  60 00 1a 00 10 00 00 6f 78 bc 5d 40 00  8d+14:17:57.662  READ FPDMA QUEUED
  60 00 b4 00 08 00 00 6f 78 bb a9 40 00  8d+14:17:57.662  READ FPDMA QUEUED
  60 00 e8 00 00 00 00 6f 78 ba c1 40 00  8d+14:17:57.662  READ FPDMA QUEUED

Error 2230 [21] occurred at disk power-on lifetime: 15412 hours (642 days + 4 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER -- ST COUNT  LBA_48  LH LM LL DV DC
  -- -- -- == -- == == == -- -- -- -- --
  40 -- 51 00 00 00 00 6f 78 ba e0 40 00  Error: UNC at LBA = 0x6f78bae0 = 1870183136

  Commands leading to the command that caused the error were:
  CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
  -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
  60 00 19 00 18 00 00 6f 78 bc f8 40 00  8d+14:17:53.358  READ FPDMA QUEUED
  60 00 81 00 10 00 00 6f 78 bc 77 40 00  8d+14:17:51.929  READ FPDMA QUEUED
  60 00 1a 00 08 00 00 6f 78 bc 5d 40 00  8d+14:17:51.919  READ FPDMA QUEUED
  60 00 b4 00 00 00 00 6f 78 bb a9 40 00  8d+14:17:51.919  READ FPDMA QUEUED
  60 00 e8 00 28 00 00 6f 78 ba c1 40 00  8d+14:17:51.919  READ FPDMA QUEUED

Error 2229 [20] occurred at disk power-on lifetime: 15405 hours (641 days + 21 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER -- ST COUNT  LBA_48  LH LM LL DV DC
  -- -- -- == -- == == == -- -- -- -- --
  40 -- 51 00 00 00 00 dc 0c 0f 3f 40 00  Error: UNC at LBA = 0xdc0c0f3f = 3691777855

  Commands leading to the command that caused the error were:
  CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
  -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
  60 00 1a 00 08 00 00 dc 0c 22 ac 40 00  8d+07:49:03.743  READ FPDMA QUEUED
  60 00 1a 00 00 00 00 dc 0c 0f 3f 40 00  8d+07:49:03.742  READ FPDMA QUEUED
  60 00 19 00 00 00 00 dc 0c 0e 53 40 00  8d+07:49:03.732  READ FPDMA QUEUED
  60 00 1a 00 00 00 00 dc 0c 0e 39 40 00  8d+07:49:03.729  READ FPDMA QUEUED
  60 00 1a 00 00 00 00 dc 0c 0e 1f 40 00  8d+07:49:03.727  READ FPDMA QUEUED

Error 2228 [19] occurred at disk power-on lifetime: 15405 hours (641 days + 21 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER -- ST COUNT  LBA_48  LH LM LL DV DC
  -- -- -- == -- == == == -- -- -- -- --
  40 -- 51 00 00 00 00 db 3b f7 f8 40 00  Error: UNC at LBA = 0xdb3bf7f8 = 3678140408

  Commands leading to the command that caused the error were:
  CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
  -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
  60 00 1a 00 18 00 00 db 3b fb 25 40 00  8d+07:26:01.243  READ FPDMA QUEUED
  60 00 1a 00 28 00 00 db 3b fb 0b 40 00  8d+07:26:01.240  READ FPDMA QUEUED
  60 00 e8 00 18 00 00 db 3b fa 23 40 00  8d+07:26:01.240  READ FPDMA QUEUED
  60 00 e7 00 20 00 00 db 3b f9 3c 40 00  8d+07:26:01.239  READ FPDMA QUEUED
  60 00 e8 00 18 00 00 db 3b f8 54 40 00  8d+07:26:01.239  READ FPDMA QUEUED

Error 2227 [18] occurred at disk power-on lifetime: 15405 hours (641 days + 21 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER -- ST COUNT  LBA_48  LH LM LL DV DC
  -- -- -- == -- == == == -- -- -- -- --
  40 -- 51 00 00 00 00 db 3b f7 f8 40 00  Error: UNC at LBA = 0xdb3bf7f8 = 3678140408

  Commands leading to the command that caused the error were:
  CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
  -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
  60 00 1a 00 10 00 00 db 3b f8 20 40 00  8d+07:25:57.318  READ FPDMA QUEUED
  60 00 19 00 08 00 00 db 3b f8 07 40 00  8d+07:25:57.314  READ FPDMA QUEUED
  60 00 1a 00 00 00 00 db 3b f7 ed 40 00  8d+07:25:57.310  READ FPDMA QUEUED
  60 00 1a 00 08 00 00 db 3b f7 d3 40 00  8d+07:25:57.306  READ FPDMA QUEUED
  60 00 01 00 00 00 00 da 20 c4 7d 40 00  8d+07:25:57.303  READ FPDMA QUEUED

Error 2226 [17] occurred at disk power-on lifetime: 15404 hours (641 days + 20 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER -- ST COUNT  LBA_48  LH LM LL DV DC
  -- -- -- == -- == == == -- -- -- -- --
  40 -- 51 00 00 00 00 db c8 90 38 40 00  Error: UNC at LBA = 0xdbc89038 = 3687354424

  Commands leading to the command that caused the error were:
  CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
  -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
  60 00 19 00 10 00 00 db c8 90 4b 40 00  8d+07:06:26.786  READ FPDMA QUEUED
  60 00 01 00 08 00 00 db 21 6a 9f 40 00  8d+07:06:26.782  READ FPDMA QUEUED
  60 00 1a 00 00 00 00 db c8 90 30 40 00  8d+07:06:26.782  READ FPDMA QUEUED
  60 00 68 00 20 00 00 db c8 8f c8 40 00  8d+07:06:26.778  READ FPDMA QUEUED
  60 00 01 00 18 00 00 db 26 28 fe 40 00  8d+07:06:26.778  READ FPDMA QUEUED

SMART Extended Self-test Log Version: 1 (1 sectors)
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed: read failure       50%     15563         1373147720

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

SCT Status Version:                  3
SCT Version (vendor specific):       258 (0x0102)
SCT Support Level:                   1
Device State:                        Active (0)
Current Temperature:                    35 Celsius
Power Cycle Min/Max Temperature:     27/36 Celsius
Lifetime    Min/Max Temperature:     27/40 Celsius
Under/Over Temperature Limit Count:   0/0
SCT Temperature History Version:     2
Temperature Sampling Period:         1 minute
Temperature Logging Interval:        1 minute
Min/Max recommended Temperature:      0/60 Celsius
Min/Max Temperature Limit:           -41/85 Celsius
Temperature History Size (Index):    478 (397)

Index    Estimated Time   Temperature Celsius
 398    2012-11-20 08:31    35  ****************
 ...    ..( 74 skipped).    ..  ****************
 473    2012-11-20 09:46    35  ****************
 474    2012-11-20 09:47    34  ***************
 475    2012-11-20 09:48    35  ****************
 ...    ..(  6 skipped).    ..  ****************
   4    2012-11-20 09:55    35  ****************
   5    2012-11-20 09:56    34  ***************
   6    2012-11-20 09:57    34  ***************
   7    2012-11-20 09:58    33  **************
 ...    ..(  2 skipped).    ..  **************
  10    2012-11-20 10:01    33  **************
  11    2012-11-20 10:02    34  ***************
 ...    ..(  7 skipped).    ..  ***************
  19    2012-11-20 10:10    34  ***************
  20    2012-11-20 10:11    35  ****************
 ...    ..(120 skipped).    ..  ****************
 141    2012-11-20 12:12    35  ****************
 142    2012-11-20 12:13    34  ***************
 ...    ..(  2 skipped).    ..  ***************
 145    2012-11-20 12:16    34  ***************
 146    2012-11-20 12:17    33  **************
 147    2012-11-20 12:18    34  ***************
 ...    ..( 19 skipped).    ..  ***************
 167    2012-11-20 12:38    34  ***************
 168    2012-11-20 12:39    35  ****************
 169    2012-11-20 12:40    35  ****************
 170    2012-11-20 12:41    34  ***************
 171    2012-11-20 12:42    35  ****************
 ...    ..( 10 skipped).    ..  ****************
 182    2012-11-20 12:53    35  ****************
 183    2012-11-20 12:54    34  ***************
 184    2012-11-20 12:55    35  ****************
 185    2012-11-20 12:56    35  ****************
 186    2012-11-20 12:57    33  **************
 187    2012-11-20 12:58    34  ***************
 188    2012-11-20 12:59    34  ***************
 189    2012-11-20 13:00    35  ****************
 ...    ..(  4 skipped).    ..  ****************
 194    2012-11-20 13:05    35  ****************
 195    2012-11-20 13:06    34  ***************
 ...    ..( 19 skipped).    ..  ***************
 215    2012-11-20 13:26    34  ***************
 216    2012-11-20 13:27    33  **************
 217    2012-11-20 13:28    34  ***************
 ...    ..(  7 skipped).    ..  ***************
 225    2012-11-20 13:36    34  ***************
 226    2012-11-20 13:37    33  **************
 227    2012-11-20 13:38    34  ***************
 ...    ..(  2 skipped).    ..  ***************
 230    2012-11-20 13:41    34  ***************
 231    2012-11-20 13:42    33  **************
 232    2012-11-20 13:43    34  ***************
 233    2012-11-20 13:44    35  ****************
 234    2012-11-20 13:45    34  ***************
 235    2012-11-20 13:46    35  ****************
 236    2012-11-20 13:47    35  ****************
 237    2012-11-20 13:48    34  ***************
 ...    ..( 31 skipped).    ..  ***************
 269    2012-11-20 14:20    34  ***************
 270    2012-11-20 14:21    33  **************
 271    2012-11-20 14:22    35  ****************
 272    2012-11-20 14:23    35  ****************
 273    2012-11-20 14:24    34  ***************
 274    2012-11-20 14:25    35  ****************
 ...    ..(  5 skipped).    ..  ****************
 280    2012-11-20 14:31    35  ****************
 281    2012-11-20 14:32    34  ***************
 282    2012-11-20 14:33    35  ****************
 ...    ..( 20 skipped).    ..  ****************
 303    2012-11-20 14:54    35  ****************
 304    2012-11-20 14:55    34  ***************
 305    2012-11-20 14:56    34  ***************
 306    2012-11-20 14:57    35  ****************
 ...    ..( 12 skipped).    ..  ****************
 319    2012-11-20 15:10    35  ****************
 320    2012-11-20 15:11    34  ***************
 321    2012-11-20 15:12    35  ****************
 322    2012-11-20 15:13    35  ****************
 323    2012-11-20 15:14    35  ****************
 324    2012-11-20 15:15    36  *****************
 ...    ..(  5 skipped).    ..  *****************
 330    2012-11-20 15:21    36  *****************
 331    2012-11-20 15:22    35  ****************
 332    2012-11-20 15:23    36  *****************
 333    2012-11-20 15:24    36  *****************
 334    2012-11-20 15:25    35  ****************
 ...    ..( 41 skipped).    ..  ****************
 376    2012-11-20 16:07    35  ****************
 377    2012-11-20 16:08    36  *****************
 378    2012-11-20 16:09    35  ****************
 379    2012-11-20 16:10    35  ****************
 380    2012-11-20 16:11    36  *****************
 381    2012-11-20 16:12    35  ****************
 382    2012-11-20 16:13    35  ****************
 383    2012-11-20 16:14    36  *****************
 384    2012-11-20 16:15    35  ****************
 385    2012-11-20 16:16    36  *****************
 386    2012-11-20 16:17    35  ****************
 387    2012-11-20 16:18    35  ****************
 388    2012-11-20 16:19    36  *****************
 389    2012-11-20 16:20    35  ****************
 ...    ..(  2 skipped).    ..  ****************
 392    2012-11-20 16:23    35  ****************
 393    2012-11-20 16:24    36  *****************
 394    2012-11-20 16:25    35  ****************
 ...    ..(  2 skipped).    ..  ****************
 397    2012-11-20 16:28    35  ****************

Warning: device does not support SCT Error Recovery Control command
SATA Phy Event Counters (GP Log 0x11)
ID      Size     Value  Description
0x0001  2            0  Command failed due to ICRC error
0x0002  2            0  R_ERR response for data FIS
0x0003  2            0  R_ERR response for device-to-host data FIS
0x0004  2            0  R_ERR response for host-to-device data FIS
0x0005  2            0  R_ERR response for non-data FIS
0x0006  2            0  R_ERR response for device-to-host non-data FIS
0x0007  2            0  R_ERR response for host-to-device non-data FIS
0x000a  2            1  Device-to-host register FISes sent due to a COMRESET
0x000b  2            0  CRC errors within host-to-device FIS
0x8000  4      1289723  Vendor specific


I also see this a lot (just a random sampling):

Code:
Nov 19 21:26:18 nas smartd[2583]: Device: /dev/da3 [SAT], 13 Currently unreadable (pending) sectors
Nov 19 21:26:18 nas smartd[2583]: Device: /dev/da3 [SAT], 2 Offline uncorrectable sectors
Nov 19 21:56:17 nas smartd[2583]: Device: /dev/da0 [SAT], 665 Currently unreadable (pending) sectors
Nov 19 21:56:17 nas smartd[2583]: Device: /dev/da0 [SAT], 8 Offline uncorrectable sectors
Nov 19 21:56:17 nas smartd[2583]: Device: /dev/da2 [SAT], 505 Currently unreadable (pending) sectors
Nov 19 21:56:17 nas smartd[2583]: Device: /dev/da2 [SAT], 195 Offline uncorrectable sectors
Nov 19 21:56:18 nas smartd[2583]: Device: /dev/da3 [SAT], 13 Currently unreadable (pending) sectors
Nov 19 21:56:18 nas smartd[2583]: Device: /dev/da3 [SAT], 2 Offline uncorrectable sectors
Nov 19 22:26:17 nas smartd[2583]: Device: /dev/da0 [SAT], 665 Currently unreadable (pending) sectors
Nov 19 22:26:17 nas smartd[2583]: Device: /dev/da0 [SAT], 8 Offline uncorrectable sectors
Nov 19 22:26:17 nas smartd[2583]: Device: /dev/da2 [SAT], 505 Currently unreadable (pending) sectors
Nov 19 22:26:17 nas smartd[2583]: Device: /dev/da2 [SAT], 195 Offline uncorrectable sectors
Nov 19 22:26:17 nas smartd[2583]: Device: /dev/da3 [SAT], 13 Currently unreadable (pending) sectors
Nov 19 22:26:17 nas smartd[2583]: Device: /dev/da3 [SAT], 2 Offline uncorrectable sectors
Nov 19 22:56:17 nas smartd[2583]: Device: /dev/da0 [SAT], 665 Currently unreadable (pending) sectors
Nov 19 22:56:17 nas smartd[2583]: Device: /dev/da0 [SAT], 8 Offline uncorrectable sectors
Nov 19 22:56:18 nas smartd[2583]: Device: /dev/da2 [SAT], 509 Currently unreadable (pending) sectors (changed +4)
Nov 19 22:56:18 nas smartd[2583]: Device: /dev/da2 [SAT], 195 Offline uncorrectable sectors
Nov 19 22:56:18 nas smartd[2583]: Device: /dev/da3 [SAT], 13 Currently unreadable (pending) sectors
Nov 19 22:56:18 nas smartd[2583]: Device: /dev/da3 [SAT], 2 Offline uncorrectable sectors
Nov 19 23:26:17 nas smartd[2583]: Device: /dev/da0 [SAT], 666 Currently unreadable (pending) sectors (changed +1)
Nov 19 23:26:17 nas smartd[2583]: Device: /dev/da0 [SAT], 8 Offline uncorrectable sectors
Nov 19 23:26:18 nas smartd[2583]: Device: /dev/da2 [SAT], 511 Currently unreadable (pending) sectors (changed +2)
Nov 19 23:26:18 nas smartd[2583]: Device: /dev/da2 [SAT], 195 Offline uncorrectable sectors
Nov 19 23:26:18 nas smartd[2583]: Device: /dev/da3 [SAT], 13 Currently unreadable (pending) sectors
Nov 19 23:26:18 nas smartd[2583]: Device: /dev/da3 [SAT], 2 Offline uncorrectable sectors
Nov 19 23:56:18 nas smartd[2583]: Device: /dev/da0 [SAT], 666 Currently unreadable (pending) sectors
Nov 19 23:56:18 nas smartd[2583]: Device: /dev/da0 [SAT], 8 Offline uncorrectable sectors
Nov 19 23:56:19 nas smartd[2583]: Device: /dev/da2 [SAT], 511 Currently unreadable (pending) sectors
Nov 19 23:56:19 nas smartd[2583]: Device: /dev/da2 [SAT], 195 Offline uncorrectable sectors
Nov 19 23:56:19 nas smartd[2583]: Device: /dev/da3 [SAT], 13 Currently unreadable (pending) sectors
Nov 19 23:56:19 nas smartd[2583]: Device: /dev/da3 [SAT], 2 Offline uncorrectable sectors
Nov 20 00:26:18 nas smartd[2583]: Device: /dev/da0 [SAT], 666 Currently unreadable (pending) sectors
Nov 20 00:26:18 nas smartd[2583]: Device: /dev/da0 [SAT], 8 Offline uncorrectable sectors
Nov 20 00:26:18 nas smartd[2583]: Device: /dev/da2 [SAT], 511 Currently unreadable (pending) sectors
Nov 20 00:26:18 nas smartd[2583]: Device: /dev/da2 [SAT], 195 Offline uncorrectable sectors
Nov 20 00:26:19 nas smartd[2583]: Device: /dev/da3 [SAT], 13 Currently unreadable (pending) sectors
Nov 20 00:26:19 nas smartd[2583]: Device: /dev/da3 [SAT], 2 Offline uncorrectable sectors
Nov 20 00:56:17 nas smartd[2583]: Device: /dev/da0 [SAT], 666 Currently unreadable (pending) sectors
Nov 20 00:56:17 nas smartd[2583]: Device: /dev/da0 [SAT], 8 Offline uncorrectable sectors
Nov 20 00:56:19 nas smartd[2583]: Device: /dev/da2 [SAT], 511 Currently unreadable (pending) sectors
Nov 20 00:56:19 nas smartd[2583]: Device: /dev/da2 [SAT], 195 Offline uncorrectable sectors
Nov 20 00:56:19 nas smartd[2583]: Device: /dev/da3 [SAT], 13 Currently unreadable (pending) sectors
Nov 20 00:56:19 nas smartd[2583]: Device: /dev/da3 [SAT], 2 Offline uncorrectable sectors
Nov 20 01:26:17 nas smartd[2583]: Device: /dev/da0 [SAT], 666 Currently unreadable (pending) sectors
Nov 20 01:26:17 nas smartd[2583]: Device: /dev/da0 [SAT], 8 Offline uncorrectable sectors
Nov 20 01:26:18 nas smartd[2583]: Device: /dev/da2 [SAT], 511 Currently unreadable (pending) sectors
Nov 20 01:26:18 nas smartd[2583]: Device: /dev/da2 [SAT], 195 Offline uncorrectable sectors
Nov 20 01:26:18 nas smartd[2583]: Device: /dev/da3 [SAT], 13 Currently unreadable (pending) sectors
Nov 20 01:26:18 nas smartd[2583]: Device: /dev/da3 [SAT], 2 Offline uncorrectable sectors
Nov 20 01:56:17 nas smartd[2583]: Device: /dev/da0 [SAT], 666 Currently unreadable (pending) sectors
Nov 20 01:56:17 nas smartd[2583]: Device: /dev/da0 [SAT], 8 Offline uncorrectable sectors
Nov 20 01:56:19 nas smartd[2583]: Device: /dev/da2 [SAT], 511 Currently unreadable (pending) sectors
Nov 20 01:56:19 nas smartd[2583]: Device: /dev/da2 [SAT], 195 Offline uncorrectable sectors
Nov 20 01:56:19 nas smartd[2583]: Device: /dev/da3 [SAT], 13 Currently unreadable (pending) sectors
Nov 20 01:56:19 nas smartd[2583]: Device: /dev/da3 [SAT], 2 Offline uncorrectable sectors
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
I would look at replacing disks 0, 2 and 3, immediately. All 3 of those disks are going to fail soon. If you don't have a backup, you'd BETTER do one.. ASAP.
 

bhilgenkamp

Dabbler
Joined
Oct 1, 2012
Messages
16
Sounds good, getting some ordered now. Do you think this would be part of the performance issue, or is that unrelated?
Is there an acceptable level of unreadable sectors, or once I see one I should replace the drive?
Also, is there a guide or something somewhere for the proper procedures to replace a drive in a ZFS array?
 

bhilgenkamp

Dabbler
Joined
Oct 1, 2012
Messages
16
And one more question - I'm looking at Western Digital drives. Should I try to get drives with TLER? I'm seeing conflicting posts about whether or not this is needed in a software RAID/ZFS setup.
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
Sounds good, getting some ordered now. Do you think this would be part of the performance issue, or is that unrelated?
Is there an acceptable level of unreadable sectors, or once I see one I should replace the drive?
Also, is there a guide or something somewhere for the proper procedures to replace a drive in a ZFS array?

It is very likely related to your performance issue. However, until you have those crappy drives replaced you won't know how much of an increase to see.

The manual provides excellent instructions for replacing a failing drive(6.3.11). Do realize that with each resilvering/scrub you risk failing another disk. Especially the ones that are already showing signs of failing. If you have important data on your array, I'd recommend you back up any important data before you start this whole processes. You have a RAIDZ2 which only allows for 2 failures, but you have 3 disks that are going bad. I always err on the side of caution and assume that my hardware does NOT support hot-pluggable disks so I do the applicable shutdowns per the manual. I've seen some issues with some controllers not having full support for hot-plugging disks in FreeBSD.

The resilvering will probably kill your server performance since you have so many failing disks. Just let it finish and with a little luck you'll be back in business with no lost data. If you were planning to do an upgrade, now would be a time to consider it. Assuming all of the drives are about the same age, I'd expect more drives to fail in the future. Just something to consider. It would suck to replace 1/2 your array over the next 6 months and then decide you want to upgrade to 3TB drives.

Typically "zero" is best for those errors, but "not increasing significantly" is also good. I have 16 drives in a pool and I get 1 sector per 4 months from a random drive. I'm not that concerned. In your case, you have accumulated a large number of errors in just a few hours(Holy shit! I think that's a new world record!). If I were in your shoes I'd stop worrying about performance and start hoping your data is safe. After you've replaced those 3 drives THEN reexamine the performance.
 

cyberjock

Inactive Account
Joined
Mar 25, 2012
Messages
19,526
And one more question - I'm looking at Western Digital drives. Should I try to get drives with TLER? I'm seeing conflicting posts about whether or not this is needed in a software RAID/ZFS setup.

Just noticed you made this post at the same time I did. TLER is preferred, but will mean you are buying VERY expensive drives. If you are already using commercial drives I'd just stick with those for consistency.
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
TLER being needed/not needed has nothing to do with software RAID. TLER bounds the drive's response time. Hardware RAID controllers will sometimes decide a very slow drive is dead (probably an accurate forecast) and drop it out of an array; TLER reduces the chance of that happening. Most software RAID systems don't do that, but if your array grinds to a halt, you may have many problems with your NAS/SAN clients. TLER can help there.

However, in either case, disks responding so slowly that TLER is even an issue probably is a sign that the disk needs looking-at or (better yet) replacement. So it's like seat belts in a car, great to have, but you never want to have to actually need them.
 

Stephens

Patron
Joined
Jun 19, 2012
Messages
496
And one more question - I'm looking at Western Digital drives. Should I try to get drives with TLER? I'm seeing conflicting posts about whether or not this is needed in a software RAID/ZFS setup.

Get the WD Red drives, which are usually only marginally more expensive than the WD Green ones. And they're made for NAS. TLER is helpful whether you use hardware or software RAID, but it's more critical to hardware RAID. If you're using RAIDZ1 or RAIDZ2, you can set TLER on the Red drives and let them go ahead and fail marginal sectors and let the recovery/parity information recreate it. IMO that's better than sitting there for 5 minutes trying to recover a marginal/bad sector. If you're buying drives, just make life simple on yourself and get the WD Reds unless they're much more expensive than Greens.
 
Status
Not open for further replies.
Top