Many Seagate 600GB 15K SAS Failures (ST3600057SS)

Steven Sedory · Nov 18, 2014

I have a production box that I use for the SAN of a 2012 R2 Hyper V Cluster. It is made up of 15 15K 600GB SAS drive (Seagate Cheetah ST3600057SS). I bought Manufacture Certified Refurbished Drives, as I could get them for almost $100 less. It may simply be that I got what I paid for..

Anyhow, over the past 10 months that the box has been in production, I've had seven or eight of the drives failed, which I've RMA'd. The last two had the SMART FAILURE PREDICTION THRESHOLD EXCEEDED error, so I just RMA'd those today.

The drives seem to be running at a Drive Temp of 47 C, with the ambient room temp at 25C (this is apparently within the healthy limits. Please see below from the product manual (http://www.seagate.com/staticfiles/support/disc/manuals/enterprise/cheetah/15K.7/SAS/100516226d.pdf)

"The maximum allowable continuous or sustained HDA case temperature for the rated Annualized Failure
Rate (AFR) is 122°F (50°C) The maximum allowable
HDA case temperature is
60°C. Occasional excur sions of HDA case temperatures above 122°F (50°C) or below 41°F (5°C) may occur without impact to the specified AFR. Continual or sustained operation at HDA case temperatures outside these limits may degrade AFR."

The last two drive that I RMA'd were at 41 and 42 C.

Below is the SMART data for the two drives before I removed them:

[root@san1] ~# smartctl -a /dev/da2
smartctl 6.2 2013-07-26 r3841 [FreeBSD 9.2-RELEASE-p12 amd64] (local build)
Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Vendor: SEAGATE
Product: ST3600057SS
Revision: 000B
User Capacity: 600,127,266,816 bytes [600 GB]
Logical block size: 512 bytes
Rotation Rate: 15000 rpm
Form Factor: 3.5 inches
Logical Unit id: 0x5000c5005d58270b
Serial number: 6SL1XJ4F0000N2027PB4
Device type: disk
Transport protocol: SAS
Local Time is: Tue Nov 18 14:24:19 2014 PST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
Temperature Warning: Enabled

=== START OF READ SMART DATA SECTION ===
SMART Health Status: FAILURE PREDICTION THRESHOLD EXCEEDED [asc=5d, ascq=0]

Current Drive Temperature: 41 C
Drive Trip Temperature: 68 C

Elements in grown defect list: 164

Vendor (Seagate) cache information
Blocks sent to initiator = 495170415
Blocks received from initiator = 1215535082
Blocks read from cache and sent to initiator = 105525624
Number of read and write commands whose size <= segment size = 11127391
Number of read and write commands whose size > segment size = 12

Vendor (Seagate/Hitachi) factory information
number of hours powered up = 5242.23
number of minutes until next internal SMART test = 52

Error counter log:
Errors Corrected by Total Correction Gigabytes Total
ECC rereads/ errors algorithm processed uncorrected
fast | delayed rewrites corrected invocations [10^9 bytes] errors
read: 7754326 0 0 7754326 7754327 253.527 1
write: 0 0 0 0 0 625.068 0
verify: 585 45 0 630 784 0.000 21

Non-medium error count: 2

[GLTSD (Global Logging Target Save Disable) set. Enable Save with '-S on']
SMART Self-test log
Num Test Status segment LifeTime LBA_first_err [SK ASC ASQ]
Description number (hours)
# 1 Background short Completed - 5241 - [- - -]
# 2 Background short Completed - 5240 - [- - -]
# 3 Background short Completed - 5239 - [- - -]
# 4 Background short Completed - 5238 - [- - -]
# 5 Background short Completed - 5237 - [- - -]
# 6 Background short Completed - 5236 - [- - -]
# 7 Background short Completed - 5235 - [- - -]
# 8 Background short Completed - 5234 - [- - -]
# 9 Background short Completed - 5233 - [- - -]
#10 Background short Completed - 5232 - [- - -]
#11 Background short Completed - 5231 - [- - -]
#12 Background short Completed - 5230 - [- - -]
#13 Background short Completed - 5229 - [- - -]
#14 Background short Completed - 5228 - [- - -]
#15 Background short Completed - 5227 - [- - -]
#16 Background short Completed - 5226 - [- - -]
#17 Background short Completed - 5225 - [- - -]
#18 Background short Completed - 5224 - [- - -]
#19 Background short Completed - 5223 - [- - -]
#20 Background short Completed - 5222 - [- - -]
Long (extended) Self Test duration: 6400 seconds [106.7 minutes]

[root@san1] ~# smartctl -a /dev/da4
smartctl 6.2 2013-07-26 r3841 [FreeBSD 9.2-RELEASE-p12 amd64] (local build)
Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Vendor: SEAGATE
Product: ST3600057SS
Revision: 000B
User Capacity: 600,127,266,816 bytes [600 GB]
Logical block size: 512 bytes
Rotation Rate: 15000 rpm
Form Factor: 3.5 inches
Logical Unit id: 0x5000c5005d586f27
Serial number: 3SL1AMXH00009108K8KE
Device type: disk
Transport protocol: SAS
Local Time is: Tue Nov 18 14:24:24 2014 PST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
Temperature Warning: Enabled

=== START OF READ SMART DATA SECTION ===
SMART Health Status: FAILURE PREDICTION THRESHOLD EXCEEDED [asc=5d, ascq=0]

Current Drive Temperature: 42 C
Drive Trip Temperature: 68 C

Elements in grown defect list: 2037

Vendor (Seagate) cache information
Blocks sent to initiator = 9449
Blocks received from initiator = 202
Blocks read from cache and sent to initiator = 5947
Number of read and write commands whose size <= segment size = 62
Number of read and write commands whose size > segment size = 0

Vendor (Seagate/Hitachi) factory information
number of hours powered up = 5242.70
number of minutes until next internal SMART test = 52

Error counter log:
Errors Corrected by Total Correction Gigabytes Total
ECC rereads/ errors algorithm processed uncorrected
fast | delayed rewrites corrected invocations [10^9 bytes] errors
read: 104 0 0 104 104 0.005 0
write: 0 0 0 0 0 0.000 0

Non-medium error count: 0

[GLTSD (Global Logging Target Save Disable) set. Enable Save with '-S on']
No self-tests have been logged

Any ideas?

SweetAndLow · Nov 18, 2014

Drives are way to hot. They shouldn't go over 40c at max load.

rs225 · Nov 19, 2014

Is that temperature really too hot for a 15,000RPM drive?

Cooler is better. Power supply could be an issue, also vibration.

Other possibility is that this is normal (or at least only the fault of) the drives you are using.

What was the failure symptoms of the 7 or 8 that didn't give you SMART warnings?

Steven Sedory · Nov 19, 2014

SweetAndLow said:
Drives are way to hot. They shouldn't go over 40c at max load.

Really though? Please see from page 29 of the product manual http://www.seagate.com/staticfiles/support/disc/manuals/enterprise/cheetah/15K.7/SAS/100516226d.pdf

It says 5C to 50C is the acceptable range. Am I understanding it incorrectly?

Steven Sedory · Nov 19, 2014

rs225 said:
Is that temperature really too hot for a 15,000RPM drive?

Cooler is better. Power supply could be an issue, also vibration.

Other possibility is that this is normal (or at least only the fault of) the drives you are using.

What was the failure symptoms of the 7 or 8 that didn't give you SMART warnings?

The issue with about four of the previously failed drives was mechanical. Audible clicking and such.

Below is the Critical Alarm alert that freenas sent me when two others failed about a month ago. You'll notice that as one of the spares took over mirror-1, it also failed with "too many errors":

FAULT: **********************************************************************
FAULT: Appliance : storage (OS v3.1.5, NMS v31-5-0)
FAULT: Machine SIG : 86597K9ML
FAULT: Primary MAC : 0:30:48:c8:72:dc
FAULT: Time : Mon Oct 6 09:49:08 2014
FAULT: Trigger : nms-fmacheck
FAULT: Fault Type : ALARM
FAULT: Fault ID : 5
FAULT: Fault Count : 1
FAULT: Severity : CRITICAL
FAULT: Description : FMA Module: zfs-diagnosis, UUID:
FAULT: : 8e10e781-23b4-e84d-f33b-8bda592c652f
FAULT: **********************************************************************

!
! For more detais on this trigger click on link below:
! http://192.168.77.200/data/runners?selected_runner=nms-fmacheck
!

List of faulty resources:
--------------- ------------------------------------ -------------- ---------
TIME CACHE-ID MSG-ID SEVERITY
--------------- ------------------------------------ -------------- ---------
Oct 06 09:48:31 b2aa6658-aaed-e4ea-e4d9-af5511c56668 ZFS-8000-GH Major

Host : myhost
Platform : X8DTN Chassis_id : 1234567890
Product_sn :

Fault class : fault.fs.zfs.vdev.checksum
Affects : zfs://pool=vol0/vdev=83acd2c81e78ac6e
faulted but still in service Problem in : zfs://pool=vol0/vdev=83acd2c81e78ac6e
faulted but still in service

Description : The number of checksum errors associated with a ZFS device
exceeded acceptable levels. Refer to
http://nexenta.com/msg/ZFS-8000-GH for more information.

Response : The device has been marked as degraded. An attempt
will be made to activate a hot spare if available.

Impact : Fault tolerance of the pool may be compromised.

Action : Run 'zpool status -x' and replace the bad device.

--------------- ------------------------------------ -------------- ---------
TIME CACHE-ID MSG-ID SEVERITY
--------------- ------------------------------------ -------------- ---------

Host : myhost
Platform : X8DTN Chassis_id : 1234567890
Product_sn :

Fault class : fault.fs.zfs.vdev.timeout
Affects : zfs://pool=vol0/vdev=4711f130867b0240
faulted and taken out of service Problem in : zfs://pool=vol0/vdev=4711f130867b0240
not present

Description : A Solaris Fault Manager component generated a diagnosis for which
no message summary exists. Refer to
http://nexenta.com/msg/FMD-8000-11 for more information.

Response : The diagnosis has been saved in the fault log for examination by
Sun.

Impact : The fault log will need to be manually examined using fmdump(1M)
in order to determine if any human response is required.

Action : Use fmdump -v -u <EVENT-ID> to view the diagnosis result. Run
pkgchk -n SUNWfmd to ensure that fault management software is
installed properly.

--------------- ------------------------------------ -------------- ---------
TIME CACHE-ID MSG-ID SEVERITY
--------------- ------------------------------------ -------------- ---------

Host : myhost
Platform : X8DTN Chassis_id : 1234567890
Product_sn :

Fault class : fault.fs.zfs.vdev.timeout
Affects : zfs://pool=vol0/vdev=d2a89124bf6f2590
out of service, but associated components no longer faulty Problem in : zfs://pool=vol0/vdev=d2a89124bf6f2590
not present

Description : A Solaris Fault Manager component generated a diagnosis for which
no message summary exists. Refer to
http://nexenta.com/msg/FMD-8000-11 for more information.

Response : The diagnosis has been saved in the fault log for examination by
Sun.

Impact : The fault log will need to be manually examined using fmdump(1M)
in order to determine if any human response is required.

Action : Use fmdump -v -u <EVENT-ID> to view the diagnosis result. Run
pkgchk -n SUNWfmd to ensure that fault management software is
installed properly.

--------------- ------------------------------------ -------------- ---------
TIME CACHE-ID MSG-ID SEVERITY
--------------- ------------------------------------ -------------- ---------
Oct 06 09:48:31 f4b2bcc4-ecc2-47ed-8b19-ec3bfbc3bd0b ZFS-8000-GH Major

Host : myhost
Platform : X8DTN Chassis_id : 1234567890
Product_sn :

Fault class : fault.fs.zfs.vdev.checksum
Affects : zfs://pool=vol0/vdev=6c7b3f7fadb717e8
faulted but still in service Problem in : zfs://pool=vol0/vdev=6c7b3f7fadb717e8
faulted but still in service

Description : The number of checksum errors associated with a ZFS device
exceeded acceptable levels. Refer to
http://nexenta.com/msg/ZFS-8000-GH for more information.

Response : The device has been marked as degraded. An attempt
will be made to activate a hot spare if available.

Impact : Fault tolerance of the pool may be compromised.

Action : Run 'zpool status -x' and replace the bad device.

--------------- ------------------------------------ -------------- ---------
TIME CACHE-ID MSG-ID SEVERITY
--------------- ------------------------------------ -------------- ---------

Host : myhost
Platform : X8DTN Chassis_id : 1234567890
Product_sn :

Fault class : fault.fs.zfs.vdev.checksum
Affects : zfs://pool=vol0/vdev=fd6a4eda8756f480
ok and in service
Problem in : zfs://pool=vol0/vdev=fd6a4eda8756f480
repair attempted

Description : The number of checksum errors associated with a ZFS device
exceeded acceptable levels. Refer to
http://nexenta.com/msg/ZFS-8000-GH for more information.

Response : The device has been marked as degraded. An attempt
will be made to activate a hot spare if available.

Impact : Fault tolerance of the pool may be compromised.

Action : Run 'zpool status -x' and replace the bad device.

List of faulty resources:
=========: Event Details :========
FMA EVENT: PROBLEM-IN: zfs://pool=vol0/vdev=6c7b3f7fadb717e8
FMA EVENT: AFFECTS: zfs://pool=vol0/vdev=6c7b3f7fadb717e8
FMA EVENT: VOLUME: vol0

pool: vol0
state: DEGRADED
status: One or more devices is currently being resilvered. The pool will
continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
scan: resilver in progress since Mon Oct 6 09:48:35 2014
20.8M scanned out of 93.5G at 647K/s, 42h4m to go
3.42M resilvered, 0.02% done
config:

NAME STATE READ WRITE CKSUM
vol0 DEGRADED 0 0 0
mirror-0 ONLINE 0 0 0
c0t5000C5005D579DABd0 ONLINE 0 0 0
c0t5000C5005D57E2C3d0 ONLINE 0 0 0
mirror-1 DEGRADED 0 0 0
spare-0 DEGRADED 0 0 7
c0t5000C5005D5A89A3d0 DEGRADED 0 0 9 too many errors
c0t5000C5005D581D6Bd0 DEGRADED 0 0 9 too many errors
c0t5000C5005D587E83d0 ONLINE 0 0 0 (resilvering)
c0t5000C5005D5823B3d0 ONLINE 0 0 0
mirror-2 ONLINE 0 0 0
c0t5000C500322BE8CFd0 ONLINE 0 0 0
c0t5000C5005D58270Bd0 ONLINE 0 0 0
mirror-3 ONLINE 0 0 0
c0t5000C5005D586BCFd0 ONLINE 0 0 0
c0t5000C5005D586CF3d0 ONLINE 0 0 0
mirror-4 ONLINE 0 0 0
c0t5000C5005D586F27d0 ONLINE 0 0 0
c0t5000C5005D586FBBd0 ONLINE 0 0 0
mirror-5 ONLINE 0 0 0
c0t5000C5005D58717Bd0 ONLINE 0 0 0
c0t5000C5005D5871FFd0 ONLINE 0 0 0
spares
c0t5000C5005D587E83d0 INUSE currently in use
c0t5000C5005D581D6Bd0 INUSE currently in use

errors: No known data errors

Ericloewe · Nov 19, 2014

SweetAndLow said:
Drives are way to hot. They shouldn't go over 40c at max load.

I imagine it's easier said than done with 15k RPM drives.

cyberjock · Nov 19, 2014

Yeah, 40C is the top of the "recommended" range based on Google's white paper on hard drives.

I've owned 15kRPM hard drives. They draw a lot of power (relatively speaking) and they are EXTREMELY hot requiring LOTS of cooling. They run hot and die hard.

In today's day and age 15kRPM drives have been replaced with SSDs, which have far better performance I/O than 15k drives and none of the downsides. ;)

rs225 · Nov 19, 2014

Checksum errors? That is a sign of trouble.

Run a memory test. If that is fine, good. Make sure you are doing scrubs monthly. Start trying to figure out what components are in common with any of the drives which have/had checksum errors. While it could have possibly been problems with those drives, it is more likely something else. Power, cables, backplanes, enclosures, etc.

Beyond that, your high failure rate (excluding checksum errors) is probably just inherent with those drives.

anodos · Nov 19, 2014

cyberjock said:
Yeah, 40C is the top of the "recommended" range based on Google's white paper on hard drives.

I've owned 15kRPM hard drives. They draw a lot of power (relatively speaking) and they are EXTREMELY hot requiring LOTS of cooling. They run hot and die hard.

In today's day and age 15kRPM drives have been replaced with SSDs, which have far better performance I/O than 15k drives and none of the downsides. ;)

15kRPM drives are good for:

Testing your hearing's upper frequency range when they're spinning up.
Sucking lots of power.
Running hot,
Making you regret not buying SSDs.

cyberjock · Nov 19, 2014

anodos said:
15kRPM drives are good for:

Testing your hearing's upper frequency range when they're spinning up.

Sucking lots of power.

Running hot,

Making you regret not buying SSDs.

LOL! So true!

Steven Sedory · Nov 21, 2014

Haha. Thanks guys. The upside is that all the RMA's have been free, but that will be coming to and end soon.

I was able to redirect the cooling in my cabin ate, which help quite a bit. All drives have cooled down a bit, but a few still peak at as high as 51c (one is there right now).

I will definitely move to SSD's soon. Any suggestions? Do I need SLC to get the same data integrity? Am I'm out of line asking these questions on a post about 15k drives, lol?

marbus90 · Nov 21, 2014

The best bet should be Intel S3700 SSDs, but they aren't SAS. So if you're using dual controllers to serve that cluster, those won't work. It's important to get SSDs with a capacitor to survive power losses without data loss (or have a fully redundant power solution with multiple feeds via different UPS - and still the internal power distributor could fail...).

In terms of SAS SSDs the Toshiba PX02SM is nice on the price, tough probably slower than the S3700. The 800GB sizes seem to be quite a good bang per buck, whereas going more smaller SSDs for IOPS might be another design to pursue. On top of that you might want to buy a new 2.5" chassis. Those usually pack 24 disks in 2U.

Apollo · Nov 21, 2014

cyberjock said:
Yeah, 40C is the top of the "recommended" range based on Google's white paper on hard drives.

In today's day and age 15kRPM drives have been replaced with SSDs, which have far better performance I/O than 15k drives and none of the downsides. ;)

I wouldn't say SSD don't have any downsides, you still need wear leveling and firmware if buggy could wipe your data clean.

Apollo · Nov 21, 2014

Steven Sedory said:
Haha. Thanks guys. The upside is that all the RMA's have been free, but that will be coming to and end soon.

I was able to redirect the cooling in my cabin ate, which help quite a bit. All drives have cooled down a bit, but a few still peak at as high as 51c (one is there right now).

I will definitely move to SSD's soon. Any suggestions? Do I need SLC to get the same data integrity? Am I'm out of line asking these questions on a post about 15k drives, lol?

Your SAS controller if equipped with SFF-8087 should be compatible with SATA, you just need a SFF-8087 to 4 SATA adapter. This is what I did on mine.

Ericloewe · Nov 22, 2014

Apollo said:
I wouldn't say SSD don't have any downsides, you still need wear leveling and firmware if buggy could wipe your data clean.

Well, buggy HDD firmware can lose all your data, as well. I think we're past the "dangerous firmware is the norm" era.

Apollo · Nov 22, 2014

Ericloewe said:
Well, buggy HDD firmware can lose all your data, as well. I think we're past the "dangerous firmware is the norm" era.

Not the norm, but SSD evolution as well as any hardware development go through stages similar to software development.
New technologies, new standards, new protocols, new hardware, new firmware, new bugs.
It is a vicious cycle.

anodos · Nov 22, 2014

Apollo said:
Not the norm, but SSD evolution as well as any hardware development go through stages similar to software development.
New technologies, new standards, new protocols, new hardware, new firmware, new bugs.
It is a vicious cycle.

And of course OEMs trying to find new ways to cut corners.

Steven Sedory · Nov 22, 2014

Well, just got another one complaining about imminent failure...Anything telling/revealing about what's causing these continued failures?

[root@san1] ~# smartctl -a /dev/da3
smartctl 6.2 2013-07-26 r3841 [FreeBSD 9.2-RELEASE-p12 amd64] (local build)
Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Vendor: SEAGATE
Product: ST3600057SS
Revision: 000B
User Capacity: 600,127,266,816 bytes [600 GB]
Logical block size: 512 bytes
Rotation Rate: 15000 rpm
Form Factor: 3.5 inches
Logical Unit id: 0x5000c5005d579dab
Serial number: 6SL173050000N14615TE
Device type: disk
Transport protocol: SAS
Local Time is: Sat Nov 22 23:49:06 2014 PST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
Temperature Warning: Enabled

=== START OF READ SMART DATA SECTION ===
SMART Health Status: FAILURE PREDICTION THRESHOLD EXCEEDED [asc=5d, ascq=0]

Current Drive Temperature: 45 C
Drive Trip Temperature: 68 C

Elements in grown defect list: 492

Vendor (Seagate) cache information
Blocks sent to initiator = 7785009
Blocks received from initiator = 138701224
Blocks read from cache and sent to initiator = 199793
Number of read and write commands whose size <= segment size = 1393082
Number of read and write commands whose size > segment size = 0

Vendor (Seagate/Hitachi) factory information
number of hours powered up = 5347.75
number of minutes until next internal SMART test = 25

Error counter log:
Errors Corrected by Total Correction Gigabytes Total
ECC rereads/ errors algorithm processed uncorrected
fast | delayed rewrites corrected invocations [10^9 bytes] errors
read: 857910 1 0 857911 857911 3.986 0
write: 0 0 2 2 2 73.318 0

Non-medium error count: 6007

[GLTSD (Global Logging Target Save Disable) set. Enable Save with '-S on']
SMART Self-test log
Num Test Status segment LifeTime LBA_first_err [SK ASC ASQ]
Description number (hours)
# 1 Background short Completed - 5326 - [- - -]
# 2 Background short Completed - 5302 - [- - -]
# 3 Background short Completed - 5274 - [- - -]
Long (extended) Self Test duration: 6400 seconds [106.7 minutes]

marbus90 · Nov 23, 2014

Did you have to RMA an already exchanged HDD? If not, you might need to swap them all drive by drive - if they fail again I'd look for a different model.

Sir.Robin · Nov 23, 2014

My NAS02 has old Seagate Barracuda ES drives in it. Been running at 45C+ last couple of years. 56.000 hours spinning time now.
I did loose one a month ago... predictive failure. But still... it had over 55.000 hours on it :)

Important Announcement for the TrueNAS Community.

Many Seagate 600GB 15K SAS Failures (ST3600057SS)

Explorer

Sweet'NASty

Guru

Explorer

Explorer

Server Wrangler

Inactive Account

Guru

Sambassador

Inactive Account

Explorer

Guru

Wizard

Wizard

Server Wrangler

Wizard

Sambassador

Explorer

Guru

Guru

Similar threads