Seagate ST3000DM001 - An Adventure, or something like that...

Status
Not open for further replies.

Scharbag

Guru
Joined
Feb 1, 2012
Messages
620
I would just like to share my experience with my 13 Seagate ST3000DM001 3TB SATA drives. In a word - shitty.

Informational preface:
  • the maximum temperature EVER reported by smart has been 41C on one drive and 40C on some others with averages around 34C.
    • The 41C temperature on one drive was reached when I had all 18 spinning drives in my system being scrubbed at the same time. This no longer happens as I fixed the schedule...
  • scrubs are every 21 days for Raid Z1 backup pool and 28 days for Raid Z2 production pool and can no longer overlap
  • Case has 6 80mm fans and good airflow
  • server runs 24/7/365 in a room temperature environment
I have had these drives for exactly 2 years as they are just now out of warranty. During these 2 years, I have replaced 7 of 13 drives for the following reasons (at a cost to me of $12 each for advanced replacement services):
  • 2 for failure to spin up after a a system shutdown (both in the last month when upgrading to 9.3)
  • 5 for SMART errors (SMART still passed but Offline Uncorrectable Errors were increasing, scrubs did have to fix some errors at times)
I realize these drives are "consumer" grade but 2 of 13 drives failed completely in 2 years. That is a 7.7% annualized failure rate. This agrees closely with a study from BackBlaze which had a MUCH larger sample size. If you add in the 5 drives that did not fail completely, but were on their way to serious issues, the ST3000DM001 had a 26.9% annualized issue rate in my system. Yikes.

The really AWESOME part is that one of the refurbished warranty drives started throwing SMART errors within DAYS of being resilvered into my pool. This fact likely screws with the above statistics somehow, but my brain is not in the mood to tackle that animal at this time.

Bottom line, if you are going to build a reliable NAS box, it is my opinion that you are best to steer clear of the ST3000DM001 drives. They may be really good $/GB, but they are not the best bang for the buck long term.

Finally, using Raid Z2 and a proper backup strategy has certainly reduced my anxiety due to these failures. Thanks to the FreeNAS team for providing a totally cool way for me to keep my data safe!

Cheers,
 

solarisguy

Guru
Joined
Apr 4, 2014
Messages
1,125
I tend to pay a small additional premium to get a new hard drive and not a refurbished one. Not really because of getting a previously enjoyed one :), but due to the replacements coming with a different warranty...

P.S.
Thank you for sharing the experience!
 

titan_rw

Guru
Joined
Sep 1, 2012
Messages
586
I have a total of 24 of the ST3000DM001 drive. 2 nas's which each have 2 vdevs of 6 in z2.

I've had fairly good luck with them. The drives seem to have a higher infant mortality, but after the first couple of months they seem ok.

That being said, as I need to replace them, I'll go with something different. Probably WD Red's.
 

Scharbag

Guru
Joined
Feb 1, 2012
Messages
620
Well, I now have 3 more 3TB Seagate drives that are showing the starting signs of failures... FML. Now that they are off warranty, I will need to buy new ones. I will likely go with the 4TB version as replacements and over time, the pool will get to be 25% bigger. I will stick with the same type as in my backup pool and then I can buy larger drives over time for that pool (4TB to 6TB a possibility) so it can grow and I can re-use the backup drives in the production pool.

I do like ZFS and FreeNAS very much but there is always something going down that needs to be addressed :) I guess if it were easy, it would not be as much fun.

Cheers,
 

Scharbag

Guru
Joined
Feb 1, 2012
Messages
620
Code:
/dev/da6
smartctl 6.3 2014-07-26 r3976 [FreeBSD 9.3-RELEASE-p10 amd64] (local build)
Copyright (C) 2002-14, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Seagate Barracuda 7200.14 (AF)
Device Model:     ST3000DM001-1CH166
Serial Number:    W1F1EEWN
LU WWN Device Id: 5 000c50 05ca2d959
Firmware Version: CC43
User Capacity:    3,000,592,982,016 bytes [3.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    7200 rpm
Form Factor:      3.5 inches
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ATA8-ACS T13/1699-D revision 4
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Fri Apr  3 05:00:03 2015 CST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
Power mode is:    ACTIVE or IDLE

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate     0x000f   089   089   006    Pre-fail  Always       -       176156032
3 Spin_Up_Time            0x0003   092   091   000    Pre-fail  Always       -       0
4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       83
5 Reallocated_Sector_Ct   0x0033   099   099   036    Pre-fail  Always       -       1480
7 Seek_Error_Rate         0x000f   062   056   030    Pre-fail  Always       -       1877627727570
9 Power_On_Hours          0x0032   076   076   000    Old_age   Always       -       21028
10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       83
183 Runtime_Bad_Block       0x0032   100   100   000    Old_age   Always       -       0
184 End-to-End_Error        0x0032   100   100   099    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   001   001   000    Old_age   Always       -       797
188 Command_Timeout         0x0032   100   100   000    Old_age   Always       -       1 1 1
189 High_Fly_Writes         0x003a   097   097   000    Old_age   Always       -       3
190 Airflow_Temperature_Cel 0x0022   066   059   045    Old_age   Always       -       34 (Min/Max 31/40)
191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age   Always       -       0
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       83
193 Load_Cycle_Count        0x0032   039   039   000    Old_age   Always       -       122550
194 Temperature_Celsius     0x0022   034   041   000    Old_age   Always       -       34 (0 17 0 0 0)
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       88
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       88
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
240 Head_Flying_Hours       0x0000   100   253   000    Old_age   Offline      -       19242h+42m+40.784s
241 Total_LBAs_Written      0x0000   100   253   000    Old_age   Offline      -       26176501697
242 Total_LBAs_Read         0x0000   100   253   000    Old_age   Offline      -       333037650998

SMART Error Log Version: 1
ATA Error Count: 3
    CR = Command Register [HEX]
    FR = Features Register [HEX]
    SC = Sector Count Register [HEX]
    SN = Sector Number Register [HEX]
    CL = Cylinder Low Register [HEX]
    CH = Cylinder High Register [HEX]
    DH = Device/Head Register [HEX]
    DC = Device Command Register [HEX]
    ER = Error register [HEX]
    ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 3 occurred at disk power-on lifetime: 20894 hours (870 days + 14 hours)
When the command that caused the error occurred, the device was doing SMART Offline or Self-test.

After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455

Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
-- -- -- -- -- -- -- --  ----------------  --------------------
60 00 c0 ff ff ff 4f 00  22d+04:03:32.165  READ FPDMA QUEUED
60 00 40 ff ff ff 4f 00  22d+04:03:32.159  READ FPDMA QUEUED
60 00 c0 ff ff ff 4f 00  22d+04:03:32.151  READ FPDMA QUEUED
60 00 40 ff ff ff 4f 00  22d+04:03:32.150  READ FPDMA QUEUED
60 00 40 ff ff ff 4f 00  22d+04:03:32.150  READ FPDMA QUEUED

Error 2 occurred at disk power-on lifetime: 20631 hours (859 days + 15 hours)
When the command that caused the error occurred, the device was active or idle.

After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455

Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
-- -- -- -- -- -- -- --  ----------------  --------------------
60 00 80 ff ff ff 4f 00  11d+05:34:05.169  READ FPDMA QUEUED
60 00 c0 ff ff ff 4f 00  11d+05:34:05.166  READ FPDMA QUEUED
60 00 40 ff ff ff 4f 00  11d+05:34:05.162  READ FPDMA QUEUED
60 00 00 ff ff ff 4f 00  11d+05:34:05.153  READ FPDMA QUEUED
60 00 80 ff ff ff 4f 00  11d+05:34:05.144  READ FPDMA QUEUED

Error 1 occurred at disk power-on lifetime: 20582 hours (857 days + 14 hours)
When the command that caused the error occurred, the device was doing SMART Offline or Self-test.

After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455

Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
-- -- -- -- -- -- -- --  ----------------  --------------------
60 00 80 ff ff ff 4f 00   9d+04:28:06.680  READ FPDMA QUEUED
60 00 c0 ff ff ff 4f 00   9d+04:28:06.674  READ FPDMA QUEUED
60 00 40 ff ff ff 4f 00   9d+04:28:06.673  READ FPDMA QUEUED
60 00 80 ff ff ff 4f 00   9d+04:28:06.668  READ FPDMA QUEUED
60 00 c0 ff ff ff 4f 00   9d+04:28:06.659  READ FPDMA QUEUED


How can SMART pass a drive with this many issues...

Oh well, drive has been replaced. Took 7 hours to resilver which is pretty good given the size of the array.

Other 2 problem drives seem to be stable at 8 offline correctable errors. I have ordered 2 new 4TB drives JIC.

Gotta say that ZFS rocks though!!

Cheers,
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
During these 2 years, I have replaced 7 of 13 drives for the following reasons

Yeah, some of their 1.5TB drives had awesome failure characteristics as well. I am not impressed with the 3TB drives either.
 

marbus90

Guru
Joined
Aug 2, 2014
Messages
818
They're desktop drives. Designed for 8x5 and light usage. Not for a ZFS array hammering away.

If I may note that there was that batch of WD Reds delivered with the Greens firmware -> high LCC and hundreds of RMA'd drives... and that nobody would complain about Greens dying in a NAS enviroment. And then there were the HGST 2.5" drives, also with Firmware issues, hundreds of them dropping like flies in less than 2 months.

In contrast: I got 60x 4TB Seagate Hybrid HDDs spinning in an enviroment for which they're designed. Client workloads, not 24x7 in a server with ZFS hammering away. Not a single failure yet.
 

Scharbag

Guru
Joined
Feb 1, 2012
Messages
620
I know that the desktop drives are not "rated" for NAS use, but there has been a bunch of data that suggests the enterprise drives are not worth the money. And for me, my system does not hammer away 24x7 as it serves my HTPC content. I know I watch more TV than I should, but I don't watch 24 hours a day :)

I also do not get why vendors try to say that drives are only good for an 8 drive array or some such. What does that matter?

Anyway, $160 for a 4TB desktop drive that will give me 2 years (under warranty) and then some. A true enterprise drive is twice the price and there is no data that supports that it will last twice as long as a desktop drive in a demanding environment (although some have better warranties). If it does die after 2 years, I can buy a larger drive for the same money then and grow the array (what I am doing now). In my opinion, it is cheaper to use desktop drives with Z2 or Z3 (or mirroring) than it is to use enterprise drives over the long run. As for performance, I do not need that and with a properly architected ZFS system (system memory, ZPOOL type and size, L2ARC and ZIL SSDs) the spinning disk performance is not as critical. And if IOPS requirements are very high, then use 2.5" fast spindles and lots of them (or SSDs).

Backblaze has some good stats on drive types and I think the 3TB Seagate model I bought was just a lemon. Time will tell if the 4TB drives are any better for my use.

Just resilvering the first 4TB drive into my production array made up of 3TB drives now. Data is being written to the 4TB drive at ~80MBps. Similar to the faster spinning 3TB drives so I am happy with that.

Cheers,
 

marbus90

Guru
Joined
Aug 2, 2014
Messages
818
So you aren't aware of the WD Red, HGST NAS and Seagate NAS series? The markup is quite marginal, especially for the Reds.
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
I know that the desktop drives are not "rated" for NAS use, but there has been a bunch of data that suggests the enterprise drives are not worth the money.

They may not be. Even in a much heavier use model than yours. I was one of the people who introduced consumer-grade SATA drives to the Usenet business, where drives get hammered until they diiiiiiie. MUCH heavier random access patterns than Backblaze. Time and time again, it was apparent that cheap was the better option since redundancy could make up for failures.

And for me, my system does not hammer away 24x7 as it serves my HTPC content. I know I watch more TV than I should, but I don't watch 24 hours a day :)

I also do not get why vendors try to say that drives are only good for an 8 drive array or some such. What does that matter?

Because it'd really suck to be Seagate if all your enterprise customers started buying cheap consumer drives.

Anyways, they do allegedly have some better vibration protection in the enterprise class drives, and so the Red or NAS class drives are targeted at smaller arrays. I've seen nothing that convinces me this is significant.

Anyway, $160 for a 4TB desktop drive that will give me 2 years (under warranty) and then some. A true enterprise drive is twice the price and there is no data that supports that it will last twice as long as a desktop drive in a demanding environment (although some have better warranties). If it does die after 2 years, I can buy a larger drive for the same money then and grow the array (what I am doing now). In my opinion, it is cheaper to use desktop drives with Z2 or Z3 (or mirroring) than it is to use enterprise drives over the long run.

Correct. The "I" in RAID was originally "inexpensive" - for a reason. You can maybe squeeze a little more reliability by using a better class of drives, but my choice is to increase redundancy instead. If I can get two consumer grade drives for the price of a single enterprise drive, that's almost always a win.

As for performance, I do not need that and with a properly architected ZFS system (system memory, ZPOOL type and size, L2ARC and ZIL SSDs) the spinning disk performance is not as critical. And if IOPS requirements are very high, then use 2.5" fast spindles and lots of them (or SSDs).

2.5" consumer drives will probably not be as fast as their 3.5" counterparts - especially since many of them are variants on laptop drives. However, their density is substantially greater and power consumption lower. That said, the VM storage server I'm working on will have two dozen 2.5" drives.

Backblaze has some good stats on drive types and I think the 3TB Seagate model I bought was just a lemon.

The Backblaze stats are commonly understood not to be "good" for values of "good" beyond "we didn't intentionally harm bits while writing this blog." You can walk away from their numbers with some specific inferences, but hard to generalize them. I could just as easily tell you that I've seen a 50% eventual failure rate on the 3TB'ers (true!) while the 4TB'ers seem to be much more resilient, but just as with Backblaze, my data suffers some defects too.
 

Scharbag

Guru
Joined
Feb 1, 2012
Messages
620
So you aren't aware of the WD Red, HGST NAS and Seagate NAS series? The markup is quite marginal, especially for the Reds.

I am aware of those drive but my opinion is that they are not worth the small premium. Especially because those drives have the recommendation of only being good to an 8 disk array. Marketing hubbub IMHO.
 

Scharbag

Guru
Joined
Feb 1, 2012
Messages
620
2.5" consumer drives will probably not be as fast as their 3.5" counterparts - especially since many of them are variants on laptop drives. However, their density is substantially greater and power consumption lower. That said, the VM storage server I'm working on will have two dozen 2.5" drives.

I agree, I would not use consumer (laptop) 2.5" drives for fast applications. If speed is a real need, use the 10K or 15K SAS drives. Given the 2.5" drives are lower capacity, you need more of them and that allows for better speed through increased spindles (as long as they are configured appropriately).

The Backblaze stats are commonly understood not to be "good" for values of "good" beyond "we didn't intentionally harm bits while writing this blog." You can walk away from their numbers with some specific inferences, but hard to generalize them. I could just as easily tell you that I've seen a 50% eventual failure rate on the 3TB'ers (true!) while the 4TB'ers seem to be much more resilient, but just as with Backblaze, my data suffers some defects too.

There are three kinds of lies: lies, damn lies and statistics. :)

Correlation is never causation. Backblaze data agrees with my own experience as well as other reports I have seem on other forums. Thanks to ZFS (and an OK backup strategy), I really don't much care about the failures from a data perspective. More of an annoyance. Here is hoping the 4TB drives are more reliable.

Cheers,
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
I agree, I would not use consumer (laptop) 2.5" drives for fast applications. If speed is a real need, use the 10K or 15K SAS drives. Given the 2.5" drives are lower capacity, you need more of them and that allows for better speed through increased spindles (as long as they are configured appropriately).

Also increased redundancy. Some of this gets really bizarre as you plug away at the numbers.

Consider that you want a machine for VM storage, which means mirrors. A two-way mirror loses redundancy if any drive fails, so three-way mirrors are kind of the minimum standard for high reliability. And let's concede that some warm spares need to be available, which reduces slots available.

In a 2U 12 3.5" chassis, you can fit 10 x 6TB drives (one warm spare, plus slots for up to two more). That's three vdevs of 6TB each, three wide mirrors, for 18TB, at $260/drive, -> $2600, or ~$144/TB

In a 2U 24 2.5" chassis, you can fit 22 x 2TB drives (one warm spare, plus slots for up to two more). That's seven vdevs of 2TB each, three wide mirrors, for 14TB, at $99/drive, -> $2079, or ~$149/TB

Of course, the 21 drives in the 2.5" pool are each individually slower than then 9 drives in the 3.5" pool, but in aggregate there's a lot more of them. Better for IOPS.
 

Scharbag

Guru
Joined
Feb 1, 2012
Messages
620
Makes my brain hurt.

Last place I worked we bought a N-Series SAN from IBM. $25,000 for 5TB of useable space with redundant controllers. Over $200K for a 3 host VMWare cluster with some Cisco 3750 switches, UPS and racks. Just floors me what some companies charge because they offer "support" for their equipment.

At home, I built a 33TB+ NAS for about $5,000 (started off small, has grown to this size, 20TB in production, 13TB backup). I know the loading is different but the irony is, my NAS has never failed but the SAN from IBM did (bad controller that was not properly configured by IBM). That caused a wee bit of consternation as it ran a plant SCADA system... :) Blame that one on piss poor commissioning cause we also found that all the servers and switches were plugged into only 1 of the 2 UPSs and that the UPSs would not charge when on generator due to the voltage tolerance settings... Prior to leaving that job, we un-fooked that turd. Good ole big blue...

I wish Oracle was less lame and the ZFS was used more often. The price of some big name storage solutions is just plain ridiculous (we were quoted ~$40K to add 20TB to an existing SAN by IBM recently). And snapshots are AWESOME.

Oh well, got a little off topic there.

Over 50% resilvered already since this morning. Will replace the second 3TB drive that is starting to be silly later today. That will be 3 3TB drives replaced in 2 weeks.

Cheers,
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
Just floors me what some companies charge because they offer "support" for their equipment.

Just floors me what some companies will pay for equipment. I get all twitchy-eyed just thinking about signing off on capex, and then I find out other companies are paying many times what we are...

nerd-rage.jpg
 

Apollo

Wizard
Joined
Jun 13, 2013
Messages
1,458
If only nobody would buy from then then they will be dropping price like crazy.
 

stuartsjg

Dabbler
Joined
Apr 14, 2015
Messages
18
Im going through this "enterprise" or "nas" grade decision right now. For the last year ive had a 4x2Tb RAID5 array on a dell PERC using Seagate ST32000542AS drives installed inside a desktop PC serving as my main data drive (SSD C) and they have been faultless and all smart tests pass.

This has been fine and the controller was very robust at not losing a drop of data even when i accidentally booted the PC with 2 out of 4 drives disconnected! Reading on here, im not sure if ZFS, despite its benefits, would recover from such a silly mistake.

This set of Seagate drives are likley to end up in my freenas machine (RAIDZ1 via the mobo onboard sata) but i feel ill only trust them to non-critical data (TV, Movies, Music). Its backed up so if i have an issue then thats no probs.

For the irreplaceable data, although its backed up to Carbonite, Livedrive and external hard drives, i am looking to do a mirror array. Its this mirror array im likely to use a higher grade drive for.

My background is an Electrical Engineer and the company i work at builds life critical equipment for the offshore oil and gas industry. When designing PCB's and the likes, we will often end up paying perhaps 10x the price for a resistor, capacitor etc which comes with a MILspec or other certification. The difference is perhaps £0.02 vs £0.20 and for our volume the price makes no odds. Build 100,000 TVs for supermarket customers and this would be unacceptable.

Now, the 10x more expensive component is probably exactly the same as the cheaper one, however it has an MTBF or FIT (failure in time) figure which after a lot of number crunching with every other part in your product gives you a PFD (probability of failure on demand). This then can be used to certify a SIL (safety integrity level) figure. This lets your client know that they need single, double, triple, etc redundancy for their application which will have a SIL target dictated by a safety or national standards agency. This certification is expensive which is why the SCADA system you describe would have cost so much. All this doesn't (yet...) take into account programming or configuration so as with anything, a bad setup makes good equipment useless.

With the NAS and enterprise dives they all vouch for a lower error rate and higher MTBF. These can often be orders of magnitude of difference in overall chance of their being data loss. This will be determined by binning of platters, heads, PCBs, motors and the likes as well as by perhaps selecting a higher grade component at design stage. They may also have more QA checks at the factory, all of which permit the manufacturer to vouch certain levels of integrity.

This great calculator lets you see the difference a better drive may have on your system. As with any probability figure, the absolute value will vary with real life but relative to another figure is often sensible.

Take my 4 x 2Tb Raid 5 array using these Seagate ST32000542AS desktop drives gives an MTTDL of 83,300 years. Even if i had a data center of 10,000 arrays like this, chances are i would lose data every 8 years. For home usage, this statistic is likely to be acceptable.

The same array with Seagate NAS ST2000VN000 gives me 213,000 years. Again, a huge number and impractical in every day usage but you could deduce you are 257% less likely to suffer data loss.

Failure rates are like the term "a 1-in-50 year storm". It could happen tomorrow, or it could happen in 49 years 264 days time.
 

jgreco

Resident Grinch
Joined
May 29, 2011
Messages
18,680
This great calculator lets you see the difference a better drive may have on your system. As with any probability figure, the absolute value will vary with real life but relative to another figure is often sensible.

Take my 4 x 2Tb Raid 5 array using these Seagate ST32000542AS desktop drives gives an MTTDL of 83,300 years. Even if i had a data center of 10,000 arrays like this, chances are i would lose data every 8 years. For home usage, this statistic is likely to be acceptable.

The problem is that we're already quite aware that this isn't based in reality.

In reality, the likelihood of the loss of one-of-three of the remaining drives in a 4-drive RAID5 set is fairly high, and I'm not even talking single bit loss, I'm talking large scale-to-total drive failure, which renders the thing incapable of retrieving some large amount of useful data. With no redundancy at that point, it is certainly going to hit you in a data center far more often than what you suggest.

As with so many numbers, the statistical numbers that manufacturers like to quote are not all that useful to those of us in the trenches and who are actually responsible for protecting the data. We know failures are more likely than manufacturers like to suggest. "MTBF" is a wonderfully deceptive concept, and is nicely discussed by some companies such as Digi, which is a pretty good explanation. I fall under the "MTBF is not accepted by all customers (some regard it as a meaningless number, both absolutely and relatively)" category. Most of the other reliability statistics quoted us are similarly useless.

mod note- edited for offsite link update, thanks @pro lamer
 
Last edited:

Scharbag

Guru
Joined
Feb 1, 2012
Messages
620
Reliability engineering is so much fun.

Biggest thing to remember is redundancy is the best way to decrease failures. Two 99% available systems in parallel give a system reliability of 99.99%. Add one more in parallel, you get 99.9999% reliability.

I know availability and data loss are different (controller failure does not mean data loss, but does mean accessibility loss). Still, the principal or redundancy holds true when setting up storage solutions. The more redundancy the better but at some point, there is a diminishing return on investment.

My opinion is that if "enterprise" grade drives are twice the price, then I would rather buy a few more desktop drives and take advantage of the increased redundancy rather than expect the enterprise drives to be individually more reliable.

Example for illustration:

Lets assume:
  • cheap component is 80% reliable in a Raid Z3 11 drive system
  • enterprise component is 85% reliable in a Raid Z2 10 drive system (using silly numbers so the decimals are not so small) and is twice the price
Given the above assumptions, we then have 2 parallel systems, cheap pool is 11-need-8 and the enterprise pool is 10-need-8. Results are:
  • cheap system reliability of 83.8861%
  • enterprise system reliability of 82.0196%
I know these numbers are not the right absolute values for hard drives but it does show that increased redundancy can help overcome individual component reliability issues faced by cheaper components. This example shows that by adding 1 additional cheap drive, the pool can be as reliable at just over half the enterprise cost.

Now, lets apply this to mirrored pools of size 2-need-1 for enterprise and size 3 need 1 for cheap (for each vDev):
  • cheap system reliability of 99.200%
  • "enterprise" system reliability of 97.775%
Again, 150% drive count allows the cheaper and less reliable drives to offer a significantly better system reliability at ~75% of the cost of the enterprise solution.

None of the above focuses on performance at all. I am not saying the enterprise drives do not have their place but I will not use them in my home system :)

The above numbers were generated using "simple" reliability theory and excel (I hate all things to do with probability but I have to deal with this crap from time to time at work...).

Any feedback would be great as if I got this wrong, I better think hard about my chosen career as a maintenance engineer!! And yeah, probability is voodoo magic and to be taken with a grain of salt... But what else we got?

Cheers,
 

Scharbag

Guru
Joined
Feb 1, 2012
Messages
620
Just floors me what some companies will pay for equipment. I get all twitchy-eyed just thinking about signing off on capex, and then I find out other companies are paying many times what we are...

nerd-rage.jpg

The biggest battle we faced when purchasing IT equipment is we could never get staff positions to support the server side hardware. So we never could do stuff ourselves and that made us rely on IBM etc. Their support is not worth the paper it is printed on IMHO even though they charge a TON of money for anything they sell.

The other issue was standards. Every business unit had their own internal support level staff (mice and KB replacers) and they all did their own thing. Little islands of money, all similar but not the same. And finally, the data was never given a value on a balance sheet. They lost 3 months of payroll data one time and were not worried because they had paper copies... FML. One of the reasons I moved on.
 
Status
Not open for further replies.
Top