Intel passed power-loss-protected SSD tests

cyberjock · Dec 27, 2013

Check this out:

http://hardware.slashdot.org/story/...protected-ssds-tested-only-intel-s3500-passes

This was almost an impossible task: after months of searching the shortlist was very short indeed. There was only one drive that survived the torturing: the Intel S3500. After more than 6,500 power-cycles over several days of heavy sustained random writes, not a single byte of data was lost. Crucial M4: failed. Toshiba THNSNH060GCS: failed. Innodisk 3MP SATA Slim: failed. OCZ: failed hard. Only the end-of-lifed Intel 320 and its newer replacement, the S3500, survived unscathed. The conclusion: if you care about data even when power could be unreliable, only buy Intel SSDs."

That is freakin' AMAZING!

Michael Wulff Nielsen · Dec 28, 2013

Incredible, I would have thought that most ssd drives would survive that kind of torture. Makes me happy to have an intel ssd in my mac.

cyberjock · Dec 28, 2013

I've been buying Intel only drives since the G2 days. Not because I'm an Intel guy, but when I started reading about how the technology works internally I felt that Intel had it together the most. They've had some hiccups with firmware problems. But by far Intels seem to be the best designed. I have no regrets with my decisions either. I've been responsible for about 15 Intel SSDs since the G2s hit the market and none have failed and none show a lifespan remaining below 90%(I think they are all above 95% but I don't keep close tabs on them). I'd say that's pretty good for many of the drives being 4 years old.

Even today, I've been semi-blindly buying Intel SSDs. When I read reviews I just don't see other companies doing the same kind of exhaustive testing of Sandforce controllers Intel does(partly because they don't have the budget for that kind of testing) and I've taken the stance that if they worked as well in the past they are likely to continue to make good products.

There's no doubt that for your standard desktop it's probably safe to use some of the other brands. Samsung, Crucial, and Kingston come to mind.

But I like to stick with what I know, and I've always felt my data was safely stored with Intel. So far, it has.

titan_rw · Dec 30, 2013

The only SSD I've had fail on me was an OCZ Vertex. And it failed in the exact same way that paper described. Not seen by OS, nothing. Just like there was no drive connected at all.

I still have 4 of the original jmicron ssd's that suffered from the 2-3 random writes per second thing. Other than being slow, they still work very well.

I have 2 Intel x25-m 160GB G2's from 2009 or so that have always been in a raid 0 array. Used primarily as a windows system drive, and now being used as a scratch / download drive. Smart info still shows 99% ssd life left.

Also have 5 sandforce based 240 gig ssd's from various manufacurers (sandisk extreme II / Corsair force gt I think), and they have all worked fine.

Funny my only failed ssd was an OCZ. And about 6 months after it's warranty was up too. It actually died while I was using the laptop it was in. All of a sudden C drive disappeared. Totally gone. D: (optical drive) still showed up, and windows was still running, it just couldn't access C. Rebooted the laptop, and the drive never came back. I don't remember which firmware was on the drive. It was probably upgraded beyond the 1.6 that that article recommended.

Knowltey · Jan 3, 2014

Just got an Intel SSD for Christmas, so good to know. Granted I have the computer it's in and my NAS on a medical grade UPS so probably not even an issue to be concerned with.

cyberjock · Jan 15, 2014

My OCZ that I bought looong ago(OCZ Summit) apparently doesn't like being power cycled. My files on it(all were backed up) got trashed... verified by md5 sums. So its being replaced by an Intel. Can't say I'm too terribly surprised. They've always been somewhat flaky for me. I bought the OCZ before Intel had the G2.

It was cheap and I wanted to see of SSDs really were the next best thing to hardware RAM drives. At the time I had over $2500 worth of RAM in a few RAM drives in a giant RAID0. Booting up and using the system was ultra fast. But, the cost was hard to swallow. Clearly wasn't for those that want to build a fast server on the cheap. ;)

cyberjock · Jan 15, 2014

Here's the current version.. looks like the author is doing some updates of it since its drawn people's attention.

Revision History
Published: 27 Dec 2013 - first published
Updated : 28 Dec 2013 - add TODO list of Samsung 840 and Crucial M500
Updated : 29 Dec 2013 - add Editor's note after slashdot article
Updated : 1 Jan 2014 - Add stec-inc S230 SATA Slim

Editor's note 29Dec2013

Thank you for everyone's input from the Slashdot story.
The additional drives for consideration is extremely useful but they will
have to go through the same process of cost-benefit - followed only then by
reliability - analysis that the other drives went through, with the additional
handicap that the Intel S3500 has already "won" and been selected for live
deployment.

Which brings me to a keen point that is difficult to express when there
are 275 slashdot comments to contend with. The belief that Intel paid
for this report comes through loud and clear. Those who believe that
are severely mistaken. Let's look at it again.

Statement of fact: The S3500 SSD happens to be the sole drive which
a) is cost-effective
b) passed all the extreme tests
c) is within budget
d) was clearly marked in the online marketing as "having power loss protection"
e) is not end-of-life

So let us be absolutely clear:

Fact: the Intel S3500 was the only drive which matched the requirements

That it did so so completely comprehensively despite the extreme nature of
the testing, which lasted several days whilst all other drives failed within
minutes, is the real key point of this report.

However that point - that success - is itself also completely irrelevant
beside the fact that the testing itself provided the company that commissioned
the work with an amazingly high level of confidence in "an SSD" despite their
complete paranoia which had driven them to commission the testing in the first
place. To make that clear:

The company doesn't care about Intel: they care about a reliable drive

If there were other drives that had passed or were known about or could have
been found, they would have been added to the list already.

Analysis of SSD Reliability during power-outages

This report was originally commissioned due to the remote deployment of
over 200 32gb OCZ SSDs resulting in severe data corruption in over 50%
of the units. The recovery costs were far in excess of the costs saved
by purchasing the cheaper OCZ units. They were replaced rapidly over a
period of years by Intel SSD 320s, where, despite remote deployment of
over 500 units there have only ever been three unrecoverable failures.

However, the Intel 320 SSD has reached end-of-life, so a replacement was
sought. Due to paranoia over the OCZs an in-depth analysis was requested.
Around the time that the paranoia was hitting, a report had come out
on slashdot, covering power-related corruption.
It made sense therefore to attempt to replicate that report, as it was
believed that the data corruption of the OCZs was related to power loss.

This report therefore covers the drives selected and the testing that was
carried out. We follow up with a conclusion (summary: if you care about
power loss don't buy anything other than Intel SSDs - end of story) and
some interesting twists.

Picking drives for testing

The scenario for deployment is one where huge amounts of data simply are
not required. An 8gb drive would be able to store 1 month's worth of sensor
data, as well as have room for a 1.5gb OS deployment. A 16gb drive stores
over two months. Bizarrely, except in the Industrial arena the focus
is on constant increases in data storage capacity rather than data
reliability. The fact that shrinking geometries automatically results
in higher susceptibility to data corruption is left for another time,
however.

Additionally, due to the aforementioned paranoia and assumptions that the
data loss was occurring due to loss of power, the requirements to have
"Power Loss Protection" were made mandatory. Power Loss Protection is
usually found in Industrial and Server Grade SSDs, which are typically
more expensive.

So, finding low-cost low-size reliable SSD reported to have
"Power Loss Protection" proved... challenging. After an exhaustive search,
the following candidates were found:

Crucial M4 128gb
The unpronounceable Toshiba THNSNH060GCS 60gb
The new Intel S3500
The Innodisk 3MP Sata Slim (8gb and 16gb)

The Innodisk units came in around £30, whilst all the other drives came
in at between £60 and £90. Also added to the testing was the original
32gb Vertex OCZ and the Intel 320.

Test procedure

The original report at the FAST conference was quite hard to replicate:
the report is a summary rather than containing detailed procedures or
source code. A best effort was made and then extended.

OS-based test. The first test devised was to boot up a full OS and to power-cycle it using a mains timer. This test turned out to be completely lame, except for its negative results proving that simply switching power on and off was not the root cause of problems.
OS-based huge parallel writes. The second test was to write huge numbers of files and subdirectories in parallel. Thousands of directories and millions of small files as well as one large one were copied, sync'd then deleted using 64 parallel processes. Power was not pulled during this test.
Direct disk writing. This test was closer to the original FAST report, except simplified in some ways and extended in others.

Crucial M4

The Crucial M4 was tested with an early prototype version of the SSD
torture program. It was power-cycled approximately 1,900 times over a
48 hour period. Data was randomly written, sync'd and then read back,
whilst power-cycling was done on a random basis between 8 and 25 seconds
through the read-sync-write cycle. Every 30 seconds the geometry was
checked and a smartctl report obtained.

After approximately 1600 power-cycles, the Crucial M4's SMART report showed
over 20,000 CRC errors. Within 1900 power-cycles, that number had jumped
to 40,000 CRC errors and had been joined by serious LBA errors.

Conclusion: epic fail. Not fit for purpose: returned under warranty.

Toshiba THNSNH060GCS 60gb

This drive turned out to be a little more interesting. It passed the OS-based
parallel writes test with flying colours. Running for over 20 minutes, several
million files and directories were created and deleted. In between each run
no filesystem corruption was observed.

Then came the direct-disk writing. It turns out that if the write speed is
kept below around 20mbytes/sec, the Toshiba THNSNH060GCS is perfectly capable
of retaining data integrity even when power is being pulled, even when there
are 64 parallel threads all writing at the same time.

However when the write speed exceeds a certain threshold, all bets are off.
At higher write speeds, data loss when power is pulled is only a matter
of time (minutes).

We conclude from this that the Toshiba THNSNH060GCS does have power-loss
protection circuitry and firmware, but that the internal power reservoir
(presumably supercapacitors) simply isn't large enough to cover saving the
entire outstanding cache of writes.

Conclusion: close, but no banana.

Innodisk 3MP Sata Slim

There were high hopes for these drives, based on the form-factor and low cost.
However, unfortunately they turned out to have rather interesting firmware
issues.

The observed write-then-read speeds (a write followed by a verify step)
turned out to be adversely affected by the number of parallel writes. If
there were no parallel writes (only one thread) then it was possible to
write and then read at least 18 mbytes per second (i.e. the data was written
at probably 30mbytes/sec then read at probably 45mbytes/sec, except that
the timer was started at the beginning of the write and stopped at the end
of the read). This speed was sustained.

However, if there were even just two parallel write-read threads, the speed
was sustained for approximately 15 seconds and then dropped down to 1 (one!)
mbyte/sec. The more threads were introduced, the less time it took for
the write-then-read speed to drop to a crawl.

Paradoxically, if the torture program was suspended temporarily even for
a duration of a few seconds, then when it was resumed the speed would shoot
back up to 18 mbytes / sec and then once again plummet.

We conclude from this that either the CPU on the Innodisk SATA Slim or the
algorithms being used are just too slow to deal with parallel writes. There
is clearly a RAM cache which is being filled up: the speed of writing to the
NAND itself is not an issue (because if it was, then single-threaded writes
would be slow as well). So it really is a firmare / CPU issue: when the
cache is full of random parallel data, the firmware / CPU goes into meltdown,
cannot cope, and the write speed suffers as a result.

To Innodisk's credit, they actually responded and were given a copy of
the SSD torture program and instructions on how to replicate the issue.
It will be interesting to see how they solve this one: updates will be
provided.

Conclusion: wait and see.

OCZ Vertex 32gb

This was also interesting. The OS-based test (which was ordered to be run,
despite reservations that it would be ineffective) showed absolutely ZERO
data corruption. Let's repeat that. When picking one of the worst
drives with the worst smartctl report ever seen that was still functional
from a batch with over 50% failure rates and using it to install an OS and
then leaving it to power-cycle over 100 times there was ZERO data
corruption.

What we can conclude from this is that power-loss had absolutely nothing to
do with the data-loss. What it was then necessary to do was to devise a
test which would show where the problem actually was. This test was the
"OS-based huge parallel writes" test. Running this test for a mere 5 minutes
(bear in mind that there was no power-cycling) resulted in immediate data
corruption.

Further investigation was therefore warranted. OCZ (before they went into
liquidation) had been advising - without explanation - to upgrade the firmware.
After working out how this can be done on GNU/Linux systems, and after
observing in passing that the firmware upgrade system was using syslinux
and FreeDOS, the firmware was "downgraded" to Revision 1.6.

The exact same OCZ - with an incredible array of failures, CRC errors,
lost sectors as reported by smartctl - when downgraded to firmware Revision
1.6 - then showed ZERO data corruption when the exact same OS-based
parallel write testing was carried out.

which is fascinating in itself.

Further investigation then dug up an interesting nugget: it turns out that
OCZ apparently had been warned by Sandforce not to enable a switch in
the firmware which would result in "increased speed". OCZ, in their desperate
attempt to remain "king of the speed wars" ignored the advice that doing so
would result in data corruption. The results correlate with this advice:
at higher speeds, data corruption is guaranteed to occur.

The hypothesis here is that at higher speeds there is a bug in the firmware
which results in the data being written incorrectly. What was not determined
was whether that data was simply... not written at all or whether it
was written in the wrong place. Given that out of the 50% failed drives a
number of them actually could not be seen on the SATA bus at all, it seems
likely that at high speeds, OCZs with the faulty firmware are actually capable
of overwriting their own firmware! However, actually demonstrating this
is beyond the scope of the tests carried out, not least because it would
require wiping an entire drive, carrying out some parallel writes, then
checking the entire drive to see where the writes actually ended up.
This test may be added to the suite at a later date.

Once the firmware was downgraded to Revision 1.6, the drive-level testing
was carried out (there was no point doing so when the drive's firmware could
not even maintain data integrity even when power was provided). Surprisingly,
the drive fared pretty well. Sustained random speed levels were good, but
data was lost intermittently when power was pulled, especially
(like the Toshiba) at higher speeds.

Conclusion: buy cheap, flash firmware to 1.6 if power-loss not important

Intel 320 and S3500

As already hinted at, these drives simply could not be made to fail, no matter
what was thrown at them. The S3500 was power-cycled some 6,500 times for
several days: several terabytes of random data were written and read from that
drive. not a single byte of data was lost. Despite even the reads being
interrupted, there was not a single time - not even once - when the S3500
failed to verify the data that had been written.

The only strange behaviour observed was that the write-then-read cycle
speeds tended to fluctuate, sustaining around 25 to 30mbytes of write-then-read
speed continuously for several minutes then dropping after 10 or so minutes
to 20 or even 12 mbytes / sec for one (and only one) write-read cycle.
The only possible explanation for this could be some housekeeping going
on, in the firmware, which would take up CPU cycles for short durations.

Conclusion: don't buy anything other than Intel SSDs

Conclusion

Right now, there is only one reliable SSD manufacturer: Intel.
That really is the end of the discussion. It would appear that Intel is
the only manufacturer of SSDs that provide sufficiently large on-board
temporary power (probably in the form of supercapacitors) to cover writing
back the entire cache when power is pulled, even when the on-board cache
is completely full.

The Toshiba drives have some power-loss protection, but it's not
enough to cover an entire cache. The Innodisk team have tried hard: their
datasheet shows that they are also providing power-loss protection as well
as detecting when power and current drop below unsustainable levels.
Given how difficult it is to even find out if Manufacturers provide this
kind of capability at all it is worth giving Innodisk credit for
at least making that information publicly accessible.

The OCZ Management deserve everything that's happened to OCZ. They should
have listened to Sandforce: the history of SSDs would have been a radically
different story. The sad thing is that when the firmware is downgraded,
the drives are no worse than any other consumer-grade SSD.

The Crucial M4 is probably okay for general use, as are all the other drives
(except the Innodisk until they fix the firmware issues to get the sustained
write speeds back). And so, if it's possible to buy them cheap, and
power-loss is not an issue, getting hold of second-hand OZC Vertex drives
and downgrading the firmware would not be that bad an option.

However, if data integrity is really important, even when power could be
pulled at any time, then there really is absolutely no question: get an
Intel SSD. it's as simple as that.

Future

On the TODO list is to write that test which wipes the drive, carries out
random writes, then checks the entire drive to see if the writes went in
the correct places. On the face of it this seems such an obvious thing
that drives should do, but the OCZ Vertex's show that it's an
assumption that cannot be made.

The Innodisk drives are one to watch: the price and tiny size is well worth
continuing to work with Innodisk to see if they can solve the problem of
parallel-write-cache overload.

Other drives may prove to be as good as the Intel S3500, however they were
not tested during this research because other drives were either way outside
of the budget, or it was impossible to find out from even exhaustive Internet
searches as well as speaking to suppliers whether the other potential
candidates had any form of power-loss protection.

If anyone would like to find out if a particular make or model of drive is
reliable under extreme torturing and power-interruption, contact
lkcl@lkcl.net: a contract can be arranged
and this report updated.

Lastly, it is worth noting that this testing was only carried out for
a maximum of a few days sustained writing. The long-term viability
obviously has not been tested. However, given that deployment of over
500 Intel 320 SSDs has been carried out and only 3 failures observed
over several years, it would be reasonable to conclude that Intel S3500s
could be trusted long-term as well, bearing in mind - as a precautionary
tale - that lower geometries means more unreliability for the firmware
to contend with.

TODO Updated: 28th Dec 2013

Thank you to everyone who's recommended drives since this report was published.
The initial investigation is basically over: the Intel S3500 was top of the
list as it was the only one that passed. However, based on unit cost it could
well be the case that the investigation is reopened.

Recommended drives for consideration at a later date:
* Samsung 840
* Crucial M500 (first Crucial drive with power-loss capacitors)
* Intel 540 series (which are apparently made differently from S3500 and 320s)
* stec-inc S230 SATA Slim

Recommended tests:
* Use new linux kernel 3.8 "cmd flush disable" option to check data integrity
* "Power brown-outs" (reducing current intermittently) as an advanced test

russnas · Feb 28, 2014

i had a feeling intel was more reliable in this field.

ive had 4 ocz ssd purchased as they were cheap , vertex 2 and some other variation failed just after 15 months, with 3 years warranty, i got replacements and sold them as well as vertex 3 as they were only worthy for pc game data , one i opened had been repaired before,

the samsung evo drives have great performance but i have more glitchy issues than i do with intel,

i have never used them with freenas but intel would be my first choice over other consumer drives.

D4nthr4x · Mar 5, 2014

The only Intel drives that were true failures were the ones that were using third party controllers. Doesn't look like they tested Samsung drives, or the PNY XLR drives with the high endurance nand (not that I think they are better but they are advertised as more rugged). I also wonder about enterprise level drives.

cyberjock · Jun 17, 2014

Here's some more info on a particular SSD test being run.

http://hardware.slashdot.org/story/...ment-writes-one-petabyte-to-six-consumer-ssds

Note that one comment in the discussion someone said that I think is worth mentioning is this:

SSDs can be easily forced to do a whole erase/write cycle just by writing single bytes into the wrong sector.
There is no need to waste bus bandwidth with a petabyte of data.
The problem was never the amount of the information.
The problem was always the IO pattern which might accelerate the wear of the the media.

This is absolutely true! Knowing that an SSD can write a PB of sequential writes is not necessarily 10x better than an SSD that can write 100TB of random writes. So do keep that in mind if you read the article(and do read the first article from last year if you haven't).

no_connection · Jun 18, 2014

The point I get from these articles is that amount of data should never be a problem in normal use. 600TB it more than enough, even if you have a stupid app writing single byte thingies around the disk. Not sure how normal use could amplify enough to equal the abuse of 600TB during the lifetime.
And if you have such usage or IO requirements you should probably use redundancy anyway.

Although I don't particularly like the design of bricking themselves at eol.

cyberjock · Jun 18, 2014

Well, let's take a hypothetical "nasty" case. We write exactly 512-bytes at a time all the time. Each write will more than likely consume 4k(or whatever your cluster size is). So you will see an 8x magnification in the amount of wear-and-tear on your memory as you will write only 512-bytes out of 4k, but the memory cells are either used, unused(but allocated), or erased. So you will see an 8x decrease in the amount of data you will need to write to wear out the drive. Now let's make things even worse....

what if it's actually less than 512-bytes? If we use single-byte writes like you mentioned above that's an 8096x increase in the amount of wear-and-tear over the quantity of data written.

If cluster size isn't 4k and is 8k but it is still 512-bytes per write you're going to see a 16x increase in the wear and tear.

If you do a lot of deliberate garbage collection or force TRIM commands out-of-band you will also force the SSD to consolidate space. In these 2 situations you'll perform zero writes but internally data will be moved around. If you were one of those idiots that just had to do a GC every time you booted your machine when those 2nd gen SSDs hit the market you definitely put excessive wear and tear on your SSDs. I always laughed because they'd do daily GC on bootup, and then wonder why they had to replace their SSD 3 times in a year. Intel's toolbox says it recommends weekly but I've run into more people than not that do daily at 3am or something like that. Not smart at all and they're foolish for doing that if they care about drive life. When I've asked them what they *think* the benefit is, they never know. They just know that they want daily. When I ask them "what makes you think you know more than Intel if you can't even tell me what it does?" they are often dumbfounded and realize they are foolish for their thought process. ;) Critical thinking skills are distubingly lacking in the IT industry.

I don't know about you but I change the oil in my BMW every 100,000 miles because I know that's when I want to pay for it. See the problem?

Anyway, Intel's trim in the toolbox simply forces the OS to re-issue all space that *should* be free and *should* be trimmed but may not have been for various reasons(for example you delete 50GB of data and then immediately shutdown your machine). Generally this is self-correcting even if you never use the toolbox, but there may be a small performance penalty for this. When I say small I'm talking <5% from what some people have claimed. It is not necessary to run it daily as it's mean to catch anything that *might* not be trimmed but *should* have been.

Too many people do things, don't have a clue what it means, but know that they want it with no actual data to backup their desire.

no_connection · Jun 18, 2014

Still, it has to be a stupid amount of 515k writes and I see no normal usage that would generate it. Even with 100x amplification you still need to write 16GB of strange data a day for a year to get to 600TB of "RAW" data usage. And if I didn't get math wrong around 380 IO/sec
I am more worried about firmware bugs or bad construction than wearing the thing out.

cyberjock · Jun 18, 2014

no_connection said:
I am more worried about firmware bugs or bad construction than wearing the thing out.

I agree 100%. My Intel G2 160GB that I bought in 2010 is currently estimated by SSDLife Pro to wear out around 2123 or something.

The bottom line is that if you are just using your SSD in a desktop application it will pretty much outlive your need for that particular size. If you do things that are extremely write heavy(think torrents, video editing, huge game installs and uninstalls several times a week) you "might" run into problems after 3-5 years. And I say "might" with a very very large margin.

I had concerns with SSDs back in 2008-2010, and with OCZ's entire product line, but overall the reality is that today's SSDs should last so long you will want to throw out that "small 250GB SSD" in 2020 because it will be too small for Windows 12, but you aren't going to wear it out.

titan_rw · Jul 4, 2014

cyberjock said:
I agree 100%. My Intel G2 160GB that I bought in 2010 is currently estimated by SSDLife Pro to wear out around 2123 or something.

Exactly.

I bought two x25-m's in 2009 or so. They've always been hidden behind a raid controller (raid 0), so they've never had GC, TRIM, or anything ssd specific done to them. They were my primary windows drive for 2-3 years, now they're a scratch drive for downloads, VM storage, etc.

The smart attribute "ssd life left" or whatever it's called? Still at 99%. I'm waiting for it to drop to 98% left at some point.

I know the smart attribue does move, because I have an x25-v that is sitting at 92% because it was 'abused' for a while doing some unnecessary writes.

And yes, I'm with Cyberjock on the OCZ thing. I've had a lot of ssd's over the years. From the first gen jmicron based ssd's that had the horrible random write performance, all the way to the new TLC NAND stuff from samsung. The only ssd I ever had fail was an OCZ Vertex. And it was like 3 months after the 3 year warranty was up.

Yatti420 · Jul 7, 2014

I've only ever seen Ocz ssds hard fail.. One of the original Ocz SSDS..

Sent from my SGH-I257M using Tapatalk 2

Important Announcement for the TrueNAS Community.

Intel passed power-loss-protected SSD tests

cyberjock

Inactive Account

Michael Wulff Nielsen

Contributor

cyberjock

Inactive Account

titan_rw

Guru

Knowltey

Patron

cyberjock

Inactive Account

cyberjock

Inactive Account

russnas

Contributor

D4nthr4x

Explorer

cyberjock

Inactive Account

no_connection

Patron

cyberjock

Inactive Account

no_connection

Patron

cyberjock

Inactive Account

titan_rw

Guru

Yatti420

Wizard

Similar threads