Discussion thread: How reliable are SSDs?

vbimport

#1

Our user ChristineBCW said in the chat this:

I am not happy with SSD reliability. We’d put out a year’s worth - 200? and had maybe 40 die.

Which makes me wonder what experiences of others are when it comes to SSDs?


#2

Almost all of the major brands have had some issues, even Intel. Crucial had a killer flaw that would stop their M4 drives after a set number of hours (later fixed through a firmware update). But the Sandforce ssd’s, especially those using the SF-2281 controller seem to have had the most complaints. From discussions around the net I’ve seen, the latest rounds of firmware updates really helped those Sandforce based drives, but lots of folks still don’t trust them.

As for my personal experience, a sample size of one is hardly worth talking about. :bigsmile: So far so good with my Crucial M4, and yes it has the updated firmware straight from the factory.


#3

This was a Friday lunch-chat session, and a few wise ideas were bounced around that often had disastrous-APPEARING consequences to bothered customers.

For example, we all want MORE space. MORE capacity. Me, too.

I hate buying 40Gb or 60Gb SSDs. Just too puny.

However, this brief size does enforce a user-level discipline of OS & Program Files Only.

Data? We don’ need no steekin’ data!

The great tendency of SSD Failures seems to be On Startup. Like lightbulbs, they “pop” when power’s first applied. If they’re left on, maybe they’d have died at the same usage moment, too - but so far, all of our SSD failures have been ‘at power-up’.

The customer reaction to this is almost window-jumping suicide time. “My computer’s dead! Everything’s ruined! I’ve lost my entire business!”

Uh. No. I can reload Windows and the few programs, and poof, up and running. NO DATA LOSS.

Because the SSD is soooo puny that you don’t dare save any picture or song on it… good user discipline pays off - not one bit o’ user data is lost.

And cloning the Boot Drive with a second identical cheapo 40-60Gb SSD is an easy weekly chore. Or leave it Mirrored - what’s the chance that both SSDs blow up on the same power-up? (Wait wait - don’t answer that!)

The Small SSD size is a very nice enforce User Discipline Whip, I think. Cloning off the drive occasionally - weekly? Monthly? Just once? Also saves a beaucoup of re-load time.

For those customers of ours (none of whom have found this forum, thankfully!) who can’t discipline themselves, “You didn’t think your data was important enough to back it up and now your actions have proven it! It’s gone! Poof! Enjoy the consequences of your choices.”

I’m almost certain Charles Darwin would have had a law about these species.


#4

Am I right?

Have folks ‘lost’ their SSDs at power-up times? Have any of them died while in use? We haven’t had any of those - not that we believe. We think our customers have turned on computers only to find the boot drives dead.

But they MIGHT have left them on overnight… some of them were never certain.


#5

For any meaningful discussion, you should differentiate hardware reliability from software/driver issues. Dealing with anecdotal reports, that’s near impossible.

Most folks involved with consumer electronics will tell you that maybe 50-75% of all returns, complaints or “failures” are directly related to user error. Wrong driver, wrong software, wrong setup, etc., etc., etc.

With SSDs you add a couple more layers of OS drivers, MB BIOS, not to mention questionable MB support for uber-fast drives.

Anyhow, Vertex-3 here with over 5000 hrs on it and no issues that were not related to software or firmware.

Let’s not forget the platter drives fail pretty regular too. Do we hold SSD to a different standard?


#6

No, but we’ve put out 200 boot-drives on SSDs in about 18 months, and had 40 die. We don’t have anywhere near that in mechanical drives. Two or three out of thousands of HDDs in 2-3 years.

But our SSD installs are, yes, much smaller but with a pretty stable set of MBs, too. The one thing that changes - week to week - is the SSD model number. If we’ve purchased 30 from a wholesaler in a month, and go back the next month, we’ll invariably get a different model number.

We realize part of this is because they’re a hot-commodity for manufactuers and so many ‘assemblers’ are buying up last-gen NAND chips from ?? Whoever.

We haven’t been able to figure out if these are circuit board failures, power, NAND, yet - nothing seems to be consistent. Just “poof”.

I’ve been told that this wasn’t an uncommon phenonemon in the world of storage - that new gens of HDDs would always fail more often in their first year of that generation’s introduction. Seems logical.


#7

It’s true that SSDs tend to just die without warning, and most appear to fail when first powered on. However, failure rates are exaggerated to a large extent.

I closely followed the SandForce SF-2281 BSOD, and from what I saw, anyone who had a BSOD and happened to have a SF-2281 based SSD automatically blamed the SSD for the BSOD. I also followed the manufacturers support forums during this period, and most of these problems turned out not to be the SSD, but in actual fact a system stability problem or a driver based problem.

Most people still blame only SandForce for this problem, but I will stick to the facts here on how it was fixed, at least for the vast majority of SandForce users.

Fact 1
99% of these problems only manifested themselves on Intel based chipsets, AMD chipset users were almost untouched.

Fact 2
Several Intel chipset “SATA option ROMS” were updated, not only once, but it was more or less a monthly event. The same applied to the RST driver versions. This cured about half of the remaining problems.

Fact 3
Intel chipsets from P55 to Z68 had a power management bug, which still remains unfixed.

Fact 4
SandForce released several firmware updates to address the problem, with varying success. The big firmware fix for the BSOD problem came just after Intel launched the 520 series, which was also accompanied by another SATA option ROM update. Coincidence?

Moving on
Here are some bugs with SSDs that I can name from the top of my head, most have been fixed by firmware updates.

Indilinx Barefoot
Data corruption when the volume is nearly full. (Fixed with firmware 1.70)

Intel G1 and G2, bricked when a security password was applied (fixed)
Intel 320 series 8MB bug which reduced the volume to 8MB (fixed)

SandForce SF12xx sleep bug which sets the drive into panic mode, requiring the SSD to be replaced (still unresolved)

Crucial M4 5001 hour bug, when the SSD clocked up 5000+ the SSD would fail. (fixed with firmware update).
Crucial C300 and M4 LPM issues, which can cause the SSD to just drop out (in most parts fixed, but Crucial still recommends that LPM (Link power management) is disabled.

Samsung 830. When the partition is GTP, the Magician software corrupts the volume. (fix use MBR instead). :slight_smile:

To finish and get some things into perspective.
Here is list of the SSDs I have, starting with the oldest first.
OCZ Core V2 60GB (JMicron) No problems.
OCZ Apex 120GB (JMciron) I have two of them and one failed. The other has no problems
Intel G1 80GB No problems
OCZ Vertex 120GB (Indilinx Barefoot) no problems
OCZ Agility 120GB (Indilinx Barefoot) No problems
Crucial C300 128GB (Marvell) no problems
OVZ Vertex 2 100GB (SandForce SF12xx) no problems
OCZ RevoDrive X2 240GB (4X SandForce SF12xx on a PCIe2 x4 card) no problems
OCZ Vertex 3 240GB (SandForce SF2281) no problems
OCZ Octane 512GB (Indilinx Everest) no problems on SATA3 but drops out of SATA2 on Z68 chipset.
Intel 520 240GB (SandForce SF2281) no problems on Z77 chipset, but drops out when connected as a spare on Z68 chipset.
OCZ Vertex 4 512GB (Indilinx Everest 2) Pre production sample was bricked by trying to flash production firmware) ooops :slight_smile: The replacement has no problems once the buggy 1.4 firmware was updated.
SanDisk Extreme 120GB (SandForce SF2281) no problems


#8

Dee, thanks so much for this. I’ve sent this back to the office and maybe one of our ‘experts’ (cough cough) can get a scope on his experiences and let me know. I’m uncertain what products we’ve installed - I assume there will be 10 or a dozen varieties, each with several units.

Harley, in the CHAT session, brought up his “I clone instead of Mirror” his SSD, and then he’s reinforced my notion that a disciplined user will lash his Data to other HDDs so, if there’s a death in the SSD family, he’s got a quick replacement handy.

Your note also helps me understand why our SSD folks have been paying attention to firmware updates so frequently. It’s like the early days of CD Burners haggling with CD Blanks, I suppose.


#9

Dee, one other question - have you observed a Death-While-Operating? The ones I’ve seen have all been on power-up. I assume this “D-W-O” will be quick? Or maybe will there be some DELAYED WRITE warning message then BSOD?


#10

Backups of all data should always be done, regardless if its an SSD or HDD.

@Christine
I’ve only had one SSD actually fail, the OCZ Apex. It was working one day then didn’t work the next. The PC was off overnight.

Have you ever witnessed more than one SSD fail on one of these PCs?


#11

Dee, I’ve only seen the “refusal to start up” failures. Our customers have sometimes waffled on whether they left an SSD-PC up and running overnight, and most of them can’t remember what their first steps were - with certainty.

“I came in, the screen wasn’t on, so I turned the computer on. Or I thought I did. Did I turn off a dead computer, and then turn it back on?”

Those “first thing in the morning” kind of absentminded answers.

Some have sworn that they left ‘that’ computer on all the time, and then it was dead when they came in the next morning, meaning it probably was a “death during operations” of some sort.

Lightning strikes? Power surges? Hand of God? Terrorist flying errant amps into that one computer-cable?!!

On the plus side, I think all of them RMA’d just fine, although I’ve heard the complaint “I asked if they could describe what went wrong but never heard a word from them.” (I’d find out we’d RMA’d back to our distributor, who has Shipping Clerks, not Tech’s. du-uh… even MY head coulda figgered that one out!)

I have no idea which is the more likely element to ‘die’ - a circuit board? An electrical connector? One chip or another? Basically, the question may be the eternal chicken and egg variety: “Circuit board or Cache or NAND?” That would be interesting to discover, though.


#12

About two years ago I had an OCZ Agility 60GB (Indilinx Barefoot) fail while in use. I left the PC for an hour and came back with an “Unbootable Boot Disk” message, i.e. when a boot drive drops out, Windows will BSOD and automatically reboot. The only warning sign at the time was a BSOD a few days earlier with a message about the boot volume being unexpectedly dismounted.

The only other SSD failure in use I came across was a friend with an OCZ Vertex 2 120GB. He said that while using his laptop, it stopped responding, so he forced the power off and the SSD did not appear during boot. However, it is unclear if Windows stopped responding due to the SSD failing or that the laptop hung for some reason with the SSD coincidentally failing when powered off and on. He said that his laptop never hung while in use before and had the SSD for almost a year.

For protection, I have Windows 7 Backup scheduled to make a weekly image to a HDD. Should an SSD fail (such as the above incident), I just put in a spare HDD, run the Windows 7 recovery DVD and am back up & running within an hour. As others have mentioned about data, I have my Documents, Pictures, etc. folders mapped to a HDD. Once I get the replacement SSD, I either rerun another recovery or just mirror the SSD over (e.g. with Linux ddrescue.)

The risk of an SSD failing doesn’t bother me and already have an SSD in my Netbook and family laptop. The postage of sending an SSD back is €2 to €4 depending on the country and with the last SSD I RMA’d with OCZ, I had the replacement within a week. On the other hand, I would not rely on an SSD for backing up data to, but then again SSD’s are too expensive to use as a backup medium.


#13

The OCZ SSD importer/distributor in South Korea has had some higher-than-normal return rates. But it’s a much smaller market (than Europe and North America) and OCZ never had serious market share here (nobody can compete against Samsung in its domestic market.) None of the other - excluding those even smaller than OCZ and don’t forget OCZ itself remains a startup with tiny capital money and R&D capacity - SSD makers/importers had significant return/replacement rates. Even the OCZ Agility and Vertex 2 and 3 series had far less than 5%. Return, or replacement rates are not defect rates. Defects are far fewer than replacements. Replacements are far fewer than returns. The machines need to be shipped back to OCZ for further checks and that costs a lot of money and time. Supermicro can afford some things sometimes, but it used to be a Supermicro importer, or a Lian-Li importer for that matter, typically had $200 or $500 profits for every motherboard it sold. Technicians cost a lot and it’s usually the kind of people far less experienced and knowledgeable than, well me, or anyone among the CDFreaks reviewers, that interact with the users and customers or the retailers or the resellers or wholesalers. You cannot expect OCZ employees know fully well about malfuctioning bugs or specific chip defects, either. They simply don’t care because they were not hired to do too much. So it is rational to reach and talk to the engineers responsible for chip manufacturing and firmware coding directly and discuss in depth on the issues widely circulated and reasonably suspected. Drives are replaced 1:1 on site, no question, if that’s the sensible way to do ‘service’ or ‘care’. Most of the returned drives are of course fully well and functioning and it happens with all kinds of drives: CD, DVD, Blu-ray, HDD, SSD, USB… and with all kinds of manufacturers: Samsung, Lite-On IT, OCZ, LG, Mtron, Seagate, Western Digital, and even the notorious IBM during its worst days of Deathstars. Retailers and manufacturers accept and replace those returns, or RMA’s to gain favor and support among consumers, websites, reporters, governments, and so on. What AMD and IBM do in the US market may be and may look just and honest and honorable and nice. The same companies respond in some other markets differently, and in many times in the exact opposite ways, and not everything is AMD’s or IBM’s fault, but a distributor’s or importer’s or a retailer’s, or a local government’s. Since the US consumers are most powerful, posting the most, and so on, Samsung and Seagate, OCZ and Intel, Toyota and Sony… all had to provide the fastest, and the most expensive services to them whether it was because of faulty engines, or tires, or chips, or batteries, or even pixels. That led not to the megadeth of IBM’s hard disk drives, but to its business division being sold to Hitachi of Japan. So OCZ was sold and if there are much-higher-than-normal numbers of complaints against Seagate’s SSD’s, it will be sold as well.

One of the reasons so many SSD’s are returned is user prejudice against SSD. They simply don’t know, or don’t care, or are told from someone that it’s because of their SSD’s and there are enough cases in the real world technicians and customer support personnel actually provide standard replies: “Never use SSD. It’s unreliable.” That’s similar to what the cable guys tell home users, especially the young girls living alone: “Cable is superior to FTTH. Your Internet is slow because your computers are poorly working, or slow.” Both are plain lies in most cases. Almost any computer equipped with the latest SSD’s and RAM-based drives can handle 100Mbps or 1Gbps downloads. Cable on such computers remains often the most outdated component, and cable (in South Korea) inherently can fail literally thousands of times more frequently than FTTH.

Another good thing about living in a dense area like Seoul is virtually everything I actually buy and use, and every importer, distributor, reseller, service center, etc. is within bus and subway rides that cost $1 or a little over $1. So I can overhear and listen to what other customers tell the engineers, or the girls in uniform, and what the representatives and managers of importers and distributors tell me or reporters, or reviewers. Trade secrets exist, but generally known insights and publicly distributed statistics are valid enough to trust.


#14

I’d love to spend a year there, yes, absolutely. Maybe I could practice my Laser-Eyes look on poor technicians! “Give me what I want!!” I’d wear heels, too!


#15

[QUOTE=ChristineBCW;2646126]I’d love to spend a year there, yes, absolutely. Maybe I could practice my Laser-Eyes look on poor technicians! “Give me what I want!!” I’d wear heels, too![/QUOTE]

On top of heels, you are gonna look two heads taller than some, or most, of them. :slight_smile:


#16

Oh, I don’t know - we were watching the South Korean volleyball team at the Olympics, and there were some pretty tall women doing smashes and blocks. Great looking women, too.


#17

[QUOTE=ChristineBCW;2646173]Oh, I don’t know - we were watching the South Korean volleyball team at the Olympics, and there were some pretty tall women doing smashes and blocks. Great looking women, too.[/QUOTE]

Average female height of South Korea in 2012 is under 160 cm. :slight_smile:

But the South Korean Olympic team will do well.


#18

Volleyball, like badminton, still depends on the ball hitting the floor. And great short defenders can wear out tall hitters faster than the other way around. Those hitters get tired in long rallies, invariably hitting a ball long. Three jumps in a row, and that third one is so often ‘out’, or else becomes a dink that can feed a quick kill in reply, or a great, patient block to stuff it back.

Badminton helped me with patience and vision in all of my court sports because the birdy travels slower as it travels farther. Allowing myself to take “one more step” and do sweeping off-the-deck hits gave me a chance to play with better foot-work positions. That was always my problem in volleyball.

Once Hubby proved to me that VB was a game of the feet, not hands, I had a lot more confidence in front and back rows. I still don’t bend well - or fast enough, so my digging is not a great thing, nor is returning serves. Ah well… “I can trip the other players with my big feet! Doesn’t that count for something?” Yeah, until they start wearing hobnail boots!

Anyway… thanks for all the input on the SSD reliability experiences.


#19

What is causing those SSDs to become unreliably? HDDs wear down due to mechanical problems don’t they?


#20

SSDs, like all other devices based on solid state chips, can wear down physically over time. Chips are, at a very basic level, tiny switches flipping on and off at a certain voltage. The chips can eventually age to a point where the tiny voltages are no longer within the design specifications, and at this point the affected silicon is no longer good.

With flash modules, differencing voltages represent different storage states. The more writes, the more wear on the modules, the lower the chance that the memory module can successfully store the voltage–which means the lower the chance the data is effectively written. Over a long period of time, the voltages will no longer represent the intended storage state, and there you have corruption. Now, with SSDs, this is avoided by having spare area on the drive for defect management (also present in HDDs), and wear leveling exists partially due to this reason. Also, certain types of flash modules, due to the way they are designed to cram more data in the same amount of space, are more susceptible to physical wear; the chips have smaller tolerances for being out of spec, and must be handled a bit more delicately so no singular chip is overworked. (Quite a bit of this is analogous to the way hard drives and/or RAM modules work).

People can exacerbate premature physical failure by performing large numbers of benchmarks, which eats up the finite number of read/write cycles available for normal use. Wear leveling is effective in normal circumstances, and benchmarks are not normal circumstances.

Another problem might be firmware bugs or other general software bugs. These plague traditional hard drives, too, resulting in undetected drives, drives negotiation bad connection speeds, drives with wrong reported capacities, or drives that seem to go offline during normal operation, to name a few problems. This is quite a prominent problem in comparison to physical failure, but it is still not an epidemic.

Now, good hardware design typically delays mechanical failures until well into the life of the hardware, and good software design avoids firmware bugs, but testing cannot replicate every condition and catch every error in a reasonable manner. Such is the way of the world. The majority of users are fine, while folks on the bleeding edge or in unusual conditions may find themselves facing problems. You just have to be aware of what you’re purchasing and plan to use it appropriately. :slight_smile: