SSD power failure test / support

vbimport

#1

A friend of mine wondered whether the power failure feature of SSDs really works and how well. That immediately made me wonder if we can somehow test that.

I have no clue how we could “safely” test a power failure and also when we would say it does it well and why not, but I do think it could be something many people are interested in.

Any ideas whether we can do this and if so, how?


#2

Hi,

I think that to do this we need to verify that the drive has not lost any writes that were inflight when the power failure occurs.

I suspect that there is a way to do this using the Oakgate unit - I’ll do some research.

It is largely enterprise drives that have ‘power failure’ support.

Regds, JR


#3

Not impossible to do this, but quite hard to do with authority.
As Jeremy says, this will more or less be restricted to enterprise class drives, as there are very few consumer models use supercaps to prevent data loss during a power outage.

There are some pitfalls to doing this with authority.
The supercaps or ‘power failure caps’ are only designed to write what is in the SSD’s cache to NAND should a power failure occur.

A power outage would take down the whole PC, so only data in the SSDs cache would be written to NAND. It wont prevent data loss due to Windows caching policies.

When you shut down a PC normally, or restart it. Windows will flush the cache buffer, and update things like the paging file. and any temp data held in the Windows cache.

During a power outage, the SSD may well do its thing and write any data in its cache to NAND, but it wont be able to write any data in the Windows cache or update the paging file.

So you could have data loss, even although the SSD and its supercaps have done its job perfectly. The pitfall is. How can you tell which part lost the data. The SSD’s power failure prevention or Windows caching policy.


#4

Yes, I suspect it is very very difficult.

Some layman’s thoughts -

An OS will assume that a write operation has been completed regardless of whether it has been cached by an SSD or not.

I think most OS’s adopt of a strategy of effectively rolling back incomplete ‘units of work’ when it next comes up following a catastrophic failure. My thinking on the definition of a ‘unit of work’ here is - the completion of a write and the completion of updating any metadata that is required as a result of the write.

Crudely, I imagine that when the OS comes up following a failure it simply assumes that the then current metadata (data about data) that it finds on the drive is correct. So, for example, a write to an LBA that wasn’t recorded in the metadata is logically lost (even if it has actually been written).

So, any units of work that are cached (or partially cached) by the SSD at a point of failure will effectively be lost solely because of the SSD.

Most probably this is all nonsense but I hope at least it goes towards illustrating how difficult it may be to detect if a drive has or hasn’t flushed its cache in the event of a power failure.

I’m wondering if the only way this can be tested is at a very low level (e.g. something that traces the firmware’s operation in the event of a catastrophic external power failure).

I’ll ask OakGate. Maybe they know a way but I’m beginning to think not.

Regds, JR


#5

According to my limited understanding of how the Windows file system works, I think you’re pretty much near bang on, Jeremy.

The point I was making was, again, using my limited knowledge of how the Windows file system works was. I think we would have to remove Windows from the equation, and indeed try and do this test at low level, where Windows can’t have an effect on the result.

I found this document. It makes quite interesting reading.


#6

This would be my suggestion using Windows, although not as realistic as pulling the plug to the PC:

Go into the device manager and disable write caching on the SSD:

[ol]
[li]Prepare a folder with a large number of small files (e.g. 5000 x 1MB files) on a RAM drive or fast read source.
[/li][li]Create an empty folder on the SSD as the target…
[/li][li]Set up the SSD so it can be quickly unplugged, such as in a hot-swappable bay if possible.
[/li][li]Start the file copy of this folder using a method that logs what is copied.
[/li][li]Disconnect the SSD before the copy completes.
[/li][/ol]

The following is a command line example that logs while copying:

xcopy w:\fileset* x:\fileset >w:\log.txt /j

This assumes ‘w:’ is the drive with the source file set and ‘x:’ is the SSD being tested. ‘/j’ specifies not to buffer so that each file is read and written simultaneously without read caching.

xcopy shows the current file name that is being copied, so the last file in the log.txt file will be the point where the SSD was disconnected. In theory, every file leading up to that file should have been copied successfully.

As a test, I would suggest using a file compare utility to compare every copied file with the original. This can be done with the command line as follows, assuming the same drive letters:

for %f in (x:\fileset*) do fc /b w:\fileset%~nxf x:\fileset%~nxf >>w:\complog.txt

Once this completes, the complog.txt file will have a list of all the files compared. The last file should obviously show a lengthy series of hex byte mismatches. However, if any files leading up to the last also show mismatches, then those files were not written successfully. If any files towards the end of the copy log.txt file are missing from the SSD, then this is also a clear sign that cached data was lost, in this case file entries in the file system.

[B]Edit:[/B] I didn’t see Dee’s reply until I posted and never thought about the complexity involved.


#7

My first idea would be, but what if ONLY the SSD’s power fails. But I guess that’s never a real world scenario. Either the entire system loses power, which is what you both describe above, OR the SSD has a fault…

Nevertheless, this is a feature they’re selling and it would be nice to find out how well it works and how useless/not useless it is.

BTW, someone did something on this already (and it made Slashdot en hackernews :wink: which proofs how interesting the subject appears to be) http://lkcl.net/reports/ssd_analysis.html