CD/DVD-ROM identification method?

Not quite sure where this question might belong, so I’ll ask it here.

I’m looking for some kind of method or algorithm to identify a CD- or DVD-ROM with a fairly high degree of a probability. I’m not trying to identify the medium, but for a was to say ‘yes, I’m sure I’ve already catalogued the contents of this CD/DVD before’ in something like 1 or 2 seconds. I imagine that someone must already have tried to develop something like that.

In short, I’m looking for something like a CDDB algorithm, but for CD and DVD-ROM instead of CD-DA, and with lower false positives.

A MD5 hash for the contents (with certain restrictions) would probably have been ideal, except that it can’t be done in 1-2 seconds.

Is there anything like this out there?

[QUOTE=athulin;1969923]Not quite sure where this question might belong, so I’ll ask it here.

I’m looking for some kind of method or algorithm to identify a CD- or DVD-ROM with a fairly high degree of a probability. I’m not trying to identify the medium, but for a was to say ‘yes, I’m sure I’ve already catalogued the contents of this CD/DVD before’ in something like 1 or 2 seconds. I imagine that someone must already have tried to develop something like that.

In short, I’m looking for something like a CDDB algorithm, but for CD and DVD-ROM instead of CD-DA, and with lower false positives.

A MD5 hash for the contents (with certain restrictions) would probably have been ideal, except that it can’t be done in 1-2 seconds.

Is there anything like this out there?[/QUOTE]

I doubt that is possible as you think.
FIRST and always the mediatype is identified by a drive, the content is another part.

>I doubt that is possible as you think.

I have no doubt that it is possible – I’m simply trying to find out if anyone have done something like this before: if there already is something reasonably well accepted? That is my main question.

For instance, CDDB1 works simply by taking the number of tracks on the CD-DA and their length, chew on that a bit, and spit out a number.

The absolutely simplest thing to use for a ISO9660 CD-ROM would be to use the Volume Identifier as key. But as there are quite a number of collisions already (especially for old CD-ROMs), a more extensive scheme, such as taking the primary volume set descriptor and MD5-hash that. A still more complex scheme would be to append the directory structure it references and MD5-hash that.

Each of these [B]could[/B] be used as a ‘thumbprint’ to identify the contents of the CD – from the awful (Volume Identifier alone) to what at first sight, at least, seems to work better.

Moved to ‘Optical Storage Technical Discussions’.

One could create a numerical string that contains # of files, # of dirs, size of disc in bytes, # of sectors, time used on the disc, leadout lba, etc. All of which can be read in seconds.

Example



50      8    2289248
FILES   DIRS LEADOUT LBA
|-----||----||--------|
00000500000080002289248



[QUOTE=Nakor_;2013779]One could create a numerical string that contains # of files, # of dirs, size of disc in bytes, # of sectors, time used on the disc, leadout lba, etc. [/QUOTE]

True, but it ‘feels’ as if that captures more form than content. And that’s what made CDDB1 fail ultimately – I can easily imagine that, say, two successive releases of some software development kit have the same number of files, dirs, and size, and only differs in file sizes and file contents.
Any differences between these two will be found in the volume descriptors, i.e. the system area of the CD-ROM.

What I’ve experimented with so far is to create a MD5 hash from all the
ISO9660 file descriptor blocks followed by at most 20 further evenly spaced
blocks counted from block 16 (that is, with LBA = 16 + n*32 or something like that.

It doesn’t capture file structure, which I would like to include.

I would also like to include media size and recording method (is it a CD-ROM, or a CD-R etc.) to ensure simple copies could be recognized.

But I’ve realised that it’s going to be well-nigh impossible to test it properly, so it’s rather on back-burner for now.