12

I want to create a fast way to detect whether a file might or might not be the same. For almost 100% sureness I would use an existing hash algorithm, e.g. SHA256. However, the files are expected to be huge video files with several GB, so calculating the SHA256 hash could take some time, especially over the network.

Therefore I want to combine different other techniques:

  • file size: if the file size has changed, the content has changed (sure)
  • head / tail hash
  • random hash

The latter 2 are part of my question:

My guess would be that in the header there are things like:

  • frame rates (e.g. Videos)
  • resolution (e.g. Videos, Images)
  • (file) length (e.g. in frames, pixels etc.)
  • last change date (e.g. Word documents, not specifically Videos)

Why I consider checking the tail is:

  • MP3 has the tag information there
  • EXIF adds custom data at the end if I'm right

Random hashes would select e.g. 126 regions at random positions in the file with a specific length, e.g. 64 kB and create a hash for them. Of course I remember the offsets for later comparison. All in all I would use (1+126+1)*64 kB of data for my hash, so I need to read only 8 MB instead of several GB to get the hash.

Maybe it's more a Math question now, but: how likely is it to detect a change using the combination of file size, head, tail and random data to generate this quick hash sum?

I assume that the files are always legal files. There's no benefit in manipulating single bytes. The user would use a normal video editing tool to change the files.

UPDATE: I unaccepted this answer which came from Crypto.StackExchange. I agree that my proposal is not cryptographic and not intended to be secure. I also agree that CRCing a file is fast, but in my case I really need a hash - I'll explain why:

  • My application is expected to save bookmarks in videos. My database is expected to save the video hash and the bookmarks.
  • Users sometimes move or rename files. My program will notice that a file does no longer exist, but will not delete the bookmarks from the database. Instead, when the same video is (accidentally) played again, I want to recognize that it's (probably) the same file.
  • Users are expected to save files on network drives (NAS) and stream videos. Those are dumb storages. I cannot install a server component. And they might be quite slow, so I really don't want the full hash. Calculating a full hash on a 3 GB file takes at least 5 minutes @ 10 MB/s, no matter how fast the hashing algorithm is.
  • If the user has edited the file, I somehow hope that the hash won't match any more, because otherwise I would display wrong bookmarks.

I'd be ok with a ~80% chance of having the correct bookmarks. How many hash pieces should I put together and where in the file would that be?

Thomas Weller
  • 261
  • 2
  • 9
  • 1
    As long as malicious tampering or file corruption is not a concern, there's no need for any of this. Just use a specialized program to interpret the media file's headers, which should contain the streams' encoding/tagging dates and sizes. You can hash the media information for easy comparison. –  Dec 13 '13 at 17:45
  • Also, most operating systems keep a 'last modified date' available for each file. If you don't have to worry about malicious tampering (that last modified date can generally be set by someone), you can just look at that, and not bother with any file contents at all. – poncho Dec 13 '13 at 19:23
  • EXIF or MP3tag are almost useless for detecting changes: Many of manipulation programs are unable to touch these so they retain their previous contents. For example EXIF may well retain the original picture. –  Dec 13 '13 at 20:17
  • @e-sushi: I agree. Only thing which may be interesting to note from perspective of cryptography, is that CPU power generally grows faster than I/O or storage. Also CPUs get more cryptographic capabilities. This means that there are less and less uses where it is worth using "classical" non-cryptgoraphic hashing techniques and just going with proper cryptographic hashes even when there appears to be small chance of foul play. –  Dec 13 '13 at 20:40
  • 1
    Going by “I assume that the files are always legal files”, I guess you aren't looking for any security? In this case you're on the wrong site. [cs.se] should be a better help. The answers you've had here are irrelevant if you don't want security, so if this is the case I would suggest to repost on [cs.se] and clarify that point in your reposted question. – Gilles 'SO- stop being evil' Dec 13 '13 at 21:44
  • 2
  • The actual hash calculation will usually be cheap compared to the IO. MD5 will detect all non malicious changes and is pretty fast. Especially if you parallelize it. You'd need a RAID of SSDs or something similarly fast to exceed its speed. 2) For local files the OS can often tell you if it changed. Not just the last-change date, there are some specialized APIs as well.
  • – CodesInChaos Dec 14 '13 at 20:59
  • There has been a proposal to close this as opinion-based. I don't see how a direct question about the probability of some event is a matter of opinion. – David Richerby May 27 '14 at 16:06
  • I too find this a very good question, though it might be impossible. But again, with an accepted error margin for 20% I'm guessing it to be possible. I could use a solution as well, but I would not accept error margins greater than 1%, but speed is paramount. – Johny Skovdal Dec 11 '17 at 15:59