Fast hashing: combination of different techniques to identify changes in a file?

Question

I want to create a fast way to detect whether a file might or might not be the same. For almost 100% sureness I would use an existing hash algorithm, e.g. SHA256. However, the files are expected to be huge video files with several GB, so calculating the SHA256 hash could take some time, especially over the network.

Therefore I want to combine different other techniques:

file size: if the file size has changed, the content has changed (sure)
head / tail hash
random hash

The latter 2 are part of my question:

My guess would be that in the header there are things like:

frame rates (e.g. Videos)
resolution (e.g. Videos, Images)
(file) length (e.g. in frames, pixels etc.)
last change date (e.g. Word documents, not specifically Videos)

Why I consider checking the tail is:

MP3 has the tag information there
EXIF adds custom data at the end if I'm right

Random hashes would select e.g. 126 regions at random positions in the file with a specific length, e.g. 64 kB and create a hash for them. Of course I remember the offsets for later comparison. All in all I would use (1+126+1)*64 kB of data for my hash, so I need to read only 8 MB instead of several GB to get the hash.

Maybe it's more a Math question now, but: how likely is it to detect a change using the combination of file size, head, tail and random data to generate this quick hash sum?

I assume that the files are always legal files. There's no benefit in manipulating single bytes. The user would use a normal video editing tool to change the files.

UPDATE: I unaccepted this answer which came from Crypto.StackExchange. I agree that my proposal is not cryptographic and not intended to be secure. I also agree that CRCing a file is fast, but in my case I really need a hash - I'll explain why:

My application is expected to save bookmarks in videos. My database is expected to save the video hash and the bookmarks.
Users sometimes move or rename files. My program will notice that a file does no longer exist, but will not delete the bookmarks from the database. Instead, when the same video is (accidentally) played again, I want to recognize that it's (probably) the same file.
Users are expected to save files on network drives (NAS) and stream videos. Those are dumb storages. I cannot install a server component. And they might be quite slow, so I really don't want the full hash. Calculating a full hash on a 3 GB file takes at least 5 minutes @ 10 MB/s, no matter how fast the hashing algorithm is.
If the user has edited the file, I somehow hope that the hash won't match any more, because otherwise I would display wrong bookmarks.

I'd be ok with a ~80% chance of having the correct bookmarks. How many hash pieces should I put together and where in the file would that be?

As long as malicious tampering or file corruption is not a concern, there's no need for any of this. Just use a specialized program to interpret the media file's headers, which should contain the streams' encoding/tagging dates and sizes. You can hash the media information for easy comparison. — , Dec 13 '13 at 17:45
Also, most operating systems keep a 'last modified date' available for each file. If you don't have to worry about malicious tampering (that last modified date can generally be set by someone), you can just look at that, and not bother with any file contents at all. — poncho, Dec 13 '13 at 19:23
EXIF or MP3tag are almost useless for detecting changes: Many of manipulation programs are unable to touch these so they retain their previous contents. For example EXIF may well retain the original picture. — , Dec 13 '13 at 20:17
@e-sushi: I agree. Only thing which may be interesting to note from perspective of cryptography, is that CPU power generally grows faster than I/O or storage. Also CPUs get more cryptographic capabilities. This means that there are less and less uses where it is worth using "classical" non-cryptgoraphic hashing techniques and just going with proper cryptographic hashes even when there appears to be small chance of foul play. — , Dec 13 '13 at 20:40
Going by “I assume that the files are always legal files”, I guess you aren't looking for any security? In this case you're on the wrong site. [cs.se] should be a better help. The answers you've had here are irrelevant if you don't want security, so if this is the case I would suggest to repost on [cs.se] and clarify that point in your reposted question. — Gilles 'SO- stop being evil', Dec 13 '13 at 21:44
There has been a proposal to close this as opinion-based. I don't see how a direct question about the probability of some event is a matter of opinion. — David Richerby, May 27 '14 at 16:06
I too find this a very good question, though it might be impossible. But again, with an accepted error margin for 20% I'm guessing it to be possible. I could use a solution as well, but I would not accept error margins greater than 1%, but speed is paramount. — Johny Skovdal, Dec 11 '17 at 15:59

score 8 · Accepted Answer · edited Apr 13 '17 at 12:48

There are two sides to your coin:

if you want to do it secure, you will need to use a cryptographically secure hash like SHA256 (crypto-hashes are meant to be fast, but tend to be a bit slow due to security constraints),
things like CRCs are definitely quicker, but will never be able to offer the same kind of security (especially when we’re talking about .

Option 1: CRCs — Doing it quickly at the price of security:

If you're just after the detection of changes, go for a checksum instead of a hash. That's what checksums were made for: quickly detecting changes in a file or data-stream. But keep in mind that CRC was designed to prevent transmission errors, not malicious action!

Practically, CRC32 is the most obvious candidate (but even an additive CRC8 would do the job if you only want to detect if something has changed and don't expect anything else than that from the CRC.)

Option 2: Beyond CRCs — Doing it rather quickly while enhancing change-detection:

Other valid options (looking at @poncho's comment) are indeed to simply check the last-mod timestamp.

Or, you combine both (to prevent bottlenecks), using something like this pseudo-code shows:

if(LastMod != knownLastMod) { CreateNewCRCandCompare(FileName, knownCRC) };

But does this offer any real security? No. Same goes for your…

Why I consider checking the tail is:
- MP3 has the tag information there
- EXIF adds custom data at the end if I'm right

Again, it depends on how much security you expect. You have to realize that an adversary will surely manipulate the file to keep (or copy-and-paste) any old ID3 and EXIF data… as anyone (with appropriate RW file-access rights) can modify that. Same goes for the Last-Modification timestamp, frame rates, resolution, last change date, and even the (file) length. Depending on that “additional” and “modifiable” data — which can be modified and removed by anyone with enough file-access rights — would introduce a security flaw.

But you do expect security, don't you? After all, that's the reason why you're thinking about all this in the first place. Well, then there is not way around using crypto-secure hashes…

Option 3: Cryptographically Secure Hashes — Doing it securely at the price of speed:

If you expect real security, you will have to rely on hashing; to be more precise: cryptographically secure hashing (using a hash which is not known to produce collisions). It takes time (a few microsecs per MB) but it's worth it.

My 2 (personal) cents:

Try to live with the fact that hashing costs time and hash the whole files with a cryptographically secure hash. Because, when stuff starts hitting the fan… you're better off being slow, instead of being sorry.

EDIT based on your EDIT…

If cryptographic security isn’t your main focus, you could look at MD5 or SHA1. Both MD5 and SHA1 are “cryptographically broken” because collisions have been detected… yet for the change-detection purposes you describe (especially after your EDIT), the likely-ness of hitting such a collision should be minimal enough.

Looking at everything again (including your EDIT), I personally would most probably use MD5, because it offers a usable collision resistance (for change-detection purposes) while still being fast enough to completely hash multi-gigabyte files.

If that still doesn’t satisfy you in a “speed” sense or if your hardware resources are really that limited, you have to try to balance collision-resistance/change-detection with speed. Meaning…

Take the individual timestamp, the individual filename, and hash the header (length depends on media type and used file format) as well as a good chunk from the middle and a good chunk of the tail (= file end). Combine those 5 and you should be able to roughly filter out most

I'd be ok with a ~80% chance of having the correct bookmarks. How many hash pieces should I put together and where in the file would that be?

That’s more of a personal opinion, as it depends on a whole truckload of details (media type, file format, available resources, expected change-detection ratio, file similarity, etc.) so will have to balance that out yourself depending on your personal expectations, your implementations, and local results due to hardware and/or software bottlenecks.

Let me try to provide you with some guidance nevertheless:

If hashing the complete file isn’t an option for whatever reasons, I would – at least – take: the header (and maybe a few KBs more), a good chunk from the middle (at least the size of the “header & co.” part), and a good chunk from the file end (again, at least the size of the “header & co.” part).

The more resources you can invest (or are willing to invest), the more chunks you can take and/or the bigger those chunks can be. If you think your resources/feel/whatever still offers room for more, increase the size of the chunks you hash and/or increase the number of chunks you hash.

Increasing the number of chunks is easy: as all you need to do is to take care of an equal distribution (by dividing the filesize accordingly, resulting in same-size chunks you extract from equally-spaced parts over the whole file-length).

And if you’re asking yourself “Why equally distributed and not random chunk-positions?”, let me simply note that picking random chunk positions might practically render your change-detection efforts void since it incorporates the risk of skipping some important parts media where you would normally detect the chances you are aiming to detect. Choosing an equal distribution is – simply said – more neutral.

I wouldn't use CRC32, too big chance of failure even without malicious attacks. Crypto is pretty fast. You should get 1GB/s on a single core with a standard hash. If you weaken it a bit 3GB/s should be possible. It's almost certain that IO is more expensive than hashing. — CodesInChaos, Dec 14 '13 at 21:05
@CodesInChaos I agree. That's why my closing words advise to go for a cryptographically secure hash. — e-sushi, Dec 14 '13 at 22:12
Carter-Wegman hashes and other universal hashes could help. These have the speed of a wide CRC, and the security of hashes, assuming a key remains unknown to the attacker and is not reused. See this answer for references. — fgrieu, Dec 15 '13 at 11:48
@fgrieu But wouldn't that - in OPs situation - mean OP would need an individual key per file? Seems a bit impractical to me. Especialy, since it would introduce the need for key-management etc. just to verify potential file-modifications. — e-sushi, Dec 15 '13 at 19:44
@e-suschi: if there is some unique file identifier (like, a path), a master key and HMAC is all it takes to get a unique key per file. That said, if the adversary gets read access to the key, she can make a forgery, when she can't with a regular hash of the file and read-only access. — fgrieu, Dec 16 '13 at 06:07

score 5 · Answer 2 · answered Dec 13 '13 at 20:02

Shortcuts

If you have multiple files and you want to detect changes to files, use file size and last modification timestamp.

It is possible that operating system you use provides facilities to detect file changes, for example Linux allows to get notification of changes to directories.

Full file processing

If you need to read actual contents of files to check if files have changed, go with actual cryptographic hash. CRC has significant potential of giving a false negative. SHA-256 can be quite good, but actually, SHA-512 is faster on many modern platforms.

If you have many CPU cores, it could be useful to calculate different hashes for different parts of the file or use hash tree to parallelize the processing.

The reason for suggesting proper hash is that once you go to actual file data, the cryptographic processing will not be too much, instead there will be a lot of other slower things, typically e.g. disk I/O or sending and receiving network packets.

Note: For (at least) small files it is also possible to store entire file contents, and do comparison of the contents instead of hash.

Note 2: If you are very tight on storage, CRC or truncated cryptographic hash could be good choice. CRC32 takes 4 bytes per file, and SHA-256 is 32 bytes. Small tags 4 bytes are not able to protect against malicious attempts to hide edits.

Partial file processing

In most cases I would recommend using just the full file processing.

Maybe it's more a Math question now, but: how likely is it to detect a change using the combination of file size, head, tail and random data to generate this quick hash sum?

For picture files it is common to make small edits, like remove red eye, add mustache or horns, etc. These edits in JPG format would occasionally not affect file size (with editing program which is able to make changes to JPG with recompressing only altered areas) or one of other attributes you mention.

File modification time would usually be affected though.

Considering video files: many video formats generate constant bit rate. For constant bit rate file, if some frames in the middle are altered, it'll also not appear in file size, head or tail. Removing or adding frames will almost always result in difference in size.

So I see it entirely possible that field get changes without it getting detected.

It is very hard to estimate probability edits are detected with this scheme, but there are common usage scenarios for videos and images which are not properly detected.

Yes, small edits on PNG or WAV files have a large chance to be missed if only some chunks are processed. — galinette, Oct 15 '16 at 11:22