There are two sides to your coin:
- if you want to do it secure, you will need to use a cryptographically secure hash like SHA256 (crypto-hashes are meant to be fast, but tend to be a bit slow due to security constraints),
- things like CRCs are definitely quicker, but will never be able to offer the same kind of security (especially when we’re talking about .
Option 1: CRCs — Doing it quickly at the price of security:
If you're just after the detection of changes, go for a checksum instead of a hash. That's what checksums were made for: quickly detecting changes in a file or data-stream. But keep in mind that CRC was designed to prevent transmission errors, not malicious action!
Practically, CRC32 is the most obvious candidate (but even an additive CRC8 would do the job if you only want to detect if something has changed and don't expect anything else than that from the CRC.)
Option 2: Beyond CRCs — Doing it rather quickly while enhancing change-detection:
Other valid options (looking at @poncho's comment) are indeed to simply check the last-mod timestamp.
Or, you combine both (to prevent bottlenecks), using something like this pseudo-code shows:
if(LastMod != knownLastMod) { CreateNewCRCandCompare(FileName, knownCRC) };
But does this offer any real security? No. Same goes for your…
Why I consider checking the tail is:
- MP3 has the tag information there
- EXIF adds custom data at the end if I'm right
Again, it depends on how much security you expect. You have to realize that an adversary will surely manipulate the file to keep (or copy-and-paste) any old ID3 and EXIF data… as anyone (with appropriate RW file-access rights) can modify that. Same goes for the Last-Modification timestamp, frame rates, resolution, last change date, and even the (file) length. Depending on that “additional” and “modifiable” data — which can be modified and removed by anyone with enough file-access rights — would introduce a security flaw.
But you do expect security, don't you? After all, that's the reason why you're thinking about all this in the first place. Well, then there is not way around using crypto-secure hashes…
Option 3: Cryptographically Secure Hashes — Doing it securely at the price of speed:
If you expect real security, you will have to rely on hashing; to be more precise: cryptographically secure hashing (using a hash which is not known to produce collisions). It takes time (a few microsecs per MB) but it's worth it.
My 2 (personal) cents:
Try to live with the fact that hashing costs time and hash the whole files with a cryptographically secure hash. Because, when stuff starts hitting the fan… you're better off being slow, instead of being sorry.
EDIT based on your EDIT…
If cryptographic security isn’t your main focus, you could look at MD5 or SHA1. Both MD5 and SHA1 are “cryptographically broken” because collisions have been detected… yet for the change-detection purposes you describe (especially after your EDIT), the likely-ness of hitting such a collision should be minimal enough.
Looking at everything again (including your EDIT), I personally would most probably use MD5, because it offers a usable collision resistance (for change-detection purposes) while still being fast enough to completely hash multi-gigabyte files.
If that still doesn’t satisfy you in a “speed” sense or if your hardware resources are really that limited, you have to try to balance collision-resistance/change-detection with speed. Meaning…
Take the individual timestamp, the individual filename, and hash the header (length depends on media type and used file format) as well as a good chunk from the middle and a good chunk of the tail (= file end). Combine those 5 and you should be able to roughly filter out most
I'd be ok with a ~80% chance of having the correct bookmarks. How many hash pieces should I put together and where in the file would that be?
That’s more of a personal opinion, as it depends on a whole truckload of details (media type, file format, available resources, expected change-detection ratio, file similarity, etc.) so will have to balance that out yourself depending on your personal expectations, your implementations, and local results due to hardware and/or software bottlenecks.
Let me try to provide you with some guidance nevertheless:
If hashing the complete file isn’t an option for whatever reasons, I would – at least – take: the header (and maybe a few KBs more), a good chunk from the middle (at least the size of the “header & co.” part), and a good chunk from the file end (again, at least the size of the “header & co.” part).
The more resources you can invest (or are willing to invest), the more chunks you can take and/or the bigger those chunks can be. If you think your resources/feel/whatever still offers room for more, increase the size of the chunks you hash and/or increase the number of chunks you hash.
Increasing the number of chunks is easy: as all you need to do is to take care of an equal distribution (by dividing the filesize accordingly, resulting in same-size chunks you extract from equally-spaced parts over the whole file-length).
And if you’re asking yourself “Why equally distributed and not random chunk-positions?”, let me simply note that picking random chunk positions might practically render your change-detection efforts void since it incorporates the risk of skipping some important parts media where you would normally detect the chances you are aiming to detect. Choosing an equal distribution is – simply said – more neutral.