Real world hash collision risk, for finite input; uniqueness only no data risk

Question

I've got a comma-delimited list of identifiers that I need to use for uniqueness. MSSQL limits uniqueness to 1700 bytes, which according to my sampling, doesn't appear to be sufficient.

hashes are the obvious solution.

the identifiers are completely meaningless... they're lookups to locations of data... so there is NOTHING sensitive to care about if a rainbow table were to be used... unless you care about "2,4,8,23" or whatever combination shows up in data.

Also, since I know it's a comma-delimited list of numbers, I can limit input to [0-9,] and ensure that nothing else will ever exist. no Unicode or hidden character nonsense.

so... given all that...

yes, I know some hashes can have calculable collisions.

but for my purpose, is there a real risk of hash collision, at any hash (even MD5)?

fgrieu · Answer 1 · 2022-06-17T11:20:43.460

for my purpose, is there a real risk of hash collision, at any hash (even MD5)?

That depends on

The number of possible inputs, and width of hash. For $2^s$ inputs chosen independently of the hash, and $w$-bit hash, the probability of collision is¹ $$p\lessapprox2^{2s-w-1}$$ when $2s<w-5$. So for $s=48$ (>30000 entries for each living human), $w=128$ (MD5), probability of collision is $p\approx2^{-33}$ (1 chance in 8 billion, about the probability that a randomly selected living human is you).
If adversaries actively try to create collisions, or if in doubt
- $s=48$ is way too low! In fact, facing adversaries able to choose messages (in full, or in part with knowledge of the rest), $s$ must be defined not by how many things we hash, but how many things adversaries can hash. We are talking $s\approx53$ if adversaries use a single GPU for one year. If we trust that source, bitcoin mining contributes to the ruin of our ecosystem at a rate of $2^{93.5}$ hashes per year, using specialized ASICs, thus we should use $s>94$ if we assume comparable waste can occur against our system.
- It's unwise to use a broken hash, such as MD5 or SHA-1, though the restriction in character set at the input mitigates existing better-than-brute-force attacks to a sizable degree.

To be on the safe side, you can use SHA-256 or the typically faster² SHA-512/256 ($w=256$). If space is an issue, SHA-512/224 ($w=224$) which limits each hash to 28 bytes. See FIPS 186-4.

If speed matters, there's Blake2/3, which are competitive with MD5 on speed. It's OK to truncate such hash to save space, within the limits of the above formula.

since I know it's a comma delimited list of numbers, I can limit input to [0-9,] and ensure that nothing else will ever exist. No unicode or hidden character nonsense.

When using an unbroken hash, such considerations are unnecessary.

¹ For a derivation, see my Birthday problem for cryptographic hashing, 101, "assuming $n\ll\sqrt k$", "additionally assuming large $n$". In that source $n=2^s$ and $k=2^w$, thus $p\lessapprox{\frac{n^2}{2k}}$ yields our $p\lessapprox2^{2s-w-1}$.

² For messages larger than 55 bytes, and without hardware assistance, SHA-512 is often faster than SHA-256 on 64-bit CPUs, because it makes good use of 64-bit word.

i get that "using SHA-256/SHA-512 is a simple guarantee, so why not use that"... the REASON for not using "more advanced" hash algos: first the size of the hash, second the speed... if the older hashes are sufficient, they're a ton faster to execute, and notably smaller (MD5 vs SHA512 is 16 vs 64 bytes)... yes i'm trying to minmax on performance, but I also need data assurances. — Scott Brickey, Jun 16 '22 at 12:39
@Scott Brickey: in cryptography, we assume adversaries trying to mess with the system, here trying to create collisions by crafting the messages. If you can safely assume there's no adversary, and want speed, then a 128-bit or even 96-bit CRC might be enough. E.g. if you process less than $2^{40}$ entries (over a million million) with a 96-bit CRC, $p≈2^{2\cdot40-96-1}=2^{-17}<1/130.000$ that there's a collision, which might be acceptable. But then that's rather off-topic, and I won't treat that in the answer. — fgrieu, Jun 16 '22 at 13:06

Real world hash collision risk, for finite input; uniqueness only no data risk

1 Answers1