For data shorter than 32 bytes, is it sure that no collision exists?

Question

I’m not talking about if a collision can be found but even simply exists.

The point is since it’s shorter than the length of the hash (since I’m talking about keccak256), normally there’s a hash value for each possible data. I mean in that case there’s more possible hash than possible values for input data.

So is the case ? Or by example, can 2 different 16 bytes long number share the same hash ?

Is Just how surjective is a cryptographic hash like SHA-1? useful? — Paul Uszak, Oct 15 '19 at 02:30
I'm having trouble following your question. Are you asking whether, for some particular hash function with a 32-byte codomain such as SHA-256 or SHAKE128-256, there are any collisions on (say) 16-byte inputs? — Squeamish Ossifrage, Oct 15 '19 at 02:57
@user2284570 Please note that although noting the exact hash is useful for us to answer, answers may still be valid even if they are written for a different hash function or for a more generic construct. — Maarten Bodewes, Oct 15 '19 at 14:55

score 2 · Answer 1 · answered Oct 15 '19 at 03:16

I think what you're asking is:

We know there certainly exist two 33-byte strings $m \ne m'$ such that $H(m) = H(m')$, when $H$ is (say) SHAKE128-256, because there are only $2^{256}$ distinct possible outputs of $H$ and $2^{262}$ distinct possible 33-byte messages.

Do there exist two <32-byte strings, say 16-byte strings, $m \ne m'$ such that $H(m) = H(m')$?

Suppose the output of the hash function $H$ is $h$ bits long, and suppose the input is $t$ bits long. If we model $H$ as a uniform random function, then each output for distinct messages $m_1, m_2, \dotsc, m_n$ is an independent random variable $H(m_1), H(m_2), \dotsc, H(m_n)$ with uniform distribution on the $h$-bit strings. In this case, we are wondering whether there is any collision under $H$ in $n = 2^t$ distinct messages.

By the birthday paradox, the probability of a collision grows quadratically with the number of messages: it is at most $n^2\!/2^h = 2^{2t}\!/2^h$; of course, this bound is not very helpful if $t \geq h/2$, but the probability of a collision rapidly converges to $1$ as $t$ exceeds $h/2$. So, there's roughly 50-50 odds that there's two 16-byte inputs that collide under (say) SHAKE128-256—but unless you find some amazing cryptanalysis of SHA-3, we'll never know what those two inputs are, if they exist at all. The same goes for any other unbroken hash function of the same codomain, like SHA-256.

I was talking about keccak256. Since there’s more hash values than possible values for input data. — user2284570, Oct 15 '19 at 11:38
OK. I'm guessing that by ‘keccak256’ you mean some hash function based on a Keccak sponge with a 256-bit output, and while there's no standard defining ‘keccak256’ per se, it doesn't really matter for the sake of the question as long as it isn't wildly divergent from (e.g.) SHAKE128-256 or SHA3-256. — Squeamish Ossifrage, Oct 15 '19 at 13:59

rvalue · Answer 2 · 2019-10-15T03:03:56.070

Think of an abstract hash function as a black-box mapping from an arbitrary input to a finite output.

Since the number of possible inputs is infinite, if you assume the function to approximate a stable (because it's a hash), random (cryptographic) mapping of input to output, each output value must have an infinite number of collisions (infinite input space / finite output space)

They're just randomly distributed, an average of 2 ^ $output_bits (far) apart. Note that for this to be true, the size of the input doesn't matter; just the fact that it is a function which takes any input and produces a bounded output.

A perfect* hash function might have 1:1 mapping for inputs of exactly the output length; with any practical hash function once the number of input values becomes significant, the "birthday problem" should be noted, in that while the chance of any pair of items colliding is small, as the number of items grows the number of pairs which might collide grows with n! so the probability of any collision rapidly approaches 1. In your example, 16 bytes is a lot of n.

If the input is shorter than the output, where does the extra output space come from? Since we have only one input, the answer must be derived from the input or internal to the function (constant).

Practically, hash implementations mix similar short inputs with padding and the input length itself to distinguish 0101 from 0000 0101 as part of the standard interface specified to the particular cryptographic primitives.

*yes; for this purpose, the identity f(x) = x is a "perfect" hash function (no collisions), but has some confidentiality flaws.

For your information; $\LaTeX$ / MathJax is enabled on our site. — kelalaka, Oct 15 '19 at 07:57

For data shorter than 32 bytes, is it sure that no collision exists?

2 Answers2