Probability of hash collision in exhaustive mapping from input to output

Question

Suppose I had a hash function which produced a 256-bits output for any given input (such as SHA-3-256). Now suppose i did an exhaustive mapping from every possible 512-bits input to a given 256-bits output, and stored it in a table along with the original 512-bits input.

Now, if I later wanted to find the original 512-bits input based on a 256-bits hash, What is the probability of finding a collision rather than the original input?

It seems that in the case of exhaustive mapping from input to output, the risk of collision get "unacceptable" very fast. Would I be correct to assume that in this case there exists $2^{512}$ inputs, but the hash function is only able to supply $\approx 2^{256}$ (minus unnecessary collisions caused by imperfections in the hash function), so the there must exist $\approx 2^{256}$ collisions?

Just a question : assume that due to some quantum theory, using multiple dimensions and other Sci-fi stuff, you can store each of the pair of $2^{512}$ inputs to their respective output on 1 bit (i.e. 8 per byte). How many Terabytes of data would you need ? :) — Biv, Jun 01 '16 at 22:03
@Biv hehe, right, but it is a theoretical question that works just as well on 4 bit output, 8 bit input, which would be easily storable :) — Daniel Valland, Jun 01 '16 at 22:05

score 4 · Accepted Answer · edited Apr 13 '17 at 12:48

As pointed out by @kodlu, given an input sequence of 512 bits ($2^{512}$ possibilities), and given a hash function which spreads out the resulting hash approximately evenly, you would wind up with approximately $\frac{2^{512}}{2^{256}} = 2^{256}$ input sequences in your table which store the result of your exhaustive mapping for a given 256 bits hash (provided you could physically store such a large table).

Of course only one of those $2^{256}$ sequences would be the actual original input, and so the probability that you'd pick the correct original sequence from the table by choosing at random one of those $2^{256}$ sequences would be:

$$ P_{\mbox{original}} \approx \frac{1}{2^{256}} \approx 8\times 10^{-78} $$

Correspondingly, the probability that you'd pick a collision would be:

$$ \begin{align} P_{\mbox{collision}} &= 1 - P_{\mbox{original}}\\ &\approx 1 - \frac{1}{2^{256}} \\ &\approx 1 \end{align} $$

In other words it would be nearly impossible to obtain the original sequence and one would almost surely obtain a collision by randomly choosing a sequence from your large exhaustive mapping. This is provided you do not have additional information which would allow you to choose one of the sequence from your table in a better way than a simple random selection.

kodlu · Answer 2 · 2016-06-02T21:47:00.760

If you're tossing $n=2^{512}$ balls into $m=2^{256}$ bins, as a model of a good hash function, so $n=m^2$, the average load will be $n/m=m$ so the typical output will have $2^{256}-$fold collisions.

See paper here, last case of theorem 1, whereby the maximum load won't be too far away from average load in a multiplicative sense, in your case.

What about the output with least number of collisions? Probability a fixed bin is missed is $$ (1-(1/m))^n\approx \exp(-n/m)=\exp(-m) $$ so the expected number of missed bins is essentially zero, approximately $m \exp(-m)$.

A poisson approximation can show that with probability almost one the minimum load bin has at least a $$O(m/\log m)-fold ~~~~(1)$$collision. So, many collisions will occur even for the least popular output.

Edit: To put this in the context of @SleuthEye's answer, even if the actual input led to the least popular output your probability of finding the actual input would be something like 1 in $2^{248}$ due to the estimate in (1).

Probability of hash collision in exhaustive mapping from input to output

2 Answers2

Linked