7

If we start with a set of possible input values, and apply the md5 algorithm to all elements of this set and then filter out the unique results (a.k.a. filter out collision), we are left with a smaller set. Consider this small piece of pseudo code:

Set<String> inputs = ALL_UNIQUE_INPUTS;
while (inputs.length > 1) {
    Set<String> newInputs;
    foreach (String input : inputs) {
        newInputs.add(md5(input));
    }
    inputs = newInputs;
}

I believe that most iterations of the while loop will decrease the size of inputs, untill it reaches 1. Is this true?

Furthermore, can we somehow determine how many iterations this would take for a given input space?

Note: I am aware that you should not attempt to increase the computational complexity of password hashing in this manner, but use salts, reapply those and other neat tricks to make the process more computationally complex whilst not increasing the chance of collisions.

CodesInChaos
  • 24,841
  • 2
  • 89
  • 128

4 Answers4

8

After about $2^{n/2}=2^{64}$ iterations an input will enter a cycle (of length approximately $2^{n/2}=2^{64}$). If inputs didn't collide by the time they enter a cycle, they never will.

If you have more than about $2^{n/4}=2^{32}$ inputs, you'll get collisions after $2^{n/2}=2^{64}$ iterations, as per birthday-problem. But of course you won't have the patience to wait that long.

CodesInChaos
  • 24,841
  • 2
  • 89
  • 128
  • The $2^{32}$ in the second paragraph in the current answer is at least short on justification. On the first iteration, odds of collisions are about $2^{-65}$. I believe these odds slightly decrease with the number of iterations, and I have the feeling that for $2^{32}$ inputs and $2^{64}$ iterations, odds that there has been a collision are significantly less than 50%, perhaps vanishingly small. I feel that it is more likely that some cycles have converged, but to different phases, thus have not collided, and will never. – fgrieu Jan 09 '17 at 13:36
  • @fgrieu I didn't work out the small constant factors. But the rough numbers should be correct. The typical cycle size is $2^{64}$ and there are only a small number of very hairy cycles. So I approximate that there are $2^{64}$ possible phases, with each input input getting a phase at random. So once you have about $2^{32}$ inputs, you get colliding phases, per birthday problem. – CodesInChaos Jan 09 '17 at 14:16
  • I'm OK that there are about $2^{64}$ points in a typical cycle; and that each input gets a random phase on *its* cycle; and thus that we are near the collision threshold *if* most of the $2^{32}$ starting points share the same cycle. But why would that later condition hold? [fixed] – fgrieu Jan 09 '17 at 15:49
  • 2
    @fgrieu Because almost all of the the space is covered by very few cycles and the forest of hairs leading into them. – CodesInChaos Jan 09 '17 at 15:53
  • @fgrieu Quick and dirty code to confirm my prediction that you get collision after about $2^{n/4}$ inputs. – CodesInChaos Jan 09 '17 at 17:04
  • @CodeInchaos: made an independent check, I get 45% ±5% odds that there is at least one collision for $4k$-bit hash, $2^k$ starting values, and $2^{2k}$ chained hashes, quite independent of $k\ge4$. This confirms your point. – fgrieu Jan 12 '17 at 11:05
3

This question was adressed here quite a few times, just differently phrased, for example in Cycles in SHA256.

Another awnser here had the following statement:

Hashes have a fixed size output. After one round, all your new inputs will be equally sized, being the size of the hash output. An ideal hash function operating on such inputs will be bijective and you will constantly just be rearranging your inputs with none of them ever colliding.

Ideal hash function in this sense is quite similar to the definition of a perfect hash function. However, this just does not fit with cryptographic hash functions, where the "ideal" version is a random oracle or a truly random function (with the specified domain), where collisions can happen.

Exactly this question was already adressed here:

But to come back to the question: Cryptographic hashes are designed to be as close to random functions as possible. In an answer to the first linked question, fgrieu drew a really nice visualization here.

A few of the key points of what to expect:

  • The graph is probably disconnected
  • The graph contains cycles of different leangth
  • It might contain fixed points (cycles of length 1)
  • It also contains nodes, which lead to a cycle but are not part of it.

So to answer the initial questions:

I believe that most iterations of the while loop will decrease the size of inputs, untill it reaches 1. Is this true?

No, it can decreate the size of the set. But with a cryptographic hash it is unlikely. The set size only decreases in case of a collision, which is really unlikely (and surely not "most iterations").

Considering the second part of the question: That could happen, but it is unlikely. If we remember the graph of the random function, then the original set have to be

  • in the same connected subgraph.
  • Have a fixed point instead of a cycle; or alternatively there have to be collisions, so that there is only one element in the cycle.

Furthermore, can we somehow determine how many iterations this would take for a given input space?

Well, for practical purposes with MD5: Much, much too long. The cycles can have any length within the graph, and you have to save all previous steps to actually notice that you are in a cycle. With a graph of $2^{128}$ nodes, you would have to estimate the number of nodes in cycles, and then estimate how many values you need to store to be able to determine that you are in a cycle. It is quite likely, you need close to $2^{128}$ steps anyway.

tylo
  • 12,654
  • 24
  • 39
  • An interesting aspect of the question, not addressed elsewhere that I could find, is that at each iteration the old hashes are forgotten. The problem is thus not if the cycles converge; rather, it is if two cycles that converge will do so with the same phase, such that there will be a collision. That's non-trivial. – fgrieu Jan 09 '17 at 13:30
1

Your set of inputs will not necessarily reach a single input, and for any well designed cryptographic hash function, it won't.

Hashes have a fixed size output. After one round, all your new inputs will be equally sized, being the size of the hash output. An ideal hash function operating on such inputs will be bijective and you will constantly just be rearranging your inputs with none of them ever colliding.

Evaluating the difference between an ideal hash function and the actual md5 algorithm in this scenario is many PhD's worth of research.

  • 4
    "An ideal hash function operating on such inputs will be bijective" A hash is no permutation. – CodesInChaos Aug 07 '16 at 09:32
  • 1
    A hash mapping 128 bit values to 128 bit hash values should be a permutation. Since there are exactly as many possible inputs and outputs, no two inputs should be hashed to the same output. –  Aug 08 '16 at 08:24
  • @CodesInChaos: why is not having collisions for every input, for a hypothetical ideal hash function, impossible? –  Aug 08 '16 at 16:45
  • 1
    @whatsisname Consider an input of 2^128+1 values mapped to 2^128 possible outputs. There exists at least one collision, per pidgeonhole principle. Now remove one of the inputs. Unless it's part of a single colliding pair (unlikely, unless you chose it specifically), the reduced input set will still contain a collision. ---- If the hash function is ideal, the number of inputs mapping to each output is Poisson distributed with an expectancy value of 1. – CodesInChaos Aug 08 '16 at 18:27
  • @CodesInChaos: the 2^128+1 values won't happen, as I already specified I have a maximum of 2^128 inputs, because after the first round, every input into the next cycle is equal to the output size of the hash. The first round is the only opportunity for collisions because only the first round has variable sized input. –  Aug 08 '16 at 18:54
  • 1
    @whatsisname Consider the 2^128 outputs. Then you sequentially map each input to an output. The first input won't cause a collision. The second one lands on an already used output with probability 1/2^128. The third with probability 2/2^128 etc. By the time you reached 2^64 inputs the cumulative probability of a collision has reached about 50%. By the time you're done, a collision is virtually certain. Even if you just consider the last input. If you had no collision so far, it only has a 1 in 2^128 chance of landing on the sole unoccupied output. – CodesInChaos Aug 08 '16 at 18:59
  • I think when we are dealing with a hypothetical, ideal hash with perfect collision avoidance, that 1 in 2^128 chance of landing on the sole unoccupied slot would reasonably be a 100% chance. Obviously that is not the case for a real-world hash algorithm. –  Aug 08 '16 at 19:03
  • 6
    @whatsisname For a cryptographer, an ideal hash is as close to a random oracle as possible, and not a perfect hash over the 128 bit strings. – CodesInChaos Aug 08 '16 at 20:07
  • @whatisname A hypothetical, ideal hash function is not bijective. For each input, it should select one of its possible outputs at random. Each of these events should be independent of one-another — in fact, an $n$-bit hash function that's bijective on $n$-bit strings would be considered broken. An attacker can feed it $> 2^\frac{n}{2}$ inputs and check for a collision. As you feed it more inputs, you can distinguish it from a random oracle more than 50% of the time by simply testing for a collision. – Stephen Touset Aug 10 '16 at 20:21
1

I believe that most iterations of the while loop will decrease the size of inputs, untill it reaches 1. Is this true?

No, not for input that has no pre-computed collisions.

Finding collisions is supposed to be a hard problem for secure hash functions. For SHA-1 none have been published yet at the time of writing, and that's a hash function that is considered pretty weak.

Furthermore, can we somehow determine how many iterations this would take for a given input space?

The chance of an accidental collision over a short input space is very low. The chance of finding a single one is high for a set of $2^{64}$ and probably still significant for $2^{32}$ elements.

In other words, you'll wait forever. Again, for input that has not pre-computed collisions.

Note: I am aware that you should not attempt to increase the computational complexity of password hashing in this manner, but use salts, reapply those and other neat tricks to make the process more computationally complex whilst not increasing the chance of collisions.

I don't know about the other neat tricks, but for MD5 it's easy to find collisions because MD5 is broken. Anybody could create a set of distinct inputs in such a way that the while loop would immediately terminate (once it gets past that, it will probably never end).

Better use a secure hash such as one of the SHA-2 or SHA-3 variants.

Maarten Bodewes
  • 92,551
  • 13
  • 161
  • 313