How does a birthday attack on a hashing algorithm work?

Question

A "normal", brute-force attack on a cryptographic hashing algorithm $H$ should have a complexity of about $2^{n}$ for a hash algorithm with an output length of $n$ bits.

That means it takes about $2^{n-1}$ tries on average to find a colliding message $y$ for a given message $x$ so that $H(y) = H(x)$ while $y \ne x$.

However, a birthday attack (e.g. both $x$ and $y$ can be selected arbitrarily, but $H(x) = H(y)$ is of course still required) is supposed to be much faster, and take only $2^{n/2}$ tries to find a collision.

I see how that works out in terms of time complexity - since we are now looking at all tuples of all guesses so far, of which there are $n^2$, the probability of a collision now grows quadratically instead of linearly.

But shouldn't we also need space to store all of the already tried $x$ and $y$? If we don't, how can we even compare all the tuples of all the hashes enumerated so far? That seems to make such a birthday attack practically impossible, since $2^{n/2}$ bits of storage are a lot, not even thinking about the problem of accessing all that storage in constant time.

Update: I've found http://www.cs.umd.edu/~jkatz/imc/hash-erratum.pdf, which seems to say that the attack is possible with constant space. However, the proof and explanation given are way above my head.

The classic text on the subject is Parallel Collision Search with Cryptanalytic Applications — fgrieu, Jul 20 '12 at 21:32

Ilmari Karonen · Accepted Answer · 2012-07-23T09:57:11.577

The method described in the link you cited is based on Floyd's cycle finding algorithm, also known as "the tortoise and the hare" algorithm. This is a general-purpose algorithm for detecting cycles in iterated maps, which I will first describe below.

Specifically, consider the sequence $(x_i)$ defined by $x_i = H(x_{i-1})$ for some map $H$ and some initial value $x_0$. If this sequence is cyclic, then we have some integers $j > 0$ and $k > 0$ such that $x_j = x_{j+k}$, and thus also $x_i = x_{i+nk}$ for all integers $i \ge j$ and $n \ge 0$.

In particular, this holds for $i = nk$ and any integer $n$ satisfying $n \ge j/k$, yielding $x_i = x_{2i}$. Thus, if the sequence $(x_i)$ is cyclic, there exists an integer $i > 0$ such that $x_i = x_{2i}$. Conversely, the existence of such an integer also clearly implies that the sequence must be cyclic (with a period that evenly divides $i$).

It thus follows that, to detect cycles in the sequence $(x_i)$, it suffices to check whether $x_i = x_{2i}$ for any positive integer $i$. We can do this iteratively using constant space by keeping track of two elements of the sequence, $y = x_i$ and $z = x_{2i}$ (both initialized to $x_0$ at iteration $i = 0$), and, on each successive iteration, updating them as $y \gets H(y)$ and $z \gets H(H(z))$.

If we find such an $i$, we'll know that the sequence $(x_i)$ is cyclic. Now, we have two possibilities: either the initial value $x_0$ is part of the cycle, or it is not. In the latter case, we have $x_0 \ne x_i$ but $$H^{(i)}(x_0) = x_i = x_{2i} = H^{(i)}(x_i),$$ and we've thus found a collision in the $i$-fold iterated map $H^{(i)}$, which in turn implies the existence of a collision in the underlying map $H$.

All that remains to be done is locating the underlying collision. To accomplish that, we rewind the iteration back $i$ steps, so that $y = x_0$ and $z = x_i$, and advance them one step at a time, this time as $y \gets H(y)$ and $z \gets H(z)$ so that, at iteration $j$, we'll always have $y = x_j$ and $z = x_{i+j}$. Since we know that $x_0 \ne x_i$ and $x_i = x_{2i}$, it follows that there must be some $j$ between $0$ and $i-1$ such that $x_j \ne x_{i+j}$ but $x_{j+1} = x_{i+j+1}$, and thus $H(x_j) = H(x_{i+j})$. When we find that $j$ — i.e. when we find the first $y$ and $z$ such that $H(y) = H(z)$ — we stop; that's the collision we've been looking for.

This algorithm only requires space for storing a fixed number of values: $x_0$, $y$ and $z$. How much time does it take? Well, if $j$ and $k$ are the lowest positive integers satisfying $x_j = x_{j+k}$, then Floyd's cycle finding algorithm will take $i = k \lceil j/k \rceil < k(j/k + 1) = j + k $ steps (each involving three evaluations of $H$) to detect the cycle, and then $j$ further steps (involving two evaluations of $H$) to locate the collision, for a total of up to $5j + 3k \le 5(j+k)$ evaluations of $H$.

Now, if $H$ is a random function on an $m$-element set, then, by the birthday paradox, the expected number of steps $\mathbb E[j+k]$ before the first collision is $O(\sqrt{m})$. Thus, the expected runtime of the collision-finding algorithm described above is also $O(\sqrt{m})$.

If the first of the two possibilities (initial value $x_0$ is part of the cycle) occurs, the algorithm fails. This can be detected cheaply, by checking if $x_0=x_i$ after it is detected that $x_i=x_{2i}$. Should this occurs, we have little options beyond trying another $x_0$, at random. Fortunately, this has low odds $O(1/\sqrt{m})$. Argument: in the event that we get a collision after exactly $i$ iterations, all the $x_j$ with $0\le j<i$ are equally likely to be the cycling point; hence odds of the first of the two possibilities are $1/i$. — fgrieu, Jul 24 '12 at 11:24

How does a birthday attack on a hashing algorithm work?

1 Answers1

Linked