19

A "normal", brute-force attack on a cryptographic hashing algorithm $H$ should have a complexity of about $2^{n}$ for a hash algorithm with an output length of $n$ bits.

That means it takes about $2^{n-1}$ tries on average to find a colliding message $y$ for a given message $x$ so that $H(y) = H(x)$ while $y \ne x$.

However, a birthday attack (e.g. both $x$ and $y$ can be selected arbitrarily, but $H(x) = H(y)$ is of course still required) is supposed to be much faster, and take only $2^{n/2}$ tries to find a collision.

I see how that works out in terms of time complexity - since we are now looking at all tuples of all guesses so far, of which there are $n^2$, the probability of a collision now grows quadratically instead of linearly.

But shouldn't we also need space to store all of the already tried $x$ and $y$? If we don't, how can we even compare all the tuples of all the hashes enumerated so far? That seems to make such a birthday attack practically impossible, since $2^{n/2}$ bits of storage are a lot, not even thinking about the problem of accessing all that storage in constant time.

Update: I've found http://www.cs.umd.edu/~jkatz/imc/hash-erratum.pdf, which seems to say that the attack is possible with constant space. However, the proof and explanation given are way above my head.

lxgr
  • 1,798
  • 1
  • 13
  • 22

1 Answers1

10

The method described in the link you cited is based on Floyd's cycle finding algorithm, also known as "the tortoise and the hare" algorithm. This is a general-purpose algorithm for detecting cycles in iterated maps, which I will first describe below.

Specifically, consider the sequence $(x_i)$ defined by $x_i = H(x_{i-1})$ for some map $H$ and some initial value $x_0$. If this sequence is cyclic, then we have some integers $j > 0$ and $k > 0$ such that $x_j = x_{j+k}$, and thus also $x_i = x_{i+nk}$ for all integers $i \ge j$ and $n \ge 0$.

In particular, this holds for $i = nk$ and any integer $n$ satisfying $n \ge j/k$, yielding $x_i = x_{2i}$. Thus, if the sequence $(x_i)$ is cyclic, there exists an integer $i > 0$ such that $x_i = x_{2i}$. Conversely, the existence of such an integer also clearly implies that the sequence must be cyclic (with a period that evenly divides $i$).

It thus follows that, to detect cycles in the sequence $(x_i)$, it suffices to check whether $x_i = x_{2i}$ for any positive integer $i$. We can do this iteratively using constant space by keeping track of two elements of the sequence, $y = x_i$ and $z = x_{2i}$ (both initialized to $x_0$ at iteration $i = 0$), and, on each successive iteration, updating them as $y \gets H(y)$ and $z \gets H(H(z))$.

If we find such an $i$, we'll know that the sequence $(x_i)$ is cyclic. Now, we have two possibilities: either the initial value $x_0$ is part of the cycle, or it is not. In the latter case, we have $x_0 \ne x_i$ but $$H^{(i)}(x_0) = x_i = x_{2i} = H^{(i)}(x_i),$$ and we've thus found a collision in the $i$-fold iterated map $H^{(i)}$, which in turn implies the existence of a collision in the underlying map $H$.

All that remains to be done is locating the underlying collision. To accomplish that, we rewind the iteration back $i$ steps, so that $y = x_0$ and $z = x_i$, and advance them one step at a time, this time as $y \gets H(y)$ and $z \gets H(z)$ so that, at iteration $j$, we'll always have $y = x_j$ and $z = x_{i+j}$. Since we know that $x_0 \ne x_i$ and $x_i = x_{2i}$, it follows that there must be some $j$ between $0$ and $i-1$ such that $x_j \ne x_{i+j}$ but $x_{j+1} = x_{i+j+1}$, and thus $H(x_j) = H(x_{i+j})$. When we find that $j$ — i.e. when we find the first $y$ and $z$ such that $H(y) = H(z)$ — we stop; that's the collision we've been looking for.

This algorithm only requires space for storing a fixed number of values: $x_0$, $y$ and $z$. How much time does it take? Well, if $j$ and $k$ are the lowest positive integers satisfying $x_j = x_{j+k}$, then Floyd's cycle finding algorithm will take $i = k \lceil j/k \rceil < k(j/k + 1) = j + k $ steps (each involving three evaluations of $H$) to detect the cycle, and then $j$ further steps (involving two evaluations of $H$) to locate the collision, for a total of up to $5j + 3k \le 5(j+k)$ evaluations of $H$.

Now, if $H$ is a random function on an $m$-element set, then, by the birthday paradox, the expected number of steps $\mathbb E[j+k]$ before the first collision is $O(\sqrt{m})$. Thus, the expected runtime of the collision-finding algorithm described above is also $O(\sqrt{m})$.

Ilmari Karonen
  • 46,120
  • 5
  • 105
  • 181
  • 1
    If the first of the two possibilities (initial value $x_0$ is part of the cycle) occurs, the algorithm fails. This can be detected cheaply, by checking if $x_0=x_i$ after it is detected that $x_i=x_{2i}$. Should this occurs, we have little options beyond trying another $x_0$, at random. Fortunately, this has low odds $O(1/\sqrt{m})$. Argument: in the event that we get a collision after exactly $i$ iterations, all the $x_j$ with $0\le j<i$ are equally likely to be the cycling point; hence odds of the first of the two possibilities are $1/i$. – fgrieu Jul 24 '12 at 11:24