1

About birthday attack, book Cryptography Engineering says:

In general, if an element can take on N different values, then you can expect the first collision after choosing about $\sqrt{N}$ random elements. We're leaving out the exact details here, but $\sqrt{N}$ is fairly close. For the birthday paradox, we have N = 365 and $\sqrt{N} \approx 19$. The number of people required before the chance of a duplicate birthday exceeds 50% is in fact 23, but $\sqrt{N}$ is close enough for our purposes and is the approximation that cryptographers often use.

One way of looking at this is that if you choose $k$ elements, then there are $k(k - 1)/2$ pairs of elements, each of which has a $1/N$ chance of being a pair of equal values. So the chance of finding a collision is close to $k(k - 1)/2N$. When $k = \sqrt{N}$, this chance is close to 50 % .

and wikipedia says:

As an example, consider the scenario in which a teacher with a class of 30 students (n = 30) asks for everybody's birthday (for simplicity, ignore leap years) to determine whether any two students have the same birthday (corresponding to a hash collision as described further). Intuitively, this chance may seem small. Counter-intuitively, the probability that at least one student has the same birthday as any other student on any day is around 70% (for n = 30), from the formula ${\displaystyle 1-{\frac {365!}{(365-n)!\cdot 365^{n}}}}$.

which can be rephrased in terms of the language in Cryptography Engineering:

$$1 - \frac{N!}{(N-k)! * N^k}$$

Is it supposed to equal to the following from Cryptography Engineering:

$$ (k(k-1))/(2N) $$

Why?

forest
  • 15,253
  • 2
  • 48
  • 103
Tim
  • 151
  • 4

1 Answers1

2

The question asks how we go from $\displaystyle p=1 - \frac{N!}{(N-k)!\,N^k}$ for the probability of collision of $k$ uniformly random values among $N$, to the approximation $\displaystyle p\approx\frac{k(k-1)}{2N}$ (which assumes $k$ is small compared to $\sqrt N$ ).

First we get back to $\displaystyle1-p=\prod_{j=0}^{k-1}{\left(1-\frac j N\right)}$, which is how $p$ was determined in the first place. Then we take the logarithm on both sides and use that $u>0,v>0\implies\ln(u\,v)=\ln(u)+\ln(v)$ to get $$\displaystyle\ln(1-p)=\sum_{j=0}^{k-1}{\ln\left(1-\frac j N\right)}$$

For small $|x|$, it hold $\ln(1+x)\approx x$. Applying this to $x=p$ on the left side and $\displaystyle x=\frac j N$ on the right side, we get $\displaystyle p\approx\sum_{j=0}^{k-1}\frac j N$. We rewrite this as $\displaystyle p\approx\frac 1 N\sum_{j=0}^{k-1}j$.

Now we use that the sum of non-negative integers less than $k$ is $\displaystyle\frac{k\,(k-1)}2$ and get the desired $\displaystyle p\approx\frac{k(k-1)}{2N}$.

Without proof: this approximation of $p$ is always by excess. It's off by less than $+28\%$ when $k\le\sqrt N$, less than $+14\%$ when $k\le\sqrt{2N}$, less than $+7\%$ when $k\le2\sqrt N$.


Most of the error is from the approximation $\ln(1-p)\approx-p$. A much better approximation is: $$p\approx1-e^{-\frac{k(k-1)}{2N}}$$ which assumes only $k\ll N$ rather than $k\ll\sqrt N$. However beware that this alternate formula is numerically instable for small $p$.


In comment it's additionally asked

How shall I understand (this formula) from $1/N$ for each pair? Are the pairs each having two equal values disjoint events? Which part in its derivation is approximation?

One easy way to derive the probability $p$ that there is a collision among $k$ uniform values among $N$ (for $0\le k\le N$) is as the complement of the probability that there is not collision.

For a fixed $N$, define $q_k$ as the probability that there is no collision after $k$ values. Obviously $q_0=q_1=1$. And for $k\ge2$, $q_k$ is the probability that there was no collision among the first $k-1$ values (that is, $q_{k-1}$), time the probability that there is no collision between the $k-1$ previous values and the last drawn one, which is $\displaystyle\frac{N-k}N$ (justification there a exactly $N-k$ values among $N$ that do not leak to collision for the last value drawn).

It follows $\displaystyle q_k=q_{k-1}\left(1-\frac k N\right)$, thus $\displaystyle q_k=\prod_{j=0}^{k-1}{\left(1-\frac j N\right)}$, thus $$p=1-\prod_{j=0}^{k-1}{\left(1-\frac j N\right)}=1-\frac{N!}{(N-k)!\,N^k}$$

This is exact. See first two sections of this answer for the derivation of approximations.


One source justified the approximation as:

One way of looking at this is that if you choose $k$ elements, then there are $k(k−1)/2$ pairs of elements, each of which has a $1/N$ chance of being a pair of equal values.

This hand-waving argument does not yield a mathematically exact derivation of $\displaystyle p=\frac{k(k-1)}{2N}$, since the events are not disjoint. As long as $p$ is small, we can get away with it, but that gets grossly wrong when $k>\sqrt{2N}$.

When $k = \sqrt{N}$, this chance is close to 50%.

That's true if 39.3% is close to 50%.

fgrieu
  • 140,762
  • 12
  • 307
  • 587
  • Thanks. My comment at https://crypto.stackexchange.com/questions/99160/are-these-both-the-probability-of-collision-in-birthday-attack/99176#comment214408_99160 asked for different questions – Tim Mar 19 '22 at 16:12
  • If you take a look at the first quote in my post, the book doesn't derive the probability the way your added to your reply – Tim Mar 19 '22 at 16:30
  • 1
    "since the events are not independent." Additivity of probabilities of several events depends on the events being disjoint not independent – Tim Mar 19 '22 at 16:59
  • The percentages need links... – kelalaka Mar 20 '22 at 11:46
  • @kelalaka: my evidence for +28% is only numerical (hence the without proof): I plotted $\frac{\left(1-\frac{{k^2}!}{(k^2-k)!,k^{2k}}\right),2k^2}{k(k-1)}-1$; same for +14% +7%. – fgrieu Mar 20 '22 at 12:56
  • Thanks. (1) Does your "First we get back to ... get the desired p≈k(k−1)/2N" justify (k(k−1))/(2N) as a reasonable approximation? (2) Does "This hand-waving argument does not yield a mathematically exact derivation of p=k(k−1)/2N, since the events are not disjoint. As long as p is small, we can get away with it, but that gets grossly wrong when k>sqrt(2N). When k=sqrt(N), this chance is close to 50%. That's true if 39.3% is close to 50%" mean that k(k−1)/2N is a bad approximation? (3) if you don't mean contradiction in (1) (2), could you explain, edit, or paraphrase what you actually mean? – Tim Mar 20 '22 at 13:44
  • (1) More precisely, it implies $k(k−1)/(2N)$ is a valid approximation for $k$ or $p$ lower than some threshold. (2) It means that $k(k−1)/(2N)$ is poorly justified by the argument given, comes without a statement of when it applies, and is a terrible approximation for $k$ above some thresold. As explained, it's already significantly off (by +24%) when applied to $k=\sqrt N$ and $N$ of cryptographic interest. (3) There's no contradiction. – fgrieu Mar 20 '22 at 14:01