4

Theorem: Choose $Q$ random natural numbers in the set $\{1,2, \dots, M\}.$ The probability of getting at least one collision is

$$P_C(Q) = 1 - \frac{M - (Q - 1)}{M} P_{\neg C}(Q-1).$$

Notation: By $P_C$, I mean the probability of getting a collision. By $P_{\neg C}$ I mean the probability of not getting a collision.

Remark: This is the birthday problem.

Remark: So $P_C(Q)$ is just being computed by using its complement. The reason I express the theorem this way is because its induction proof relates directly to it being enunciated this way.

Theorem: $$P_C(Q) \approx 1 - e^{-\tfrac{(Q-1)Q}{2M}}.$$

Proof: We know $$e^{-x} = 1 - x + \frac{x^2}{2!} - \frac{x^3}{3!} + \frac{x^4}{4!} - \ldots$$

If we take the two terms of this expansion, we get $e^{-x} \approx 1 - x$. Then

\begin{align} P_{\neg C}(Q)&= \prod_{i=1}^{Q-1} \left(1 - \dfrac{i}{M}\right)\\ &\approx \prod_{i=1}^{Q-1} e^{-i/M} \\ &= e^{-1/M} e^{-2/M} \dotsc\ e^{-(Q - 1)/M} \\ &= e^{-\sum_{i=1}^{Q-1} i/M} \\ &= e^{-\dfrac{1}{M} (Q-1)Q/2}\\ &= e^{-\dfrac{(Q-1)Q}{2M}}, \end{align}

So $P_C(Q) \approx 1 - P_{\neg C}(Q),$ as desired.

Question: How can I (at least) have some notion of how off the right number I am if I use this estimation to compute the probability of getting a collision in a concrete case?

Squeamish Ossifrage
  • 48,392
  • 3
  • 116
  • 223
user45491
  • 409
  • 2
  • 10
  • 1
    Are you asking because you want to know the precise error of the approximation or because you want to know if the equation is good enough for casual estimation purposes for certain Q/M? – Future Security Jan 07 '19 at 19:45
  • I'd like to know the precise error of the approximation. (What's Q/M?) – user45491 Jan 07 '19 at 20:50
  • The precise error can (probably) only be calculated by calculating exact probabilities for each combination of set size and number of samples. It's only easy to calculate exact probabilities for very small set sizes. The math stack exchange can better address the birthday paradox and combinatoric aspects of this problem. And maybe if they can't think of any method they might be able to think of a way to find bounds on the error. – Future Security Jan 07 '19 at 21:36

2 Answers2

4

Clearly, for $x\in (0,1)$ which is our case, $$ e^{-x}-(1-x)< x^2/2, $$ and thus the relative multiplicative error between the estimate and actual answer satisfies $$ \frac{\widehat{P}_{\neg C}(Q)}{P_{\neg C}(Q)}<\prod_{i=1}^{Q} \frac{(i/M)^2}{2(1-(i/M))}\leq 2^{-Q} M^{-2Q} \frac{Q(Q+1)(2Q+1)}{6}\frac{1}{(M+1-(Q+1)/2))}, $$ by using the sum of first $Q$ squares on the numerator and the arithmetic geometric mean inequality in the denominator. So $$ \frac{Q^3/3}{ 2^Q M^{2Q}(M-Q/2)} $$ is a good approximation to the error.

kodlu
  • 22,423
  • 2
  • 27
  • 57
  • 1
    Yes. Watch that the expression derived is the relative error on $P_{\neg C}(Q)$, not on $P_C(Q)$. Note: when applying $P_C(Q) \approx 1 - e^{-\tfrac{(Q-1)Q}{2M}}$ with $Q\ll\sqrt M$ using a computer with fixed-width arithmetic (including most spreadsheets), we can encounter a numerical stability issue because the second term of the subtraction is very close to $1$. We can avoid this by using a function giving $e^x-1$, often called expm1(), with $P_C(Q)=-\mathtt{expm1}\left(-\tfrac{(Q-1)Q}{2M}\right)$. Another option is using $P_C(Q) \approx \dfrac{(Q-1)Q}{2M}$ when $Q\ll\sqrt M$. – fgrieu Jan 08 '19 at 07:52
  • I couldn't follow the first inequality --- where the left side has a hat and the right side the product. You lost me with "sum of first $Q$ squares on the numerator". Can you clarify that? Thank you! – user45491 Feb 16 '19 at 23:22
  • 1
    The hat term is the estimated probability, divided by the actual probability – kodlu Feb 17 '19 at 00:53
  • 1
    Take out $M^{-2Q}$ on the right hand side. Then the numerator is the product of the first $Q$ natural numbers. $1\times 4\times \cdots \times Q$ which is known to be $Q(Q+1)(2Q+1)/6.$ – kodlu Feb 17 '19 at 00:55
  • I'm actually still on the first inequality, you explained the second. I don't know where you took $\prod_{i=1}^Q \frac{(i/M)^2}{2(1 - (i/M))}$ from. – user45491 Feb 18 '19 at 00:04
  • Termwise division, that is what relative error means. So I divide the relative error termwise by your product terms. So I get $(e^{-x}-(1-x))/(1-x)<x^2/2(1-x)$ and use the RHS for $x=i/M.$ – kodlu Feb 18 '19 at 02:27
  • This answer is to condense to be read at once. If you extend it as in the comments, it will be more educative. – kelalaka Feb 22 '21 at 19:37
4

kodlu provided an approximation to the error term, but you might also be interested in firm bounds on the collision probability, which you can get without diving into the higher-order terms of the Taylor expansion of $e^{-x}$. How?

  1. You are guaranteed that $1 - x \leq e^{-x}$, for all $x$.

    Proof. Let $f(x) = e^{-x} + x - 1$; the claim is that $f(x)$ is nonnegative everywhere. $f'(x) = 1 - e^{-x}$ is zero only at $x = 0$, so the only possible extreme point is $x = 0$ where $f$ is zero; at, e.g., $x = 1$ and $x = -1$, $f(x)$ is positive, so $f$ is positive on both sides of $x = 0$ and nonnegative everywhere.

Consequently, you can set $1 - i/M \leq e^{-i/M}$ and thus $P_{\lnot C}(Q) \leq e^{-Q (Q - 1)/(2M)}$.

  1. You are also guaranteed that $e^{-2x} \leq 1 - x$, as long as $0 \leq x \leq 1/2$.

    Proof. Let $g(x) = 1 - x - e^{-2x}$; the claim is that $g(x)$ is nonnegative for all $x \in [0, 1/2]$. $g'(x) = 2 e^{-2x} - 1$ is zero only at $x = \frac 1 2 \log 2 \in [0, 1/2]$, so $g$ can have only one extreme point, where it is positive, since $g(\frac 1 2 \log 2) = 1 - \frac 1 2 \log 2 - 1/2 > 0$; at the endpoints, $g(0) = 0$ and $g(1/2) = 1/2 - 1/e$, $g$ is nonnegative, so it is nonnegative on the whole interval.

Consequently, if $Q < M/2$, you can set $1 - i/M \geq e^{-2i/M}$ and thus $P_{\lnot C}(Q) \geq e^{-Q(Q - 1)/M}$.

Putting the inequalities together, if $Q (Q - 1) < M/8$, we have $$1 - \frac{Q (Q - 1)}{M} \leq e^{-Q(Q - 1)/M} \leq P_{\lnot C}(Q) \leq e^{-Q(Q - 1)/(2M)} \leq 1 - \frac{Q (Q - 1)}{4M},$$ or equivalently $$\frac{Q (Q - 1)}{4M} \leq P_C(Q) \leq \frac{Q (Q - 1)}{M}.$$

Squeamish Ossifrage
  • 48,392
  • 3
  • 116
  • 223