10

The following problem arises in the analysis of Bloom filters.

Consider $m$ bins and $N=nk$ balls placed uniformly and independently at random into the bins. A query chooses $k$ bins uniformly and independently at random and asks if they are all non-empty. The main question is as follows.

What is the probability that all $k$ bins in the query are non-empty?

It is assumed that this probability will be a function of $k$, $m$ and $n$.

The second question is:

For what value of $k$ (which is a function of $m$ and $n$) is this probability minimized?

The standard version of the analysis taught the world over and reproduced in the wikipedia page linked above contains a "now the magic occurs" step which ignores the non-independence of the bins. It gives $k \approx \frac{m}{n} \ln{2}$ as the answer to the second question.

Is there a clean and rigorous way of doing this analysis correctly?

This is a repost of this one which hasn't had any good answers yet.

Clarification.

The difficultly arises because the probability that all $k$ bins are non-empty is not $(1-(1-\frac{1}{m})^{kn})^k$. The reason is that the probability that one of the $k$ (not necessarily distinct) bins that is queried is non-empty is dependent on whether the other queries return non-empty bins.

An approximate answer or good bounds will suffice.

3 Answers3

5

(Note: The analysis in this answer is for all $k$ bins we query being distinct. This is not what we care about in the case of Bloom filters, so I've left another answer with the right analysis. The stuff in this post is still useful, so I'm not deleting this answer yet.)

To avoid the incorrect independence assumption (though it turns out to be a good approximation), we can do the analysis in the following way.

Let $E_i$, for $1 \le i \le k$, be the event that the $i$th bin (out of the $k$ bins we're interested in) is empty after the $N = nk$ balls have been inserted. For $E_i$ to happen, each of the $N$ balls must have gone to any of the other $m - 1$ bins other than bin $i$. So $$\Pr(E_i) = \left(1 - \frac1m\right)^N$$ What about $\Pr(E_i \cap E_j)$? For both those bins to be empty, all the $N$ balls must have gone to the $m - 2$ bins other than bin $i$ and bin $j$, so $$\Pr(E_i \cap E_j) = \left(1 - \frac2m\right)^N.$$ Similarly, the intersection of some number (say $r$) of the $E_i$s has probability $\left(1 - \frac{r}m\right)^N$.

Finally, by the inclusion-exclusion principle, we have

$$\begin{align} &\phantom{=}\Pr(\text{all $k$ bins are non-empty}) \\ &= 1 - \Pr(\text{at least one bin is empty}) \\ &= 1 - \Pr(\bigcup_{i=1}^k E_i) \\ &= 1 - \sum_{i}\Pr(E_i) + \sum_{i<j}\Pr(E_i \cap E_j) - \sum_{i<j<k}\Pr(E_i \cap E_j \cup E_k) + \dots \\ &= 1 - k\left(1 - \frac1m\right)^N + \binom{k}{2}\left(1 - \frac2m\right)^N - \binom{k}{3}\left(1 - \frac3m\right)^N + \dots \end{align}\\ = \sum_{r = 0}^k \binom{k}{r}(-1)^r\left(1 - \frac{r}m\right)^N $$ as the exact probability.


From this we can see that $\displaystyle (1 - e^{-N/m})^k$ is indeed a good approximation (optimizing this gives $k = \frac{m}{n}\ln 2$ as the best value of $k$), as it's equal to (by the binomial theorem; compare with the exact expression above): $$\sum_{r = 0}^k \binom{k}{r}(-1)^r e^{-Nr/m}.$$

The two are close, term-for-term, because for large $N$, $\left(1 + \frac{x}{N}\right)^N \approx e^x,$ and so $\left(1 - \frac{r}m\right)^N = \left(1 + \frac{-Nr/m}{N}\right)^N \approx e^{(-Nr/m)}.$ Concretely, $\left(1 - \frac{r}m\right)^N$ and $e^{(-Nr/m)}$ are, respectively $$1 - N\frac{r}{m} + \binom{N}{2}\left(\frac{r}{m}\right)^2 + \binom{N}{3}\left(\frac{r}{m}\right)^3 + \dots$$ and $$1 - N\frac{r}{m} + \frac{N^2}{2!}\left(\frac{r}{m}\right)^2 + \frac{N^3}{3!}\left(\frac{r}{m}\right)^3 + \dots.$$

ShreevatsaR
  • 41,374
  • 1
    Is it possible to give bounds on how far $e^{-nk/m}$ is from being correct? –  May 27 '13 at 17:39
  • @felix: Sorry, I guess I don't have precise numerical bounds after all, but you can see from the expressions above that for large $N$ they are very close. – ShreevatsaR May 27 '13 at 18:37
  • @felix: Another issue is that it's not clear what to give the bounds in terms of: $n$, without assuming anything about $m$ and $k$? Or in terms of $m$? Or $N$? – ShreevatsaR May 27 '13 at 19:15
  • There seems to be an answer on pages 27-32 of http://www.cs.nthu.edu.tw/~wkhon/random12/lecture/lecture15.pdf but I can't quite follow it. Does it make sense to you? –  May 27 '13 at 21:06
  • @felix: It makes sense except for the $er^{1/2}$ factor he introduces for the "exact case" versus "Poisson case"... anyway even if we understand that, he's still doing the same independence assumption, so it doesn't help. – ShreevatsaR May 28 '13 at 13:16
  • Ah yes. How about pages 110 and 111 of http://books.google.co.uk/books?id=0bAYl6d7hvkC&lpg=PA111&ots=onP4AtJavN&dq=bloom%20filters%20independent%20chernoff%20bound%20corollary&pg=PA110#v=onepage&q=bloom%20filters%20independent%20chernoff%20bound%20corollary&f=false ? –  May 28 '13 at 16:42
  • @felix: That's very cool; it makes sense now! It's all rigorous finally... I'll update this answer in a couple of days. Thanks for the reference... the techniques developed there (pages 100 to 104) are very useful in these general situations, for bounding the error we incur by assuming independence. – ShreevatsaR May 28 '13 at 22:44
  • Thanks! This $e \sqrt{N}$ factor looks potentially very large. How is it we get to ignore it? –  May 29 '13 at 18:57
  • @felix: I guess the understanding is that when $k$ is large, it's not large compared to the other factor. This analysis is one that's supposed to help with rare events. But you're right, in small instances (say $m=1024$, $n=128$, and $k=6$?), it may be very large. But then for small instances we can do the probability analysis directly and exactly I guess. BTW, the analysis they do in the book, which I guess is what is reproduced in the slides (and now makes sense), is slightly different; they're using this $e\sqrt{N}$ factor for a different term. So the false positive rate is without it. – ShreevatsaR May 29 '13 at 19:37
  • @felix: Oh BTW, they also prove that you can replace the $e\sqrt{N}$ with $2$ when the probability of the event is either monotonically increasing or decreasing in the total number of balls, as is the case here. – ShreevatsaR May 29 '13 at 19:38
  • @felix: I had missed the fact that the $k$ bins we query need not be distinct. The original analysis (as given on Wikipedia etc.) is closer to being correct. I've left a different answer below for it... I'll move the relevant stuff from this answer over there eventually, and delete this one. Sorry for the (my) confusion! – ShreevatsaR May 29 '13 at 20:32
4

(Note: This is the correct analysis for the $k$ queries being not necessarily distinct, but I've posted this as a separate answer because my previous one was already too long and messy.)

After the $N = nk$ balls are put in the $m$ bins, suppose a fraction $q$ of bins are empty. When we make the $k$ queries, the probability that a particular query encounters a non-empty bin is exactly $(1 - q)$, as the query might encounter any of the $m$ bins with equal probability. Further, even though the bins are not independent, the $k$ queries we make are completely independent, so the probability that they all encounter non-empty bins is exactly $(1-q)^k$. The final answer (the probability) is got by considering all possible values for $q$, weighted by their probability: the probability of all $k$ queries seeing a nonempty bin is $$\sum_{\alpha} \Pr(q = \alpha)(1-\alpha)^k$$ where the sum is over values $\alpha \in [0, 1]$ that $q$ can take, namely values of the form $\frac{r}{m}$ for $0 \le r \le m$. No approximation here.

The question of independence of the bins comes up when we try to say what the fraction $q$ might be. The probability of a particular bin (say bin $i$) being empty is $p = \left(1 - \frac1m\right)^N$, since for this bin to be empty, each of the $N$ balls must have gone to any of the other $(m - 1)$ bins other than this one. This probability $p$ is also the expected value of the fraction $q$.
(To prove this rigorously: define the indicator variable $X_i$ to be $1$ if bin $i$ is empty, and $0$ otherwise. The total number of empty bins $X = X_1 + X_2 + \dots + X_m$ has expected value $E[X] = E[X_1 + \dots + X_m] = E[X_1] + \dots + E[X_m] = mE[X_1]$ by linearity of expectation, so the expected value of the fraction $q$ (which is the same as $X/m$) is $E[q] = E[X/m] = E[X]/m = E[X_1] = p.$)

The actual value of $q$ can vary, and be different from its expected value $p$. However, the probability of its being far from $p$ is very low: as we move away from $p$, the probability of $q$ taking that value falls off exponentially. As $p \approx e^{-N/m}$, this also means that the probability of $q$ being far from $e^{-N/m}$ is also very low. (We still need to prove this.)

So the probability of all $k$ bins being nonempty, which is $\sum_{\alpha} \Pr(q = \alpha)(1-\alpha)^k$, is effectively the same as $\left(1 - e^{-N/m} \right)^k$, as $q$ takes on a value around $e^{-N/m}$ with overwhelmingly high probability.


Left to prove: that $q$ is unlikely to be far from $e^{-N/m}$. It's a bit cumbersome to analyze the fraction $q$, as it depends on all the bins, and they are not independent. One approach is to bound the error caused by the independence assumption. Mitzenmacher and Upfal, in their book Probability and Computing: Randomized Algorithms and Probabilistic Analysis, give an elegant technique for doing precisely this.

Let us focus on a particular bin $i$ of the $m$ bins. Consider the number of balls in bin $i$, after $N$ balls have been dropped independently and uniformly at random into the $m$ bins. For each bin $i$, this number (call it $X_i$) follows the binomial distribution: the probability that the bin has $r$ balls is
$\Pr[X_i = r] = \binom{N}{r}\left(\frac1m\right)^r\left(1-\frac1m\right)^{N-r}.$ This is approximately a Poisson distribution with parameter $\lambda = \frac{N}{m}$, or in other words $\Pr[X_i = r] \approx e^{-\lambda}(\lambda^r / r!)$.

Motivated by this, for the $m$ bins all viewed together, they introduce what they call the Poisson approximation. Consider $m$ bins with the number of balls in each bin $i$ (call this number $Y_i$) as being independently distributed, following a Poisson distribution with parameter $\lambda = \frac{N}{m}$. Of course, under this distribution, the total number of balls across the $m$ bins could vary, though it's indeed $N$ in expectation. However, it is a surprisingly straightforward exercise to prove that

  1. The distribution of $(Y_1, \dots, Y_m)$ when conditioned on $\sum_{i=1}^{m} Y_i = N$ is the same as the distribution of $(X_1, \dots, X_m)$.
  2. Any event that takes place with some probability $p$ in the "Poisson approximation scenario" takes place with probability at most $pe\sqrt{N}$ in the "exact scenario".

This inequality above is proved by showing that for any nonnegative function $f(x_1, \dots, x_m)$ (such as the indicator function of an event), we have $$\begin{align} E[f(Y_1, \dots, Y_m)] &\ge E\left[f(Y_1, \dots, Y_m) | \sum_i Y_i = N\right]\Pr(\sum_i Y_i = N) \\ &= E[f(X_1, \dots, Y_m)] \Pr(\sum_i Y_i = N) \\ &\ge E[f(X_1, \dots, Y_m)] (1 / e\sqrt{N}) \end{align} $$ as $\Pr(\sum_i Y_i = N) = e^{-N} (N^N / N!)$ and $N! \le e\sqrt{N} (N/e)^N$.

So, they prove something like(?) $\Pr(|q - p|) \ge \epsilon) \le 2e\sqrt{m}e^{-m\epsilon^2 / 3p}$ using this and the Chernoff bound.

In fact, they prove using martingales and the Azuma-Hoeffding inequality (12.5.3, p. 308) that $\Pr(|q - p| \ge \frac{\lambda}{m}) \le 2\exp(-2\lambda^2/m)$.

They even have an exercise (12.19, p. 313) showing that $\Pr(|q - p| \ge \frac{\lambda}{m}) \le 2\exp(-\lambda^2(2m - 1) / (m^2 - p^2m^2))$.

ShreevatsaR
  • 41,374
0

The probability that the bit is not set in the Bloom filter after $nk$ trials is equal to $(1-\frac 1 m)^{nk}\approx e^{-\frac{nk}{m}}$, if $k$ is big enough. The probability of $k$ filter collisions is equal to $(1-e^{-\frac{nk}{m}})^{k} = (1-e^{-\frac{n}{m}k})^{k}=f(k,m,n)$. This function has the minimum exactly in $\frac m n \ln(2)$, if the $\frac n m$ is fixed.

gukoff
  • 1,500
  • The problem is that this is not true as the probability of the bits being set are not independent. One option would be to give a rigorous argument for why it is approximately true or to give bounds but this is the point of the question. –  May 25 '13 at 17:37
  • You'd better pick independent hash functions for your filter. Otherwise it is impossible to predict any probability, if we don't know anything about your functions choice. – gukoff May 25 '13 at 17:49
  • 1
    The problem is that even with independent hash functions the probability of the bits being set is not independent. The abstraction in terms of balls and bins that I have given encapsulates the problem that needs to be solved I believe. –  May 25 '13 at 17:55
  • Everything is fine, the values are independent everywhere where I use them. Note just that these $k$ bins are chosen with possible repetition and independently. – gukoff May 25 '13 at 18:31
  • 2
    I added a clarification to the question. The point is that when you do the $k$ queries, the probability of one of them seeing a non-empty bin is dependent on whether the others have. So you can't just take $(1-(1-1/m)^{kn})$ and raise it to the power $k$. That is where the error is. –  May 25 '13 at 18:34
  • Amm... Sure I can :D They are all independent, just like the hash-functions. You don't change the filter's table after each query, you do just $k$ independent checks. – gukoff May 25 '13 at 18:41
  • That is wrong I am afraid. As an extreme case consider just two bins and set $k=10$, say. If you check both bins and see they are both non-empty then you know that all other checks will certainly return non-empty. However if you check one bin and see it is empty then you know the probability of each future check finding a non-empty bin is exactly a half. The result of the bin checks are not independent. –  May 25 '13 at 18:52