More computationally optimal way to solve probability of N or more empty buckets given B buckets and A balls

Question

Problem

What is the probability of observing N or more empty buckets given B buckets and A balls, if you throw the balls into any of the buckets with equal probability.

Simulations

Python

from random import random as r
TRIALS = 10**5
BALLS = 10
BUCKETS = 12
MIN_EMPTY = 5
success = 0
for _ in range(TRIALS):
    buckets = [0 for _ in range(BUCKETS)]
    for _ in range(BALLS):
        buckets[int(r() * float(BUCKETS))] += 1
    if buckets.count(0) >= MIN_EMPTY:
        success += 1
ans = (success / float(TRIALS))
print ans

R

TRIALS = 10**5
BALLS = 10
BUCKETS = 12
MIN_EMPTY = 5
success=0
for(i in 1:TRIALS){
  b=rep(1,BUCKETS)
  b[sample(1:BUCKETS,BALLS,replace=T)]=0
  if(sum(b) >= MIN_EMPTY){
    success = success + 1
  }
}
success/TRIALS

Computationally difficult numerical solution

Using the equations in this previous question, What's the probability that there's at least one ball in every bin if 2n balls are placed into n bins?, I put together a numerical solution for the probability.

$${\sum_{k=N}^{B-1} (B-k)! S(A,B-k) {B \choose k} \over B^A}$$

This solution produces very large numbers internally for even moderate numbers of Buckets and Balls. Is there a different solution that is friendlier for computers to compute across a wider range of values?

Other solutions

I also came across this problem referred to as "the occupancy problem". See Probability of finding $m$ or more cells empty and http://probabilityandstats.wordpress.com/2010/04/04/a-formula-for-the-occupancy-problem/, although buyer beware, I have not yet been able to get the solution provided on that blog post to give me the answer I get from simulation, and my implementation of the above numerical solution.

Thanks!

Something cannot be «more optimal» than something else: what you mean is «better» :-) — Mariano Suárez-Álvarez, Sep 01 '14 at 17:32
Should the question therefore be ported to "English Language & Usage" :-) — Dale M, Sep 04 '14 at 01:28

Dale M · Answer 1 · 2014-09-03T07:15:20.413

1

The number of balls in each bucket will follow a multinomial distribution, and the probability that any given bucket holds $x$ balls is:

$$Pr(X=x)=\frac{a!}{x!(a−x)!}\frac{(B−1)^{a−x}}{B^a}$$

Without loss of generality, arrange the buckets in ascending order of the number of balls they contain.

So, the probability that the first bucket contains 0 balls (and that there are $N \le 1$ empty buckets is:

$$Pr(N\le 1)=\frac{(B-1)^a}{B^a}$$

From this we can say that the probability that there are $N=0$ empty buckets is:

$$Pr(N=0)=1-\frac{(B-1)^a}{B^a}$$

If there is 1 empty bucket, we now have a multinomial distribution of $a$ balls in $B-1$ buckets. Rinse and repeat.

$$\begin{align} Pr(N\le 2)&=\frac{(B-1)^a}{B^a}\frac{(B-2)^a}{(B-1)^a}\\ &=\frac{(B-2)^a}{B^a} \end{align}$$

A pattern emerges ...

In general:

$$Pr(N\le n)=\frac{(B-n)^a}{B^a}$$

Simulation Code

Module Module1
Dim r As New Random

Function BallsIntoBins(Bins As Integer, Balls As Integer) As Integer()
    Dim retVal() As Integer
    ReDim retVal(Bins - 1)

    For i = 0 To Balls - 1
        Dim thisBin = r.Next(0, Bins)
        retVal(thisBin) += 1
    Next
    Return retVal
End Function

Sub Main()
    Dim Bins As Integer = 10
    Dim Balls As Integer = 12
    Dim Trials As Integer = 1000000
    Dim Empties() As Integer
    ReDim Empties(Bins)

    For i = 1 To Trials
        Dim thisTrial = BallsIntoBins(Bins, Balls)
        Dim empty = thisTrial.Where(Function(b) b = 0).Count
        Empties(empty) += 1
    Next

    Console.WriteLine("{0,7}: {1,9}, {2,9}", "Empties", "=", ">=")
    For i = 0 To Bins
        Console.WriteLine("{0,7}: {1,9:p}, {2,9:p}", i, Empties(i) / Trials, Empties.Skip(i).Sum() / Trials)
    Next
    Console.ReadKey()
End Sub

End Module

With 10 bins and 12 balls we get

Empties:         =,        >=
      0:    0.61 %,  100.00 %
      1:    8.06 %,   99.39 %
      2:   28.88 %,   91.33 %
      3:   37.89 %,   62.45 %
      4:   20.04 %,   24.55 %
      5:    4.21 %,    4.51 %
      6:    0.30 %,    0.31 %
      7:    0.01 %,    0.01 %
      8:    0.00 %,    0.00 %
      9:    0.00 %,    0.00 %
     10:    0.00 %,    0.00 %

edited Sep 03 '14 at 07:15

answered Aug 29 '14 at 06:26

Dale M

2,813

This is not correct. I am digging into why but try the above simulation for B=10 N=5 A=12, just how it is there (you can copy paste that into a python shell), and you get an answer somewhere in the 0.69 ballpark, while your solution gives 0.004. – John St. John Aug 30 '14 at 10:04
I agree that the solution presented is not correct - I will think about it some more - I believe that the first equation (which I took from somewhere) is the problem. There is also something wrong with your algorithm - The expectation of 5+ empty buckets is about 0.08 based on 1,000,000 trials. – Dale M Sep 01 '14 at 02:47
Slight error in my code - the actual p is about 0.045 – Dale M Sep 01 '14 at 04:32
I just added in another simulation written in R. I still get p ~ 0.69 for the above numbers. It seems correct to me as well, 0.045 seems very low for all the ways that balls could land in the same buckets and leave 5,6,...11 buckets empty. Can I see your code? – John St. John Sep 01 '14 at 17:31
OK added to answer – Dale M Sep 03 '14 at 07:02
I got it! Your code is correct, you just have balls and bins switched relative to me. I have 10 balls, 12 bins and 5 empties. – John St. John Sep 04 '14 at 01:04
My code also gives 0.045 when you set balls to 12 and bins to 10. There are still issues in your original equation though. If you fill in your equation assuming 10 balls and 12 bins and 5 or more empties, your equation gives ((12-5)10)/float(1210) =~ 0.0045 and I bet both of our simulations will give something like 0.69 probability. – John St. John Sep 04 '14 at 01:18
Glad we agree on our simulations. I agree that the answer is wrong and your question is nagging the hell out of me! Still thinking ... – Dale M Sep 04 '14 at 01:27

Dale M · Answer 2 · 2014-09-04T07:06:17.847

Given that my initial approach led only to heartache and despair I have been thinking of another approach to the problem, which may still lead to heartache and despair but we shall see ...

Throwing balls into bins can be thought of as a Time-homogeneous Markov chain with a finite state space. Specifically, there are $n$ states representing how many of the $N$ bins have at least one ball in them.

Starting in state $1$ (where the first ball has already been thrown) and throwing in $1$ more ball moves us to state $1$ with $p_{1,1}=\frac{1}{N}$ and to state $2$ with $p_{1,2}=1-\frac{1}{N}$.

In general, from state $k$ ($k$ bins with $\ge 1$ ball), $p_{k,k}=\frac{k}{N}$ and $p_{k,k+1}=1-\frac{k}{N}$.

So the transition matrix $P$ is:

$$P=\begin{pmatrix} p_{i,j} \end{pmatrix} \begin{cases} \frac{i}{N}, & j=i\\ 1-\frac{i}{N}, & j=i+1\\ 0, & \text{otherwise} \end{cases}$$

After $b$ balls, the probability of any $k$ non-empty bins is the solution of:

$$Q=\begin{pmatrix} 1\\ 0\\ \vdots\\ 0\\ \end{pmatrix} P^{b-1}$$

As a triangular matrix, the eigenvalues are the diagonals, $\frac{i}{N}$ for $1\le i\le N$. Since there are $N$ of them and this is an $N\times N$ matrix, the matrix is diagonalisable. So we need to find $A$ and $D$ such that $P=ADA^{-1}$ because $P^{b-1}=AD^{b-1}A^{-1}$ and since D is diagonal the calculation is trivial.

Unfortunately, the last time I worked with matrices is 20+ years ago so I leave the rest to you.