Coupon collector without replacement

Question

Given a urn containing balls of $k$ distinct color such that $i$th color has $c_i$ number of balls. How many balls in expectation one needs to sample in order to to get at least one ball of each color. Problem looks very similar to coupon collector(each color being coupon) but without replacement.
I am able to solve 2 color case. For example if urn contain r red balls and b blue balls then:
Let S be the event that first ball is red and we have $$E[X]=P(S)E[X|S]+P(\bar S)E[X|\bar S]$$ Now $E[X|S]=1+(\frac{r-1}{b+1}+1)$, where $\frac{r-1}{b+1}$ is the expected number of red ball one needs to sample to get blue ball from bag that contatin $r-1$ red balls and $b$ blue balls. Similarly $E[X|\bar S]=1+(\frac{b-1}{r+1}+1)$, using these we can get $E[X]$.
But I don't know how to generalize this to k color case. Any hints?

If you give your 2-colour answer, people will be able to point how to continue to the multicoulour one. — Joce, Mar 05 '17 at 13:31
Hint: Apply the indicator method. And see if you can calculate the probability that in a draw of $n$ balls you do not have a ball of the $i$th color. — awkward, Mar 05 '17 at 16:08
@awkward Thanks for the hint. So I have following now, probability that in a draw of $n$ balls we don't have ball of the $i$th color is $\frac{m-c_i \choose n }{m \choose n}$, where $m=\sum_i c_i$ . Using inclusion exclusion we can find probability that at least one color is missing in draw of n ball, 1 minus that will give us probability that no color is missing in n draws. Now we can apply regular expectation formula to get the result(looks ugly). — user1131274, Mar 05 '17 at 17:43
@awkward If we set indicator random variable $X_i=1$ if the draw of i balls miss some color and $0$ otherwise then $X=1+\sum_{i=1}^m X_i$ give us number of balls one needs to sample to see all color. Now using $EX_i=P(\text{at least one color is missing})$ from above we will get your result. Does this look right? — user1131274, Mar 05 '17 at 17:44
@user1131274 with your definition of $X_i$, maybe you should consider $\sum (1-X_i)$. That's the number of colors in the sample. — awkward, Mar 06 '17 at 11:39

score 2 · Answer 1 · edited Apr 13 '17 at 12:21

We use the notation from the following MSE link with $m$ for the number of trials and $n$ for the number of different types of coupons. We treat the special case where there are $j$ instances of each type and we are sampling without replacement.

We ask about the probability of obtaining the distribution

$$\prod_{q=1}^n C_q^{\alpha_q}$$

where $\alpha_q$ says we have that many instances of type $C_q.$ We obtain

$$\frac{(nj-\sum_{q=1}^n \alpha_q)!}{(nj)!} \prod_{q=1}^n \frac{j!}{(j-\alpha_q)!}.$$

Therefore the sequences according to probability of length $m$ of $n$ types of coupons without replacement and a maximum of $j$ coupons of each type are given by

$$m! [z^m] \left(\sum_{k=0}^j \frac{j!}{(j-k)!}\frac{z^k}{k!}\right)^{n} = m! [z^m] (1+z)^{nj} = {nj\choose m} \times m!.$$

We then obtain from first principles the formula

$$P[T=m] = \frac{1}{m!} {nj\choose m}^{-1} \times n \times j \times (m-1)! [z^{m-1}] \left(\sum_{k=1}^j \frac{j!}{(j-k)!}\frac{z^k}{k!}\right)^{n-1} \\ = nj \times \frac{1}{m} {nj\choose m}^{-1} [z^{m-1}] \left(-1 + (1+z)^j\right)^{n-1}.$$

This becomes

$$nj \times \frac{1}{m} {nj\choose m}^{-1} [z^{m-1}] \sum_{q=0}^{n-1} {n-1\choose q} (-1)^{n-1-q} (1+z)^{qj} \\ = {nj-1\choose m-1}^{-1} \sum_{q=0}^{n-1} {n-1\choose q} (-1)^{n-1-q} {qj\choose m-1}.$$

Observe that

$${qj\choose m-1} {nj-1\choose m-1}^{-1} = \frac{(qj)! \times (nj-1-(m-1))!}{(qj-(m-1))! \times (nj-1)!} \\ = {nj-1\choose qj}^{-1} {nj-1-(m-1)\choose qj-(m-1)}.$$

We record for the probabilities the formula

$$\bbox[5px,border:2px solid #00A000]{ P[T=m] = \sum_{q=0}^{n-1} {n-1\choose q} (-1)^{n-1-q} {nj-1\choose qj}^{-1} {nj-1-(m-1)\choose nj-1-qj}.}$$

Start by verifying that this is a probability distribution. We obtain for the sum in $m$

$$\sum_{m=n}^{(n-1)j+1} {nj-1-(m-1)\choose nj-1-qj} \\ = [z^{nj-1-qj}] \sum_{m=n}^{(n-1)j+1} (1+z)^{nj-1-(m-1)} \\ = [z^{nj-1-qj}] \sum_{m=j-1}^{n(j-1)} (1+z)^m = [z^{nj-qj}] ((1+z)^{n(j-1)+1} - (1+z)^{j-1}).$$

We have $nj-qj\ge j$ so only the first term contributes and we obtain

$$\sum_m P[T=m] = \sum_{q=0}^{n-1} {n-1\choose q} (-1)^{n-1-q} {nj-1\choose qj}^{-1} {n(j-1)+1\choose nj-qj} \\ = \frac{n(j-1)+1}{j} \sum_{q=0}^{n-1} {n-1\choose q} \frac{(-1)^{n-1-q}}{n-q} {nj-1\choose nj-qj-1}^{-1} {n(j-1)\choose nj-qj-1}$$

We get for the rightmost pair of binomial coefficients

$$\frac{(n(j-1))! \times (qj)!}{(nj-1)! \times (qj+1-n)!} = {nj-1\choose n-1}^{-1} {qj\choose n-1}$$

which yields for the sum

$$\frac{n(j-1)+1}{j} {nj-1\choose n-1}^{-1} \sum_{q=0}^{n-1} {n-1\choose q} \frac{(-1)^{n-1-q}}{n-q} {qj\choose n-1} \\ = \frac{n(j-1)+1}{nj} {nj-1\choose n-1}^{-1} \sum_{q=0}^{n-1} {n\choose q} (-1)^{n-1-q} {qj\choose n-1} \\ = \frac{n(j-1)+1}{nj} {nj-1\choose n-1}^{-1} {nj\choose n-1} \\ + \frac{n(j-1)+1}{nj} {nj-1\choose n-1}^{-1} \sum_{q=0}^{n} {n\choose q} (-1)^{n-1-q} {qj\choose n-1} \\ = \frac{n(j-1)+1}{nj}\frac{nj}{nj+1-n} \\ + \frac{n(j-1)+1}{nj} {nj-1\choose n-1}^{-1} [z^{n-1}] \sum_{q=0}^{n} {n\choose q} (-1)^{n-1-q} (1+z)^{qj} \\ = 1 - \frac{n(j-1)+1}{nj} {nj-1\choose n-1}^{-1} [z^{n-1}] (1-(1+z)^j)^n$$

Now observe that $[z^{n-1}] (1-(1+z)^j)^n = 0$ hence everything simplifies to $$1$$ and we have a probability distribution.

Continuing with the expectation we have the following closed form:

$$\bbox[5px,border:2px solid #00A000]{ \mathrm{E}[T] = \sum_{q=0}^{n-1} {n-1\choose q} (-1)^{n-1-q} {nj-1\choose qj}^{-1} \sum_{m=n}^{(n-1)j+1} m {nj-1-(m-1)\choose nj-1-qj}.}$$

By means of plotting strategy let us examine some of these. Here are the first few for eight types of coupons starting at $j=1:$

$$8,{\frac {76627}{6435}},{\frac {76801}{5434}},{\frac {7473667}{480675}}, {\frac {1318429}{79794}},\ldots$$

and here is the initial segment for ten types of coupons:

$$10,{\frac {707825}{46189}},{\frac {7008811}{380380}}, {\frac {266299459}{13042315}},{\frac {182251913}{8360638}}, {\frac {748880445829}{32831263465}},\ldots$$

Careful inspection of these values reveals that we cannot hope for additional simplification when $j\ge 2$ because if it were possible it would have appeared in these sample values. We do see however that the case $j=1$ should be possible, the value being $n$ (we always finish after $n$ draws if there is only one instance of each coupon).

We now do this calculation, which is trivial, but nontheless a useful sanity check, starting with

$$\sum_{q=0}^{n-1} {n-1\choose q} (-1)^{n-1-q} {n-1\choose q}^{-1} \sum_{m=n}^{n} m {n-1-(m-1)\choose n-1-q} \\ = \sum_{q=0}^{n-1} (-1)^{n-1-q} \times n {n-1-(n-1)\choose n-1-q} \\ = (-1)^{n-1-(n-1)} \times n \times {0\choose n-1-(n-1)} = n.$$

It certainly seems like a worthwhile challenge to prove that the closed form for $\mathrm{E}[T]$ is $n H_n$ in the limit, which is confirmed by the numerical evidence.

We did verify the formula for the expectation in software, as follows. It really is quite remarkable that the output from this program is in excellent agreement with the closed form on all values that were tested.

#include <stdlib.h>
#include <stdio.h>
#include <assert.h>
#include <time.h>
#include <string.h>

int main(int argc, char **argv)
{
  int n = 6 , j = 3, trials = 1000; 

  if(argc >= 2){
    n = atoi(argv[1]);
  }

  if(argc >= 3){
    j = atoi(argv[2]);
  }

  if(argc >= 4){
    trials = atoi(argv[3]);
  }

  assert(1 <= n);
  assert(1 <= j);
  assert(1 <= trials);

  srand48(time(NULL));
  long long data = 0;

  for(int tind = 0; tind < trials; tind++){
    int src[n*j];

    for(int cind = 0; cind < n*j; cind++)
      src[cind] = cind/j;

    int seen = 0; int steps = 0; 
    int dist[n];

    for(int cind = 0; cind < n; cind++)
      dist[cind] = 0;


    while(seen < n){
      int cpidx = drand48() * (double)(n*j-steps);
      int coupon = src[cpidx];

      for(int cind=cpidx; cind < n*j-steps-1; cind++)
        src[cind] = src[cind+1];

      steps++;

      if(dist[coupon] == 0)
        seen++;
      dist[coupon]++;
    }

    data += steps;
  }

  long double expt = (long double)data/(long double)trials;
  printf("[n = %d, j = %d, trials = %d]: %Le\n", 
         n, j, trials, expt);

  exit(0);
}

score 2 · Answer 2 · answered Dec 09 '21 at 01:41

I will use $n_i$ in place of $c_i$ for the number of balls of the $i^{th}$ color, and let $n=n_1+\dots+n_k$ be the total number of balls.

For each $i\in \{1,\dots,k\}$, let $T_i$ be the number of draws it takes to see the $i^{th}$ color. The variable you are interested in is $$ T=\max(T_1,\dots,T_k), $$ which is the number of draws it takes to get all colors.

Note that $E[T_i]=(n+1)/(n_i+1)$. This is by the same logic for the two color example in your post. If you imagine lining up all the balls in the order they are drawn, then the $n_i$ balls of color $i$ will divide the other balls into $n_i+1$ sections, each with an expected length of $(n-n_i)/(n_i+1)$, to which you must add one.

By the same logic, for any subset of colors $S\subseteq \{1,\dots,k\}$, $$ E\left[\min_{i\in S} T_i\right]=\frac{n+1}{\left(\sum_{i\in S}n_i\right)+1} $$ The other piece we need is this equation which relates the max of $k$ numbers to the min's of their subsets: $$ \max(x_1,\dots,x_k)=\sum_{S\subseteq \{1,\dots,k\}, S\neq \varnothing}(-1)^{|S|-1}\min_{i\in S} x_i $$ valid for real numbers $x_1,\dots,x_k$. Applying this with $x_1=T_1,\dots,x_k=T_k$, we conclude $$ \boxed{E[T] =\sum_{S\subseteq \{1,\dots,k\}, S\neq \varnothing}(-1)^{|S|-1}\frac{n+1}{\left(\sum_{i\in S}n_i\right)+1}} $$

In the special case where there are the same number $c$ of each color of ball, this simplifies to $$ E[T]=\sum_{i=1}^k (-1)^{i-1} \binom{k}{i} \frac{ck+1}{ci+1} $$

Coupon collector without replacement

2 Answers2

Linked