Coupon collector's problem: mean and variance in number of coupons to be collected to complete a set (unequal probabilities)

Question

There are $n$ coupons in a collection. A collector has the ability to purchase a coupon, but can't choose the coupon they purchase. Instead, the coupon is revealed to be coupon $i$ with probability $p_i=\frac 1 n$. Let $N$ be the number of coupons they'll need to collect before they have at least one coupon of each type. Find the expected value and variance of $N$. Bonus: generalize to the case where the probability of collecting the $j$th coupon is $p_j$ with $\sum\limits_{j=1}^n p_j=1$.

I recently came across this problem and came up with/ unearthed various methods to solve it. I'm intending this page as a wiki with various solutions. I'll be posting all the solutions I'm aware of (4 so far) over time.

EDIT: As mentioned in the comments, this question is different from the one people are saying it is a duplicate of since (for one thing) it includes an expression for the variance and it covers the general case where all coupons have unequal probabilities. The case of calculating the variance for the general case of coupons having unequal probabilities has not been covered anywhere on the site apart from an earlier post by me, which this one intends to consolidate along with other approaches to solve this problem.

EDIT: Paper on the solutions on this page submitted to ArXiv: http://arxiv.org/abs/2003.04720

Yup. Just wanted to start a wiki page on it since I dug into it very deeply recently (by my standards anyway). — Rohit Pandey, Nov 28 '19 at 01:58
See https://math.stackexchange.com/questions/28905/expected-time-to-roll-all-1-through-6-on-a-die and the 68 pages linked to it as well as the 184 tagged pages. — Henry, Nov 28 '19 at 02:11
That link is certainly relevant and helpful. But once I post all my answers, you will see what I have in mind for this one. — Rohit Pandey, Nov 28 '19 at 02:15
To the downvoters - have patience, let me post the answers first. Besides, the link posted in the comments which this is a supposed duplicate of doesn't have the variance. — Rohit Pandey, Nov 28 '19 at 02:27
It's admirable that you're working this out in depth for yourself, and it is a very good exercise to do. The question is what is the improvement to the site from posting this exercise. The downvotes appear to register someone's opinion that there is not really any improvement. If any of your solutions is particularly novel (not found on Wikipedia nor hinted at in any of the answers of the linked questions), it might have been a good idea to lead with that one. As it is, we've seen the first answer many times already. — David K, Nov 28 '19 at 02:39
Right, I'm starting with the ones that are shortest to write down. That's why I request the downvoters to have some patience. I guarantee, there will be solutions (particularly for the variance) that are novel and not available readily on the internet. Also, the other answers with more complex methods will refer to the simpler one for reference and verification. — Rohit Pandey, Nov 28 '19 at 02:43
Would the downvoters care to elaborate? The question asks for approaches to calculate variance which have not previously been covered on this website and are not readily available on the internet/ textbooks. — Rohit Pandey, Nov 28 '19 at 08:36
@Henry none of them have an expression for the variance in the general case of unequal probabilities. — Rohit Pandey, Nov 28 '19 at 12:21
@RohitPandey perhaps not since you generalised the question 5 hours ago, but for example the variance of equal probability coupons is dealt with in https://math.stackexchange.com/questions/377077/roling-a-dice-until-we-have-all-the-numbers-variance while the more general case is in you 2017 post https://math.stackexchange.com/questions/3439096/coupon-collectors-problem-variance-calculation-missing-a-term — Henry, Nov 28 '19 at 14:30
My post was from 2019 and I would have serious memory issues if I didn't remember it. The whole point of this page was to consolidate all those disparate posts into a wiki. — Rohit Pandey, Nov 28 '19 at 16:54
Thanks for this question and the answers! I just linked to this collection for the second time. Also a good idea to turn it into a paper. — joriki, Apr 06 '20 at 08:05
Thanks @joriki. I already have a paper on this: https://arxiv.org/abs/2003.04720. Wanted to include in the paper your answer: https://math.stackexchange.com/questions/379525/probability-distribution-in-the-coupon-collectors-problem and add you as a co-author or otherwise acknowledge in some way I took it from you. Is that okay? Also, on my TODO list is to see if we can get the variance using that approach as well :) — Rohit Pandey, Apr 08 '20 at 03:12

Rohit Pandey · Answer 1 · 2019-11-28T22:10:41.963

A3: Using the Poisson process to magically concoct independent random variables. This is the most powerful of all approaches since it's the only one that allows us to solve for both mean and variance for the coupon collector's problem for the general case of coupons having unequal probabilities (and higher moments as well).

The other approaches either work for all moments but only the special case of equal probabilities (A1 and A2) or for the general case of unequal probabilities but only the mean (A4).

A question about this was asked and answered by me earlier: Coupon collectors problem: variance calculation missing a term.. This post is a consolidation.

In example 5.17 of the book, Introduction to probability models by Sheldon Ross, the Coupon collector's problem is tackled for the general case where the probability of drawing coupon $j$ is given by $p_j$ and of course, $\sum\limits_j p_j = 1$.

Now, he imagines that the collector collects the coupons in accordance to a Poisson process with rate $\lambda=1$. Furthermore, every coupon that arrives is of type $j$ with probability $p_j$.

Now, he defines $X_j$ as the first time a coupon of type $j$ is observed, if the $j$th coupon arrives in accordance to a Poisson process with rate $p_j$. We're interested in the time it takes to collect all coupons, $X$ (for now, eventually, we're interested in the number of coupons to be collected, $N$). So we get:

$$X = \max_{1\leq j \leq m}X_j$$

Note that if we denote $N_j$ as the number of coupons to be collected before the first coupon of type $j$ is seen, we also have for the number needed to collect all coupons, $N$:

$$N = \max_{1\leq j \leq m}N_j \tag{0}$$

This equation is less useful since the $N_j$ are not independent. It can still be used to get the mean (see answer A4), but trying to get the variance with this approach gets considerably more challenging due to this dependence of the underlying random variables.

But, the incredible fact that the $X_j$ are independent (discussion on that here), allows us to get:

$$F_X(t) = P(X<t) = P(X_j<t \; \forall \; j) = \prod\limits_{j=1}^{m}(1-e^{-p_j t})\tag{1}$$

Mean

Now, Ross uses the expression: $E(X) = \int\limits_0^\infty S_X(t)dt$, where $S_X(t)$ is the survival function to get:

$$E(X) = \int\limits_{0}^{\infty}\left(1-\prod\limits_{j=1}^{m}(1-e^{-p_j t})\right) dt $$

$$= \sum\limits_j\frac 1 p_j - \sum\limits_{i<j}\frac {1}{p_i+p_j} + \dots +(-1)^{m-1} \frac{1}{p_1+\dots+p_m}\tag{2}$$

For our case here we have: $p_j=\frac{1}{n} \forall j$

Substituting in the equation above we get:

$$E(X) = \sum\limits_{k=1}^{n}(-1)^k \frac{n \choose k}{k}$$

From here we get as before:

$$E(X) = n\sum\limits_{k=1}^n \frac{1}{k}$$

Further, Ross shows that $E(N)=E(X)$ using the law of total expectation.

First, he notes,

$$E(X|N=n)=nE(T_i)$$

where $T_i$ are the inter-arrival times for coupon arrivals. Since these are assume to be exponential with rate 1,

$$E(X|N)=N\tag{3}$$

Taking expectations on both sides and using the law of total expectation we get:

$$E(X)=E(N)$$

Variance

This approach can easily be extended to find $V(N)$, the variance (not covered by Ross). We can use the following expression to get $E(X^2)$:

$$E(X^2) = \int\limits_0^\infty 2tP(X>t)dt = \int\limits_0^\infty 2t\left(1-\prod\limits_{j=1}^n(1-e^{-p_j t})\right)dt$$

Using the fact that $\int\limits_0^\infty te^{-pt}=\frac{1}{p^2}$ and the same algebra as for $E(X)$ we get:

$$\frac{E(X^2)}{2} = \sum \frac {1} {p_j^2} -\sum_{i<j} \frac{1}{(p_i+p_j)^2}+\dots +(-1)^{n-1}\frac{1}{(p_1+\dots+p_n)^2} $$

Now, let's consider the special case where all coupons have an equal probability of being selected. In other words, $p_j=\frac 1 n \; \forall \; j$.

We get:

$$\frac{E(X^2)}{2} = n^2\left(\sum\limits_{k=1}^n (-1)^{k-1}\frac{n\choose k}{k^2}\right)$$

Per my answer to the question here, this summation yields:

$$E(X^2) = 2n^2\left( \sum_{j=1}^n\sum_{k=1}^j\frac{1}{jk}\right)\tag{4}$$

As a side-note, the binomial identity arising from the calculation of the second moment can be generalized: $$\sum_{k=1}^n(-1)^{k-1}\frac{n\choose k}{k^r}=\sum_{i_1<i_2<\dots <i_r}\frac{1}{i_1 i_2 \dots i_r}$$ See here.

Equation (4) has given us $E(X^2)$ but remember that we're interested in finding $E(N^2)$.

Using the law of total variance we get:

$$V(X)=E(V(X|N))+V(E(X|N))$$

So per equation (3) we have:

$$V(X)=E(V(X|N))+V(N)\tag{5}$$

Now,

$$V(X|N)=NV(T_i)$$

And since $T_i \sim Exp(1)$, we have $V(T_i)=1$ meaning, $V(X|N)=N$.

Substituting into (2),

$$V(X)=E(N)+V(N)$$

$$=> V(N)=E(X^2)-E(N)-E(N)^2$$

Where equation (4) gives $E(X^2)$ while $E(N)=n\sum_{k=1}^n \frac{1}{k}$ as shown multiple times on this page. This is consistent with equation (5) of A2.

Dear Dr. Pandey, I really appreciate your careful exposition ! I would like to cite your result on the variance in the general (not necessarily uniformly distributed coupons) case, in my work. How may I do so ? Have you a preprint on arXiv ? Thank you :) — Simon, Feb 25 '20 at 18:06
Thanks Simon. I don't have an arxiv for this yet. Can you give me until Monday to create one and share? — Rohit Pandey, Feb 26 '20 at 19:22
Thank you very much for your reply @Rohit Pandey. That would be fantastic, but please take your time if you need to ! I am a little while away from submitting my work for publication. — Simon, Feb 26 '20 at 22:08
@Simon - just completed a first draft and submitted to ArXiv: https://arxiv.org/submit/3078260/view — Rohit Pandey, Mar 08 '20 at 06:16
Thank you very much for letting me know - I'll be very interested to read it. This link is just for you, the one who submitted the article. When it is approved, please could you post its public link ? — Simon, Mar 09 '20 at 12:12
I think it was approved now. Here is the link: http://arxiv.org/abs/2003.04720. Let me know if you can see it now. — Rohit Pandey, Mar 11 '20 at 07:11
I can ! Thank you very much ! I look forward to reading it properly asap. — Simon, Mar 11 '20 at 15:36

Rohit Pandey · Answer 2 · 2019-11-28T04:53:57.523

A4: Using maximum of minimums identity

Let $N_j$ be the number of coupons to be collected before we see the first coupon of type $j$ and $N$ the number of coupons until all are collected. We have:

$$N = \max_{1\leq j \leq n}N_j$$

This is equation (0) of answer A3 and in conjunction with maximum of minimums identity we get:

$$N = \sum N_j - \sum_{1\leq j \leq k\leq n} \min N_j, N_k + \sum_{1\leq j \leq k\leq i \leq n} \min N_j, N_k, N_i - \dots \tag{1}$$

and the fact that $\min_{1 \leq j \leq m} N_j$ is a geometric random variable with parameter $p=\sum\limits_{j=1}^m p_j$ lead to equation (2) of A3 and from there, we can substitute $p_j=\frac 1 n \forall j$ to get:

$$E(N) = n\sum\limits_{k=1}^n \frac 1 k$$

Note that it's not easy to get the variance, $V(N)$ with this approach because the terms in equation (1) are not independent.

Rohit Pandey · Answer 3 · 2019-11-28T21:44:47.187

A2: Using a recurrence

Consider a state where the collector has $m$ coupons in his collection. Let $T_m$ be the number of coupons needed to complete the collection. If the total coupons he needs to collect to complete the collection is $N$, we then have:

$$N = T_0$$

Now, we could observe that (as pointed out by @DaivdK in the comments):

$$N_m = T_{m+1}-T_m$$

and summing over all $m$ (and noting that $T_n=0$) leads us to:

$$T_0 = \sum_m N_m$$

and this leads to the approach in A1 which makes the problem much easier to solve.

Alternately, we can continue working with the $T_m$'s and construct a recurrence.

Consider what happens when the collector has $m$ coupons and he collects one more. With probability $\frac{m}{n}$, he fails to add a new coupon and is back to where he started, making no progress. Let $I(\frac{n}{m})$ be a Bernoulli random variable with $p=\frac{n}{m}$. We then have the expression:

$$T_m = 1+I\left(\frac{m}{n}\right)T_m'+\left(1-I\left(\frac{m}{n}\right)\right)T_{m+1}\tag{1}$$

Where $T_m'$ is i.i.d with $T_m$. Taking expectation to both sides,

$$E(T_m) = 1+ \frac{m}{n}E(T_m)+\frac{n-m}{n}T_{m+1}$$

$$E(T_m)\left(1-\frac{m}{n}\right) = 1+ \left(1-\frac{m}{n}\right)T_{m+1}$$

$$E(T_m)-E(T_{m+1}) = \frac{n}{n-m}\tag{2}$$ As noted before, the L.H.S is simply $E(N_m)$ as defined in A1. In general we have, $$\sum\limits_{m=k}^{n-1}E(T_m)-\sum\limits_{m=k}^{n-1}E(T_{m+1}) = \sum\limits_{m=k}^{n-1}\frac{n}{n-m}$$

Noting that $T_n=0$ we have, $$E(T_k)=\sum\limits_{m=k}^{n-1}\frac{n}{n-m}$$ And letting $m=n-k$

$$E(T_{n-m}) = n\sum\limits_{k=1}^{m}\frac{1}{k}\tag{3}$$

We're interested in $T_0$, so let's substitute $m=n$ in equation (3).

$$E(T_0) = n \sum\limits_{k=1}^{n}\frac{1}{k}$$

Now, let's try and find the variance, $V(N)=V(T_0)$. Let's square both sides of equation (1). To make the algebra easier, let's re-arrange and note that $I(\frac{m}{n})(1-I(\frac{m}{n}))=I(\frac{m}{n})-I(\frac{m}{n})^2=0$.

$$=>(T_m-1)^2 = I\left(\frac{m}{n}\right)^2 T_m'^2+(1+I\left(\frac{m}{n}\right)^2-2I\left(\frac{m}{n}\right))T_{m+1}^2$$

Now, note the following property of Bernoulli random variables: $I(\frac{m}{n})^2=I(\frac{m}{n})$. This means:

$$T_m^2-2T_m+1 = I\left(\frac{m}{n}\right) T_m'^2+(1-I\left(\frac{m}{n}\right))T_{m+1}^2$$

We have to be careful here to note which random variables are i.i.d. and which are identical. See here: How to square equations involving random variables.

Taking expectation and doing some algebra gives us,

$$\left(1-\frac{m}{n}\right)E(T_m^2)=2E(T_m)+\left(1-\frac{m}{n}\right)E(T_{m+1}^2)-1$$

$$=>E(T_m^2)-E(T_{m+1}^2)=2E(T_m)\frac{n}{n-m}-\frac{n}{n-m}$$

$$=>\sum\limits_{m=0}^{n-1}E(T_m^2)-\sum\limits_{m=0}^{n-1}E(T_{m+1}^2)=\sum\limits_{m=0}^{n-1}2E(T_m)\frac{n}{n-m}-\sum\limits_{m=0}^{n-1}\frac{n}{n-m}$$

$$=> E(T_0^2)-E(T_n^2)=\sum\limits_{m=0}^{n-1}2E(T_m)\frac{n}{n-m}-\sum\limits_{m=0}^{n-1}\frac{n}{n-m}$$

But, $T_n=0$ and from equation (3), $E(T_m)=n \sum\limits_{k=1}^{n-m}\frac 1 k$. So we get:

$$E(T_0^2) = \sum\limits_{m=0}^{n-1}2E(T_m)\frac{n}{n-m}-\sum\limits_{m=0}^{n-1}\frac{n}{n-m}$$

$$=>E(T_0^2) = 2n^2 \sum\limits_{m=0}^{n-1}\frac{1}{n-m}\sum\limits_{k=1}^{n-m}\frac{1}{k} -n\sum\limits_{m=0}^{n-1}\frac{1}{n-m}$$ Now, change variables $j=n-m$

$$=>E(T_0^2) = 2n^2 \sum\limits_{j=n}^{1}\frac{1}{j}\sum\limits_{k=1}^{j}\frac{1}{k} -n\sum\limits_{j=n}^{1}\frac{1}{j}$$

$$=>E(T_0^2) = 2n^2\sum\limits_{1 \leq k \leq j \leq n} \frac{1}{jk}-E(T_0)\tag{4}$$

This can be used easily to get the variance.

$$V(T_0^2) = 2n^2\sum\limits_{1 \leq k \leq j \leq n} \frac{1}{jk}-E(T_0)-E(T_0)^2\tag{5}$$

Comparing equation (5) above with equation (2) of A1 we get the easily verifiable identity:

$$2\sum_{1\leq j\leq k \leq n} \frac{1}{jk}=\sum\limits_{i=1}^n\frac{1}{i^2}+\left(\sum\limits_{i=1}^n\frac{1}{i}\right)^2$$

You inverted the probability $\frac mn$ in $(1).$ To fix this you need to rework everything starting from that point forward. By the way the last equation isn't true; on the left you have an incorrect answer and on the right you have the answer we already know is correct. — David K, Nov 28 '19 at 03:07
Ok, sure. But wait 15 minutes from when you first see the answer. I'm going to do re-reads. — Rohit Pandey, Nov 28 '19 at 03:15
Note that a way to (partly) unify this method with the one you posted first is that in each case you are computing the expected number of coupons to get from $m$ unique coupons to $m+1.$ You've just notated that random value two different ways: here it's $T_{m+1}-T_m$, there it was $N_m.$ By relating the two you could get to do the series manipulation just once instead of twice. — David K, Nov 28 '19 at 03:19
@DavidK - yeah, but then we just get $E(T_0)=\sum N_j$ which doesn't add too much to approach 1 at all. — Rohit Pandey, Nov 28 '19 at 05:10
Yes, that's the point, a substantial part of this answer is just retreading the other answer with different notation. — David K, Nov 28 '19 at 05:20
Sure, that's easy to see in hindsight and your comment highlighting this is valuable. But when I first saw this question, I used the approach from this answer and the sum of geometrics approach didn't occur to me. The algebra here is considerably more involved, especially for the variance (which I'll be filling out) and comparison of the result with the sum of geometrics approach leads to a nice identity. — Rohit Pandey, Nov 28 '19 at 05:24
One kind of expects to use the benefit of hindsight when compiling a collection of proofs. There is new material here, up to equation $(2)$ and in the variance section, so having already done the $\sum N_j$ part doesn't detract much. — David K, Nov 28 '19 at 05:33
Sure, let me add a note highlighting the connection of this approach with the sum of geometrics approach. In the variance section of this approach, there is also an important lesson regarding squaring equations with random variables which I would have not learnt if I knew the sum of geometrics approach. — Rohit Pandey, Nov 28 '19 at 05:39

Rohit Pandey · Answer 4 · 2022-04-16T03:56:21.990

A1: Using a sum of geometric random variables

Consider the state where the collector has already collected $m$ coupons. How many coupons does he need to collect to get to $m+1$? Let this be represented by the random variable, $N_m$. Then, if the total coupons needed is $N$, we have:

$$N = \sum\limits_{m=1}^n N_m\tag{1}$$

Every coupon collected from here is like a coin toss where with probability $\frac m n$, the collector hits a coupon he already has and makes no progress. With probability $\frac{n-m}{n}$, he collects a new coupon. So, this becomes a geometric random variable with $p=\frac{n-m}{n}$. We know that a geometric random variable has a mean $\frac{1}{p}$ and variance $\frac{1-p}{p^2}$. Hence,

$$E(N_m)=\frac{n}{n-m}$$

Taking expectation of equation (1) and substituting we have:

$$E(N) = E(N_m) = \sum\limits_{m=1}^n \frac{n}{n-m}=n \sum\limits_{m=1}^n \frac{1}{n-m}$$

Substituting $m=n-m$ we get:

$$E(N) = n \sum\limits_{m=1}^n \frac{1}{m}$$

Similarly, the variance, $V(N)$ can be calculated.

$$V(N) = n^2\sum\limits_{i=1}^n \frac{1}{i^2}-n\sum\limits_{k=1}^n \frac{1}{k}\tag{2}$$

Note: This solution only works when the coupons have equal probability.

Woukd the downvoter care to elaborate? Is this answer incorrect? Scope for improvement? — Rohit Pandey, Nov 28 '19 at 08:34

Coupon collector's problem: mean and variance in number of coupons to be collected to complete a set (unequal probabilities)

4 Answers4

Linked

Related