Coupon Collectors Problem with Packets: Clarifying Wikipedia

Question

The Coupon Collector's Problem (CCP) is very useful in many applications. However, the "default" CCP is relatively simple: suppose you have an urn containing $n$ pairwise different balls. Now you want to draw a ball from the urn with replacements until you have seen each of the $n$ balls at least one. Now you can compute the average waiting time to get the number of draws overall needed by the formula \begin{align} \mathbb{E}[X] = \sum_{i=1}^n \mathbb{E}[X_i] = nH_n \end{align} where $H_n$ is defined as the harmonic series and $\mathbb{E}$ is the expected value. Also, the random variable $X$ is defined as the random number of draws you have to make in order to get all $n$ balls at least once. $X_i$ denotes the additional number of draws one has to make in order to get from $i-1$ different balls to $i$ different balls drawn. Additionally, each ball has an equal probability of $1/n$.

Now consider an advanced CCP question: how does the formula change in case you want to draw $p\geq 1$ pairwise different balls (instead of only one as in the default CCP) per draw, called packets? In other words: Given an urn containing $n$ balls, how many balls do I need to draw in order to get all $n$ balls when drawing always $p\geq 1$ (pairwise different) balls out of the urn? The set of balls is drawn with replacement. (Therefore all balls of one package are different, but different packages can contain same balls.)

An answer gives this paper on top of page 20, and also this german lecture gives an answer on slide 229, 14.7b). A third -- at the same time very intuitive to get -- answer is given on the german Wikipedia, subsection "Päckchen".

Now two questions arise.

Why do the answers in the paper and the lecture differ? If you plug in some numbers, you get different results for numbers above 1000.
How do I get from these solutions to the one given on Wikipedia? For me it seems like an approximation of the real value, since it is very fast to compute compared to the "scientific" answers, and the results is always "in the near of" the results of the other computations.

Since I am interested in understanding the formula on Wikipedia, can anyone help understanding the equation how the formula is derived or give some insight?

In what part of the German wikipedia article is the case $m<n$ treated? — Justpassingby, Feb 11 '16 at 13:39
@Justpassingby Thanks for pointing this out. It is a second extension to the CCP. I clearified this in the question to be more precise. Mainly I am eager to learn the theory behind the formula to the article on Wikipedia, i.e. the first question. The answer to the second question is just nice to have. — CAR, Feb 11 '16 at 14:56
The German Wikipedia seems to be wrong. As you say, it may be an approximation, but it doesn't say so. — joriki, Feb 12 '16 at 11:55
@joriki Thank you for your comment. Running simulations yields that the approximation on Wikipedia is quite ok. So apart from the explanation given on Wikipedia, there should be some detailed proof somewhere. Do you have any evidence for the formula "being wrong"? — CAR, Feb 12 '16 at 12:57

score 3 · Accepted Answer · answered Feb 12 '16 at 14:30

The German Wikipedia formula is indeed wrong.

It's hard to figure out why someone comes up with a wrong solution for something. However, you could think of an experiment (different from the CCP) where the formula would give the right answer.

Say we have an urn with n balls numbered from 1 to $n$. Now we draw one ball at a time with replacement, until we got every number at least once. This is CCP with $p = 1$. If we have already seen $k$ distinct numbers, the expected value for the necessary draws to get the $(k+1)$st distinct number is $\frac{n}{n-k}$. Therefore, the expectation of the total number of necessary draws is $$ \sum_{k=0}^{n-1} \frac{n}{n-k}. $$

Now let's change the setting a little bit. We start again from scratch and draw one ball at a time, basically with replacement. But any time we get a previously unseen number, we do not replace it. However, as soon as we have seen $p$ distinct numbers, we replace all of them. Then we keep drawing balls with replacement (from all $n$ balls again); any time we get a previously unseen number, we do not replace it, until we have seen another $p$ distinct numbers; then we replace again all $p$ balls into the urn, and so on. This is a mixture of with and without replacement sampling. You could view this as a series of "with replacement" episodes, where episode $k$ lasts until you get the $k$th distinct ball, and where the number of balls in the urn during episode $k$ is $n-((k-1)\mod p)$, including $n-(k-1)$ previously unseen balls. Therefore, the expected duration for the $k$th episode is $$\frac{n-((k-1)\mod p)}{n-(k-1)},$$ and thus the total number of necessary draws is in expectation

$$\sum_{k=1}^n \frac{n-((k-1)\mod p)}{n-(k-1)} = \sum_{k=0}^{n-1} \frac{n-(k\mod p)}{n-k}.$$

This is the Wikipedia formula you mentioned.

Note that in the CCP with $p$ coupons in each package, we draw $p$ coupons without replacement, then replace them, draw again $p$ coupons without replacement and so on, so in some sense this as also a series of draws, where the number of balls in the urn varies between $n$ and $n-(p-1)$. This similarity seems to have fooled the Wikipedia author.

If $p$ is small and $n$ is large, the CCP with $p > 1$ coupons may be approximated by CCP with $p=1$, and in this case the experiment decribed above is equal to CCP. Therefore the (wrong) Wikipedia formula cannot be way off in this case. But I suspect the discrepancies to be larger if $p$ is large (or $n$ is small).

Just to clarify it for me, usng the example $n=10$, $p=3$. On the first draw I get a new ball, also on the second and third (since I dont replace), meaning $X_i^j$ ($i$: overall (equals your $k$), $j$: packet) is then $X_1^1=1,X_1^2=1,X_1^3=1$. After the first package the balls are returned and we have the chance of $7/10$ that we draw a new ball, then $6/9,5/8$. Then we would get $X_2^1=10/7,X_2^2=9/6,X_2^3=8/5$. And so on ($X_3^1=10/4,X_3^2=9/3,X_3^3=8/2$,$X_4^1=10/1$, rest does not matter). Since this seem to be the strategy you explained, can you please help me see the mistake made here? — CAR, Feb 13 '16 at 14:07
Could you clarify what you mean by $X_i^j$? In my setting, there are no packets; we just draw the balls with replacement (one at a time), until we see a new number, that's what I called an episode. The length of an episode is therefore geometrically distributed (and the expected length is just the inverse of the probability). In the CCP setting, when you think of packets, there is no geometric distribution anymore. (continued below) — H.K., Feb 14 '16 at 13:09
With the first coupon in the second packet, the probability is 7/10 to get a new number, but with the second coupon the chance is not 6/9: it is 6/9, if the first coupon was a new number, and it is 7/9, if the first coupon was a previously seen number. Anyway, the inverse of this probability has no meaning as an expectation, since we do not draw with replacement, until we get a new number. — H.K., Feb 14 '16 at 13:09
First, thank you very much for your comments. With $X_i^j$ I denoted the additional number of draws one has to make in order to get from $j-1$ different balls to $j$ different balls drawn for a packet $i$. So in my case I am choosing just random subsets (packets) from a large amount of possible numbers (balls in the urn), i.e. $P \subseteq_R N, |P|=p, |N|={1,\ldots,n}$. The question is, how often do I have to select randomly a subset $P$ until I have seen each number at least once of the large set $N$. Or should I use an other approach instead of the CCP one? — CAR, Feb 14 '16 at 14:07
For the previous comment (edit was not possible anymore): $N = {1,\ldots,n}, |N|=n$. This is the original problem and why I came to the Wikipedia page (which seems to be wrong) - how many subsets $P$ do I have to draw randomly in order to get each number of $N$. Maybe you have a good hint for me, or a link? — CAR, Feb 14 '16 at 14:15
As a further note: I think I got your point now. So the probability to get a new ball for the first three ($p=3$) would be $10/10,9/9,8/8$. The fourth ball then has a probability of $p=7/10$. But the fifth then depends on if the fourth one was new or had already be drawn, i.e., $6/9$ (if fourth was new) or $7/9$ (if fourth draw was already drawn earlier). It seems Wikipedia only takes into account there the "best option". — CAR, Feb 14 '16 at 14:56

Coupon Collectors Problem with Packets: Clarifying Wikipedia

1 Answers1

Linked