Coupon collector problem with $k$ distinct coupon sets to complete

Question

In the standard coupon collector problem we have an urn with $n$ different coupons, from which coupons are being collected, equally likely, with replacement. Simple analysis shows that the expected number of draws needed to collect all coupons is asymptotically $\Theta(n\log n)$.

Consider the variant of the problem in which there are $k$ sets of $n$ distinct coupons each. What is the expected number of draws needed to complete at least one series of $n$ coupons? More precisely, let $\mu$ be the uniform distribution over the set $\{x_{i,j}\}_{i \in [k], j\in[n]}$, where $[n] = \{1, \ldots, n\}$. What is the expected (asymptotic) number of draws from $\mu$ needed to obtain all members of the set $\{x_{i,j}\}_{j\in[n]}$ for at least one $i \in [k]$?

score 2 · Answer 1 · edited Apr 13 '17 at 12:20

This can be solved using inclusion-exclusion. There are $\binom kj$ ways to choose $j$ particular sets to finish, and the probability to have completed all $j$ of them is the probability to have completed a standard coupon collection with $jn$ coupons while drawing from $kn$ coupon types. Since the expected number of draws is the sum of the non-completion probabilities over all times, it satisfies the same inclusion-exclusion relation as the probabilities. Drawing from $kn$ coupon types while collecting $jn$ increases the expected number of draws by a factor $\frac kj$. Thus the desired expectation is

\begin{align} \sum_{j=1}^k(-1)^{j-1}\binom kj\frac kjjnH_{jn} &=kn\sum_{j=1}^k(-1)^{j-1}\binom kjH_{jn} \\ &=kn\sum_{j=1}^k(-1)^{j-1}\binom kj\left(\log j+\log n+\gamma+\frac1{2jn}\right)+O\left(\frac kn\right)\\ &=kn\left(\log n+\gamma\right)+\frac12kH_k+kn\sum_{j=1}^k(-1)^{j-1}\binom kj\log j+O\left(\frac kn\right)\\ &=knH_n+\frac12kH_k-k+kn\sum_{j=1}^k(-1)^{j-1}\binom kj\log j+O\left(\frac kn\right)\;. \end{align}

For the example $n=10$, $k=2$ used in Tad's answer, this yields the approximation

$$ 20\left(H_{10}-\log2\right)+H_2-2\approx44.2164\;, $$

close to Tad's approximation.

The remaining sum is treated in Proof $\sum\limits_{k=1}^n \binom{n}{k}(-1)^k \log k = \log \log n + \gamma +\frac{\gamma}{\log n} +O\left(\frac1{\log^2 n}\right)$; substituting that expansion leads to

$$ kn\left(H_n-\log\log k-\gamma-\frac\gamma{\log k}+\frac{\pi^2+6\gamma^2}{12\log^2k}\right)+\frac12kH_k-k+O\left(\frac{kn}{\log^3k}\right)\;. $$

@joriki: Do you have a reference for this result? In particular, I would be interested in the variance of this quantity also. — Sandeep Silwal, Oct 21 '19 at 01:53

score 1 · Answer 2 · answered Aug 31 '15 at 04:33

(Partial answer but too long for a comment.) For $k=2$, the expected number of draws required is $2n \ln 2n - cn + o(n)$, where $c=\log4-\gamma\approx 0.809$ (i.e. you have to collect almost all the coupons.) What's happening here is that, as you get close to completing one set of coupons, you tend very strongly to make more progress on the other set. This effect is strongest for $k=2$, but becomes weaker as $k$ increases. As we'll see, the analysis for $k=2$ degenerates in a way which makes it much easier than the general case, but the method should still apply in principle. The following is a sketch:

We have a random walk in an acyclic ranked directed graph. A node is a pair $(a,b)$ indicating we have $a$ coupons remaining to collect in the first set, and $b$ coupons in the second set. We start at $(n,n)$, and move from $(a,b)$ to either $(a-1,b)$ or $(a,b-1)$ with probability $a/(a+b)$ or $b/(a+b)$ respectively. The rank of a node is $a+b$; each move decreases the rank by one. The expected time (number of coupons required) to take one step from node $(a,b)$ is $2n/(a+b)$ (note it depends only on the rank, not the direction). In particular, all paths from the start down to a given rank $a+b$ have the same expected delay, namely $$2n\left(\frac1{2n}+\frac1{2n-1}+\cdots+\frac1{a+b+1}\right) =2n(H(2n)-H(a+b)),$$ where $H(k)$ is the $k$-th harmonic number.

What we're interested in is the probability of seeing a transition $(a,1)\to(a,0)$ for various $a>0$; that requires knowing the number of paths from $(n,n)$ to $(a,b)$ (which is $\binom{2n-a-b}{n-a}$) and their probabilities (the probability of a path is the product of the edge probabilities.) It's easy to see that all paths from $(n,n)$ to $(a,b)$ have the same probability $Pr((n,n)\leadsto(a,b))$ (exercise: compute it!). What degenerates in the $k=2$ case is that $Pr((n,n)\leadsto(a,0))$ is independent of $a$, so we can effectively ignore all the path probabilities, and just count paths.

We find that the expected number of coupons drawn before the first completed set is $$\frac{2n}{\binom{2n-1}n}\sum_{i=0}^{n-1}\binom{n-1+i}{i}(H(2n)-H(n-i)).$$ After much cajoling, Mathematica tells is this is asymptotically $$2n\left(H(2n)-\ln4\right)+O(1)=2n(\ln n -(\ln4-\gamma))+O(1),$$ where the final term (haven't worked it out) appears to be less than $1$.

For example, with $n=10$ the exact probability is $526157011/11639628=45.2039$ and $20(H(20)-\ln4)=44.2289$; a Monte Carlo run of $1000$ trials gave $45.09$.

Thanks @Tad! What would be the most interesting to me is the asymptotical behavior of the problem for large values of $k$. I conjecture that for every constant $1/10 <\alpha < 10$, if $k = n^\alpha$, then the expected number of draws needed should still be $\Theta(n\log n)$. — Anonymous, Sep 01 '15 at 15:39
Really? You're not going to get through one set of coupons without looking through at least some of most of the other sets; the asymptotic behavior really must depend on $k$. Do you mean $\Theta(nk\log nk)$? — Tad, Sep 02 '15 at 02:14
yes, I meant $\Theta(nk \log nk)$, thanks. (I had in mind $\sqrt n$ sets of $\sqrt n$ elements.) — Anonymous, Sep 02 '15 at 14:06
I dug up this question and found an expression for arbitrary $k$. — joriki, Jun 09 '16 at 03:11

Coupon collector problem with $k$ distinct coupon sets to complete

2 Answers2