Consider the well-known coupon collector problem with $n$ potential coupons. If I collect a coupon per day, I know it will take me $n \cdot H_n$ days to finally collect all of them, where $H_n$ is the $n$-th harmonic number, and $H_n$ to collect each coupon on average. For example, if the coupons are the positive integers up to 12, one potential sequence of coupons could look like
r1=[5 2 12 9 3 2 1 8 5 6 7 11 11 11 3 4 2 9 3 5 8 12 8 3 10]
Note that for this example, it took me 25 days to collect them all, which is “kind of” close to $12 \cdot H_{12}=37$, with an average per coupon of $t_1=|r_1|/n=25/12=2.08$.
Now, once I obtain all of the coupons, then I will observe which ones I have obtained exactly once. I will eliminate this set $k_1$ of coupons, and take a look at how long it took me to collect the remaining $n-|k_1|$ coupons this time. Note that in this process I am always eliminating the last coupon to appear, but possibly more.
In the previous example, the coupons that appear once are
k1=[1 4 6 7 10]
which are the ones which I will remove. And the ones that can stay are
b1=[2 3 5 8 9 11 12]
Searching how long it took me in the initial sequence to get all the values I haven’t removed, I obtained:
r2=[5 2 12 9 3 2 8 5 11]
I would expect that this should have taken me on average $H_{n-|k_1|}$ days? Here $|k_1|=5$ so this would be $H_7=2.59$. This is not that close to the actual time that it took me, which is $t_2=|r_2|/(n-|k_1|)=9/7=1.29$ in our example.
I will now check, in the second iteration, which are the coupons that I have collected exactly once, and remove them again from the sample. The coupons that appear only once are
k2=[3 8 9 11 12]
I remove them and obtain the shortest list that contains all elements in $[1:12]\setminus \{k_1 \cup k_2\}$. This gives me
r3=[5 2]
Again, I would expect that this should have taken me on average $H_{n-|k_1|-|k_2|}$ days? Here $|k_1|=5$ and $|k_2|=5$, which gives $1.5$, which is also more than what it actually took me, which was only $t_3=|r_3|/(n-|k_1|-|k_2|)=2/2=1$ day on average.
In the end, I am interested in the average time it took me to take each coupon in each of the $r$ sequences, call this $T_n$. In our example, it took me 2.08 days on average to get my first coupons in $r_1$, this is multiplied by the number of elements that appeared only once in $r_1$, i.e. $|k_1|=5$. Then, it took me, on average, $1.29$ days to get each of the coupons in $r_2$, this is multiplied by $|k_2|=5$, and finally it took me an average of 1 day to collect each of the coupons in $r_3$, and this is multiplied by $k_3$. Finally, we divide all of it by $n$. In our example,
$$T_{12}=(5\cdot 2.08+5 \cdot 1.29+2\cdot 1)/12= 1.57$$
Thus, $T_n$ can be defined more generally as:
$$T_n=\left[\frac{1}{n}\right]\left[\frac{|k_1||r_1|}{n}+\frac{|k_2||r_2|}{n-|k_1|}+\frac{|k_3||r_3|}{n-|k_1|-|k_2|}+\ldots\right]$$
I am interested in the approximate expected value of $T_n$, particularly in its asymptotic behavior and upper/lower bounds.
My guess was that one could approximate $T_n$ by the recursive use of the initial approximation $H_n$, in which we fix that in each round we remove exactly one coupon. Therefore, we obtain $\frac{H_n + H_{n-1}+ \ldots}{n}$, which asymptotically converges to $H_n$. But, as I have shown before in the example, this approximation is not that good because the recursive approximation using $H_n$ starts getting worse and worse.
In simulations, I have found the following values for $T_n$:
Vaue of n | Output |
---|---|
25 | 2.1325 |
50 | 2.3378 |
100 | 2.4772 |
200 | 2.6004 |
500 | 2.7388 |
If it helps, note that the expected size of $k_1$ is $H_n$, see Section 4 in Myers, Amy N., and Herbert S. Wilf. "Some new aspects of the coupon collector's problem." SIAM review 48.3 (2006): 549-565.