5

Consider the well-known coupon collector problem with $n$ potential coupons. If I collect a coupon per day, I know it will take me $n \cdot H_n$ days to finally collect all of them, where $H_n$ is the $n$-th harmonic number, and $H_n$ to collect each coupon on average. For example, if the coupons are the positive integers up to 12, one potential sequence of coupons could look like

r1=[5     2    12     9     3     2     1     8     5     6     7    11    11    11     3     4     2     9     3     5     8    12     8     3    10]

Note that for this example, it took me 25 days to collect them all, which is “kind of” close to $12 \cdot H_{12}=37$, with an average per coupon of $t_1=|r_1|/n=25/12=2.08$.

Now, once I obtain all of the coupons, then I will observe which ones I have obtained exactly once. I will eliminate this set $k_1$ of coupons, and take a look at how long it took me to collect the remaining $n-|k_1|$ coupons this time. Note that in this process I am always eliminating the last coupon to appear, but possibly more.

In the previous example, the coupons that appear once are

k1=[1     4     6     7    10]

which are the ones which I will remove. And the ones that can stay are

b1=[2     3     5     8     9    11    12]

Searching how long it took me in the initial sequence to get all the values I haven’t removed, I obtained:

r2=[5     2    12     9     3     2     8     5    11]

I would expect that this should have taken me on average $H_{n-|k_1|}$ days? Here $|k_1|=5$ so this would be $H_7=2.59$. This is not that close to the actual time that it took me, which is $t_2=|r_2|/(n-|k_1|)=9/7=1.29$ in our example.

I will now check, in the second iteration, which are the coupons that I have collected exactly once, and remove them again from the sample. The coupons that appear only once are

k2=[3     8     9    11    12]

I remove them and obtain the shortest list that contains all elements in $[1:12]\setminus \{k_1 \cup k_2\}$. This gives me

r3=[5     2]

Again, I would expect that this should have taken me on average $H_{n-|k_1|-|k_2|}$ days? Here $|k_1|=5$ and $|k_2|=5$, which gives $1.5$, which is also more than what it actually took me, which was only $t_3=|r_3|/(n-|k_1|-|k_2|)=2/2=1$ day on average.

In the end, I am interested in the average time it took me to take each coupon in each of the $r$ sequences, call this $T_n$. In our example, it took me 2.08 days on average to get my first coupons in $r_1$, this is multiplied by the number of elements that appeared only once in $r_1$, i.e. $|k_1|=5$. Then, it took me, on average, $1.29$ days to get each of the coupons in $r_2$, this is multiplied by $|k_2|=5$, and finally it took me an average of 1 day to collect each of the coupons in $r_3$, and this is multiplied by $k_3$. Finally, we divide all of it by $n$. In our example,

$$T_{12}=(5\cdot 2.08+5 \cdot 1.29+2\cdot 1)/12= 1.57$$

Thus, $T_n$ can be defined more generally as:

$$T_n=\left[\frac{1}{n}\right]\left[\frac{|k_1||r_1|}{n}+\frac{|k_2||r_2|}{n-|k_1|}+\frac{|k_3||r_3|}{n-|k_1|-|k_2|}+\ldots\right]$$

I am interested in the approximate expected value of $T_n$, particularly in its asymptotic behavior and upper/lower bounds.

My guess was that one could approximate $T_n$ by the recursive use of the initial approximation $H_n$, in which we fix that in each round we remove exactly one coupon. Therefore, we obtain $\frac{H_n + H_{n-1}+ \ldots}{n}$, which asymptotically converges to $H_n$. But, as I have shown before in the example, this approximation is not that good because the recursive approximation using $H_n$ starts getting worse and worse.

In simulations, I have found the following values for $T_n$:

Vaue of n Output
25 2.1325
50 2.3378
100 2.4772
200 2.6004
500 2.7388

If it helps, note that the expected size of $k_1$ is $H_n$, see Section 4 in Myers, Amy N., and Herbert S. Wilf. "Some new aspects of the coupon collector's problem." SIAM review 48.3 (2006): 549-565.

D.W.
  • 4,540
fox
  • 679
  • 7
  • 18
  • 1
    From simulations up to $n=2^{16}$, the expected value seems to grow very slightly more than linearly; at most $n\log\log n$ and perhaps even just $n\log\log\log n$. – joriki Mar 30 '24 at 09:09
  • Thank you. I got similar results with simulations, and I know it doesn't converge to a constant. But what escapes me is to prove that the expected value must be larger than some (very small) function of \log(n). – fox Mar 30 '24 at 10:33
  • 1
    You should include those efforts in the question. – joriki Mar 30 '24 at 10:57
  • About your edit on the expected size of $k_1$: This is also treated in this question. As usual, the expected size is much easier to obtain directly (as in that question) than using the full distribution (as in the paper you linked to). – joriki Mar 30 '24 at 11:01

0 Answers0