2

I was just reading a blog post about determining how many purchases are needed to complete a set of N unique items, given that you know in advance how many items are in the set.

I was thinking as a modification of this problem, let's say you do not know how many unique items N are in the set (but we can assume each item appears with the same frequency). You're only way to estimate N is by repeatedly sampling ("buying") a box and opening it up to reveal inside which item is present. At some point, you will buy a repeat, but this will be at random, for example, let N = 26 and each letter a-z represent an item type (| represents an end of trial as a repeat is encountered):

a, f, i, l, o, f |

c, q, i, e, s, t, l, r, c |

As soon as a repeated purchase was made, the trial ends. I think this could be done easily in a computer simulation, but I'm wondering how difficult an analytical solution is, or whether it even exists. My guess is this a "classic" problem that has been worked out, but my google search prowess failed to turn up any results.

Specifically, is there a way to estimate N(n) with some confidence level X% after n=1, 2, 3, etc trials?

  • I'm pretty sure you could set up a maximum likelihood estimator for N as a function of the length of the sequence before the first repeat. When you say n=1, 2, 3 trials are you suggesting starting fresh each time (so in your example, n=2) or would you instead just keep running until you see n repeats? – ConMan Oct 27 '15 at 22:06
  • related: http://math.stackexchange.com/questions/615464/how-many-books-are-in-a-library (the example here is a special case) – Henry Oct 27 '15 at 23:35

1 Answers1

0

The probability, if there are $k$ distinct items, that the $n$th purchase is the first duplicate is $\dfrac{k!\,(n-1)}{(k-n+1)!\,k^n}$ for $2 \le n \le k+1$.

If a duplicate first turns up on the $n$th purchase, it seems empirically that an estimate of about $\hat{k}\approx \dfrac{n^2}{2}-\dfrac{5n}{6}$ distinct items in the population is reasonable.

But the uncertainty would be huge. If there really had been $k=26$ distinct items, then the probability of the first duplicate turning up in any of the $3$rd through to $14$th purchases would be slightly less than $95\%$ and in those cases your estimate of $k$ might have been from $2$ distinct items though to $86$ distinct items.

Henry
  • 157,058