Suppose I have a bag containing an unknown number N of plastic chips. The chips are labeled so that they are distinguishable; no two chips have the same label.
I draw a chip from the bag, observe its label, and put it back. After I've done this some number of times, I would like to estimate N.
As an example, suppose I draw and replace 50 times. 41 distinct chips come up; 7 come up twice and one comes up three times. What can I conclude about N? Obviously Nā„41, but also it is extremely unlikely that N is as large as 1000; if it were, we would expect far fewer duplicates.
This is very similar to the German tank problem. But there the crucial difference is that in the German tank problem the chips are labeled with consecutive numbers $1\ldots N$ and the labels themselves give us information about the population size.
Another very similar problem is the ecological problem of estimating a population size. Mark and recapture is a simple and effective method, but it's not quite what I'm after here. For mark-and-recapture we'd want to sample the chips twice, and compare the two samples. If, say, the second half of the samples included 6 chips we saw from the first sample, we might guess that $N$ is around 100.
Specifically, what I want is a method:
- The input is a vector of how many chips appeared $k$ times for each $k$. In the example above the vector is $\langle 33, 7, 1, 0, 0, \ldots\rangle$. We also have a probability $p$, say $p=0.95$ representing the degree of confidence we want to have in our answer.
- The output is an interval $[ N_{\text{min}}, N_{\text{max}}]$ such that $N$ lies in this interval with probability close to $p$, and which also contains the $N$ that is most likely.
I am sure this has been studied extensively and I would be glad for a reference to the literature.