2

Suppose I have a bag containing an unknown number N of plastic chips. The chips are labeled so that they are distinguishable; no two chips have the same label.

I draw a chip from the bag, observe its label, and put it back. After I've done this some number of times, I would like to estimate N.

As an example, suppose I draw and replace 50 times. 41 distinct chips come up; 7 come up twice and one comes up three times. What can I conclude about N? Obviously N≄41, but also it is extremely unlikely that N is as large as 1000; if it were, we would expect far fewer duplicates.

This is very similar to the German tank problem. But there the crucial difference is that in the German tank problem the chips are labeled with consecutive numbers $1\ldots N$ and the labels themselves give us information about the population size.

Another very similar problem is the ecological problem of estimating a population size. Mark and recapture is a simple and effective method, but it's not quite what I'm after here. For mark-and-recapture we'd want to sample the chips twice, and compare the two samples. If, say, the second half of the samples included 6 chips we saw from the first sample, we might guess that $N$ is around 100.

Specifically, what I want is a method:

  • The input is a vector of how many chips appeared $k$ times for each $k$. In the example above the vector is $\langle 33, 7, 1, 0, 0, \ldots\rangle$. We also have a probability $p$, say $p=0.95$ representing the degree of confidence we want to have in our answer.
  • The output is an interval $[ N_{\text{min}}, N_{\text{max}}]$ such that $N$ lies in this interval with probability close to $p$, and which also contains the $N$ that is most likely.

I am sure this has been studied extensively and I would be glad for a reference to the literature.

MJD
  • 65,394
  • 39
  • 298
  • 580
  • 1
    See the related https://math.stackexchange.com/questions/615464/how-many-books-are-in-a-library and https://math.stackexchange.com/questions/4197635/guessing-number-of-colors-of-beads-in-an-urn and https://math.stackexchange.com/questions/760664/i-pull-17-balls-out-of-a-bag-and-there-are-13-distinct-colors-in-the-sample and https://math.stackexchange.com/questions/122699/how-to-estimate-the-number-of-articles-on-wikipedia-using-the-random-article-f – Henry Oct 20 '21 at 15:38
  • Thanks, I will close as duplicate once I look over the answers you pointed to and find the most satisfactory one. – MJD Oct 20 '21 at 15:54
  • I think you will find that the maximum likelihood estimate is $119$ for your particular example of $41$ unique chips form $50$ draws, but there are other methods too. A confidence interval is harder but might range from under $75$ to over $230$, depending on how you do it. – Henry Oct 20 '21 at 21:26

0 Answers0