7

I have a bag filled with different colors of balls. My goal is to determine the number of distinct colors that in the bag, but I am limited to taking a small sample. From a sample of $N$ balls, I see that there $X$ different colors. What is the expected number of different colors in the bag?

Some assumptions which need to be made:

  • The bag is of sufficiently large size that the probability of drawing a certain color does not depend on the how many balls we have already drawn. (Effectively, we are drawing with replacement.)
  • There is an equal number of each color in the bag.

For an example, let's say that I draw $N=17$ balls out of the bag, and I see $X=13$ distinct colors. What is a good estimate for the number of colors in the bag?

So far, I have made little progress towards answering this on my own. I have tried to reverse the solution to the coupon collector's problem (as to solve for the number of colors as opposed to the number of trials), but I became stuck since it involved the harmonic numbers.

PhiNotPi
  • 2,661
  • 3
  • 23
  • 36
  • 1
    Wouldn't this depend on how many balls you have in the bag? – user88595 Apr 19 '14 at 16:59
  • Would it be allowed to taking many small samples, or are we restricted to just the one? – Marc Apr 19 '14 at 17:30
  • @Marc I would prefer a solution with a single sample, if feasible. What would be the difference between multiple small samples and one large sample? – PhiNotPi Apr 19 '14 at 18:09
  • @user88595 The size of the bag could serve as an upper bound (and thus change the estimate slightly), but I think that would only be important when the sample is a significant portion of the bag. I made the simplifying assumption that the bag was large enough relative to the sample so that those effects would be negligible. – PhiNotPi Apr 19 '14 at 18:18
  • 2
    Similar to http://math.stackexchange.com/questions/615464/how-many-books-are-in-a-library/ – Henry Apr 19 '14 at 21:13

1 Answers1

3

If the sample is size $n$, the likelihood you see $X=x$ different colours when there are $Y=y$ possible colours is proportional to $$\dfrac{y!}{(y-x)! y^n}.$$

So given $X=x$, the maximum likelihood estimate for $Y$ is the positive real solution to $y(y-1)^n -(y-x) y^n=0$ rounded down, or in other words the largest integer $y$ for which the left hand-side is non-negative. (Note that if $x=n$ the maximum likelihood estimate is infinite.)

In this particular case with $n=17$ and $X=13$, the maximum likelihood estimate for $Y$ is $28$.

The uncertainty is considerable: any value for $Y$ from $15$ through to $91$ would have a likelihood for seeing $13$ unique colours from a sample of $17$ more than a tenth of the likelihood resulting from $Y=28$.

Henry
  • 157,058