12

There's an urn with an infinite number of balls, from which one can draw as many times as one wants. The balls have $n$ different colours, and the probability a ball has a certain colour is $1/n$. For instance, if there are two colours, red and green, there's a 50% chance of a ball being red, but problem is that the number of colours is unknown.

Given that a person has made $m$ draws, of which there are $m_1$ occurences of the colour $c_1$, $m_2$ occurences of the colour $c_2$ $\ldots$ $m_n$ occurences of colour $c_n$ (so that $m = m_1+m_2+\cdots+m_n$, not all necessarily non-zero), how can one estimate the total number of colours $n$?

If there's only one colour ($n=1$), one would expect to draw $c_1c_1c_1c_1\cdots$, but after a finite number of draws, one would still only have a probability that there's only one colour, and never be certain. But intuitively, one should be able to exclude large $n$s with a high probability, so how would $\mathbb{P}(n\,|\,\underbrace{c_1,c_1,c_1,\ldots,c_1}_{m})$ be distributed, and how would the general case look like?

$$ \mathbb{P}(n\,|\,\underbrace{c_1,\ldots,c_1}_{m_1},\underbrace{c_2,\ldots,c_2}_{m_2},\underbrace{c_n,\ldots,c_n}_{m_n})\quad\sim\quad? $$

There's a similar question here, however, this question doesn't assume that each colour is of equal probability.

Frank Vel
  • 5,339
  • 1
  • 1
    I assume you don't have a prior distribution on the number of colors? – kimchi lover Dec 25 '17 at 16:51
  • @kimchilover No knowledge of that distribution. If it's helpful it can be assumed to be uniform too, but it is really unknown. – Frank Vel Dec 25 '17 at 17:02
  • Just a comment on the problem, but if $n$ is equal to $10^{1000}$ (for example) and you sample with a reasonable finite $m$, you will have a really hard time coming up with some kind of estimate of $n$ since all of the colors will, with probability approaching 1, be different. Turning it into a Stopping Problem would be interesting. – MikeY Jan 02 '18 at 18:20
  • 1
    @FrankVel: I've marked this as a duplicate of an older question (I don't know whether you automatically get notified of that.) You might want to have a look at my answer there as well. – joriki May 04 '20 at 20:08

5 Answers5

3

Note: This is not a complete answer, but in depth explores some special cases and then puts up a conjecture for the general case.

First, an infinite number of balls in the urn only makes the question of what and what is not well-defined quite complicated. Since you are assuming equal probabilities for all colours, I assume the "infinite balls" is just to ensure that the probabilities don't change after drawing one. But that can be achieved by just putting the drawn ball back (and mixing the urn) before drawing again (usually called draw with replacement).

So I'll instead look at the following problem which should be equivalent to the one you meant (it is certainly equivalent to my interpretation of your description):

Given an Urn containing an unknown positive number of balls, all of which have a different colour, assume that on drawing $m$ balls we get $m_1$ balls of colour $c_1$, $m_2$ balls of colour $c_2$ and so on. What is the probability of the urn containing exactly $n$ balls?

The best way to approach this way is probably to first assume that there is a large, but finite upper bound $N$ for the possible number of colours (so we don't run into problems with infinities), and then consider the limit $N\to\infty$.

Now the problem as is is under-specified, as you don't say anything about the prior probability of the number of colours (that is, the probabilities to assign when $m=0$). It might be that it is rather unlikely that there are too many colours, because whoever made those balls would have had to pay more to make more balls. Or it might be that it is unlikely that there are very few colours because the maker wants to show off his paint-making abilities.

But given that we on't have any information about it, let's just take as prior probability that $n$ is equally distributed between $1$ and $N$ (that is, before any draws, each possible value of $n$ has the same probability $1/N$).

Thus before the first draw, the expected number of colours is $\langle n\rangle = N/2$ and thus diverges as $N\to\infty$. Note also that $P(n)\to 0$ for each individual $n$.

I'm also assuming the you know nothing about the order of colours (that is, if $c_1$ is green, and $c_2$ is orange, and an orange ball is drawn, you cannot conclude that there also is a green ball because when $c_2$ has been drawn, $c_1$ must exist, too.

Thus the first draw tells you absolutely nothing about the number of colours. All it tells you is that one of the colours is the colour you see. So you still have all possible numbers to be of equal probability, and $\langle n\rangle = N/2 \to \infty$ while $P(n)\to 0$ for all $n$.

Now the second draw is where it starts to get interesting, as here we can have two possible results: Either you draw the same colour again, or you draw another colour.

If there are $n$ colours in total, drawing the same colour again has probability $1/n$. Therefore we can use the law of Bayes to update our probablities to $$P(n|m_1=2) = \frac{1/n}{\sum_{j=1}^{N}{1/j}}$$ Now the sum in the denominator diverges for $N\to\infty$, and thus we still have $P(n)\to 0$ for all $n$. The expectation value then is easily calculated to be $$\langle n\rangle = \frac{N}{\sum_{j=1}^{N}{1/j}}$$ where the numerator grows linearly while the denominator only grows logarithmic in $N$, this for $N\to\infty$ we still get $\langle n\rangle \to \infty$.

On the other hand, if on the second draw you get another colour, all you've done is to exclude the possibility of there just being one colour, but all remaining options remain equally probable, so you'll have $$P(n|m_1=1,m_2=1) = \cases{ 0 & if $n=1$\\ \frac{1}{N-1} & otherwise }$$ It is quite obvious that also in this case, $P(n)\to 0$ and $\langle n\rangle\to\infty$ for $N\to\infty$.

Now let's look at the third draw. Let's first consider the case that on the first two draws we've got the same ball, and not the third draw again gives the same colour. In this case our prior probability for the third draw is $P(n|m_1=2)$ from above. Note that the denominator is independent of $n$ and thus cancels out in Bayes rule. So again taking the fact that the probability with $n$ balls to draw the same colour again is $1/n$, we get for the new probability: $$P(n|m_1=3) = \frac{1/n^2}{\sum_{j=1}^{N}{1/j^2}}$$ Now we get something new: The denominator converges for $N\to\infty$, as it is the series for $\zeta(2) = \pi/6$. Therefore the probabilities no longer vanish in the large $N$ limit, but rather you get $$\lim_{N\to\infty} P(n|m_1=3) = \frac{6}{\pi n^2}$$ For example, after drawing the same colour for three times, the probability of that being the only colour is already $\pi/6 \approx 0.52$. However the expectation value $\langle n\rangle$ still diverges.

We can see that this is easily extendible for an arbitrary number of draws: If you get always the same colour for $m_1=m$ draws, then the probability of having $n$ colours total is in the limit of large $N$ $$P(n|m_1=m) = \frac{1}{n^{m-1}\zeta(m-1)}$$ and the expectation value is $$\langle n\rangle = \frac{\zeta(m-2)}{\zeta(m-1)}$$ which for $m\ge 4$ also is finite. Indeed, already for $m=4$ we get $\langle n\rangle \approx 1.37$.

Going back to the case of three draws, the next case is where two of the draws were equal, but the third was different. As the order of draws should not matter, we can as well look at the case where the first two draws were different, but the third did give one of the two (it doesn't matter which one). This case is a bit easier to handle, as the prior probability for the third draw is still uniform on $\{2,\ldots,N\}$, with the case $n=1$ simply excluded. The probability to draw one of the two colours previously drawn is, of course, $2/n$ (there are $2$ colours out of $n$), so the probability distribution is $$P(n|m_1=2,m_2=1) = \cases{ 0 & for $n=1$\\ \frac{2/n}{\sum_{j=2}^N 2/j} & otherwise}$$ Note that you can actually cancel out the $2$ in the numerator.

For $N\to\infty$ again the denominator diverges; indeed, after cancelling out the $2$ it is just the zeta function with the first term missing.

From a conceptual point of view, it seems obvious that the only thing that should matter for the probability distribution is whether the newly drawn value is one that was already drawn previously, or a new one. Also, the order should obviously not matter.

Therefore generalizing the results above, I conjecture the following general formula:

Conjecture: If in the first $m$ draws $k$ different colours were drawn, the probability distribution is $$P(n|m,k) = \cases{ 0 & if $n<k$\\ \frac{1}{n^{k-1}\sum_{j=k}^N 1/j^{k-1}} & otherwise}$$ and the corresponding expectation value is $$\langle n\rangle = \frac{\sum_{j=k}^N 1/j^{k-2}}{\sum_{j=k}^N 1/j^{k-1}}$$ Introducing the "truncated zeta function" $$\zeta_k(m) = \sum_{j=k}^{\infty} \frac{1}{j^m} = \zeta(m) - \sum_{j=1}^{k-1} \frac{1}{j^m}$$ the limiting case can then be written as $$P(n|m,k) = \cases{ 0 & for $n<k$\\ \frac{1}{n^{k-1}\zeta_k(m-1)} & otherwise}$$ and $$\langle n\rangle = \frac{\zeta_k(m-2)}{\zeta_k(m-1)}$$

celtschk
  • 43,384
  • Your double use of $n$ in your statement of the finite problem may be an error. Is there a typo? – jdods Jan 02 '18 at 21:17
  • @jdods: Indeed, I had used $n$ several times where it didn't belong; I fixed those now; hopefully I didn't miss any. Thank you for noting. – celtschk Jan 02 '18 at 21:23
  • It's in the boxed statement of the problem at the top: $n$ is stated to be the actual number of balls, and then the question is said to be what is the probability of it being $n$ balls. Thus $P(n)=1$ (surely). I'm sure there must be a typo. – jdods Jan 02 '18 at 21:29
  • @jdods: No, that is not a typo. The number of balls, $n$, is a random variable of the model. If you want to have a more frequentist account, imagine there are $N$ urns, one containing $1$ ball, one containinf $2$ balls, and so on, urn $n$ containing $n$ balls. Then he randomly selects one of the urns and uses it for the subsequent drawings. The probability $P(n)$ of taking the urn with exactly $n$ balls is definitely not always $1$. – celtschk Jan 02 '18 at 23:10
  • What does "exactly $n$" mean for random variable $n$? – jdods Jan 03 '18 at 01:22
  • @jdods: If I roll a die once, then I might roll at least $3$ (that is, I got $3$, $4$, $5$ or $6$), more than $3$ (that is, $4$, $5$ or $6$), … or I may just roll exactly $3$ (that is, $3$ and nothing else). Yes, technically, "exactly" is redundant here. – celtschk Jan 03 '18 at 12:28
  • But $3$ isn't a random variable, it's a constant, so "exactly $3$" makes sense. If $n$ is, say, geometrically distributed, what does it mean for the number of balls to be exactly $n$? The "probability that it is geometrically distributed"? If you delete the first $n$ in the boxed statement it is ok in my opinion. But I won't press the matter any further. If I am just misunderstanding something, I apologize for wasting your time. – jdods Jan 03 '18 at 14:07
  • @jdods: OK, maybe my terminology in the previous comment was not exactly right. Anyway, the sentence "What is the probability of the urn containing exactly $n$ balls?" is the analog to "What is the probability of the die showing the result $n$" except that the random variable "roll outcome" is replaced by the random variable "number of balls in the urn". – celtschk Jan 03 '18 at 19:40
  • @jdods: Anyway, I've now removed the first $n$. While I don't see why it's a problem, it certainly isn't necessary. – celtschk Jan 03 '18 at 19:49
  • @celtschk: Hi – I marked this question as a duplicate of this older one. Interesting that we chose the same example to work out (getting the same result three times). You used the wrong value of $\zeta(2)$, though; it's $\frac{\pi^2}6$, not $\frac\pi6$. – joriki May 04 '20 at 20:04
2

Without a proper prior on the number of colors in the urn, one has to rely on some ad-hoc recipe.

The maximum likelihood estimate is, I think, always the number of distinct colors actually seen.

I would look at the taper to 0 of the histogram of number of times each color is seen. If the rarest color was seen 100 times, I'd guess you've seen them all, but if the rarest colors occurred 3, 1, and 1 times apiece, I'd bet there were a few more colors to be seen, and would concoct a formula based on the shape of the taper.

kimchi lover
  • 24,277
  • I am not sure that the taper really matters. Conditioned on the number of observed colours, the distribution of observations does not depend on $n$ anymore, so we should be able to extract any relevant information only from the number of observed colours, and the number of trials. – D. Thomine Jan 03 '18 at 21:18
  • It's possible you are right, and my answer is based on a partial bet-hedging disbelief in the stated model. – kimchi lover Jan 04 '18 at 12:54
1

I think celtschk has the right answer, this is just an explicit proof of their conjecture:

I define the number of colours we see as $d$ and $\boldsymbol{m} \triangleq [m_1, m_2, ..., m_d]$.

Hence, given an indexing for the colours, we have $$\boldsymbol{m}| n \sim \text{Multinomial}(m, n^{-1} \boldsymbol{1})$$

However, we require at least one colour in order to decide which colour will be the first index. As celtschk states:

Thus the first draw tells you absolutely nothing about the number of colours. All it tells you is that one of the colours is the colour you see. So you still have all possible numbers to be of equal probability

So for our purposes the distribution is actually,

$$m_1-1, m_2, ..., m_d| n \sim \text{Multinomial}((m-1), n^{-1} \boldsymbol{1})$$

As the first draw purely tells us which colour to denote as "$c_1$".

For notational simplicity, let us redefine the variables to exclude the first trial. I.e. $m_1 = m_1-1$ and $m = m-1$ and so on.

We have a prior of $p(n) \propto 1$.

So the joint likelihood is, $$p(\boldsymbol{m}, n) = \frac{m!}{m_1!m_2!...m_n!}\prod_{i=1}^d n^{-m_i}\times 1 = \frac{m!}{m_1!m_2!...m_n!} n^{-m}$$

We require $n \geq d$ (or, more accurately, if we see $d$ colours then the probability of there being $n < d$ colours is zero).

We use celtschk's notation of $$\zeta_d(m) \triangleq \sum_{j=d}^\infty \frac{1}{j^m}$$

And so, integrating the joint likelihood over $n$ gives, $$p(\boldsymbol{m})= \sum_{n=d}^\infty \frac{m!}{m_1!m_2!...m_n!} n^{-m} = \frac{m!}{m_1!m_2!...m_n!} \zeta_d(m)$$ And hence the posterior for $n$ is, $$p(n|m_{1:n}) = \frac{\frac{m!}{m_1!m_2!...m_n!} n^{-m}}{\frac{m!}{m_1!m_2!...m_n!} \zeta_d(m)} = \frac{1}{n^m\zeta_d(m)}$$

for $n \geq d$ (and 0 otherwise)

And hence our expectation is $$\langle n\rangle = \sum_{n=d}^\infty n \frac{n^{-m}}{\zeta_d(m)} = \frac{\zeta_d(m-1)}{\zeta_d(m)}$$

Note this is identical to celtschk's conjecture when we recall that we redefined our variables so that our "$m$" actually indicates $m-1$ trials.

1

Suppose you have $n$ colors, you make $m$ draws and see a total of $0 \le q \le n$ different colors.

That corresponds to a word of length $m$, made from the alphabet $\{1,2, \cdots ,n\}$, and composed of $q$ different characters.

The total number of words of length $m$ from the alphabet $\{1,2, \cdots ,n\}$ , thus with up to $n$ different characters, is $n^m$.

The number of words of length $m$ with exactly $q$ different characters taken from the alphabet $\{1,2, \cdots ,n\}$
will be $$ \bbox[lightyellow] { N(n,m,q) = \left\{ \matrix{ m \cr q \cr} \right\}n^{\;\underline {\,q\,} } = q!\left\{ \matrix{ m \cr q \cr} \right\}\left( \matrix{ n \cr q \cr} \right) }$$ with $$ \sum\limits_{0\, \le \,q\,\left( { \le \,\min \left( {n,m} \right)} \right)} {N(n,m,q)} = \sum\limits_{0\, \le \,q\,\left( { \le \,\min \left( {n,m} \right)} \right)} {\left\{ \matrix{ m \cr q \cr} \right\}n^{\;\underline {\,q\,} } } = n^{\,m} $$

We can in fact construct the word according to the following scheme $$ \bbox[lightyellow] { \eqalign{ & \left\{ {1,2, \cdots ,m} \right\} \cr & \quad \Downarrow {\rm surjection}\quad q!\left\{ \matrix{ m \cr q \cr} \right\} \cr & \left\{ {c_{\,1} ,c_{\,2} , \cdots ,c_{\,q} } \right\} \cr & \quad \Downarrow {\rm subset}\quad \left( \matrix{ n \cr q \cr} \right) \cr & \left\{ {1,2, \cdots ,n} \right\} \cr} }$$ re. for instance to this and to this other post.

That acquired, we can put $$ \bbox[lightyellow] { \eqalign{ & \left\{ \matrix{ m \cr q \cr} \right\}{{n^{\;\underline {\,q\,} } } \over {n^{\,m} }} = P\left( {\left[ {C = q} \right]\;\,\left| {\;\left[ {T = n} \right] \wedge \left[ {L = m} \right]} \right.} \right) = \cr & = {{P\left( {\left[ {C = q} \right] \wedge \left[ {T = n} \right]\;\,\left| {\;\left[ {L = m} \right]} \right.} \right)} \over {P\left( {\left[ {T = n} \right]\;\,\left| {\;\left[ {L = m} \right]} \right.} \right)}} \cr} }$$ where $C,T,L$ are random variables indicating the number of distinct characters appearing in the word, the number of characters of the alphabet (total number of ch.), and the length of the word.

Then, assuming a suitable a priori probability for $$ P\left( {\left[ {T = n} \right]\;\,\left| {\;\left[ {L = m} \right]} \right.} \right) $$ we have all the necessary premises to exploit Bayesian inference and the tools of Hypothesis testing, in case expanding the Bayes formula to involve the $m$ parameter as well.

For instance, assuming a uniform distribution, let's say for $1 \le T \le N$, we get $$ P\left( {\left[ {C = q} \right] \wedge \left[ {T = n} \right]\;\,\left| {\;\left[ {L = m} \right]} \right.} \right) = {1 \over N}\left\{ \matrix{ m \cr q \cr} \right\}{{n^{\;\underline {\,q\,} } } \over {n^{\,m} }} $$ which, putting e.g. $N=6$ and $m=4$, returns this table for $P\left( {\left[ {C = q} \right] \wedge \left[ {T = n} \right]\;\,\left| {\;\left[ {L = m} \right]} \right.} \right) $ $$ \bbox[lightyellow] { \matrix{ {n\backslash q} & 1 & 2 & 3 & 4 \cr 1 & {{1 \over 6}} & 0 & 0 & 0 \cr 2 & {{1 \over {48}}} & {{7 \over {48}}} & 0 & 0 \cr 3 & {{1 \over {162}}} & {{7 \over {81}}} & {{2 \over {27}}} & 0 \cr 4 & {{1 \over {384}}} & {{7 \over {128}}} & {{3 \over {32}}} & {{1 \over {64}}} \cr 5 & {{1 \over {750}}} & {{{14} \over {375}}} & {{{14} \over {125}}} & {{4 \over {125}}} \cr 6 & {{1 \over {1296}}} & {{{35} \over {1296}}} & {{5 \over {54}}} & {{5 \over {108}}} \cr } }$$ That means, that if we have observed $q=3$ different characters, then the most probable $$ p = {{14} \over {125}}\;\;\mathop /\limits_{} \;\;\left( {{2 \over {27}} + {3 \over {32}} + {{14} \over {125}} + {5 \over {54}}} \right) = {{14} \over {125}}{{12000} \over {4277}} = {{192} \over {611}} $$ is that the word comes from an alphabet of $5$ characters.

G Cab
  • 35,272
0

I apply here basic statistic analysis.

What you want here is an estimate on the number of colors nc, from the $m_i$ sampling. Let's set $m=\sum_{i=1}^{cs}{m_i}$ the number of balls taken from the urn (cs the number of color seen). The estimate of the probability from each color is $m_i$/m. But you want to estimate the number of colors, which is 1 over the probability observed for each cs colors. So your sampling for the number of color is $m/m_i$ (with i going from 1 to the number of color seen, cs). All the estimates should converge to the same average value and they should be all the same according to the assumption. So, you first need to calculate the average of number of colors estimate.

$<nc>=\frac{1}{cs}\sum_{i=1}^{cs}{\frac{m}{m_i}}$

and you estimate the error by calculating the variance:

$\sigma^2=\frac{1}{cs}\sum_{i=1}^{cs}(\frac{m}{m_i}-<nc>)^2$

and there you have it! The estimate for the number of colors is: $<nc>\pm\sigma/2$ (if you have a gaussian distribution for your sampling, the 1/2 could be modified to include 95% interval or so but the distribution is more of a Poisson distribution which can be further explored).

Example: Let's say we have the distribution of 20 balls taken from the urn with the colors labelled from 1 to 4:

{4, 2, 3, 4, 4, 3, 2, 4, 2, 3, 2, 4, 2, 4, 1, 4, 4, 4, 3, 1}

If we tally the results

{{4, 9}, {2, 5}, {3, 4}, {1, 2}} meaning we get 9 balls of color 4, 5 balls of color 2 and so forth...

We get the sampling for the number of colors ($20/m_i$):

{{4., 2.22222}, {2., 4.}, {3., 5.}, {1., 10.}}

We get an average of the number of colors of 5.30556 with an error $\sigma/2=1.44358$ so from 20 balls we have an estimate from 4 to 6 colors possible.

Gwanguy
  • 196