Probabilistically learning how many numbers in a set have a specific property

Question

There are $n$ numbers in a set. $k$ of these numbers have a specific property; the task is to learn how many of the numbers have such a property, i.e. the number $k$.

The learning is done by repeated experiments. Each experiment consists of randomly selecting numbers from the set until a number who does not have the property is selected. Each number can be selected only once.

Each new experiment starts from the full set again.

For example, if numbers 1, 2, 3 have the property but the number 4 does not, then "1, 4" and "2, 3, 4" and "4" are all examples of valid experiments, but "4, 1" or "3,3,4" are not.

What is the probability that after $m$ experiments all $k$ numbers with the property have been seen, i.e. the number $k$ is learned?

I tried to reason like this (assuming that learning each number is independent):

The probability for each single specific number to be observed in a single experiment is $p = \frac{1}{n - k + 1}$.
The probability a number is observed after $m$ experiments is $1 - (1-p)^m$
The probability that all $k$ numbers are observed is $\Pi_{i=1}^{k}{(1 - (1-p_i)^m)} = (1 - (1-p)^m)^k$.

However, this does not match the numbers I got from running simulations on specific cases.

For example, for $n=8$ and $k=4$ and number of experiments $m=10$, the formula gives probability 0.6348597233188475, but simulations give around 0.660309.

Are successive experiments selecting numbers that have already been chosen in a previous experiment? If not, it seems like the number of experiments must be either $n-k$ (if the last number chosen does not have the property) or $n-k+1$ (if it does). After all, each experiment ends either on a single number lacking the property, or the exhaustion of the $n$ numbers. — Brian Tung, Sep 28 '15 at 17:31
No, the numbers are not removed. And there is always guaranteed to be at least one number in the set with that property. — kfx, Sep 28 '15 at 18:38
There are $k$ events in play, namely that each property-having number is observed in the $m$ experiments. Presumably the error in your calculation is assuming that those $k$ events are independent. You can probably do the case $n=3,k=2$ by hand to verify this. — Greg Martin, Sep 28 '15 at 19:09
I doubt that this can get a simple solution. Computed numerically (not by simulation, but by recursive exact equation) I get $p=0.66068061766$ — leonbloy, Sep 28 '15 at 19:51
Why do you call all k-members selected at least once "k learnt"? You won't know that there aren't more unless you exhausted all elements in one of the experiments. — A.S., Sep 28 '15 at 23:24

score 1 · Accepted Answer · edited Apr 13 '17 at 12:20

The identity of the elements that lack the property is irrelevant; they serve merely to produce a probability $1-k/n$ of ending the experiment in each step. Thus we are drawing from $k$ numbers, and the number $X_m$ of numbers drawn in $m$ experiments has negative binomial distribution $X_m\sim \text{NB}(m,k/n)$ with

$$P(X_m=j)=\binom{j+m-1}j\left(\frac kn\right)^j\left(1-\frac kn\right)^m\;.$$

Conditional on $X_m=j$, the probability that all $k$ numbers have been observed is given by the Probability distribution in the coupon collector's problem:

$$ \def\stir#1#2{\left\{#1\atop#2\right\}} P(\text{done after $m$ experiments}\mid X_m=j)=\frac{k!}{k^j}\stir jk=\frac1{k^j}\sum_{l=0}^k(-1)^{k-l}\binom kll^j\;, $$

where $\stir jk$ is a Stirling number of the second kind that counts the number of partitions of $j$ labeled elements into $k$ non-empty unlabeled subsets.

The probability you want is obtained by summing over $j$:

\begin{align} P(\text{done after $m$ experiments})&=\sum_{j=0}^\infty P(X_m=j)P(\text{done after $m$ experiments}\mid X_m=j)\\ &=\sum_{j=0}^\infty\binom{j+m-1}j\left(\frac kn\right)^j\left(1-\frac kn\right)^m\frac1{k^j}\sum_{l=0}^k(-1)^{k-l}\binom kll^j\\ &=\left(1-\frac kn\right)^m\sum_{l=0}^k(-1)^{k-l}\binom kl\sum_{j=0}^\infty\binom{j+m-1}j\left(\frac ln\right)^j\\ &=\left(1-\frac kn\right)^m\sum_{l=0}^k(-1)^{k-l}\binom kl\left(1-\frac ln\right)^{-m}\\ &=\sum_{l=0}^k(-1)^{k-l}\binom kl\left(\frac{n-k}{n-l}\right)^m\;. \end{align}

I don't see how to simplify this further. The value that you simulated for $n=8$, $k=4$ and $m=10$ is

$$ \sum_{l=0}^4(-1)^l\binom4l\left(\frac4{8-l}\right)^{10}=\frac{36733580223913986742043}{55599603260670000000000}\approx0.66068061766\;, $$

as already calculated by leonbloy.

Thanks a lot! Good to know that the answer was not trivial. – kfx Sep 29 '15 at 09:45 — kfx, Sep 29 '15 at 09:45

Probabilistically learning how many numbers in a set have a specific property

1 Answers1