Sampling without replacement expectation proof

Question

Here is my problem and subsequent attempted proof:

Assume we have $n$ different colored balls and $k$ of each color for a total of $nk$ balls. We want to draw from this set without replacement until we have seen at least one of each color.

Let $D_m$ define the number of distinct colored balls seen after $m$ draws from the defined set. Therefore, $D_m$ has the same probability distribution as the number of draws with replacement until each color is drawn. When sampling with replacement, we have that $m \approx n \log n$ (see coupon collector problem) so this will serve as an upperbound on $\mathbb{E}[D_m]$. Furthermore, if $k = 1$ then we have $n$ different balls and $n$ balls total, so we must draw each one to see each type.

The probability of drawing any ball on drawing $m$ is $$1 - \left(1 - \frac{1}{nk}\right)^m$$ Therefore, the expectation of $D_m$ is $$\mathbb{E}[D_m] = nk \left[ 1 - \left(1 - \frac{1}{nk}\right)^m\right] \rightarrow nk \left(1 - e^{-m/nk}\right) \text{ for $n \gg 1$}$$ Now let $m = c \cdot n \log n$ for some $c \in (0,1)$. This simplifies our expectation to $$\mathbb{E}[D_m] = nk \left(1-e^{-c \log n/k}\right) < nk\left(1-e^{-\log n/k}\right)$$ We now consider two cases, dependent upon the relative value of $k$ to $n$:

$\underline{\textit{Case I:}}$ $\lim_{n \rightarrow \infty}\frac{k}{\log n} = 0$ $$\lim_{n \rightarrow \infty}\mathbb{E}[D_m] = nk(1-0) = nk$$

$\underline{\textit{Case II:}}$ $\lim_{n \rightarrow \infty}\frac{k}{\log n} = \infty$

In this case, since $n \log n < nk$ we have by the sample-with-replacement bound that $$\mathbb{E}[D_m] = n \log n < nk$$ Therefore, if $k < \log n$ for all $n \in \mathbb{Z}$, in expectation we will draw every ball to see each type, whereas when $k > \log n$ we need not.

My question is the validity of the approach and namely if two of my "hand-waving" steps are valid: (1) imposing the $n \log n$ upperbound on expectation and (2) definition of $D_m$ as it relates to the sampling with replacement problem?

What is it that you actually want, in the end? Your intro paragraph indicates you are interested in the expected number draws to see all colors, but the rest of your question analyzes $D_m$, which is a completely different random variable. — Mike Earnest, Dec 07 '21 at 19:53
I am looking to find the expected number of draws to see at least one of each color: in essence the "coupon collector problem" without replacement. — Doc Stories, Dec 07 '21 at 20:18
To follow up, in numerical simulations, we have that you require about n log n samples in accordance with the sampling with replacement result (assuming k is large enough). — Doc Stories, Dec 07 '21 at 20:32
Thank you for clarifying. One last question, before I post my answer. Did you mean to define $D_m$ to be the number of distinct balls in $m$ draws (as you have written), or the number of distinct colors in $m$ draws. I would suspect you meant colors, as the number of distinct balls in $m$ draws is always exactly $m$. — Mike Earnest, Dec 07 '21 at 20:35
Yes, sorry. I am only interested in distinct colors. Two balls of the same color are not of interest. — Doc Stories, Dec 07 '21 at 20:37
I noticed you also asked this question on MO. I just wanted to say, I think my second answer does a better job of quantifying how far away $E[N]$ is from its limiting value of $nH_n \approx n(\log n+\gamma)$ than Fedor's answer, even though his methods are more sophisticated. Thanks for sharing this problem, I had a lot of fun with it :^) — Mike Earnest, Dec 09 '21 at 22:44
Yes, I agree your method is substantially more elegant! I agree its an interesting problem, and was mainly curious why this "coupon collector" without replacement problem seems to have been unexplored? — Doc Stories, Dec 11 '21 at 20:53

Mike Earnest · Accepted Answer · 2021-12-08T01:17:39.353

Mistakes in your analysis

To make things clear, let $N$ be the number of draws it takes to see all colors, so $E[N]$ is your quantity of interest.

Therefore, $D_m$ has the same probability distribution as the number of draws with replacement until each color is drawn.

There are two things wrong with this:

First, a unit mismatch. For $D_m$, the number of draws is fixed, and the number of distinct colors scene is random. This is not the same as having the number of colors be fixed, and the number of draws to reach that target being random. Therefore, there is no relation between $D_m$ and $N$.
- What you can probably say is that $E[N]\approx n\log n$.
Your definition of $D_m$ still fits in a "without-replacement" paradigm. You could say that it is approximately with-replacement.

The probability of drawing any ball on drawing $m$ is $1−(1−{1\over nk})^m$. Therefore, the expectation of $D_m$ is $\mathbb{E}[D_m] = nk \left[ 1 - \left(1 - \frac{1}{nk}\right)^m\right] \rightarrow nk \left(1 - e^{-m/nk}\right) \text{ for $n \gg 1$}$

This calculation assumes drawing with replacement, which is only approximately correct. Approximations should always be notated with $\approx$.
The calculation still computes the approximation incorrectly. The probability that a color is present after $m$ draws with replacement would be $$ 1-(1-\tfrac1{\color{red}{n}})^m \approx 1-e^{m/n} $$ Then, linearity of expectation would imply $$ E[D_m]\approx n(1-e^{m/n}) $$ The probability without replacement is $$ 1-\binom{(n-1)k}{m}\Big/\binom{nk}{m}, $$ so the exact expression is $$ E[D_m]=n\left(1-\binom{(n-1)k}{m}\Big/\binom{nk}{m}\right) $$

Therefore, if $k < \log n$ for all $n \in \mathbb{Z}$, in expectation we will draw every ball to see each type, whereas when $k > \log n$ we need not.

The problem here is your language is imprecise. What does "in expectation we will draw every ball to see each type" mean? In order to find $E[N]$, you will need a different method.

For each $i\in \{1,\dots,n\}$, $A_{i,m}$ be the event that you do not see color number $i$ in the first $m$ draws. Note $$ P(A_{1,m}\cap A_{2,m}\cdots \cap A_{j,m})=\binom{(n-j)k}{m}\Big/\binom {nk}m $$

Now, letting $B_m$ be the event that some color is missing after $m$ draws, we can use the above result, along with the principle of inclusion exclusion, to conclude $$ P(B_m)=\sum_{j=1}^n(-1)^{j-1}\binom{n}{j}\binom{(n-j)k}{m}\Big/\binom {nk}m $$ Warning: For this formula to work, we need to use the following convention: $\binom{a}{b}$ is defined to be zero when $b>a$ is negative. This means we do not have to worry about summation bounds.

Finally, $$ E[N]=\sum_{m=0}^{\infty}P(N>m)=\sum_{m=0}^{nk} P(B_m)= \sum_{m=0}^{nk}\sum_{j=1}^n(-1)^{j-1}\binom{n}{j}\binom{(n-j)k}{m}\Big/\binom {nk}m $$

Thank you for the extensive feedback and I agree with most of the above. However, I am a little confused on your resulting $E[N]$ formula: in the case where $k=1$ we should have that we need to take exactly N samples to see each type ($E[N] = n$) but I do not think this solution yields that result. Ultimately, I want to say that we need $k > \log n$ in order to see one of each color before exhausting the set ($E[N] < nk$). — Doc Stories, Dec 07 '21 at 21:51
@DocStories "I don't think this solution yields that result." What is your reason for this opinion? Can you find a single value of $n$ such that my formula with $k=1$ does not evaluate to $n$? — Mike Earnest, Dec 07 '21 at 22:53
It seems you edited your solution to adjust some of the summation bounds so my issue may have been resolved, but my issue was with the summation in $P(B_m)$. Why is this from $1$ to $n$? In calculating $E[N]$ with $k = 1$, when $n - j < m$ our binomial coefficient is invalid? — Doc Stories, Dec 08 '21 at 00:19
I think it should read "defined to be zero when $b > a$ but yes this would work. — Doc Stories, Dec 08 '21 at 01:10
@DocStories By the way, that is a default convention that everyone agrees on. $\binom{a}{b}$ is defined as the number of $b$ element subsets of an $a$ element set. When $b>a$, there are zero such subsets, so $\binom{a}b=0$. — Mike Earnest, Dec 08 '21 at 01:16
@DocStories By the way, the complicated formula I gave, though exactly correct, does not easily yield an asymptotic approximation. However, at least you can use that formula to computationally check how large $k$ needs to be in order for $E[N]$ to be close to its limiting value of $n(1+1/2+\dots+1/n)$. My initial exploration suggests that the error between exact and approximate decreases like $O(1/k)$ as $k\to\infty$. — Mike Earnest, Dec 08 '21 at 01:31
Right. Thats why I was trying to manipulate the easier to work with $E[D_m]$ to give an approximation which helps to give the $n \log n$ result. — Doc Stories, Dec 08 '21 at 02:04
Maybe some loose approximation like Stirling with my original two cases would help? — Doc Stories, Dec 08 '21 at 02:18

Mike Earnest · Answer 2 · 2021-12-09T22:59:46.903

Let $H_n=\frac11+\frac12+\dots+\frac1n$ be the $n^{th}$ harmonic number, and let $$ H_{n,p}=\sum_{i=1}^n\frac1{i^p} $$ be the $(n,p)^{\text{th}}$ generalized harmonic number. Recall $H_n\approx \log n+\gamma$, and $H_{n,2}\approx \zeta(2)-\int_{n+1}^\infty x^{-2}\,dx=\zeta(2)-\frac{1}{(n+1)}$. That is, for large $n$, $H_{n,2}$ is approximately constant.

Answer: The expected number of balls drawn has the following first-order asymptotic expansion: $$ nH_n -\frac1k\left[n\left(\frac{(H_n)^2+H_{n,2}}2\right)-H_n\right] +O\left({1\over k^2}\right)\qquad \text{as }k\to\infty $$ The dominant term is $nH_n$, while the second term is approximately equal to $\frac1k n (\log n)^2$ when $n$ is large enough. This shows that as long as $k\gg n (\log n)^2$, then the expected value is very close to that of the coupon collector problem without replacement.

Proof: It turns out there is a very simple expression for the expected number of balls drawn: $$ E[\text{# balls drawn}] =\sum_{i=1}^n (-1)^{i-1}\binom{n}{i} \frac{nk+1}{ik+1} \tag 1 $$

I gave a proof here: Coupon collector without replacement. In order to estimate how large we need to make $k$ in order for this to be close to the limiting value $nH_n$, let us expand the fraction $\frac{nk+1}{ik+1}$ into negative powers of $k$: $$ \begin{align} \frac{nk+1}{ik+1} &=(nk+1)\frac{(ik)^{-1}}{1-(-ik)^{-1}} \\&=(nk+1)\left[\frac1{ik}-\frac1{i^2k^2}+\frac1{i^3k^3}-\dots\right] \\&= \frac{n}{i}+\frac1k\left(\frac1i-\frac{n}{i^2}\right)+O\left(\frac1{k^2}\right) \quad \text{as $k\to\infty$}\tag2 \end{align} $$ Combining $(1)$ and $(2)$, the expected number of draws is $$ n\left(\color{#a43}{\sum_{i=1}^n(-1)^{i-1}\binom{n}i\frac1i}\right) + \frac1k\left[\left(\color{#a43}{\sum_{i=1}^n(-1)^{i-1}\binom{n}i\frac1i}\right) - n\left(\color{#207}{\sum_{i=1}^n(-1)^{i-1}\binom{n}i\frac1{i^2}}\right)\right] + O\left(\frac1{k^2}\right) $$ Therefore, in order to complete the proof, it suffices to prove the combinatorial summations $$ \begin{align} \color{#a43}{\sum_{i=1}^n(-1)^{i-1}\binom{n}i\frac1i} &= H_n \tag3 \\ \color{#207}{\sum_{i=1}^n(-1)^{i-1}\binom{n}i\frac1{i^2}} &= \frac{(H_n)^2+H_{n,2}}2\tag4 \end{align} $$ Wikipedia has a proof of $(3)$ (proof of (3)), and you can prove $(4)$ by similar methods.

Sampling without replacement expectation proof

2 Answers2

Mistakes in your analysis

Linked

Related