4

I'm working on a variant of the birthday problem that I haven't found discussed on this site.

Suppose the sequence $(X_n)$ of independent random variables takes values uniformly in $\{ 1,...,N \}$. Let $F_{N} = \min\{ m: X_m = X_k, k<m \}$ be the first time that a match is observed.

I want to know what can be said about $E(F_N)$ as $N \to \infty$.

It's easy to see that $$P(F_N = k) = \frac{N}{N} \frac{N-1}{N}... \frac{N - (k-2)}{N} \frac{k-1}{N}.$$

Hence, $$E(F_N) = \sum_{k=2}^{N+1} k \Big[\frac{N}{N} \frac{N-1}{N}... \frac{N - (k-2)}{N} \frac{k-1}{N} \Big]. $$

Any suggestions about where to go from here?

aduh
  • 8,662
  • 4
    Note that $$P(F_N \geqslant k+2) = \prod_{i=1}^{k}\left(1-\frac{i}N\right)$$ and that the RHS is of order $$e^{-k^2/(2N)}$$ hence, if $k_N^2\ll N\ll j_N^2$, then $$P(F_N\geqslant k_N)\to1\qquad P(F_N\geqslant j_N)\to0$$ In this sense, $$F_N=\Theta(\sqrt{N})$$ and one can guess that the same asymptotics holds for $E(F_N)$. – Did Sep 25 '16 at 22:43
  • 1
    Remark: to get the first asymptotic Did stated, note that $\log \prod_{i=1}^k (1-i/N) = \sum_{i=1}^k \log(1-i/N) \approx \sum_{i=1}^k -i/N = -\frac{k(k+1)}{2N} \approx \frac{-k^2}{2N}$. Then exponentiate everywhere. These approximations are reasonable in the region $1 \ll k \ll N$. – Ian Sep 25 '16 at 22:58
  • Maybe one can follow @Did's idea to obtain $E(F_N)\sim c \sqrt N$ with some explicit positive constant $c$. – Sungjin Kim Sep 25 '16 at 23:12
  • 4
    Conjecture: $$\lim_{N\to\infty}\frac{E(F_N)}{\sqrt{N}}=\sqrt{\frac\pi2}$$ – Did Sep 25 '16 at 23:15
  • 1
    @Did An empirical calculation and bold extrapolation suggests $1.2533$ might be close, and this is indeed $\sqrt{\frac{\pi}2}$ rounded – Henry Sep 25 '16 at 23:22
  • 1
    Sketch of a possible proof of Did's conjecture: $E[F_N]=\sum_{k=1}^{N+1} P(F_N \geq k)=2+\sum_{k=1}^{N-1} P(F_N \geq k+2) \sim 2+\sum_{k=1}^{N-1} e^{-k^2/2N}$. (The last step requires proof.) Finally if we look at $\frac{1}{\sqrt{N}} \sum_{k=1}^N e^{-k^2/2N}$, we may consider dividing the interval $[0,\sqrt{N}]$ into $N$ subintervals of length $\frac{1}{\sqrt{N}}$. The desired sum is is then a rectangle rule where $x_k=k/\sqrt{N}$ so that $x_k^2/2=k^2/2N$. – Ian Sep 25 '16 at 23:34
  • So we would hope that this sum would behave like $\int_0^{\sqrt{N}} e^{-x^2/2} dx$ which of course converges to $\frac{\sqrt{2 \pi}}{2} = \sqrt{\pi/2}$ as Did conjectured. This rectangle rule step still requires proof, because of the growth of the domain of integration, but I suspect that proof is not really so difficult: instead of trying to argue that you are approximating $\int_0^{\sqrt{N}}$, instead throw in an additional term so that it "looks like" you are approximating $\int_0^\infty$ and control the tail using standard techniques. – Ian Sep 25 '16 at 23:34
  • 1
    The expectation is the ratio of OEIS A063170 and OEIS A000312 and in the "Formula section" of the former N-E. Fahssi gives the equivalent asymptotic as @Did – Henry Sep 25 '16 at 23:34
  • 1
    @DId Would a similar result hold, for example, the time of the second match $F_N^{(2)}$, the third match $F_N^{(3)}$, etc.? – Sungjin Kim Sep 26 '16 at 20:59
  • 3
    @i707107 It seems that, for every fixed $n$, setting $F_N^{(0)}=0$, the random vector $$\left(\frac{F_N^{(k)}-F_N^{(k-1)}}{\sqrt{N}}\right){1\leqslant k\leqslant n}$$ converges in distribution to a continuous nonnegative random vector with joint PDF $$x_1x_2\cdots x_n,e^{-(x_1+x_2+\cdots+x_n)^2/2}$$ This suggests that each $(F_N^{(n)}-F_N^{(n-1)})/\sqrt{N}$ converges in distribution to a random variable with PDF $$xe^{-x^2/2}$$ and that, for every fixed $n$, $$\lim{N\to\infty}\frac{E(F_N^{(n)})}{\sqrt{N}}=n,\sqrt{\frac{\pi}2}$$ – Did Sep 27 '16 at 05:53
  • @Did I calculated the joint PDF as $N\rightarrow\infty$. I am not sure if the expression you have, and that I have are equivalent. – Sungjin Kim Sep 28 '16 at 04:40
  • @Did It seems that $x_2$ needs to be replaced by $x_1+x_2$, $\ldots$ , $x_n$ needs to be replaced by $x_1+x_2+\cdots x_n$. – Sungjin Kim Sep 28 '16 at 05:07
  • 2
    @i707107 Indeed, I stand corrected. Then $$\lim_{N\to\infty}\frac{E(F_N^{(k)})}{\sqrt{N}}=m_k$$ with $$m_k=\frac1{2^{k-1}(k-1)!}\int_0^\infty x^{2k}e^{-x^2/2}dx=\frac{(2k-1)!!}{2^{k-1}(k-1)!}\sqrt{\frac\pi2}=\frac{k}{2^{2k-1}}{2k\choose k}\sqrt{\frac\pi2}$$ – Did Sep 28 '16 at 06:23
  • 2
    ...and, again unless I am mistaken, $$\lim_{k\to\infty}\frac{m_k}{\sqrt{k}}=2\sqrt2$$ – Did Sep 28 '16 at 06:30
  • @Did This is interesting that $m_k$ grows as fast as $c\sqrt k$ which would make more sense because there are more track records of $X_i$ to have matches as $k$ increase. Thus, as $k$ increase, the match occurs more quickly. By the way, when I computed, it was $\sqrt 2$. – Sungjin Kim Sep 28 '16 at 15:23
  • @i707107 Again a mistake? I must be tired... :-) – Did Sep 28 '16 at 15:33

2 Answers2

7

The probability of the first match on the $k^\text{th}$ trial is $$ \begin{align} &\overbrace{\frac nn\frac{n-1}n\cdots\frac{n-k+2}n}^{\text{no match in $k-1$ trials}}-\overbrace{\frac nn\frac{n-1}n\cdots\frac{n-k+1}n}^{\text{no match in $k$ trials}}\\ &=\frac{n!}{n^{k-1}(n-k+1)!}-\frac{n!}{n^k(n-k)!}\\ &=\frac{n!}{n^{k-1}(n-k+1)!}-\frac{n!}{n^{k-1}(n-k+1)!}\frac{n-k+1}n\\ &=\frac{n!\,(k-1)}{n^k(n-k+1)!}\tag{1} \end{align} $$ Therefore, the expected value is $$ \begin{align} E(F_n)= &\sum_{k=0}^n\frac{n!\,k(k-1)}{n^k(n-k+1)!}\\ &=\frac{n!}{n^{n+1}}\sum_{k=0}^n\frac{k(k-1)}{(n-k+1)!}n^{n-k+1}\\ &=\frac{n!}{n^{n+1}}\sum_{k=1}^{n-1}\frac{(n-k+1)(n-k)}{k!}n^k\\ &=\frac{n!}{n^{n+1}}\sum_{k=1}^{n-1}\frac{n(n+1)-2kn+k(k-1)}{k!}n^k\\ &=\frac{(n+1)!}{n^n}\sum_{k=1}^{n-1}\frac{n^k}{k!}-\frac{2n!}{n^{n-1}}\sum_{k=0}^{n-2}\frac{n^k}{k!}+\frac{n!}{n^{n-1}}\sum_{k=0}^{n-3}\frac{n^k}{k!}\\ &=-\frac{(n+1)!}{n^n}+\frac{n!}{n^n}\sum_{k=0}^n\frac{n^k}{k!}\tag{2} \end{align} $$ Applying equation $(11)$ from this answer and Stirling's Approximation gives the expected value as $$ \bbox[5px,border:2px solid #C0A000]{E(F_n)=\frac12\sqrt{2\pi n}+\frac23+O\left(\frac1{\sqrt{n}}\right)}\tag{3} $$


Extended Asymptotics

Extending the computation we did for $(3)$, we get $$ E(F_n) =\sqrt{2\pi n}\left(\frac12+\frac1{24n}+\frac1{576n^2}\right) +\left(\frac23-\frac4{135n}+\frac8{2835n^2}\right) +O\left(\frac1{n^{5/2}}\right)\tag{4} $$

robjohn
  • 345,667
6

This is an elaboration of what @Did commented on $n$-th match.

Fix $K>0$. Denote by $F_N^{(n)}$ the time of $n$-th match and set $F_N^{(0)}=0$. For any fixed $n\geq 1$, we will find the joint CDF: For positive integers $k_1, \ldots , k_n $ with $0<k_1<k_2<\cdots <k_n\leq N$ and $k_n \leq K\sqrt N $, $$ \begin{align} P &(F_N^{(1)} =k_1, \ldots , F_N^{(n)} =k_n ) \\ &=\frac{k_1-1}N\prod_{i=1}^{k_1-2}\left(1-\frac{i}N\right) \frac{k_2-2}N\prod_{i=k_1-1}^{k_2-3}\left(1-\frac iN\right)\cdots \frac{k_n-n}N\prod_{i=k_{n-1}-(n-1) }^{k_n-(n+1)}\left(1-\frac iN\right)\\ &=\frac{k_1-1}N \cdots \frac{k_n-n}N \prod_{i=1}^{k_n-(n+1)}\left(1-\frac iN\right)\\ &=\frac{k_1-1}N \cdots \frac{k_n-n}N \exp\left( -\frac{(k_n-(n+1))^2}{2N}+O(N^{-2})\right). \end{align} $$ This gives $$ P (F_N^{(1)} =k_1, \ldots , F_N^{(n)} =k_n ) = \frac{k_1-1}{\sqrt N } \cdots \frac{k_n- n}{\sqrt N}\exp\left( -\frac12 \left(\frac{ k_n-(n+1) }{\sqrt N}\right)^2+O(N^{-2})\right) \frac 1{\sqrt N^n} . $$ Fix $0\leq x_1, \ldots , x_n\leq K$, and sum this up for $k_i\leq x_i \sqrt N$, we have as $N\rightarrow\infty$, $$ P\left( \frac{F_N^{(1)}}{\sqrt N}\leq x_1, \ldots , \frac{F_N^{(n)}}{\sqrt N} \leq x_n\right) \rightarrow \int_{0\leq t_1\leq \cdots \leq t_n, \ \forall i, t_i\leq x_i} t_1 \cdots t_n \exp \left(-\frac12 t_n^2\right) dV $$ (Think of this as summing the probabilities over the boxes with side length $1/\sqrt N$. The Dominated Convergence Theorem will suffice to justify this limit.)

Thus, the random vector $\left(\frac{F_N^{(1)}}{\sqrt N}, \ldots , \frac{F_N^{(n)}}{\sqrt N}\right)$ converges in distribution to the continuous random variable with PDF $$ f(t_1,\ldots , t_n) = t_1 \cdots t_n \exp\left(-\frac 12 t_n^2\right) \mathbf{1}_{0\leq t_1 \leq \cdots \leq t_n}. $$

The question was originally about the expectation in case with $n=1$. So, the above calculation suggests that the expectation can be similarly calculated as $N\rightarrow\infty$, $$ \mathbf{E}\left(\frac{F_N^{(1)}}{\sqrt N} \right)\rightarrow \int_0^{\infty} t_1^2 \exp\left(-\frac12 t_1^2\right) dt_1 = \sqrt{\frac{\pi}2}. $$

But, we need to treat the case with $k_1\neq O(\sqrt N)$. In the general case $k_n\neq O(\sqrt N)$. To do this, we use $$ \log(1-x) \leq -x. $$ Then for $K\sqrt N < k_n$, $$ \frac{k_n}{\sqrt N}P (F_N^{(1)} =k_1, \ldots , F_N^{(n)} =k_n ) \leq \frac{k_1\cdots k_{n-1}k_n^2}{\sqrt N^{n+1}} \exp\left( -\frac12 \left(\frac{ k_n-n-1 }{\sqrt N}\right)^2\right)\frac1{\sqrt N^n}. $$ Again by the Dominated Convergence Theorem, the right side after summing up for $0\leq k_1\leq \cdots \leq k_n$, becomes as $N\rightarrow\infty$, $$ \int_K^{\infty} \int_0^{t_n} \cdots \int_0^{t_2} t_1\cdots t_{n-1}t_n^2 \exp\left(-\frac12 t_n^2 \right) dt_1\cdots dt_n. $$ Note that this can be made arbitrarily small with sufficiently large $K$. This shows that the suggested calculation is valid. We now have $$ \mathbf{E}\left(\frac{F_N^{(n)}}{\sqrt N}\right) \rightarrow \int_0^{\infty} \int_0^{t_n} \cdots \int_0^{t_2} t_1 \cdots t_{n-1}t_n^2 \exp\left(-\frac 12 t_n^2 \right) dt_1\cdots dt_n. $$ This integral is in fact as @Did computed in the last comment to the question, $$ \frac1{2^{n-1}(n-1)!} \int_0^{\infty} t_n^{2n} \exp\left(-\frac 12 t_n^2 \right) dt_n=\frac{(2n-1)!!}{2^{n-1}(n-1)!} \sqrt{\frac{\pi}2}\sim \sqrt{2n}. $$

Sungjin Kim
  • 20,102
  • Thank you! This is helpful, but yes, the question is about expectation and it seems there are still some details to fill in there. Also, I appreciate the generalization but would accept a detailed answer to the case $F_N$. – aduh Sep 28 '16 at 13:54
  • I'm sure I should know this but can you tell me where $$\prod_{i=1}^{k_n - (n+1)} 1- i/N = \exp \left(- \frac{ (k_n - (n+1))^2}{2N} + O(N^{-2})\right)$$ comes from? – aduh Sep 28 '16 at 13:57
  • I am editing to include the details. About the product, there is $-$ sign inside the exponential. The idea is discussed in the comments, and it uses the Maclaurin series of $\log (1-x)$. – Sungjin Kim Sep 28 '16 at 13:58
  • Edited, thanks. I must have missed that in the comments. I'll look again more carefully. Thanks again! – aduh Sep 28 '16 at 13:59