Expected value of #correct tokens when generating a sequence

Question

Let's say I've been given a sentence $\mathcal{S}$of $n$ words. I have a vocabulary $\mathcal M $ of $m$ words. If I sample $n$ words by picking at random from $\mathcal {M} $ successively what's the expected number of places where the generated words agree with those from $\mathcal{S}$?

If I sample with replacement
If I sample without replacement

score 0 · Answer 1 · edited Sep 15 '16 at 12:46

0

I've come up with the following expressions. Wanted to check if they are correct.

\begin{gather} \sum_{\ell=1}^n \ell \binom n\ell \left(\frac 1m\right)^\ell \left( \frac{m-1}m \right)^{n-\ell} \tag{with replacement} \\ \sum_{\ell=1}^n \ell \binom n\ell \prod_{j=0}^{\ell-1} \frac{1}{m-j} \prod_{i=0}^{n-\ell-1} \frac{-1+m-\ell-i}{m-\ell-i} \tag{without replacement} \end{gather}

edited Sep 15 '16 at 12:46

kennytm

7,495

answered Sep 15 '16 at 12:41

wabbit

133

lulu · Answer 2 · 2016-09-15T12:58:14.563

0

Each word has a probability of $\frac 1m$ of matching. Hence the expected number of matches is $\frac nm$. As expectation is linear (with no assumption on independence), this is the answer regardless of whether you replace or not.

Just as an illustration: Suppose $m=2=n$. Let's suppose the words are $A,B$ and that the given sentence $S$ is $AB$.

With replacement: the random sentence can be $\{AA,AB,BA,BB\}$ each with probability $\frac 14$. The respective match scores are $\{1, 2,0,1\}$ so the expected number of matches is $\frac 14\times \left(1+2+0+1\right)=1$.

Without replacement: the random sentence can be $\{AB,BA\}$ each with probability $\frac 12$. The respective match scores are $\{2,0\}$ so the expected number of matches is $\frac 12\times \left(2+0\right)=1$.

edited Sep 15 '16 at 12:58

answered Sep 15 '16 at 12:44

lulu

70,402

Thanks for the simple explanation. Can you elaborate "As expectation is linear (with no assumption on independence), this is the answer regardless of whether you replace or not." – wabbit Sep 15 '16 at 13:13
@HrishikeshGanu Linearity of Expectation is a wonderful thing. Let $X_i$ be the indicator variable for the $i^{th}$ generated word. Thus $X_i=1$ if the $i^{th}$ generated word matches the one in $S$, and $X_i=0$ if it doesn't. Then the expected number of matches is $E=E[\sum X_i]=\sum E[X_i]=np$ where $p$ is the probability that a given word matches (so $p=\frac 1m$). At no point do I care whether or not the $X_i$ are independent variables. With replacement they are independent, without they are not. – lulu Sep 15 '16 at 13:20
Got it. That's what allows you to move the $E$ in under the $\sum$? And I guess you can't do the same thing for variance ? – wabbit Sep 15 '16 at 13:33
No, you can't. though you can write the variance of $X$ as $E[X^2]-\left(E[X]\right)^2$ and, sometimes, that helps. – lulu Sep 15 '16 at 14:46
One question-you are assuming that $E[X_{1}]$= $E[X_{2}]$ if there are two words. However if I sequentially generate the words then the pdf for $X_{1}$ will be $P(X_{1}=1)=1/m$ and for $X_{2}$ will be $P(X_{2}=1)=1/(m-1)$ since the number of options is reduced by 1 because words can't repeat. – wabbit Sep 15 '16 at 15:20
Nope. If you sample the entire sentence, the probability for each word to match (taken as a separate variable) is $\frac 1m$. Why should the second word have a greater chance of matching then the first? – lulu Sep 15 '16 at 15:33
Go through my explicit examples carefully. You'll see that the probability of any given word matching is independent of its location in the sentence. And that it is $\frac 1m$ regardless of replacement. Work the case of $n=2,m=3$ explicitly...both with and without replacement. – lulu Sep 15 '16 at 15:37
Let us continue this discussion in chat. – wabbit Sep 15 '16 at 15:44
Sorry, I don't go into char rooms. Just work a bunch of explicit examples...only way to get insight. – lulu Sep 15 '16 at 15:57
I can see it now. In the example that you gave, P(1st word is correct)= 1/2 and P(2nd word is correct)= P(1st word=B).0 + P(1st word $\neq$B).P(2nd word=B| 1st word $\neq$ B). Hence P(2nd word is correct)= $0+(1/2)(1)$ – wabbit Sep 15 '16 at 16:36

Expected value of #correct tokens when generating a sequence

2 Answers2