If there is an online document with 100,000 completely unique words and every time you download it, 5% of the words are randomly deleted, how many times do you have to download it before you get all 100,000 words?
-
Some keywords you might be interested would be Coupon collector's problem. In particular, this problem is found https://math.stackexchange.com/questions/131664/coupon-collector-problem-with-batched-selections – Gareth Ma Jan 28 '21 at 15:29
-
How long is a piece of string? – wolfies Jan 28 '21 at 15:44
-
Can we solve a simpler case? Our message is $20$ words, and each time we download we get $19$ or them (in order). The second time we download, we could get the same result as the first time, so we have to go to a third download, and so on. If the second download is different, can we then reconstruct the whole thing? Usually, yes. But there are cases with repeated words where two different downloads are not enough to reconstruct. – GEdgar Jan 28 '21 at 16:14
-
@wolfies for simplicity, we can assume that all words are the same length. Let's say 10 letters. – ANam Jan 28 '21 at 16:15
-
As a guess about $4.7$ draws to get all the unique words. $3$ guesses leaves you with an expected $12.5$ words unfound, $4$ guesses $0.625$ unfound, and $5$ guesses $0.03125$ unfound, so you are unlikely to finish by $3$, may finish by $4$ and may need one or more to finish. – Henry Jan 28 '21 at 16:58
-
2The question does not ask the expected number of downloads. It asks how many downloads are needed to get all 100,000 words with certainty. – wolfies Jan 28 '21 at 17:10
-
@wolfies - I do not see "certainty" in the question. Simulation suggests that the expected number may be closer to $4.5$ than my guess and that in $10000$ attempts it never exceeded $7$ – Henry Jan 28 '21 at 17:14
-
3Without context, it's more reasonable to read as it states, which means with certainty rather than expected value. I agree the fact that the problem contains probability makes it a little bit ambiguous, but the question itself "how many times do you have to download it before you get all 100,000 words?" is pretty clear it means with certainty. – cr001 Jan 28 '21 at 18:21
-
Yes, I want to add with certainty in there. That's an important aspect that I may have ignored. I apologize. – ANam Jan 28 '21 at 19:42
-
1@ANam If you want certainty you are not going to get it. There is a non-zero probability that even after 1 million downloads , you are missing a word. The property of randomness is that there is a lack of certainty. Even if I had a coin that flipped heads $99$% of the time, there's a non-zero chance I could get $1$ million tails in a row. – Sarvesh Ravichandran Iyer Jan 28 '21 at 20:50
1 Answers
Take a look at the first word. There is a 5% it is deleted the first time. The probability it is deleted on both the first and second download is $0.05^2$. The probability it is deleted on all downloads after $N$ downloads of the document is $0.05^N$. That is never equal to 0 no matter how big $N$ is. Thus, you cannot ever be certain that even the first word of the document is downloaded, let alone all the words in the document.
There is a formula in the answer to this question.
I implemented it two different ways in Mathematica. The first way is literally from the formula as written. The second way is an attempt to make it more numerically stable. I assume $n$ is large and$m$ is a fraction (between 0 and 1) of $n$ so that $m=f \times n$. Then, I take the log of the terms with the binomial coefficients because those numbers are huge in this case. Do the calculations on the log-scale and then exponentiate back to get the correct term. I check that both give almost the same answer in the case of that question where $m$ and $n$ are relatively small ($n=100$ and $m=10$, which means $f=0.1$). The first function won't even run with large numbers like this problem, i.e. $n=100000$.
P1[t_, n_, m_] :=
Sum[(-1)^j Binomial[n, j] (Binomial[n - j, m]/Binomial[n, m])^t, {j,
0, n}]
P2[t_, n_, f_] := N[1 + Sum[(-1)^j
Exp[(j t Log[1 - f] + j Log[n] - LogGamma[1 + j]) + (-j + f j +
j^2 - f j^2 - f j t + f j^2 t)/(2 (-1 + f) n)], {j, 1, n}]]
Next, I found that for your question
P2[8,100000,0.95] is 0.999996
P2[5,100000,0.95] is 0.969233
P2[4,100000,0.95] is 0.535181
P2[3,100000,0.95] is 0.00000356
That is, you will almost never download the whole document after 3 download attempts, there is a 53.5% of success after 4 download attempts, 96.9% chance after 5 download attempts.
Intuition: on average, each download captures 95% of the the missing words. So, after two downloads, you have only about 250 missing words. After 3 downloads, about 12 missing words. The probability that you capture all those 12 words in the fourth download is $0.95^{12} \approx 0.54$. If you don't capture all of them, you will most likely only have 1 or 2 missing now and you are practically guaranteed to catch them in the next one or two downloads.

- 1,493
-
Ah I agree and that's a mistake on my part then. I understand that will never be 0 but is there a way to calculate what the actual probability of getting all 100,000 words is after N trials? – ANam Jan 29 '21 at 14:02
-
yes, but it will not be given by the answer from the coupon collector problem with batched selections question. The answer there has the formula for the probability of not having all $n$ coupons after $k$ selections. But, in that formula there can be duplicates within the selection in the batch. In your problem, each download you get 95,000 unique words, no duplicates. The answer to this question applies for your question. – John L Jan 29 '21 at 15:19