Probability of a sequence appearing in a list of integers

Question

Let's say I have a list of 0's and 1's (e.g., 01100111000) where I know the length of the list and the number of 0's and 1's appearing in it. Is there a simple way to determine the probability that a sequence (say, 111) appears within it, given the length and the number of 0's and 1's?

Say the sequence has $6$ zeros and $6$ ones, as above. Are you asking, if a string is chosen uniformly at random from the $\binom{12}{6}$ strings with $6$ zeros and $6$ ones, what is the probability that it contains the substring $111?$ — saulspatz, Apr 02 '18 at 03:45
How long is the string? Do you allow more than $3$ ones in a row? What have you tried? — Remy, Apr 02 '18 at 03:48
@saulspatz exactly right. There are so many possible strings with 6 zeros and 6 ones, and I'm asking how many of those contain the substring 111. Obviously a string like 000000111111 is much "lower entropy" than one like 010101010101, but I'm not quite sure how to quantify that. — Adam, Apr 02 '18 at 03:57
For the specific example of where the target subsequence is all $1$'s or all $0$'s and the number of $1$'s and $0$'s in the string is unknown this becomes the same as asking for the probability of at least $k$ consecutive coinflips are heads out of $n$ coinflips. See this question for that case. For the more general case where the target subsequence can contain both $0$'s and $1$'s then some adjustments will need to be made. — JMoravitz, Apr 02 '18 at 04:13

score 2 · Accepted Answer · answered Apr 02 '18 at 04:38

If you're just interested in the probability of getting at least three consecutive ones, then this problem is analogous to finding the probability that the length of the longest run of heads in $n$ coin tosses, say $\ell_n$, exceeds a given number $m$. This probability is given by

$$\mathbb{P}(\ell_n \geq m)=\sum_{j=1}^{\lfloor n/m\rfloor} (-1)^{j+1}\left(p+\left({n-jm+1\over j}\right)(1-p)\right){n-jm\choose j-1}p^{jm}(1-p)^{j-1}.$$

Reference:

https://math.stackexchange.com/a/59749/325426

In our case, $n=11$, $m=3$, and I assume $p=0.5$

so we get

$$\mathbb{P}(\ell_n \geq 3)=\sum_{j=1}^{3} (-1)^{j+1}\left(0.5+\left({11-3j +1\over j}\right)(1-0.5)\right){n-3j\choose j-1}0.5^{3j}(1-0.5)^{j-1}$$

By Wolfram Alpha, we get

$$\mathbb{P}(\ell_n \geq 3)\approx 0.5474$$

R Simulation:

count=0
for(i in c(1:1000000)){
coin <- sample(c("H", "T"), 11, replace = TRUE);
coin.rle <- rle(coin);
sort(coin.rle$lengths, decreasing = TRUE);
max=max(tapply(coin.rle$lengths, coin.rle$values, max)["H"]);
if(is.na(max)){next}
if(max>=3){count=count+1};
i=i+1}
count/(1000000)

[1] 0.547234

so our empirical answer agrees with our analytical answer.

score 1 · Answer 2 · answered Apr 02 '18 at 04:15

The problem is double counting. Let's just consider the example you gave. Since we know that there are $\binom{12}{6}$ strings in all, we just have have to count the strings with three consecutive ones. There are $10$ positions for the first one in the substring and then $\binom{9}{3}$ places to put the other three ones, so this gives $10\binom{9}{3}$ strings. But there are strings with four consecutive ones, and we've counted them twice, so we have to subtract them. That gives $10\binom{9}{3}-9\binom{8}{2}.$ Now what about strings with five consecutive ones? We've added three strings of length three, and subtracted two of length four, so we've only counted this string once, and no adjustment is needed. (This surprises me. Usually, when you use the principle of inclusion and exclusion, you add and subtract at every point, but if I've made a mistake I can't find it.) For a string with six ones, we have added four strings of length three, and subtracted three of length four, so again we've counted it once and no adjustemnt is needed.

We're not done yet, however. It's possible to have two substrings of three ones that don't intersect at all as in $111011100000.$ I leave it to you to count those.

So far as I know, there isn't a formula that will wok for all substrings. For example, if the "target" string were $101$ instead of $111$, there is no four-bit string that can contain the substring twice, so the analysis will be different.

Probability of a sequence appearing in a list of integers

2 Answers2