1

Consider that you have an urn containing red and blue balls. Let the total number of balls be $N$. Let the number of red balls be $r$ and the number of blue balls $b$. If you draw all the balls in the urn one by one, without replacement, you get a sequence of $N$ balls.

What is the probability that the length of the longest streak of consecutive red balls is greater than or equal to $k$? (that is we have a $RRR\dots$, which is the subsequence of our result, whose length is greater than or equal to $k$)

I came up with a recursive solution inspired by this answer. However I'm not quite sure it's the right solution or if it's possible to find a simpler solution.

Denote $P(N, k, r)$ as the probability for getting a sequence which contains a subsequence of continuous red balls which is greater or equal than $k$, when you start with $N$ balls, $r$ of which being red. Then I find the following recursive relation (for $N>k$ and $r>k$):

$P(n, k, r) = \frac{1}{2}(P(n-1, k, r) + P(n-1, k, r-1)) + \frac{p \times \dots \times (p-k+1)}{m \times \dots \times (n-k+1)} \frac{n-p+k}{n-k} (1-P(n-k-1, k, r-k))$

qdr
  • 170
  • 1
    IMO, it's easier to think of this problem as how many ways can the balls be chosen so that $k$ red balls appear in a row. You can always divide by total number of arrangements to get probability. – Rushabh Mehta Jul 18 '19 at 17:26
  • If $E_{i}$ denotes the event that the balls on spots $i,i+1,\dots,i+k-1$ are red then to be found is $P\left(\bigcup_{i=1}^{N-k+1}E_{i}\right)$. This can be worked out with inclusion/exclusion. Does not look very promising, but might work for small numbers. – drhab Jul 19 '19 at 06:01
  • I'm interested in the regime where N is of the order of 1k-10k, r~N/10-N/2 and k~50-100. In that case I found it quite hard to enumerate all possibilities of $k$ red balls in a row since it's easy to double count given that there can be multiple subsequences of $k$ red balls in a row in one sequence. Not sure what you mean by inclusion/exclusion though. – qdr Jul 19 '19 at 09:49

2 Answers2

1

Here is an attempt, inspired by drhab's comment.

Let $T(x,k)$ be the outcome "there are only red balls at the indices $[x;x+k-1]$ of the sequence". Let $P(T(x,k))$ be the probability of $T(x,k)$ occurring.

Through offset invariance, $P(T(x,k))$ is the same for all offsets $x$. Moreover, $P(T(0,k))$, which is the probability of drawing $k$ red balls in a row at the beginning of the sequence without replacement, has a closed form: \begin{equation} P(T(x,k)) \overset{\forall x \in [0;N-k]}{=} P(T(0,k)) = \frac{r}{N} \frac{r-1}{N-1} \dots \frac{r-k+1}{N-k+1} = \frac{\binom{r}{k}}{\binom{N}{k}} \end{equation}

What you are looking for is the probability $P(T(any,k))$ that a sequence of $k$ consecutive red balls occurs at any offset, but you don't want to double-count overlaps: \begin{equation} P(T(any,k)) = P\left(\bigcup_{0\leq x \leq N-k}{T(x,k)}\right) \end{equation}

This expression can be expanded using the identity for the probability of a union of non-exclusive events:

\begin{equation} P(T(any,k)) = \sum_{1\leq n \leq N-k+1} (-1)^{n+1} \sum_{0\leq x_1 < x_2< \dots < x_n \leq N-k}P\left( \bigcap_{ x\in\{x_1,x_2, \dots, x_n\} }{T(x,k)} \right) \end{equation}

Now, we can use offset invariance again to simplify the inner term of the sum: \begin{equation} \DeclareMathOperator{\card}{card} P\left( \bigcap_{ x\in\{x_1,x_2, \dots, x_n\} }{T(x,k)} \right) = P\left(T\left(0,\card\left( \bigcup_{ x\in\{x_1,x_2, \dots, x_n\} }{[x;x+k-1]} \right)\right)\right) \end{equation} where $\card\left(\bigcup{[x;x+k-1]} \right)$ is the number of unique sequence indices included in the union of intervals $\bigcup{[x;x+k-1]}$.

Basic bounds: $k+n-1 \leq \card\left( \bigcup_{ x\in\{x_1,x_2, \dots, x_n\} }{[x;x+k-1]} \right) \leq n k$. The lower bound is reached when all intervals are maximally overlapping (each one is shifted by 1 after the previous). The upper bound is reached when no intervals are overlapping.

This approach therefore requires finding: \begin{equation} \card\left( \bigcup_{ x\in\{x_1,x_2, \dots, x_n\} }{[x;x+k-1]} \right) \quad \text{for} \quad 0\leq x_1 < x_2< \dots < x_n \leq N-k \end{equation} which can be efficiently computed explicitly by an algorithm.

Optimizations

The heaviest operation is clearly the inner sum $\sum_{0\leq x_1 < x_2< \dots < x_n \leq N-k}$. However, many of its terms are equal (e.g. in the non-overlapping case, the term is always $P(T(0,nk))$) and could be counted explicitly. Many terms can also be null because the cardinality of the union of intervals exceeds $r$. All these properties allow simplifying the sum.

The outer sum $\sum_{1\leq n \leq N-k+1} (-1)^{n+1}\dots$ can be truncated without loss to $1\leq n\leq r-k+1$, because all the other terms are null (cardinality exceeds $r$).

Finally, we can rewrite: \begin{equation} P(T(any,k)) = \sum_{1\leq n \leq r-k+1} (-1)^{n+1} \sum_{k+n-1\leq j \leq \min(nk,r)}D(n,j)P(T(0,j)) \end{equation} where $D(n,j)$ counts the number of terms of cardinality $j$ at outer iteration $n$. And if you can find an efficient way of computing $D(n,j)$, you have solved the problem in an efficient way.

As a side note, each iteration of the outer sum $\sum (-1)^{n+1}\dots$ approaches the exact result by exponentially decreasing up/down steps, which implies that this sum may be truncated after the desired precision is reached.

dmp32
  • 11
0

The generating function for binary (red-blue or 0-1) n-sequences that do not contain k-substrings of 1' is

$$ 1-y^k \over 1-x-y-xy^k $$

two factories

By using "factory diagrams" to generate sequences that do not contain k-strings of 1's, we get for $k=2$ (left diagram) $ S= 1 + Sx + Syx $ for sequences that do not contain 11's and end in 0, then $T = Sy$ for sequences that end in $1$. Hence the $S+T$ generating function.

Generaly (right diagram), $ S= 1 + Sx + Syx + Sy^2x +...+ Sy^{k-1}x $ and $T= S(y + y^2 +...+ y^{k-1})$

Boyku
  • 712
  • 3
  • 10
  • I'm not sure I understand how you get the generating function. I guess the result should depend on the number of 0/1. Also what are $x$ and $y$ ? – qdr Jul 23 '19 at 14:11
  • Oh dear me, it is a long story. For example, the generating function for the set $ { \lambda, 0, 1, 00, 11 }$ is $1+x+y+x^2 +y^2$. The automatons and grammars are just another way to depict this kind of structures. – Boyku Jul 24 '19 at 00:30