2

I have a software test failing from time to time, due to the fact that the string 123 appears when generating a string with SecureRandom.hex(32) (Ruby), which gives a 64 symbols string.
This includes (A-F) and (0-9), so 16 symbols possible.

I have played a little, and I noticed that it can roughly happens 1500 times over 100,000 calls to the function, so around 1.5%.

What would be the formula to have the exact probability of this test failing?

As a developer, I had to share a little bit of code: JS Fiddle.

Edit: Answer's WolframAlpha link

54392126209570846953192790832763093471349103670511891931049735583247364107/
3618502788666131106986593281521497120414687020801267626233049500247285301248
≈0.0150317
≈1.50317%

3 Answers3

3

An exact answer, via Inclusion Exclusion Principle:

Let $A_i =$ the set of 64-long strings where "123" appears starting at the $i$th position ($1 \le i \le 62$). E.g. if "123" appears at both the 2nd and 40th positions in string $x$, then $x \in A_2$ and $x \in A_{40}$, i.e. $x \in A_2 \cap A_{40} $.

Then $S = \bigcup_i A_i=$ the set of strings where "123" appears at least once, and your desired probability is $|S|/16^{64}$.

The Inclusion Exclusion Principle gives:

$$|S| = |\bigcup A_i| = \sum_i |A_i| - \sum_{i<j} |A_i \cap A_j| + \sum_{i<j<k} |A_i \cap A_j \cap A_k| - \dots $$

Obviously $|A_i| = 16^{61}$ as the other 61 positions can be anything. So the first term is $\sum_i |A_i| = 62 \cdot 16^{61}$.

Now two occurrences of "123" cannot overlap (unlike e.g. if you're looking for "111" in which case they CAN overlap). So if $j - i < 3$, then $|A_i \cap A_j| = 0$. For the non-trivial case of $j - i \ge 3$, the other $64 - 3 - 3 = 58$ positions can be filled with anything, so $|A_i \cap A_j| = 16^{58}$.

Meanwhile, how many $(i, j)$ pairs are there s.t. $j - i \ge 3$? Imagine "merging" each "123" into a new character "!", so the final string length is now $64-2-2 = 60$ instead and there are two "!" characters. The number of ways to do this is simply ${60 \choose 2}$. Thus the 2nd term is:

$$ \sum_{i<j} |A_i \cap A_j| = {60 \choose 2} 16^{58}$$

The $m$th term is similar and we have:

$$ \sum_{i_1 < i_2 < \cdots < i_m} |A_{i_1} \cap A_{i_2} \cap \cdots \cap A_{i_m}| = {64 - 2m \choose m} 16^{64 - 3m} $$

So the final result is:

$$ |S| = \sum_{m=1}^{21} (-1)^{m+1} {64 - 2m \choose m} 16^{64 - 3m} $$

antkam
  • 15,363
1

This answer is based upon the Goulden-Jackson Cluster Method.

We consider the set words of length $n\geq 0$ built from the alphabet $\mathcal{V}=\{0,\ldots,9,A,\ldots,F\}$ and the set $B=\{123\}$ of bad words, which are not allowed to be part of the words we are looking for.

  • We derive a generating function $f(s)$ with the coefficient of $s^n$ being the number of these words of length $n$.

  • Since we are looking for the number of words which contain the bad word $123$, the resulting generating function is the generating function of all words minus $f(s)$ \begin{align*} &1+16s+16^2s^2+16^3s^3\cdots-f(s)=\frac{1}{1-16s}-f(s) \end{align*}

According to the paper (p.7) the generating function $f(s)$ is \begin{align*} f(s)=\frac{1}{1-ds-\text{weight}(\mathcal{C})}\tag{1} \end{align*} with $d=|\mathcal{V}|=16$, the size of the alphabet and $\mathcal{C}$ the weight-numerator of bad words with \begin{align*} \text{weight}(\mathcal{C})=\text{weight}(\mathcal{C}[123]) \end{align*}

We calculate according to the paper \begin{align*} \text{weight}(\mathcal{C})=\text{weight}(\mathcal{C}[123])&=-s^3\tag{2}\\ \end{align*}

It follows from (1) and (2)

\begin{align*} f(s)=\frac{1}{1-16s+s^3}\\ \end{align*} and the generated function counting all strings which contain $123$ is \begin{align*} &\color{blue}{\frac{1}{1-16s}-\frac{1}{1-16s+s^3}}=s^3 + \color{blue}{32} s^4 + 768 s^5 + 16\,383 s^6 +\cdots\tag{3}\\ \end{align*}

The coefficients of the series were calculated with the help of Wolfram Alpha. We see for instance there are $\color{blue}{32}$ words of length $4$ which contain the bad word $123$. These are $$(0..9,A..F)123\qquad\text{ and }\qquad 123(0..9,A..F).$$

The coefficient of $s^{n}$

In fact we are interested in the coefficient of $s^{64}$. We calculate the coefficient of $s^n$ from (3). It is convenient to use the coefficient of operator $[s^n]$ to denote the coefficient of $s^n$ of a series.

We obtain from (3) for $n\geq 0$

\begin{align*} \color{blue}{[s^n]}&\color{blue}{\left(\frac{1}{1-16s}-\frac{1}{1-16s+s^3}\right)}\\ &=[s^n]\sum_{m=0}^\infty 16^ms^m-[s^n]\sum_{m=0}^\infty s^m(16-s^2)^m\\ &=16^n-[s^n]\sum_{m=0}^\infty s^m\sum_{k=0}^m\binom{m}{k}(-1)^ks^{2k}16^{m-k}\\ &=16^n-\sum_{m=0}^n[s^{n-m}]\sum_{k=0}^m\binom{m}{k}(-1)^ks^{2k}16^{m-k}\\ &=16^n-\sum_{m=0}^n[s^m]\sum_{k=0}^{n-m}\binom{n-m}{k}(-1)^ks^{2k}16^{n-m-k}\\ &=16^n-\sum_{m=0}^{\lfloor n/2 \rfloor} [s^{2m}]\sum_{k=0}^{n-2m}\binom{n-2m}{k}(-1)^ks^{2k}16^{n-2m-k}\\ &=16^n-\sum_{m=0}^{\lfloor n/3 \rfloor}\binom{n-2m}{m}(-1)^m16^{n-3m}\\ &\,\,\color{blue}{=\sum_{m=1}^{\lfloor n/3 \rfloor}\binom{n-2m}{m}(-1)^{m+1}16^{n-3m}}\tag{4}\\ \end{align*}

Finally we obtain from (4) in accordance with the answer of @antkam the wanted probability

\begin{align*} \color{blue}{[s^{64}]}\color{blue}{\left(\frac{1}{1-16s}-\frac{1}{1-16s+s^3}\right)16^{-64}}&\color{blue}{=\sum_{m=1}^{21}\binom{64-2m}{m}(-1)^{m+1}16^{-3m}}\\ &\color{blue}{\simeq 0.01503\,16662\,40506} \end{align*}

which gives roughly $\color{blue}{1.5\%}$.

Markus Scheuer
  • 108,315
0

Just a quick hint: total number of all possible strings is $16^{32}$, the number of strings that contain at least one occurrence of "$123$" is $32 \cdot 16^{29}=2\cdot 16^{30}$ ($32$ spots to put substring "$123$" and we need to fill remaining $29$ positions). The probability for truly random generation should be $$\frac{2\cdot 16^{30}}{16^{32}}=\frac{1}{128}$$

Vasili
  • 10,690
  • 1
    the string length is 64, there are 62 places to put "123", and this argument double-counts the cases where "123" appears twice, etc. – antkam Apr 20 '18 at 14:19