3

Have two arrays $\vec{x}$ and $\vec{y}$, both of length $N$. They are binary (filled with 1's and 0's). We know that

$\sum_i x_i = N_x$ and

$\sum_i y_i = N_y$

Let $perm(\vec{x})$ denote a random permutation of the elements of an array. Thus define

$\vec{x}' = perm(\vec{x})$ and

$\vec{y}' = perm(\vec{y})$

I am interested in finding an analytic expression for the probability $P[C = c]$ of the number of randomly-intersecting elements, namely

$C = \sum_i x_i' y_i'$

If exact expression does not have a closed form, a good approximation would also be helpful.

The origin of this problem is comes from optics. I have two multichannel recordings before and after I do something. I want to test whether the number of channels co-active in both situations can be explained by the null hypothesis that the exact channels active at every moment in time are completely random.

My Attempt No 1:

The problem can be reformulated as follows: Assume there are two urns:

  • Urn $X$ has $N_x$ white and $N-N_x$ black balls
  • Urn $Y$ has $N_y$ white and $N-N_y$ black balls.

We draw one ball from each urn without replacement, and check if both balls are white. Then repeat until all balls are drawn. We are interested in the probability that we will draw a pair of white balls exactly $C$ times.

Now, if we relax the problem and allow for draws with replacement, it is easy to see that $P[C=c] \sim Bin(c, N, p)$ is a binomial distribution with $p=\frac{N_x}{N} \cdot \frac{N_y}{N}$. Since the original problem requires us to draw without replacement, it seems that the answer might be some form of a hypergeometric distribution. However, original hypergeometric distribution deals with only 1 urn. I need an extension that deals with matching 2 urns.

  • 1
    I'm not sure how large of a problem you're dealing with here. It might be feasible to do this via recursion.

    \begin{align} p(N,N_x,N_y,c) =\hphantom{+}&\frac1{N_x}\frac1{N_y}p(N-1,N_x-1,N_y-1,c-1) \+&\frac1{N-N_x}\frac1{N_y}p(N-1,N_x,N_y-1,c) \+&\frac1{N_x}\frac1{N-N_y}p(N-1,N_x-1,N_y,c) \+&\frac1{N-N_x}\frac1{N-N_y}p(N-1,N_x,N_y,c) \end{align}

    – Fimpellizzeri Jan 10 '20 at 15:21
  • Cool! I have in fact already written this recursion, and was a little sad that I could not use it to generate an analycal solution. However, I am actually ok with numerical solution, as long as it is fast and exact. I tried random permutations, but they never converge at the tail, which is important for me. The problem size is $N=50$, so it should be doable by recursion. – Aleksejs Fomins Jan 10 '20 at 16:22
  • Ok, I have make a mistake in the last comment. I am actually interested in computing all possible probability distributions for two values of N, namely 132 and 2256. I have now used dynamic programming to calculate the recursion for N=132, it works in about a minute, and needs to store ~7*10^6 intermediate non-zero probabilities. I'm a bit afraid to run the code on the bigger value :D – Aleksejs Fomins Jan 10 '20 at 19:32

1 Answers1

1

Inspired by some other question and answers about the probability of a certain Hamming distance, for example here, I found the following formula:

$$P(N,N_x,N_y,c) = \frac{{N_y \choose c}{N - Ny \choose Nx - c}{N \choose Ny}}{{N \choose Nx}{N \choose Ny}} = \frac{{N_x \choose c}{N - Nx \choose Ny - c}{N \choose Nx}}{{N \choose Nx}{N \choose Ny}}$$

assuming ${n \choose k} = 0$ when $n \lt k$.

Note that if $N - Ny \ge N_x - c$ then $N - N_x \ge N_y - c$ and vice versa.

I tried it numerically and it holds for all cases with $N \le 10$.

The denominator is the number of all couple of arrays.

To build the numerator, we can think of choosing a couple $(\vec{x},\vec{y})$ satisfying the intersection requirement, then ${N_y \choose c}$ are all the ways that the $\vec{y}$ ones can be used to form the intersection, while ${N-N_y \choose N_x-c}$ are all the ways that the $\vec{y}$ zeroes can be assigned to the remaining $N_x-c$ ones of $\vec{x}$, all that multiplied by ${N \choose N_y}$, the number of $\vec{y}$ arrays. Okay, maybe somebody can help to justify that better!

EDIT: additional explanation using generating functions.

We can apply generating functions to obtain the above formula, in the way explained in this answer.

Suppose we choose just one $\vec{y}$ and we may assume $y_i=1$ for $i=1, \dots, N_y$ (the order is not important here). We have a system of two equations:

$$\begin{cases} x_1 + \ldots + x_{N_y} = c \\ x_1 + \ldots + x_N = N_x \\ \end{cases} $$

The coefficients of the first equation are $a_{1i}=y_i$ ($i=1,\dots,N$) and those of the second equation are $a_{2i}=1$ ($i=1,\dots,N$). The generating function is:

$$g(z_1,z_2)=\prod_{i=1}^{N}{\left(1+\prod_{j=1}^{2}{z_j^{a_{ji}}}\right)}=\left(1+z_1z_2\right)^{N_y}\left(1+z_2\right)^{N-N_y}=\left[\sum_{i=0}^{N_y}{{N_y \choose i}z_1^iz_2^i}\right]\left[\sum_{j=0}^{N-N_y}{{N-N_y \choose j}z_2^j}\right]=\sum_{i=0}^{N_y}{\sum_{j=0}^{N-N_y}{{N_y \choose i}{N-N_y \choose j}z_1^iz_2^{i+j}}}$$

and we need to get the coefficient of the term with $i = c$ and $i+j=N_x$ (and thus $j = N_x - c$) which is:

$$[z_1^{c}z_2^{N_x}]g(z_1,z_2)={N_y \choose c}{N-N_y \choose N_x-c}$$

I just saw now that the result is exactly the probability mass function of the hypergeometric distribution that you considered.

  • I'm gonna sit down and check now. If you are right, you may have just helped me push neuroscience a tiny little step forwards. – Aleksejs Fomins Jan 11 '20 at 14:54
  • It would be nice to read your paper when you're done, even if I cannot understand it. Later I will add an explanation with generating functions that I just worked out. – Fabius Wiesner Jan 11 '20 at 15:21
  • Hi again! It only took half a year, but we finally submitted. You can see the pre-print here https://www.biorxiv.org/content/10.1101/2020.07.08.193334v1.full.pdf The analysis is on page 8, the methods for shared connections is on page 23-25. The main conclusion of that section is that the brain networks of mice become more stable as they learn to do the task. Thanks again for your help :) – Aleksejs Fomins Jul 13 '20 at 09:10