Why $[(\mathbf{I}_N-\mathbf{A}^\top \mathbf{A})\mathbf{x}]$ is Gaussian with i.i.d. Gaussian $\mathbf{A}$?

Question

1. Background:

It is presented in the paper of approximate message passing (AMP) algorithm [Paper Link] that (the conclusion below is slightly modified without changing its original meaning):

Given a fixed vector $\mathbf{x}\in\mathbb{R}^N$, for a random measurement matrix $\mathbf{A}\in\mathbb{R}^{M\times N}$ with $M\ll N$, in which the entries are i.i.d. and $\mathbf{A}_{ij}\sim\mathcal{N}(0,\frac{1}{M})$, then $[(\mathbf{I}_N-\mathbf{A}^\top \mathbf{A})\mathbf{x}]$ is also a Gaussian vector whose entries have variance $\frac{\lVert \mathbf{x}\rVert_2^2}{M}$

(the assumption of $\mathbf{A}$ is in Sec. I-A, and the above conclusion is in Sec. I-D, please see 5. Appendix for more details).

2. My Problems:

I am confused by the statements in the paper about the above conclusion in [1. Background], and have three subproblems posted here (the first two subproblems are coupled, and the second one is more general):

$(2.1)$ How to prove the above conclusion in [1. Background]?

$(2.2)$ For a more general case: $\mathbf{A}_{ij}\sim\mathcal{N}(\mu,\sigma^2)$ with $\mu\in\mathbb{R}$ and $\sigma\in\mathbb{R}^+$, given $\rho\in\mathbb{R}$, will the vector $[(\mathbf{I}_N-\rho\mathbf{A}^\top \mathbf{A})\mathbf{x}]$ be still Gaussian? What are the distribution parameters (e.g. means and variances)?

$(2.3)$ When $\mathbf{A}$ is a random Bernoulli matrix, i.e., the entries are i.i.d. and $P(\mathbf{A}_{ij}=a)=P(\mathbf{A}_{ij}=b)=\frac{1}{2}$ with $a<b$, will the vector $[(\mathbf{I}_N-\rho\mathbf{A}^\top \mathbf{A})\mathbf{x}]$ obey a special distribution? What are the parameters (e.g. means and variances)?

3. My Efforts:

$(3.1)$ I have been convinced by the above conclusion in [1. Background] by writing a program to randomly generate hundreds of vectors with various $M$s and $N$s. The histograms are approximately Gaussian with matched variances.

$(3.2)$ I write out the expression of each element of the result vector, but it is too complicated for me. Especially, the elements of $\mathbf{A}^\top\mathbf{A}$ are difficult for me to analyse, since the multiplication of two Gaussian variables are not Gaussian [Reference].

$(3.3)$ After a long struggle, I still do not know if there is an efficient way to analyse the subproblems $(2.1)$, $(2.2)$ and $(2.3)$.

4. My Experiments:

Here is my Python code for experiments:

import numpy as np
for i in range(3):
    n = np.random.randint(2, 10000)
    m = np.random.randint(1, n)
A = np.random.randn(m, n) * ((1 / m) ** 0.5)
x = np.random.rand(n,)
e = np.matmul(np.eye(n) - np.matmul(np.transpose(A, [1, 0]), A), x)  # (I-A'A)x

print(np.linalg.norm(x) * ((1/m) ** 0.5))
print(np.std(e))
print('===')

try:
    from matplotlib import pyplot as plt
    import seaborn
    plt.cla()
    seaborn.histplot(e, bins=300)
    plt.title('mean = %f, var = %f' % (float(e.mean()), float(e.std())))
    plt.savefig('%d.png' % i, dpi=300)
except:
    pass

The standard deviation and the predicted one are generally matched (I have run the code multiple times):

0.7619008975832263
0.7371794446157226
===
0.5792213637974852
0.5936062808535417
===
0.5991335956466841
0.6256026437096703
===

5. Appendix

Here are two fragments of the paper:

(from Sec. I-A)

(from Sec. I-D)

Additionally, this paper [Paper Link] published on IEEE Transactions on Image Processing 2021 refers this conclusion (around the Equation (6)):

If $z$ is a Gaussian vector, then $a^\intercal z$ is a Gaussian random variable. In your case, however, $x^\intercal (I - A^\intercal A) x = |x|^2 - |Ax|^2,$ which is a translation of the random vector $-|A x|^2$ while $A x$ is Gaussian, $|A x|^2$ will be a sum of weighted chi-squared it seems. — William M., May 11 '22 at 14:47
Showing $A^TAx$ is a Gaussian vector solves parts 1 and 2, since $(I-\rho A^TA)x$ is just an affine transform of $A^TAx$. — eyeballfrog, May 11 '22 at 16:02
From reading Section 1D of the paper, I don't see where they claim that $(I-A^TA)x$ is Gaussian. — Jose Avilez, May 11 '22 at 16:34
Hi @JoseAvilez, thanks for your attention! In the first paragraph of Sec. I-A, the matrix $\mathbf{A}$ is assumed to be i.i.d. and Gaussian. Here I additionally post the two key fragments in the question (please see above). — BinChen, May 12 '22 at 02:07
Hi @WilliamM., thanks for your tips! I agree with you, i.e., at least that the chi-squared distribution shoud be engaged. However, there are actually many papers refer the original conclusion (e.g. the paper I mentioned in 5. Appendix). I am not not sure if I misunderstand the original conditions. — BinChen, May 12 '22 at 02:21
Hi @eyeballfrog, thanks for your tips! Even $\mathbf{A}^\top \mathbf{A}\mathbf{x}$ is difficult to clearly analyse for me, since $\mathbf{Ax}$ should be Gaussian, however, I have no idea about handling $\mathbf{A^\top Ax}$. — BinChen, May 12 '22 at 02:24
Another thing they may be doing without mentioning is that $\chi^2_d$ is approximately Gaussian for all large $d$ (say, greater than 50 is already Gaussian looking) due to the CLT. — William M., May 12 '22 at 02:47
@BinChen you should do simulation with $N$ small, e.g. $N = 2, 4, 6.$ This should be clearly non-Gaussian. What is reference [26]? By the way, "applied mathematicians" are not unheard of of proving false theorems. — William M., May 12 '22 at 22:02
Hi @WilliamM., the reference $[26]$ is actually the first paper "Message Passing Algorithms for Compressed Sensing" I mentioned, which propose the conclusion confusing me (see 1. Background), and is one of the foundation works in compressed sensing (CS) fields. — BinChen, May 13 '22 at 02:13
In 5. Appendix, the former two pictures correspond to the first paper (published in 2009), and the latter one corresponds to the second paper (a more recent one published in 2021). — BinChen, May 13 '22 at 02:16
When $N$ is small (e.g. $\le 10$), I can even not clearly see which special distribution the data obeys, so I think it is hard to judge the latent population distribution. — BinChen, May 13 '22 at 02:21
Is there an efficient analytic way to explicitly and exactly calculate out which distribution the data population obeys? I was stuck by the expanded formula of $\mathbf{A}^\top \mathbf{A}\mathbf{x}$, since it is too complicated for me. — BinChen, May 13 '22 at 02:26
@WilliamM. is correct; they are relying on the CLT. On page 11 they wrote: "In the main text, our motivation of the SE formalism used the assumption that the mutual access interference term $MAI_t = (A^∗A−I)(x_t − x_0)$ is marginally nearly Gaussian." — Syd Amerikaner, May 20 '22 at 10:59
Nevertheless, if you want to work out the finite sample distribution of $I - A^A$, you can use that $W = A^A$ follows a Wishart distribution. The difficulty is now to obtain the distribution of the shift $I - W$ from the pdf of the Wishart distribution. Next, you would have to find the distribution of $(I-W)x_0$. This could be seen as some kind of scaling, but since $x_0$ could contain negative values, I think it will be very hard to work out the exact distribution — Syd Amerikaner, May 20 '22 at 11:21
Hi @SydAmerikaner, I notice the sentence in the Appendix. K you mentioned. Now I am not attached to this and accept your ideas. The approximation actually plays a non-negligible role in the context. I am grateful for your efforts and kind reminder. — BinChen, May 20 '22 at 12:38

score 3 · Accepted Answer · answered May 20 '22 at 18:57

Partial answer. I hope I do not make mistakes.

The vectors $A^\top Ax$ and $(I-A^\top Ax)$ are NOT gaussian, and my impression is that the gaussian character is just an asymptotic law as $1 << M << N$. To convince yourself that it is just an an asymptotic, look at the case where $M=N=1$. Then $a_{1,1}^2x_1$ is not gaussian! I encourage also to look at the paper [26] given in reference.

To see that the vector is almost gaussian, one has to prove that every linear form of this vector, namely $$y^\top A^\top A x = \sum_{i=1}^N\sum_{k=1}^M\sum_{j=1}^N y_i a_{k,i} a_{k,j} x_j.$$ Indeed, we add a lot of terms, which are very weakly correlated, and this explain the almost gaussian behavior of their sum, even when the entries of $A$ are not gaussian, provided their distribution is concentrated enough (does the existence a second moment suffice ? do we need moments of every order ? I do not know).

For example, let us compute the first two moments of $$(A^\top A x)_1 = \sum_{k,j} a_{k,1} a_{k,j} x_j.$$ The expectation is $(1/M)x_1$ since $E[a_{k,i} a_{k,j}] = \delta_{i,j}/M$. To get the second moment, write $$(A^\top A x)_1 = \sum_{k,j,k',j'} a_{k,1} a_{k,j} x_j a_{k',1} a_{k',j'} x_j$$ and use the linearity of the expectation. Since the entries of $A$ are independent and centered, $E[a_{k,1} a_{k,j} a_{k',1} a_{k',j'}]$ is $0$ unless $j=j'=1$ or $(k,j)=(k',j')$. When it is not $0$, we get $1/M^2$ excepted in the case where all indexes equal 1, and in this case, we get $3/M^2$...

Successive moments should be estimated in this way. It becomes more and more complicated. To prove that they approach the moments of some centered gaussian distribution, an idea is to separate in the sum obtained by expansion the main contributions in the sums and the contribution which become negligible as $1 << M << N$. This is the strategy used to prove Wigner's theorem (on the spectrum of large gaussian hermitian matrices).

I encourage also to look at the paper [26] given in reference.

If $X$ is a Guassian matrix, then $XX^\intercal$ is Wishart distribution, which is a multivariate version of the chi-squared. $AA^\intercal$ is Wishart then. $(I-AA^\intercal) x$ will be a sum of weighted chi-squared, which then CLT applies. You don't need second moments for CLT to apply, see Feller-Lindeberg CLT (https://en.wikipedia.org/wiki/Lindeberg%27s_condition) — William M., May 20 '22 at 21:40

Why $[(\mathbf{I}_N-\mathbf{A}^\top \mathbf{A})\mathbf{x}]$ is Gaussian with i.i.d. Gaussian $\mathbf{A}$?

1 Answers1