Roles of $\bf A^TA$ ($\text {A transpose A}$) matrices in orthogonal projection

Question

$\bf A^TA$ forms (or equivalently (?) positive semidefinite matrices, or more particularly, covariance matrices($\bf \Sigma$)) are linked in practice to many operations in which data points are orthogonally projected:

In ordinary linear regression (OLS) is part of the projection matrix $\bf P = X(\color{blue}{X^TX})^{−1}X^T$ of the "dependent variable" on the column space of the model matrix.
In principal component analysis (PCA) the data is projected on the eigenvectors of the covariance matrix.
The covariance matrix informs white random "white" samples into diagonal projections in Gaussian processes, which seems intuitively to correspond to a way of projecting.

But I am looking at a unifying explanation. A more generic concept.

In this regard, I have come across the sentence, "It is as if the covariance matrix stored all possible projection variances in all directions," a statement seemingly supported by the fact that a for data cloud in $\mathbb R^n$, the variance of the projection of the points onto a unit vector $\bf u$ will be given by $\bf u^T \Sigma u$.

So is there a way of unify all these inter-related properties into a single set of principles from which all the applications and geometric derivations can be seen?

I believe that the unifying theme is related to the the orthogonal diagonalization $\bf A^T A = U^T D U$ as mentioned here, but I'd like to see this idea explained a bit further.

EXEGETICAL APPENDIX for novices:

It was far from self-evident, but after some help by Michael Hardy and @stewbasic, the answer by Étienne Bézout may be starting to click. So like in the move Memento, I'd better tattoo what I got so far here in case it is blurry in the morning:

Concept One:

Block matrix multiplication:

\begin{align} A^\top A & = \begin{bmatrix} \vdots & \vdots & \vdots & \cdots & \vdots \\ a_1^\top & a_2^\top & a_3^\top & \cdots & a_{\color{blue}{\bf n}}^\top\\ \vdots & \vdots & \vdots & \cdots & \vdots\end{bmatrix} \begin{bmatrix} \cdots & a_1 & \cdots\\ \cdots & a_2 & \cdots \\ \cdots & a_3 & \cdots \\ & \vdots&\\ \cdots & a_{\color{blue}{\bf n}} & \cdots \end{bmatrix}\\ &= a_1^\top a_1 + a_2^\top a_2 + a_3^\top a_3 + \cdots+a_n^\top a_n\tag{1} \end{align}

where $a_i$'s are $[\color{blue}{1 \times \bf n}]$ row vectors.

Concept Two:

The $\color{blue}{\bf n}$.

We have the same dimensions for the block matrix multiplication $\bf \underset{[\text{many rows} \times \color{blue}{\bf n}]}{\bf A^\top}\underset{[\color{blue}{\bf n} \times \text{many rows}]}{\bf A} =\large [\color{blue}{\bf n} \times \color{blue}{\bf n}] \small \text{ matrix}$, as for each individual summand $\bf a_i^\top a_i$ in Eq. 1.

Concept Three:

$\bf a_i^\top a_i$ is deceptive because of the key definition: row vector.

Because $\bf a_i$ was defined as a row vector, and the $\bf a_i$ vectors are normalized ($\vert a_i \vert =1$), $\bf a_i^\top a_i$ is really a matrix of the form $\bf XX^\top$, which is a projection matrix provided the $a_i$ vectors are independent (check: "...are linearly independent"), and orthonormal (not a requisite in the answer ("I'm no longer saying they are orthogonal")) - $\color{red}{\text{Do these vectors actually need to be defined as orthonormal?}}$ Or can this constraint of orthonormality of the vectors $a_i$ be relaxed, or implicitly fulfilled by virtue of other considerations? Otherwise we have a rather specific $\bf A$ matrix, making the results less generalizable.

Concept Four:

A projection onto what?

Onto the subspace spanned by the column space of $\bf X$ (think OLS projection ${\bf A}\color{gray}{(A^\top A)^{-1}} {\bf A^\top}$). But what is $\bf X$ here? None other than $\bf a_i^\top$, and since $\bf a_i$ is a row vector, $\bf a_i^\top$ is a column vector.

So we are doing ortho-projections onto the column space of $\bf A^\top$, which is in $\mathbb R^{\color{blue}{\bf n}}$.

I was hoping that the last sentence could have been, "... onto the column space of $\bf A$...

What are the implications?

$\bf A^T A$, being a matrix of all dot products of the columns of $\bf A$ encapsulates the "geometry" of $A$. If you have done some differential geometry, this is equivalent to the metric tensor $g_{ij}$ (http://mathworld.wolfram.com/MetricTensor.html). It is thus not strange that $\bf (A^T A)^{-1}$ plays an important role as $g^{ij}$ does. — Jean Marie, Sep 07 '16 at 22:58
Do you agree that $B:=\bf A^T A$ is made of dot products of columns of $A$ ? If these columns make an orthonormal basis $B=I$, idnetity matrix. The farthest $B$ is from $I$ the more work there will be to orthonormalize the whole... — Jean Marie, Sep 07 '16 at 23:06

Rodrigo de Azevedo · Answer 1 · 2016-09-08T14:50:08.970

2

Suppose we are given a matrix $\mathrm A$ that has full column rank. Its SVD is of the form

$$\mathrm A = \mathrm U \Sigma \mathrm V^T = \begin{bmatrix} \mathrm U_1 & \mathrm U_2\end{bmatrix} \begin{bmatrix} \hat\Sigma\\ \mathrm O\end{bmatrix} \mathrm V^T$$

where the zero matrix may be empty. Note that

$$\mathrm A \mathrm A^T = \mathrm U \Sigma \mathrm V^T \mathrm V \Sigma^T \mathrm U^T = \mathrm U \begin{bmatrix} \hat\Sigma^2 & \mathrm O\\ \mathrm O & \mathrm O\end{bmatrix} \mathrm U^T$$

can only be a projection matrix if $\hat\Sigma = \mathrm I$. However,

$$\begin{array}{rl} \mathrm A (\mathrm A^T \mathrm A)^{-1} \mathrm A^T &= \mathrm U \Sigma \mathrm V^T (\mathrm V \Sigma^T \mathrm U^T \mathrm U \Sigma \mathrm V^T)^{-1} \mathrm V \Sigma^T \mathrm U^T\\ &= \mathrm U \Sigma \mathrm V^T (\mathrm V \Sigma^T \mathrm \Sigma \mathrm V^T)^{-1} \mathrm V \Sigma^T \mathrm U^T\\ &= \mathrm U \Sigma \mathrm V^T (\mathrm V \hat\Sigma^2 \mathrm V^T)^{-1} \mathrm V \Sigma^T \mathrm U^T\\ &= \mathrm U \Sigma \mathrm V^T \mathrm V \hat\Sigma^{-2} \mathrm V^T \mathrm V \Sigma^T \mathrm U^T\\ &= \mathrm U \Sigma \hat\Sigma^{-2} \Sigma^T \mathrm U^T\\ &= \mathrm U \begin{bmatrix} \mathrm I & \mathrm O\\ \mathrm O & \mathrm O\end{bmatrix} \mathrm U^T = \mathrm U_1 \mathrm U_1^T\end{array}$$

is always a projection matrix.

edited Sep 08 '16 at 14:50

answered Sep 08 '16 at 00:31

Rodrigo de Azevedo

1

Would you mind giving some context to the expression $\begin{bmatrix} \hat\Sigma\ \mathrm O\end{bmatrix}$? – Antoni Parellada Sep 08 '16 at 00:35
@AntoniParellada If $\mathrm A$ has full column rank, then it's either square or thin (it cannot be fat) and all its singular values are positive. Matrix $\hat\Sigma$ is square, diagonal and positive. – Rodrigo de Azevedo Sep 08 '16 at 00:37
What the notation signifies is then, for example, in the matrix $\tiny\begin{bmatrix} 1&6\2&7\3&8\4&9\5&10 \end{bmatrix}$ (full rank), upon diagonalizing you end up with the diagonal $\tiny\begin{bmatrix} 19.5&0\0&1.8\0&0\0&0\0&0 \end{bmatrix}$... The zeros at the bottom are the $\bf O$? – Antoni Parellada Sep 08 '16 at 00:44
@AntoniParellada I wouldn't call it diagonalizing, as the matrix is thin, not square. Note that the columns of $\mathrm U$ are the left singular vectors of $\mathrm A$ and that the columns of $\mathrm U_1$ are the left singular vectors of $\mathrm A$ that span the column space of $\mathrm A$. – Rodrigo de Azevedo Sep 08 '16 at 00:51
Scratch 'diagonalizing' - I meant doing the svd decomposition to get the diagonal $\Sigma$ matrix. I was inquiring about notation - I think it's clear though that the $\mathrm O$ in $\begin{bmatrix} \hat\Sigma\ \mathrm O\end{bmatrix}$ is meant to represent the rectangular form of $\Sigma$. – Antoni Parellada Sep 08 '16 at 13:10
@AntoniParellada Yes, $\mathrm O$ denotes the matrix of zeros below the square positive matrix $ \hat{\mathrm \Sigma}$. If $\Sigma$ is square, then $\mathrm O$ is an empty matrix, that is, it has zero rows. – Rodrigo de Azevedo Sep 08 '16 at 14:15
https://math.stackexchange.com/questions/1298261/difference-between-orthogonal-projection-and-least-squares-solution – dantopa Mar 10 '17 at 20:54
https://math.stackexchange.com/questions/2033896/least-squares-solutions-and-the-orthogonal-projector-onto-the-column-space/2180194#2180194 – dantopa Mar 10 '17 at 20:54

Étienne Bézout · Answer 2 · 2016-09-07T23:31:14.753

1

Using block matrix notation, we can write $$A = \begin{bmatrix} a_1 \\ a_2 \\ \vdots \\ a_n \\ \end{bmatrix} $$ and $$A^T = \left[a_1^T a_2^T \dots a_n^T \right], $$ where $a_1,...,a_n$ are the rows of $A$.

Then $A^TA = a_1^Ta_1+\dots a_n^Ta_n$ which is a sum of orthogonal projections on the directions $a_1^T,...,a_n^T,$ if we also assume that $|a_1| = ... = |a_n| = 1$. If $A$ is invertible, then $a_1,...,a_n$ are linearly independent, so $A^TA$ can be seen as a sum of $n$ orthogonal projections on $n$ linearly independent directions in $\mathbb{R}^n.$

This should probably be a comment, but obviously I couldn't fit the equations in that format.

edited Sep 07 '16 at 23:31

answered Sep 07 '16 at 23:05

Étienne Bézout

1,752
1
11
20

I see everything you are explaining up to "Of course...". – Antoni Parellada Sep 07 '16 at 23:10
@AntoniParellada Sorry, I was too quick reading your question. My comment/answer assumes that A is orthogonal. I'll change it. – Étienne Bézout Sep 07 '16 at 23:17
The fact that are linearly independent, though, does not imply that they are orthogonal... I feel like there is a missing part in the story, which is probably obvious to you... – Antoni Parellada Sep 07 '16 at 23:25
@AntoniParellada I'm no longer saying they are orthogonal, only that the projections are orthogonal. – Étienne Bézout Sep 07 '16 at 23:33
Right, you are referring to the dot product of two vectors as the projection of either one of them along the unit vector in the direction of the other, correct? – Antoni Parellada Sep 07 '16 at 23:37
@AntoniParellada Well, I'm referring to the matrices of orthogonal projections. An orthogonal projection of course involves the scalar product. – Étienne Bézout Sep 08 '16 at 08:54
If you define $a_1,...,a_n$ as the column vectors of $A$ instead of the rows of $A$ you avoid $a_i^Ta_i$ coming across as an outer product. – Antoni Parellada Sep 08 '16 at 13:36
@AntoniParellada Ok, no problem. If I define $a_i$ as the columns of $A$, then the matrix $A^TA$ can not be written in the way above (unless $A$ is symmetric). Instead, the $ij$ entry of $A^TA$ will be the scalar product of $a_i$ and $a_j$. – Étienne Bézout Sep 08 '16 at 15:47
Isn't that what you want? The $ij$ entry of $A^TA$ to be the dot product of $a_i^T$ and $a_j$? – Antoni Parellada Sep 08 '16 at 15:57
@AntoniParellada Yes, you can do that, but then you won't get an expression like the one I gave in my answer. See Jean Maries comment to your original post. – Étienne Bézout Sep 08 '16 at 18:29
I'm not sure then... The $a_i^Ta_j$ are outer products, resulting each in a separate matrix? – Antoni Parellada Sep 08 '16 at 18:37
@AntoniParellada If $a_i$ are the rows, then $a_i^Ta_j$ are matrices, yes. That's what I do in my answer. Note however that for the matrices, we have $i = j$. If $a_i$ are the columns, $a_i^Ta_j$ are the scalar products of $a_i$ with $a_j$. – Étienne Bézout Sep 08 '16 at 20:30
Still working through your answer. If you have a moment, please note the changes in the OP in reference to your answer. Did I get it right? – Antoni Parellada Sep 09 '16 at 03:56
@AntoniParellada Yes, it looks right. – Étienne Bézout Sep 09 '16 at 16:22
Then the $a_i$ factors should be orthonormal? – Antoni Parellada Sep 09 '16 at 16:24
@AntoniParellada No, they don't have to be orthogonal. That's why I removed it from my answer. I simply talk about orthogonal projections, which doesn't mean that the projections are orthogonal to each other, but rather that each projection is orthogonal. It is also possible to have a projection which is not orthogonal. – Étienne Bézout Sep 09 '16 at 16:30
I am stuck with a contradiction in "Concept Three" then - see original post (before part in red). – Antoni Parellada Sep 09 '16 at 16:40
@AntoniParellada I don't see any contradiction. I'm only assuming that $a_i$ to get the expression $A^TA = a_1^Ta_1+...+a_n^Ta_n$. If they are also linearly independent, the above is a sum of $n$ projection matrices along $n$ linearly independent directions. – Étienne Bézout Sep 09 '16 at 16:43
I thought the beauty of it all was to get to an expression of the form $a_i^\top a_i$, which (and because of the very initial definition of $a_i$ as row vectors) would really correspond to a $\text{column vector } a_i, \text{dotted with }\text {row vector } a_i$, which is a projection matrix, provided the vectors are orthonormal. Where did I misunderstood you? – Antoni Parellada Sep 09 '16 at 16:49
1

The vectors being orthogonal or not does not have any importance for my answer. If $A$ is symmetric, then $a_i^T$ is equal to the $i$ th column of $A$, say $b_i$. Then you get an expression like $A^TA = b_1b_1^T+...+b_nb_n^T$. – Étienne Bézout Sep 09 '16 at 17:00
Thank you for your patience. I will keep the question open for a while, because I'm very interested in the answer, and any further elaboration on the implications of your answer, or potentially other contributions, would be welcome. I have researched it, and it doesn't seem like bounties help get more answers. If there are no other changes, I'll come back to accept your answer in a week or two. – Antoni Parellada Sep 09 '16 at 17:03
1

@AntoniParellada Ok, no problem. I might think about it a bit more and see if I come up with something. – Étienne Bézout Sep 09 '16 at 17:06

Roles of $\bf A^TA$ ($\text {A transpose A}$) matrices in orthogonal projection

2 Answers2

Linked