What is $\nabla_A \epsilon^TA^T(AA^T)^{-1}A\epsilon$?

Question

Let $q$ be the multivariate Normal distribution $\mathcal{N}(\mu, \Sigma)$ and $x$ be a sample from $q$. Hence, $x$ can be written as $$x = \mu + A\epsilon \,, \Sigma = AA^T\,, \epsilon \sim \mathcal{N}(0, I)$$ and $I$ represents the identity matrix. I am trying to compute $\nabla_{A}\log{q(x)}$.

Now, $$ \nabla_A \log{q(x)} = -\frac{1}{2}\nabla_A \log\det(AA^T) - \frac{1}{2}\nabla_A \epsilon^TA^T(AA^T)^{-1}A\epsilon$$

The first gradient evaluates to $-A^T(AA^T)^{-1}$ (with help from stack exchange answers). However, since I don't have a formal training in graduate level calculus (I am a CS student), I don't know how to evaluate the gradient of the second term. Can anybody help?

After reading up a bit about matrix calculus, this is my effort.

Let $B = A^T(AA^T)^{-1}A$ and $(B + \delta B) = (A+\delta A)^T((A+\delta A)(A+\delta A)^T)^{-1}(A+\delta A)$ This implies $$AB = A$$ and $$(A+\delta A)(B + \delta B) = A+\delta A$$ Expanding the last equation, we get $$A\delta B = \delta A(I-A^T(AA^T)^{-1}A) $$

I am not sure how to proceed after this. Thanks

What happens when you follow the procedure explained in detail à propos your previous question? http://math.stackexchange.com/a/1797287/ — Did, May 26 '16 at 11:03
I end up with infinitesimal $h$ inside the matrix inverse operation, something like $(AA^T + hVA^T + hAV^T + h^2 VV^T)^{-1}$. — user2808118, May 26 '16 at 11:10
Okay, I think I solved it. Thanks. I will post the answer. Based on the comments, I will know if it is correct. — user2808118, May 26 '16 at 11:19
No, I am stuck. I don't know how to expand the matrix-inverse operation. — user2808118, May 26 '16 at 11:26
$\nabla_A f(A) = \lim_{\epsilon \to 0} \frac{f(A+\epsilon A) - f(A)}{\epsilon}$ . let $f(A) = A^T (A A^T)^{-1} A$. what do you find for $\frac{f( A+\epsilon A)-f(A)}{\epsilon}$ ? — reuns, May 26 '16 at 11:39
(Horrible sign error in the first version, here is the correct answer.) Recall that the gradient of a function $f$ at some point $A$ in a manifold $M$ is in fact a linear transformation defined on the tangent space $T_AM$ of $M$ at $A$. When $M$ is itself a vector space, say a space of matrices, $T_AM=M$ hence the gradient $\nabla f(A)$ is a linear function $L_A$ defined on the vector space $M$. Applying the method explained in the question mentioned in my first comment to the function $$f(A)=\epsilon^TA^T(AA^T)^{-1}A\epsilon,$$ one wants $$f(A+tH)=f(A)+tL_A(H)+o(t),$$ when $t\to0$, ... — Did, May 26 '16 at 13:27
... for every matrix $H$, hence $$L_A(H)=\epsilon^T(S_AH^T(AA^T)^{-1}A+A^T(AA^T)^{-1}HS_A)\epsilon,$$ where $S_A$ denotes the symmetric matrix $$S_A=I-A^T(AA^T)^{-1}A.$$ Note that when $A$ is invertible, $(AA^T)^{-1}=(A^T)^{-1}A^{-1}$ hence $S_A=0$ and $L_A=0$. — Did, May 26 '16 at 13:27
@Did : tell him that this is $\nabla_H f(A) = \lim_{ \epsilon \to 0} \frac{f(A+\epsilon H) - f(A)}{\epsilon}$ the directional derivative in the direction $H$, with $f(A) = A^T (A A^T)^{-1} A$. and the Frechet derivative at $A$ is the operator $H \mapsto \nabla_H f(A)$ — reuns, May 28 '16 at 01:05
@user2808118 The answer you quickly accepted declares that they compute the gradient of $$g(A)=A^T(AA^T)^{-1}A\epsilon\epsilon^T,$$ not $$f(A)=\epsilon^TA^T(AA^T)^{-1}A\epsilon.$$ How comes? — Did, May 28 '16 at 05:37
For the directional derivative derived by you, the corresponding gradient (the matrix notation of the linear transform) is $2 (AA^T)^{-1} A \epsilon \epsilon^T S_A$, which is exactly the same as derived in the accepted answer.
Note that the gradient of g(A) will be a 4-dimension tensor. — user2808118, May 28 '16 at 05:49
I think that your confusion stems from thinking of matrix multiplication as inner product. The Frobenius inner product between 2 matrices $A$ and $B$ is just the the sum of entries of $C$, where $C$ is obtained by componentwise multiplication of elements of $A$ and $B$. — user2808118, May 28 '16 at 06:19
No, the differential I computed is not the same as the one proposed in the answer you accepted. Hint: Matrices do not always commute. Now I understand why you saw fit to accept it, unfortunately your conviction that the formulas are equivalent is ill-founded, they are different, one is correct and the other is wrong. (Unrelated: Please use @, unless you do not want your comment to be read by those it is ostensibly addressed to.) — Did, May 28 '16 at 15:18
@Did : I think I have found the source of confusion. Implicit in the context part of the question is the fact that $\epsilon$ is a vector.
The transpose of accepted answer will be same as yours. $$2<S_A\epsilon \epsilon^TA^T(AA^T)^{-1}, H>$$ $$=2trace(S_A\epsilon \epsilon^TA^T(AA^T)^{-1}H )$$ $$=2trace(\epsilon^TA^T(AA^T)^{-1}HS_A\epsilon )$$ $$=2*\epsilon^TA^T(AA^T)^{-1}HS_A\epsilon$$. which is the same as your reply. (Note that the two terms in your answer are equal.) — user2808118, May 28 '16 at 16:43

hans · Accepted Answer · 2016-05-28T19:08:42.087

0

Note that $$A^T(AA^T)^{-1}=A^+$$ So we can write the function in terms of the pseudo-inverse and the Frobenius (:) Inner Product $$\eqalign{ f &= ee^T:A^+A \cr }$$ Now we can borrow a result from Harville's "Matrix Algebra from a Statistician's Perspective" $$\eqalign{d(A^+A)=2\,{\rm sym}\big(A^+\,dA\,(I-A^+A)\big)}$$ to find the differential of the function $$\eqalign{ df &= 2ee^T:{\rm sym}\big(A^+\,dA\,(I-A^+A)\big) \cr &= 2ee^T:\big(A^+\,dA\,(I-A^+A)\big) \cr &= 2\,(A^+)^Tee^T(I-A^+A):dA \cr }$$ Since $df=\big(\frac{\partial f}{\partial A}:dA\big),\,$ the gradient is $$\eqalign{ \frac{\partial f}{\partial A} &= 2\,(A^+)^Tee^T(I-A^+A) \cr &= 2\,(AA^T)^{-1}A\,\,ee^T\Big(I-A^T(AA^T)^{-1}A\Big) \cr }$$
Update
I just noticed that you wrote the gradient of the first term as $$A^T(AA^T)^{-1}$$ whereas I would write it as $$(AA^T)^{-1}A$$ so you are using a convention which is the transpose of my usual convention.

Which is fine, but you will need to use the transpose of the result above to be consistent with your previous derivation.

edited May 28 '16 at 19:08

answered May 28 '16 at 01:01

hans

1,724
8
4

First, this seems to apply to $$g(A)=A^T(AA^T)^{-1}A\epsilon\epsilon^T,$$ not to $$f(A)=\epsilon^TA^T(AA^T)^{-1}A\epsilon.$$ Second, are you saying that $$g(A+tH)=g(A)+2t,(AA^T)^{-1}A,,\epsilon\epsilon^T\Big(I-A^T(AA^T)^{-1}A\Big)H+o(t),$$ when $t\to0$, for every fixed $H$? – Did May 28 '16 at 05:36
No, I guess what hans is saying is $$g(A+tH) = g(A) + 2t <(AA^T)^{-1}A\epsilon \epsilon^T S_A, H> + o(t)$$ where $<.,.>$ denotes the inner product between the matrices. – user2808118 May 28 '16 at 05:51
I mean $$f(A+tH) = f(A)+ ...$$ – user2808118 May 28 '16 at 05:59
Yes, that's right. I'm using the inner product of the matrices. – hans May 28 '16 at 19:13
@hans : Thanks for the update. I had realized that you are using the transpose convention. – user2808118 May 29 '16 at 07:04

What is $\nabla_A \epsilon^TA^T(AA^T)^{-1}A\epsilon$?

After reading up a bit about matrix calculus, this is my effort.

1 Answers1