Let $f:\mathbb{R}^{n \times n} \to \mathbb{R}$ be defined as: $$ f(A)= x^T (A^2)^i y + v^T A^i w, $$ where $i \in \mathbb{N}$ and $x,y,v,w$ are some fixed column vectors. One can assume that $A$ is a symmetric matrix. I am interested in computing the gradient of $f$ with respect to $A$. The only rule that I know is: $$ \frac{\partial x^T (A^TA)y }{\partial A}=A(xy^T+yx^T). $$ Can anyone help me to find the derivative $\frac{\partial f(A)}{\partial A}$?
-
Shouldn't that rule give $2x^TAy$, as your matrix is out of place – Triatticus Nov 28 '16 at 15:33
-
1Comparing with my answer, I'd get your formula as $Axy^T + xy^T A$; since $(xy^T A)^T=A yx^T$, our answers disagree on whether the second term is transposed or not. I don't know which of us is right... – Semiclassical Nov 28 '16 at 16:39
-
Weirder: If I work in index notation, I get a different result for the $A-$derivative of $x^T(A^T A)y$ than $x^T A^2 y$. But that doesn't make sense, since $A^T=A$ here. – Semiclassical Nov 28 '16 at 16:54
1 Answers
Warning: Lots of tedious algebra below, so there's definitely room for errors. I'll see if I can validate this result by other means.
To make the notation less confusing, I'll write the exponent as $N$ rather than $i$ so that I can use lower-case Latin letters such as $i$ for indices. I will also adopt the Einstein convention, i.e. doubled indices are to be summed over.
First, let's translate $f(A)$ into a sum over indices. Inserting dummy indices for each of the matrix multiplications yields \begin{align} v^T A^N w &=v_k (A^N)_{kl}w_l \\ &=v_k A_{kj_1}A_{j_1j_2}\cdots A_{j_{n-1}l}w_l,\\\\ x^T (A^2)^N y &=x_k (A^2)^N_{kl}y_l\\ &=x_k(A^2)_{k j_1}(A^2)_{j_1j_2}\cdots(A^2)_{j_{N-1}l}y_l\\ &=x_kA_{ki_1}A_{i_1j_1}A_{j_1i_2}A_{i_2j_2}\cdots A_{j_{N-1}i_n}A_{i_N l}y_l. \end{align}
We can now differentiate with respect to a matrix element $A_{ab}$. We have $(\partial A_{ij}/\partial A_{ab})=\delta_{ia}\delta_{jb}$, so the linear term gives
\begin{align} \frac{\partial}{\partial A_{ab}}\left(v^T A^N w\right) &=v_k (\delta_{ka}\delta_{j_1b})A_{j_1j_2}\cdots A_{j_{n-1}l}w_l+\cdots+v_k A_{kj_1}A_{j_1j_2}\cdots (\delta_{j_{n-1}a}\delta_{lb})w_l \\ &=v_aA_{bj_1}\cdots A_{j_{n-1}l}w_l+\cdots+v_k A_{kj_1}A_{j_1j_2}\cdots A_{j_{n-2}a}w_b \\ &=(v^T)_a (A^{N-1} w)_b+(v^T A)_a(A^{N-2}w)_b+\cdots+(v^T A^{N-1})_a (w)_b\\ &=(v)_a (w^T A^{N-1})_b+(Av)_a(w^T A^{N-2})_b+\cdots+(A^{N-1}v)_a (w^T)_b. \end{align}
where in the last line I have both used $A^T=A$ and swapped column vectors with row vectors (and vice versa). Similarly, for the quadratic term (going directly to the result) we obtain
\begin{align} \frac{\partial}{\partial A_{ab}}\left(x^T (A^2)^N y\right) &=(x^T)_a (A^{2N-1}y)_b+(x^T A)_a (A^{2N-2}y)_b+\cdots +(x^T A^{2N-1})_a (y)_b\\ &=x_a (y^T A^{2N-1})_b+(Ax)_a (y^T A^{2N-2}y)_b+\cdots +(A^{2N-1}x)_a (y^T)_b\\ \end{align}
Since $\left(\frac{\partial}{\partial A} f(A)\right)_{ab}=\frac{\partial}{\partial A_{ab}} f(A)$, we can combine these two terms and place them in matrix form: $$\boxed{\frac{\partial}{\partial A} f(A)=\left(x y^T A^{2N-1}+A x y^T A^{2N-2}+\cdots+A^{2N-1} xy^T\right)\\\hspace{2cm}+ \left( vw^T A^{N-1}+Avw^T A^{N-2}+\cdots A^{N-1}vw^T \right).}$$

- 15,842