7

Let $X$ be a squared matrix,

We know that $\frac {\partial tr(X^TX)}{\partial X}$ is $2X$

But how about the case of $\frac {\partial tr((X^TX)^2)}{\partial X}$ or even $\frac {\partial tr((X^TX)^p)}{\partial X}$

Is there any generalization?

Note that here $(X^TX)^2 = X^TXX^TX$ and similar case applies to $(X^TX)^p$

Shuchang
  • 9,800

2 Answers2

7

Meaning of the gradient

The first thing is to be clear what is meant by the gradient of a matrix function. The definition of the gradient $G(X)$ as a linear functional acting on elements $V$ of the underlying space is unambiguously defined by the limit of finite differences, $$f'(X) \circ V = \lim_{s \rightarrow 0} \frac{1}{s}\left[f(X+sV) - f(X) \right].$$ However, if you view the gradient not as function but rather as a vector $G(X)$, the elements of that vector depends on the inner product $\langle \cdot, \cdot \rangle$ of your space. $G(X)$ is the unique vector such that $$f'(X) \circ V = \langle G(X),V\rangle$$ for all $V$. If you change the inner product, the entries of your gradient vector will change, but in such a way so that it's action through the inner product stays the same.

The standard inner product on matrix spaces is variously called the Frobenius inner product, the vectorization inner product, and the Hilbert Schmidt inner product of real matrices. These refer to varying levels of generalization of the same thing: $$\langle A,B \rangle = \sum_{ij}A_{ij}B_{ij} =\mathrm{vec}(A)^T\mathrm{vec}(B) = \mathrm{tr}(A^TB).$$

You could have other inner products, but they would just be the same basic inner product but with some symmetric positive definite "mass matrix" inserted into it, $$\langle A,B \rangle_M =\mathrm{vec}(A)^TM\mathrm{vec}(B).$$

Sometimes one defines the trace in terms of the inner product (rather than as a sum of entries), in which case the situation is automatically consistent such that $$\langle A,B \rangle_M =\mathrm{tr}_M(A^TB).$$


Derivative of $\mathrm{tr}\left((X^TX)^2\right)$

Here we choose to use the Frobenius inner product for our purposes of gradient calculation, understanding that if the inner product changes then the gradient will change by multiplication with a mass matrix.

For notation, let's call the original overall function $f$, $$f(X):=\mathrm{tr}(X^TXX^TX).$$

Based on the linearity of the trace and the product rule for matrices, it is straightforward to evaluate the derivative of $f$ in any given direction $V$, $$f'(X)\circ V = \mathrm{tr}(V^TXX^TX + X^TVX^TX + X^TXV^TX + X^TXX^TV).$$

However, in this form we only have the action of $f'$ as a function. To get the elements, we need to somehow gather all of the $V's$ together so that $$f'(X) \circ V = \mathrm{tr}(V^T \mathrm{[something]}),$$

then whatever is left is the gradient in matrix form, ready to be applied. Ie, $$G(X) = \mathrm{[something]}$$ so that $$f'(X) \circ V = \langle G(x), V\rangle.$$

To gather all of the $V's$ into one place and transpose them as required, we're going to need to use 2 "moves". The trace is

  1. invariant under cyclic permutations, $\mathrm{tr}(ABC) = \mathrm{tr}(CAB) = \mathrm{tr}(BCA)$, and
  2. invariant under transposes $\mathrm{tr}(A) = \mathrm{tr}(A^T).$

Using these two moves yields, \begin{align} f'(X)\circ V &= \mathrm{tr}(V^TXX^TX + X^TVX^TX + X^TXV^TX + X^TXX^TV) \\ &= \mathrm{tr}(V^TXX^TX + X^TXX^TV + V^TXX^TX + X^TXX^TV) \\ &= \mathrm{tr}(V^TXX^TX + V^TXX^TX + V^TXX^TX + V^TXX^TX) \\ &= \mathrm{tr}(V^T [4XX^TX]), \end{align}

and so $$G(X) = 4XX^TX.$$


Derivative of $\mathrm{tr}\left((X^TX)^p\right)$

For the general case we have, \begin{align} f'(x) \circ V:=& \mathrm{tr}((X^TX)^p) \\ =& \mathrm{tr}\left(\sum_{i=1}^p (X^TX)^{k-1}(V^TX)(X^TX)^{p-k} + \sum_{i=1}^p (X^TX)^{k-1}(X^TV)(X^TX)^{p-k}\right) \\ =& \mathrm{tr}\left(\sum_{i=1}^p (V^TX)(X^TX)^{p-k}(X^TX)^{k-1} + \sum_{i=1}^p (X^TX)^{p-k}(X^TX)^{k-1}(X^TV))\right) \\ =& \mathrm{tr}\left(\sum_{i=1}^p 2 V^TX(X^TX)^{p-1}\right) \\ =& 2\mathrm{tr}\left( V^T[pX(X^TX)^{p-1}]\right), \end{align}

and so $$G(X) = 2pX(X^TX)^{p-1}.$$


Generalization to smooth functions of $X^TX$

If you wanted to go further, you can use the result for $(X^TX)^p$ and linearity of the gradient to find the gradient for any polynomial $f(X) = q(X^TX)$ where $q(x) = \sum_k c_kx^k$, yielding $$G(X) = \sum_k c_k k X(X^TX)^{k-1} = Xq'(X^TX).$$

Then using polynomial approximation, you can extend the result to general functions $f(X^TX)$ $$G(X) = Xf'(X^TX).$$

If you did this whole procedure with $X^p$ instead of $(X^TX)^p$, you would have come up with the theorem, $$G(X) = f'(X^T).$$


Finally, if you are working with matrix calculations and derivatives, the matrix cookbook is an excellent reference. If you want to dive deep into the theory about these sort of things, the notes Trace, Metric, and Reality: Notes on Abstract Linear Algebra are great.

Nick Alger
  • 18,844
  • How about generalizing to $f(U)$ where $f$ is smooth and $U$ depends on $X$? Do we get $\frac{dU}{dX}f'(U^T)$? – Thomas Ahle Mar 16 '14 at 09:59
  • I think in the case $X^TX$ you lost a factor of $2$. For example, in the last second formula, it should be $G(X)=2Xf'(X^TX)$. – Xiang Yu Dec 20 '18 at 18:27
3

When $p=2$, \begin{align*} &\left[(X+\Delta X)^\top(X+\Delta X)\right]^2 - (X^\top X)^2\\ =&\left[(X+\color{red}{\Delta X})^\top(X+\color{green}{\Delta X})(X+\color{blue}{\Delta X})^\top(X+\color{orange}{\Delta X})\right] - (X^\top X)^2\\ =&\color{red}{\Delta X}^\top X(X^\top X) +X^\top \color{green}{\Delta X} (X^\top X) +(X^\top X)\color{blue}{\Delta X}^\top X +(X^\top X)X^\top \color{orange}{\Delta X}+O(\|\Delta X\|^2). \end{align*} Therefore, using the properties $\newcommand{\tr}{\operatorname{tr}}\tr(AB)=\tr(BA)$ and $\tr(A^\top)=\tr(A)$, we get $$ \tr\left\{\left[(X+\Delta X)^\top(X+\Delta X)\right]^2 - (X^\top X)^2\right\} = 4\tr\left(\Delta X^\top X(X^\top X)\right) +O(\|\Delta X\|^2). $$ and hence $\dfrac{\partial \tr(X^TX)}{\partial X} = 4 X(X^\top X)$. By a similar argument, one can deduce that $\dfrac{\partial \tr\left((X^TX)^p\right)}{\partial X} = 2p X(X^\top X)^{p-1}$.

user1551
  • 139,064