Derivative of $\frac{\partial f(X^{\top}X)}{\partial X}$

Question

The task is to prove that for any matrix $X$ and differentiable scalar function $f$, the following holds:

$$ \frac{\partial f(X^\top X)}{\partial X} = 2{X}\frac{\partial f(X^\top X)}{\partial (X^\top X)}. $$

Following the Chain Rule, I could already denote that:

$$ \frac{\partial f(X^\top X)}{\partial X} = \frac{\partial f(u)}{\partial X} = \frac{\partial f(u)}{\partial u}\frac{\partial u}{\partial X} = \frac{\partial f(X^\top X)}{\partial (X^\top X)}\frac{\partial X^\top X}{\partial X} $$

for $u$ being a function of $X$ with $u(X) = X^{\top}X$

Now, I have difficulty proving that $\dfrac{\partial (X^{\top}X)}{\partial X} = 2X$.

I know from this previous question as well as The Matrix Cookbook, that for a vector x, the following holds for the squared Euclidian Norm: $\dfrac{||\textbf{x}||^2_2}{\partial x} = \dfrac{||x^{\top} x||_2}{\partial x} = 2x$, but how can I go from this to rigorously arguing the same for $\dfrac{\partial f(X^{\top} X)}{\partial X} $? Or is there an even simpler way to do it?

Any hints would be appreciated! Thank you very much.

lynn · Answer 1 · 2023-05-14T15:35:07.230

$ \def\LR#1{\left(#1\right)} \def\op#1{\operatorname{#1}} \def\trace#1{\op{Tr}\LR{#1}} \def\frob#1{\left\| #1 \right\|_F} \def\qiq{\quad\implies\quad} \def\p{\partial} \def\grad#1#2{\frac{\p #1}{\p #2}} \def\c#1{\color{red}{#1}} $The matrix inner product (denoted by a colon) is extremely useful in these sorts of problems.
It is essentially a product notation for the trace $$\eqalign{ A:B &= \sum_{i=1}^m\sum_{j=1}^n A_{ij}B_{ij} \;=\; \trace{A^TB} \\ A:A &= \frob{A}^2 \qquad \{ {\rm Frobenius\;norm} \} \\ }$$ This is also called the double-dot, or Frobenius product.
When applied to vectors $(n={\tt1})$ it reduces to an ordinary dot product.
Terms in a Frobenius product can be rearranged in many useful ways, e.g. $$\eqalign{ A:B &= B:A \\ A:B &= A^T:B^T \\ C:\LR{AB} &= \LR{CB^T}:A &= \LR{A^TC}:B \\ }$$ Define $\:Y=X^TX\:$ and assume that we know the gradient $\:G=\large{\grad fY}$

Use $G$ to calculate the differential of the function, then perform a change variables from $Y\to X.\;$ Then recover the gradient with respect to $X$ $$\eqalign{ df &= G:dY \\ &= G:\LR{dX^TX+X^TdX} \\ &= 2\,G:\LR{X^TdX} \\ &= \LR{2XG}:dX \\ \grad fX &= 2XG \\ }$$

Masacroso · Answer 2 · 2023-05-14T13:32:10.863

I'm not sure what the notation $\frac{\partial \alpha (X)}{\partial X}$ means, but you can write $f(X^\top X)=(f\circ g\circ h)(X)$ for $h(X):=(X^\top ,X)$ and $g(X,Y):=XY$, then using the Fréchet derivative and for every matrix $H$ the chain rule gives $$ \begin{align*} \partial [f(X^\top X)]H&=(\partial f\circ g\circ h)(X)(\partial g\circ h)(X)\partial h(X)H\\ &=\partial f(X^\top X)\partial g(X^\top ,X)\partial h(X)H\\ &=\partial f(X^\top X)\partial g(X^\top ,X)(H^\top ,H)\\ &=\partial f(X^\top X)(X^\top H+H^\top X) \end{align*} $$

Hope it helps.

score 0 · Answer 3 · answered May 14 '23 at 13:41

Before proving this claim, it is important to clarify the notation. Given a function $f:M_{m\times n}(\Bbb C)\to \Bbb C$, we define its derivative $\frac{\partial f(X)}{\partial X}$ as another $m\times n$ matrix $D_f=[d_{i,j}]_{m\times n}$ such that $$ d_{i,j}=\frac{\partial f(X)}{\partial x_{i,j}}, $$ where the differentiation has been performed w.r.t. $X=[x_{i,j}]_{m\times n}$. Now, by defining $Y=X^TX=[y_{i,j}]_{n\times n}$, we have $y_{i,j}=\sum_{k=1}^mx_{k,i}x_{k,j}$ and we can write $$ \frac{\partial f(X^TX)}{\partial x_{i,j}}{ = \sum_{\mu=1}^n\sum_{\nu=1}^n\frac{\partial f(X^TX)}{\partial y_{\mu,\nu}}\frac{\partial y_{\mu,\nu}}{\partial x_{i,j}} \\= \sum_{\mu=1}^n\sum_{\nu=1}^n\frac{\partial f(X^TX)}{\partial y_{\mu,\nu}}\frac{\partial \sum_{\kappa=1}^mx_{\kappa,\mu}x_{\kappa,\nu}}{\partial x_{i,j}} \\= \sum_{\mu=1\\\mu\ne j}^n\sum_{\nu=1\\\nu\ne j}^n\frac{\partial f(X^TX)}{\partial y_{\mu,\nu}}\frac{\partial \sum_{\kappa=1}^mx_{\kappa,\mu}x_{\kappa,\nu}}{\partial x_{i,j}} \\+ \sum_{\nu=1\\\nu\ne j}^n\frac{\partial f(X^TX)}{\partial y_{j,\nu}}\frac{\partial \sum_{\kappa=1}^mx_{\kappa,j}x_{\kappa,\nu}}{\partial x_{i,j}} \\+ \sum_{\mu=1\\\mu\ne j}^n\frac{\partial f(X^TX)}{\partial y_{\mu,j}}\frac{\partial \sum_{\kappa=1}^mx_{\kappa,\mu}x_{\kappa,j}}{\partial x_{i,j}} \\+ \frac{\partial f(X^TX)}{\partial y_{j,j}}\frac{\partial \sum_{\kappa=1}^mx_{\kappa,j}x_{\kappa,j}}{\partial x_{i,j}} \\= 0 + \sum_{\nu=1\\\nu\ne j}^n\frac{\partial f(X^TX)}{\partial y_{j,\nu}}x_{i,\nu} + \sum_{\mu=1\\\mu\ne j}^n\frac{\partial f(X^TX)}{\partial y_{\mu,j}}x_{i,\mu} + \frac{\partial f(X^TX)}{\partial y_{j,j}}2x_{i,j} \\= \sum_{\nu=1\\\nu\ne j}^n\frac{\partial f(X^TX)}{\partial y_{j,\nu}}2x_{i,\nu} + \frac{\partial f(X^TX)}{\partial y_{j,j}}2x_{i,j} \\= \sum_{\nu=1}^n\frac{\partial f(X^TX)}{\partial y_{j,\nu}}2x_{i,\nu}. } $$ From here, $\frac{\partial f(X^TX)}{\partial y_{i,j}}$ are the entries of $\frac{\partial f(X^TX)}{\partial X^TX}$. Therefore, according to definition of matrix multiplication, $\sum_{\nu=1}^n\frac{\partial f(X^TX)}{\partial y_{j,\nu}}2x_{i,\nu}$ are the entries of $2X\frac{\partial f(X^TX)}{\partial X^TX}$ and the proof is complete $\blacksquare$

Thank you very much for the detailed answer! Could you briefly explain what you are summing up from line 3 to line 6 of your calculation? I can't quite follow your indexing. Thanks in advance! — That1WasTaken, May 14 '23 at 14:50
Well, the double summation is over $\mu$ and $\nu$ for $(\mu,\nu)\in{1,2,\cdots ,n}^2$ in the 3rd line. I split this double summation to four cases where $\mu$ and $\nu$ are both unequal to $j$, only $\mu$ is equal to $j$, only $\nu$ is equal to $j$ or both $\mu$ and $\nu$ are equal to $j$. — Mostafa Ayaz, May 14 '23 at 14:56
I see, thank you, but what is the reason for splitting it up into four cases? — That1WasTaken, May 14 '23 at 16:17
The need for that is that the derivative is being calculated differently for these cases. — Mostafa Ayaz, May 14 '23 at 16:57

Derivative of $\frac{\partial f(X^{\top}X)}{\partial X}$

3 Answers3