I would like to write down the update rule for a set of parameters in a neural network, which minimizes a loss function that I think is general enough to be instructive for others.
Let $\Phi \in \mathbb{R}^{l \times m \times n}$ be a $l \times m \times n$ tensor of learnable parameters and $\mathscr{L(\Phi)}$ be a scalar loss function of those parameters to be minimized:
$$\mathscr{L} = \beta\sum_{i=1}^{m}\sum_{j=1}^{n}\sum_{k=1}^{n}|\Phi_{i}^{\top}\Phi_{i} - \mathbb{I}_{\text{n}}|_{jk},$$
where $|\cdot|$ is element-wise absolute value, $\beta$ is some scalar constant, $\Phi_{i}$ is a $l \times n$ matrix, and $\mathbb{I}_{\text{n}}$ is the $n \times n$ identity matrix. I would like to know the derivative of this loss with respect to an $l$-dimensional vector: $\frac{\partial \mathscr{L}}{\partial \Phi_{ab}}$, where $a$ and $b$ index the $m$ and $n$ dimensions of $\Phi$, respectively.
Following the chain rule described in chapter 18 from the Matrix Differential Calculus book by Magnus and Neudecker, I can use differentials to get most of the way there. Specifically, I can modify example 18.6a to let $F(X) = |X^{\top}X|$ for some $X \in \mathbb{R}^{l \times n}$, where again $|\cdot|$ is absolute value, not determinant. Then,
\begin{align} \text{d}F &= \text{d}|X^{\top}X| \\ &= \frac{X^{\top}X}{|X^{\top}X|} \text{d}(X^{\top}X) \\ &= \frac{X^{\top}X}{|X^{\top}X|} (\text{d}X)^{\top}X + \frac{X^{\top}X}{|X^{\top}X|} X^{\top} \text{d}X \\ &= 2 \frac{X^{\top}X}{|X^{\top}X|} X^{\top}\text{d}X \end{align}
The book also provides an identification theorem for connecting differentials to derivatives: $$\text{d} \text{vec}F = A(X) \text{d} \text{vec}X \iff \frac{\partial\text{vec}F(X)}{\partial(\text{vec}X)^{\top}} = A(X),$$ where $\text{vec}$ is the matrix vectorization operator. I believe I can now use the chain rule to get close to my desired derivative if I set $F=|X^{\top}X-\mathbb{I}_{\text{n}}|$ and $X=\Phi_{i}$: \begin{align} \frac{\partial\mathscr{L}}{\partial(\text{vec}\Phi_{i})^{\top}} &= \frac{\partial\mathscr{L}}{\partial\text{vec}F} \frac{\partial\text{vec}F}{\partial(\text{vec}\Phi_{i})^{\top}} \\ &= \frac{\partial\mathscr{L}}{\partial\text{vec}F} 2 \frac{\Phi_{i}^{\top}\Phi_{i}-\mathbb{I}_{\text{n}}}{|\Phi_{i}^{\top}\Phi_{i}-\mathbb{I}_{\text{n}}|} \Phi_{i}^{\top} \end{align}
I do not know how to get from this point to a partial derivative with respect to a single vector, $\Phi_{ab}$. I would guess that almost all of the entries from the sums in $\mathscr{L}$ will be zero for $\frac{\partial \mathscr{L}}{\partial \Phi_{ab}}$. I think I can use this to my advantage, which I think would mean multiplying the above derivative by $\delta_{ia}\delta_{jb}\delta_{kb}$, but this is where I am less sure.
I also used this blog post as a resource. My question is very similar to this one, and also related to this one, this one, and this one, although I was not able to get to an answer from those posts.
$$\frac{\partial|X|}{\partial X}=\frac{X}{|X|}$$ – greg Jan 24 '20 at 15:29