Converting a matrix differential to a derivative

Question

I would like to write down the update rule for a set of parameters in a neural network, which minimizes a loss function that I think is general enough to be instructive for others.

Let $\Phi \in \mathbb{R}^{l \times m \times n}$ be a $l \times m \times n$ tensor of learnable parameters and $\mathscr{L(\Phi)}$ be a scalar loss function of those parameters to be minimized:

$$\mathscr{L} = \beta\sum_{i=1}^{m}\sum_{j=1}^{n}\sum_{k=1}^{n}|\Phi_{i}^{\top}\Phi_{i} - \mathbb{I}_{\text{n}}|_{jk},$$

where $|\cdot|$ is element-wise absolute value, $\beta$ is some scalar constant, $\Phi_{i}$ is a $l \times n$ matrix, and $\mathbb{I}_{\text{n}}$ is the $n \times n$ identity matrix. I would like to know the derivative of this loss with respect to an $l$-dimensional vector: $\frac{\partial \mathscr{L}}{\partial \Phi_{ab}}$, where $a$ and $b$ index the $m$ and $n$ dimensions of $\Phi$, respectively.

Following the chain rule described in chapter 18 from the Matrix Differential Calculus book by Magnus and Neudecker, I can use differentials to get most of the way there. Specifically, I can modify example 18.6a to let $F(X) = |X^{\top}X|$ for some $X \in \mathbb{R}^{l \times n}$, where again $|\cdot|$ is absolute value, not determinant. Then,

\begin{align} \text{d}F &= \text{d}|X^{\top}X| \\ &= \frac{X^{\top}X}{|X^{\top}X|} \text{d}(X^{\top}X) \\ &= \frac{X^{\top}X}{|X^{\top}X|} (\text{d}X)^{\top}X + \frac{X^{\top}X}{|X^{\top}X|} X^{\top} \text{d}X \\ &= 2 \frac{X^{\top}X}{|X^{\top}X|} X^{\top}\text{d}X \end{align}

The book also provides an identification theorem for connecting differentials to derivatives: $$\text{d} \text{vec}F = A(X) \text{d} \text{vec}X \iff \frac{\partial\text{vec}F(X)}{\partial(\text{vec}X)^{\top}} = A(X),$$ where $\text{vec}$ is the matrix vectorization operator. I believe I can now use the chain rule to get close to my desired derivative if I set $F=|X^{\top}X-\mathbb{I}_{\text{n}}|$ and $X=\Phi_{i}$: \begin{align} \frac{\partial\mathscr{L}}{\partial(\text{vec}\Phi_{i})^{\top}} &= \frac{\partial\mathscr{L}}{\partial\text{vec}F} \frac{\partial\text{vec}F}{\partial(\text{vec}\Phi_{i})^{\top}} \\ &= \frac{\partial\mathscr{L}}{\partial\text{vec}F} 2 \frac{\Phi_{i}^{\top}\Phi_{i}-\mathbb{I}_{\text{n}}}{|\Phi_{i}^{\top}\Phi_{i}-\mathbb{I}_{\text{n}}|} \Phi_{i}^{\top} \end{align}

I do not know how to get from this point to a partial derivative with respect to a single vector, $\Phi_{ab}$. I would guess that almost all of the entries from the sums in $\mathscr{L}$ will be zero for $\frac{\partial \mathscr{L}}{\partial \Phi_{ab}}$. I think I can use this to my advantage, which I think would mean multiplying the above derivative by $\delta_{ia}\delta_{jb}\delta_{kb}$, but this is where I am less sure.

I also used this blog post as a resource. My question is very similar to this one, and also related to this one, this one, and this one, although I was not able to get to an answer from those posts.

Is $| \cdot |$ notation $\ell_2$ norm? Assuming yes, then are you looking for a gradient of $|\Phi_i^T \Phi_i - I |$ with respect to $\Phi_i$? — user550103, Jan 23 '20 at 11:12
@user550103 In the initial summation, it looks like the notation is the element-wise absolute value which is then used to calculate an $L_1$ Manhattan norm. But later in the post, the notation changes and indicates the $L_2$ Frobenius norm. — greg, Jan 23 '20 at 15:38
The notation is indeed element-wise absolute value. I do not change it to the Frobenius norm - where are you suggesting that is happening @greg? — Dylan, Jan 24 '20 at 09:00
Magnus-Neudecker example (18.6a) is using $|X^TX|$ as a shorthand for $\det(X^TX)$, not the absolute value. Also, the meaning of the expression $\frac{\Phi^T\Phi-I}{|\Phi^T\Phi-I|}$ is ambiguous unless $|\Phi^T\Phi-I|$ is a scalar quantity such as a norm or a determinant. I assumed that you were using it to denote the Frobenius norm because (perhaps coincidentally) its derivative has precisely the same form
$$\frac{\partial|X|}{\partial X}=\frac{X}{|X|}$$ — greg, Jan 24 '20 at 15:29

greg · Accepted Answer · 2020-01-25T21:04:01.017

For ease of typing define the variables $$\eqalign{ P &= \phi,\quad &X=\big(P^TP-I\big) &\implies dX=\big(P^TdP+dP^TP\big) \\ A &= \operatorname{abs}(X),\quad &G = \operatorname{sign}(X) &\implies \;\, A=G\odot X \\ }$$ where $(\odot)$ is the elementwise/Hadamard product and all functions are applied elementwise. Forget about the subscripts, they'll be added later.

Note that $(G,A,X)$ are symmetric matrices.

Write the elementwise $L_1$-norm (aka Manhattan norm) of $X$ and calculate its differential. $$\eqalign{ {\mu} &= {\tt1}:A \\&= {\tt1}:(G\odot X) \\&= G:X \\ d{\mu} &= G:dX \\ &= G:(P^TdP+dP^TP) \\ &= (G+G^T):P^TdP \\ &= 2PG:dP \\ }$$ where $\tt1$ is the all-ones matrix and a colon is shorthand for the trace, i.e. $\;G\!:\!X = \operatorname{Tr}(G^T\!X)$

Append subscripts to the above result, sum, and multiply by $\beta$ to construct the loss function. $$\eqalign{ {\scr L} &= \beta\sum_i \mu_i \\ d{\scr L} &= \beta\sum_i d\mu_i = \beta\sum_i 2P_iG_i : dP_i \\ \frac{\partial\scr L}{\partial P_i} &= 2\beta\,P_iG_i \\ }$$ In terms of the original variables, the gradient is $$\eqalign{ \frac{\partial\scr L}{\partial\phi_i} &= 2\beta\,\phi_i\,\operatorname{sign}(\phi_i^T\phi_i-I) \\ }$$ NB: $\,\operatorname{sign}(z)$ has a discontinuity at $z=0$, so this gradient doesn't exist everywhere.

Since $\Phi$ is a $3$rd-order tensor, the above gradient is more clearly expressed in index notation. $$\eqalign{ \phi_i &\to \Phi_{mil} \quad \big({\rm matrix\, used\, in\, the\, preceding\, derivation}\big) \\ \frac{\partial\scr L}{\partial\phi_{i}} &\to \frac{\partial\scr L}{\partial\Phi_{mil}} \;=\; 2\beta \sum_j\Phi_{mij}\,\operatorname{sign} \left(\sum_k\Phi_{kij}\Phi_{kil}-\delta_{jl}\right) \;\doteq\; \Gamma_{mil} \\ }$$ Finally, the matrix components of the crazy derivative that was requested can be written as $$\eqalign{ Q_j &= \sum_i\sum_k \Gamma_{jik}\;e_ie_k^T \\ }$$ where $\{e_i\}$ denotes the standard Cartesian basis vector.

Awesome, thanks a lot @greg. I am not sure why you used the Cartesian basis vector, or what the final equation is representing. However, everything leading up to that makes sense to me and answers my question. Also, regarding your opinion on my derivative's sanity - the loss I wrote down encourages orthogonalization of weights along the axis M, which it turns out is highly useful for imposing group priors on probabilistic models. So, not so crazy if you have all of the context ;) — Dylan, Jan 29 '20 at 14:20

Converting a matrix differential to a derivative

1 Answers1