What is the (sub-)gradient of $\|W^TW-I\|_*$ w.r.t. the matrix variable $W$?

Question

For $\|W^TW-I\|_*$ , $W\in R^{m\times n}$ with $n\leq m$ and $I\in R^{n\times n}$ is an identity matrix, $\|\|_*$ is nuclear norm, also named as trace norm.

Q1: $\|W^TW-I\|_*$ is usually used in machine learning algorithms as a regularization term. Minimizing such term can be done by calculating its (sub-)gradient with regard to the matrix variable $W$. Then how to do? I only know the (sub-)gradient of $\|W\|_*$ as answered in Derivative of the nuclear norm with respect to its argument.

Q2: Maybe Q2 will be much complicated. The (sub-)gradient of $\|W\|_*$ involves Singular Value Decomposition, and Is there a nuclear norm approximation for stochastic gradient descent optimization? provides some way to make it efficient. I am afraid the solution of Q1 will also be computationally expensive. Then any method to make it efficient?

greg · Answer 1 · 2020-01-29T21:07:13.550

For typing convenience, define the symmetric matrix $$\eqalign{ X &= W^TW-I \\ }$$ Write the nuclear norm in terms of this new variable.
Then find its differential and (sub)gradient wrt $W$. $$\eqalign{ \lambda &= \|X\|_* \\ &= {\rm Tr}\big((X^TX)^{1/2}\big) \\ &= \pm{\rm Tr}(X)\quad\Big({\rm choose\,sign\,such\,that\;}\lambda>0\Big) \\ d\lambda &= \pm{\rm Tr}(dX) \\ &= \pm{\rm Tr}(W^TdW+dW^TW) \\ &= \pm{\rm Tr}\Big((2W)^TdW\Big) \\ \frac{\partial \lambda}{\partial W} &= \pm 2W = \begin{cases} +2W &{\rm if\;Tr}(W^TW-I)>0 \\ -2W &{\rm otherwise} \end{cases} \\ }$$ If you're concerned about computational expense, this result is as inexpensive as one could reasonably hope to achieve.

Update

The matrix sign function is defined such that $$\eqalign{ S &= \operatorname{sign}(X) = X(X^2)^{-1/2} \\ I &= S^2 \quad\implies S^{-1} = S \\ XS &= SX \\ }$$
For a symmetric matrix $${ (X^TX)^{1/2} = (X^2)^{1/2} = XS^{-1} = SX }$$ Therefore $S$ can be used to write the nuclear norm and its gradient as $$\eqalign{ \lambda &= \operatorname{Tr}(SX) \\ \frac{\partial \lambda}{\partial W} &= 2WS \\ }$$ The previous result is only valid when $S=\pm I,\,$ which is true for some matrices.

What is the (sub-)gradient of $\|W^TW-I\|_*$ w.r.t. the matrix variable $W$?

1 Answers1

Update