5

What would be the the Sub Gradient of

$$ f \left( X \right) = {\left\| A X \right\|}_{2, 1} $$

Where $ X \in \mathbb{R}^{m \times n} $, $ {A} \in \mathbb{R}^{k \times m} $ and $ {\left\| Y \right\|}_{2, 1} = \sum_{j} \sqrt{ \sum_{i} {Y}_{i,j}^{2} } $.

What would be the Prox of:

$$ \operatorname{Prox}_{\lambda {\left\| \cdot \right\|}_{2,1}} \left( Y \right) = \arg \min_{X} \frac{1}{2} {\left\| X - Y \right\|}_{F}^{2} + \lambda {\left\| X \right\|}_{2, 1}, \; X, Y \in {\mathbb{R}}^{m \times n} $$

Can either be generalized for $ {\left\| \cdot \right\|}_{q, p} $?

Royi
  • 8,711
  • $f$ is not differentiable. The $\ell_{2,1}$-norm is a separable sum, so it follows from the separable sum rule that evaluating the prox-operator of the $\ell_{2,1}$-norm reduces to evaluating the prox-operator of the 2-norm. – littleO Jul 29 '19 at 20:28
  • You're right. I was not accurate with this. Edited for the Sub Gradient. I meant something similar to https://math.stackexchange.com/questions/2035198. – Royi Jul 29 '19 at 20:31
  • There's a separable sum rule for subgradients and also a separable sum rule for prox-operators. Because the $\ell_{2,1}$-norm is a separable sum, we can use these separable sum rules to reduce these calculations to the corresponding calculations for the 2-norm. – littleO Jul 29 '19 at 20:34
  • This is what I thought about (Decompose $ X $ into its columns). I thought someone will come with more elegant and simple way. – Royi Jul 29 '19 at 20:40

2 Answers2

6

In the following I will use MATLAB's notation using the : operator for selecting a column of a Matrix.

Sub Gradient of $ {L}_{2, 1} $ Mixed Norm

$$ f \left( X \right) = {\left\| A X \right\|}_{2, 1} = \sum_{i} {\left\| A {X}_{:, i} \right\|}_{2} $$

Now, for a vector $ x $ the gradient:

$$ \frac{\mathrm{d} {\left\| A x \right\|}_{2} }{\mathrm{d} x} = \frac{ {A}^{T} A x }{ {\left\| A x \right\|}_{2} } $$

Which implies:

$$\begin{align*} \frac{\mathrm{d} {\left\| A X \right\|}_{2, 1} }{\mathrm{d} X} & = \frac{\mathrm{d} \sum_{i} {\left\| A {X}_{:, i} \right\|}_{2} }{\mathrm{d} X} && \text{} \\ & = \frac{\mathrm{d} \sum_{i} {\left\| A {X}_{:, i} \right\|}_{2} }{\mathrm{d} {X}_{:, i}} \boldsymbol{e}_{i}^{T} && \text{Where $ \boldsymbol{e}_{i} $ is the standard $ i $ -th basis vector} \\ & = \sum_{i} \frac{\mathrm{d} {\left\| A {X}_{:, i} \right\|}_{2} }{\mathrm{d} {X}_{:, i}} \boldsymbol{e}_{i}^{T} && \text{} \\ & = \sum_{i} \frac{ {A}^{T} A {X}_{:, i} }{ {\left\| A {X}_{:, i} \right\|}_{2} } \boldsymbol{e}_{i}^{T} && \text{} \\ & = {A}^{T} A X D \end{align*}$$

Where

$$ D = \operatorname{diag} \left\{ {d}_{1}, {d}_{2}, \ldots, {d}_{n} \right\}, \; {d}_{i} = \begin{cases} 0 & \text{ if } {\left\| A {X}_{:, i} \right\|}_{2} = 0 \\ \frac{1}{{\left\| A {X}_{:, i} \right\|}_{2}} & \text{ if } {\left\| A {X}_{:, i} \right\|}_{2} \neq 0 \end{cases} $$

Remark
For a column of zero sin $ X $ the Sub Gradient of the $ {L}_{2} $ norm of that vector is any vector with $ {L}_{2} $ norm which less or equal to unit. In the case above it was chosen to be the zero vector which indeed has a norm less than 1.

Prox of $ {L}_{2, 1} $ Mixed Norm

The problem is given by:

$$ \arg \min_{X} \frac{1}{2} {\left\| X - Y \right\|}_{F}^{2} + \lambda {\left\| X \right\|}_{2, 1} $$

Where $ X, Y \in \mathbb{R}^{m \times n} $.

Again, this can be decomposed into working on each column of $ X $ separately:

$$\begin{aligned} \arg \min_{X} \frac{1}{2} {\left\| X - Y \right\|}_{F}^{2} + \lambda {\left\| X \right\|}_{2, 1} & = \arg \min_{X} \sum_{i} \frac{1}{2} {\left\| {X}_{:, i} - {Y}_{:, i} \right\|}_{2}^{2} + \sum_{i} \lambda {\left\| {X}_{:, i} \right\|}_{2}^{2} && \text{} \\ & = \arg \min_{X} \left( \frac{1}{2} {\left\| {X}_{:, 1} - {Y}_{:, 1} \right\|}_{2}^{2} + \lambda {\left\| {X}_{:, 1} \right\|}_{2}^{2} \right) && \\ & + \left( \frac{1}{2} {\left\| {X}_{:, 2} - {Y}_{:, 2} \right\|}_{2}^{2} + \lambda {\left\| {X}_{:, 2} \right\|}_{2}^{2} \right) && \\ & + \cdots && \\ & + \left( \frac{1}{2} {\left\| {X}_{:, n} - {Y}_{:, n} \right\|}_{2}^{2} + \lambda {\left\| {X}_{:, n} \right\|}_{2}^{2} \right) \end{aligned}$$

Each term in the brackets is independent Prox Function of $ {L}_{2} $ Norm.
Hence the solution is given by:

$$ \hat{X} = \arg \min_{X} \frac{1}{2} {\left\| X - Y \right\|}_{F}^{2} + \lambda {\left\| X \right\|}_{2, 1} $$

Where $ \hat{X}_{:, i} = {Y}_{:, i} \left( 1 - \frac{\lambda}{\max \left( {\left\| {Y}_{:, i} \right\|}_{2} , \lambda \right)} \right) $

MATLAB Code

I implemented a code to verify the results versus Numerical Derivative (Finite Differences) and CVX (Reference for the Prox).
The full code is in my StackExchange Mathematics Q3307741 GitHub Repository.

Royi
  • 8,711
3

Let $x_j$ be the $j$th column of $X$. (So $\|X\|_{2,1} = \sum_{j=1}^n \| x_j \|_2$.) Then the separable sum rule for proximal operators tells us that $\text{prox}_{\lambda\| \cdot \|_{2,1}}(X)$ is the $m \times n$ matrix whose $j$th column is $\text{prox}_{\lambda\| \cdot \|_2}(x_j)$.

littleO
  • 51,938