Optimization Problem Involving $ {L}_{2} $, $ {L}_{1} $ Norm and Constraints

Question

Can somebody suggest me how to solve the following optimization problem? \begin{equation*} F(\mathbf{w},\xi)= \begin{aligned} & \underset{\mathbf{w,\xi}}{\text{minimize}} & & \frac{1}{2}\|\mathbf{w}-\mathbf{w}_{t-1}\|^2_2 + \beta \|\mathbf{w}\|_1 +\alpha C\xi\\ & \text{subject to} & & 1-y\mathbf{w}^T\mathbf{x} \leq \xi, \xi \geq 0 \end{aligned} \end{equation*}

An equivalent formulation is this: \begin{equation*} F(\mathbf{w},\xi)= \begin{aligned} & \underset{\mathbf{w,\xi}}{\text{minimize}} & & \frac{1}{2}\|\mathbf{w}-\mathbf{w}_{t-1}\|^2_2 +\alpha C\xi\\ & \text{subject to} & & 1-y\mathbf{w}^T\mathbf{x} \leq \xi, \xi \geq 0\\ & & &\|\mathbf{w}\|_1 \leq t \end{aligned} \end{equation*} Edit: I need solution techniques that works in online setting and suitable for large $\mathbf{w,x}$, say in millions.

You are missing a $beta$ and t in the objective in the equivalent formulation. When you say solve, do you mean a dedicated structure exploiting fast solver, in contrast to the obvious approach of just writing it as a standard convex quadratic program and use standard methods for that. — Johan Löfberg, Sep 24 '15 at 10:42
unless you mean there exist a beta such that equivalence holds etc — Johan Löfberg, Sep 24 '15 at 10:46
as per Mark Schmidt here notes on LASSO, $t$ and $\beta$ are inversely related. By solve, I mean approach to solve it in cases when $\mathbf{w,x}$ are in huge dimension, say, 1M. — CKM, Sep 24 '15 at 10:55
my concern is non-differentiability of $L1$ . Proxmial algorithms can be used but hinge loss constraint prohibits to use it, since, originally, they were proposed for unconstrained problems. I don't know if anybody is aware of using proximal algorithms under constraints. — CKM, Sep 24 '15 at 11:01
I would recommend a custom proximal minimizer that incorporates the $\ell_1$ and $\xi$ terms in the objective along with the constraints. It can probably be implemented efficiently. — Michael Grant, Sep 24 '15 at 15:24
Actually, what does $w_{t-1}$ represent? It already looks very close to a proximal minimization form. — Michael Grant, Sep 24 '15 at 15:26
Any insight into designing custom proximal minimizer would be helpful. I am new to proximal thing and going through prof. Boyd notes on proximal algorithms. — CKM, Sep 24 '15 at 15:27
$w_{t-1}$ is value of $w$ at $t-1$ step and can be considered as known. — CKM, Sep 24 '15 at 15:29
You really need to curate your notations. For example, the LHS of your problem is a function of $(w, \xi)$, the RHS is not. — dohmatob, Sep 25 '15 at 07:35
You can (and should) replace $\alpha C$ with $\alpha$ and $yw^Tx$ with $w^Tx$. There are many irrelevant details in your problem which can be "normalized" away. — dohmatob, Sep 25 '15 at 08:04

score 3 · Accepted Answer · edited Oct 02 '15 at 06:20

3

You can (and should) replace $\alpha C$ with $\alpha$ and $yw^Tx$ with $w^Tx$. There are many irrelevant details in your problem which can be "normalized" away.

Now, a bit of algebra reveals that it is optimal to set (indeed the problem in $\xi$ is the minimization of a line with slope $\alpha > 0$) \begin{eqnarray} \xi = (1 - w^Tx)_+ := \max(1-w^Tx, 0), \end{eqnarray} and then minimize $E(w) := \frac{1}{2}\|w-w_{t-1}\|^2 + \beta\|w\|_1 + \alpha(1 - w^Tx)_+$ for $w$.

Now, $E(w)$ can be rewritten as \begin{eqnarray} E(w) = g(w) + f(Kw), \end{eqnarray} where $K := x^T \in \mathbb{R}^{1 \times n}$, $g(w) := \frac{1}{2}\|w-w_{t-1}\|^2 + \beta\|w\|_1$, and $f(z) := \alpha(1 - z)_+$.

Thus minimizing $E(w)$ is equivalent to solving the following minimax game \begin{eqnarray} \underset{z \in \mathbb{R}}{\text{maximize }}\underset{w \in \mathbb{R}^n}{\text{minimize }}\langle z, Kw\rangle + g(w) - f^*(z), \end{eqnarray} where $f^*(z) := \underset{s \in \mathbb{R}}{\text{max }}zs - f(s)$, the convex conjugate of $f$ (compute it as exercise). It's straightforward to compute the proximal operators of $g$ (this will be a shrinkage) and $f^*$ (this will be a projection onto a compact interval followed by a translation), and so you have all the ingredients (except minor details ...) needed to invoke the proximal primal-dual algorithms of Chambolle-Pock, for example.

Finally, because $g$ is strongly convex, you will converge in $\mathcal{O}(\|x\|/\sqrt{\epsilon})$ iterations for a tolerance $\epsilon > 0$ on the duality gap.

Below are some details (useful for actually implementing the algorithms).

Computing $f^*$ and $\textrm{prox}_{\lambda f^*}$. Using basic properties of convex conjugates, we have \begin{eqnarray} (z)_+^* = i_{[0, 1]}(z) \implies (-z)_+^* = i_{[0,1]}(-z) = i_{[-1,0]}(z) \implies (1 - z)_+^* = z + i_{[-1,0]}(z), \end{eqnarray} and so $f^*(z) = \alpha(\frac{z}{\alpha}) + \alpha i_{[-1, 0]}(\frac{z}{\alpha}) = \begin{cases}z, &\mbox{ if} -\alpha \le z \le 0,\\+\infty, &\mbox{ otherwise.}\end{cases}$

A direct computation then yields that any prox step $\lambda > 0$, we have \begin{eqnarray}\textrm{prox}_{\lambda f^*}(z) = \textrm{proj}_{[\lambda - \alpha, \lambda]}(z) - \lambda, \end{eqnarray} i.e a projection onto a compact interval followed by a translation.

edited Oct 02 '15 at 06:20

CKM

1,594

answered Sep 25 '15 at 08:03

dohmatob

9,535

Looks cool. Let me see how fast and good it is to apply proximal primal-dual algo. – CKM Sep 25 '15 at 08:48
I hope this helps. Lemme know if you need more practical details. – dohmatob Sep 25 '15 at 13:18
why did you change the sign in $(1-z)^*+=-z + i{[-1,0]}(z)$? Earlier was correct as shown in 28 [slide ] (http://www.mit.edu/~9.520/spring07/Classes/svmwithfenchel.pdf) – CKM Oct 01 '15 at 04:12
The earlier stuff was a misprint. Indeed, you have the identity (easy to prove): $h^(z + a) \equiv h^(z) - a^Tz$. – dohmatob Oct 01 '15 at 07:05
what about the proof given in slide 28 linked in my previous comment? – CKM Oct 01 '15 at 07:12
I think the author made an error... – dohmatob Oct 01 '15 at 07:27
Learn more about properties of convex conjugates here: https://en.wikipedia.org/wiki/Convex_conjugate#Table_of_selected_convex_conjugates – dohmatob Oct 01 '15 at 07:33
Also does the expression for $(1-z)^*_+$ valid for $z=y\mathbf{w}^T\mathbf{x}$? Here, our variable of interest is $\mathbf{w}$,i.e. $f(\mathbf{w})=(1-y\mathbf{w}^T\mathbf{x})$. Any hint will be useful. Thanks. – CKM Oct 01 '15 at 10:25
Upon checking, $h^(z + a) \equiv h^(z) - a^Tz$. Substituting, $h(z+a)=(z+a)+$, $z=-z$ and $a=1$, we get $(1-z)^*+=(-z)^*+ + z = \iota{[-1,0]}(z)+z$. – CKM Oct 01 '15 at 12:36
Hum, by the identity, i meant if $f(z) \equiv h(z + a)$ then $f^(z) \equiv h^(z) - a^Tz$. This is an elementary fact (which can be shown in one-line). Look at the second row of the the table https://en.wikipedia.org/wiki/Convex_conjugate#Table_of_selected_convex_conjugates. It doesn't amount to just plugging values into the expression for $h^*(z)$. – dohmatob Oct 01 '15 at 13:11
Martin Wainwright lecture notes contains proof for conjugate of hinge loss. Plz look carefully. – CKM Oct 01 '15 at 13:23
OK, it seems you're right. Such mistakes are typically made when you're expedient, saying to people "hey look, this and that are trivial, just do the math and you'll see". My bad. There is a story of Grothendieck mistaking 57 for a prime, because he never ever cared about computational details. Lack of café can be a serious problem too. Anyways, I hope you can reconstruct a solution to your problem from all of this. Good luck! – dohmatob Oct 01 '15 at 14:50
@dohmatob, Any thought on this - https://math.stackexchange.com/questions/2595199. – Royi Mar 13 '18 at 15:55
@Royi https://math.stackexchange.com/a/2689627/168758 – dohmatob Mar 13 '18 at 17:30

Optimization Problem Involving $ {L}_{2} $, $ {L}_{1} $ Norm and Constraints

1 Answers1