Linear Least Squares with $ {L}_{2} $ Norm Regularization / Penalty Term

Question

I want to minimize the following cost function

$$J= \displaystyle\min_{\boldsymbol{\rm x}} \left\{\|A\boldsymbol{\rm x}_\lambda-b\|_2^2 + \lambda\|\boldsymbol{\rm x}_\lambda\|_2\right\}$$

It is a bit different from Tikhonov regularization because the penalty term is not squared. As opposed to Tikhonov, which has an analytic solution, I was not able to find such solution for $J$. Could someone give me some hint which method I can use to solve for $\boldsymbol{\rm x}_\lambda$?

any quasi newton / trust region method for instance. you might lose some theoretical convergence speed due to the fact that $\left|\cdot\right|_2$ is continous but not smooth. (more explicit hint for the case of quasi-newton: CG/BFGS/limited-memory-BFGS) — Max, Feb 09 '18 at 22:55
@Max: I used CVX (http://cvxr.com/) which uses SeDuMi to solve it but unfortunately, I have not figured out the details of the solver method. CVX gives me solutions that makes sense. However, I am trying to write my own solver. By any chance do you happen to use CVX/SeDuMi? — shashashamti2008, Feb 09 '18 at 23:08
since cvx seems to be a matlab lib, i guess you use matlab. do you have the optimization toolbox? if yes: fminunc( @(x) sum((Ax-b).^2 +$\lambda$norm(x) , x0 ) might already do the job. https://de.mathworks.com/help/optim/ug/fminunc.html having said this, i would always recommend fmincon before fminunc since it has more solvers (such as L-BFGS) implemented (and you can just omit any constraints) https://de.mathworks.com/help/optim/ug/fmincon.html And of course it's always a good idea to implement a gradient function such that finite differences are not necessairy. — Max, Feb 09 '18 at 23:20
Thanks Max. I do have MATLAB but it does not have the optimization toolbox. — shashashamti2008, Feb 09 '18 at 23:25
if the example code on http://cvxr.com/cvx/ does work, all you need to do is to omit the constraints and to replace the objective function by sum((Ax-b).^2) +$\lambda$norm(x). — Max, Feb 10 '18 at 09:28
I successfully used CVX and got excellent results. I want to know how SeDuMi minimizes $J$ and by which method as I want to write my own solver. — shashashamti2008, Feb 10 '18 at 09:30
The method seems to be from "An O (√ nL)-iteration homogeneous and self-dual linear programming algorithm" (no free pdf on google scholar, but sci-hub.tv works) — Max, Feb 10 '18 at 21:08

littleO · Accepted Answer · 2018-02-10T09:53:19.460

You can solve this problem easily using the proximal gradient method or an accelerated proximal gradient method such as FISTA. This is a powerful tool that everyone can easily learn to use. The proximal gradient method solves optimization problems of the form $$ \text{minimize} \quad f(x) + g(x) $$ where $f$ and $g$ are convex and $f$ is differentiable (with a Lipschitz continuous gradient) and $g$ is "simple" in the sense that its proximal operator can be evaluated efficiently. The proximal operator of $g$ is defined as follows: $$ \tag{$\spadesuit$} \text{prox}_{tg}(\hat x) = \arg \, \min_x \quad g(x) + \frac{1}{2t} \| x - \hat x \|_2^2. $$ The proximal operator of $g$ reduces the value of $g$ without straying too far from $\hat x$. The parameter $t > 0$ can be thought of as a "step size" that controls how far away from $\hat x$ we are allowed to move. Note that $g$ is not required to be differentiable. (But $g$ is required to be lower-semicontinuous, which is a mild condition that is usually satisfied in practice.)

The proximal gradient method iteration is $$ x^{k+1} = \text{prox}_{tg}(x^k - t \nabla f(x^k)). $$ What is happening here is that, starting from the current location $x^k$, we take a gradient descent step to reduce the value of $f$, then we apply the prox-operator of $g$ to reduce the value of $g$. The step size $t > 0$ is required to satisfy $t \leq 1/L$, where $L$ is a Lipschitz constant for $\nabla f$. (There is also a line search version of the proximal gradient method that might converge faster. In the line search version the step size $t$ is chosen adaptively at each iteration.)

Your optimization problem has the form $(\spadesuit)$ where $$ f(x) = \| Ax - b \|_2^2 $$ and $$ g(x) = \lambda \|x\|_2. $$ The gradient of $f$ can be evaluated using calculus: $$ \nabla f(x) = 2 A^T(Ax - b). $$ There is a nice formula for the proximal operator of the $\ell_2$-norm: $$ \text{prox}_{tg}(x) = \text{prox}_{t \lambda \| \cdot \|_2}(x) = x - P_B(x), $$ where $B$ is the $2$-norm ball of radius $t \lambda$ centered at the origin and $P_B(x)$ is the projection of $x$ onto $B$. (A formula for the proximal operator of a norm can be found for example in Vandenberghe's UCLA 236c notes. See chapter 8 "The proximal mapping", slide 8-3.)

With these formulas, you are ready to implement the proximal gradient method to solve your optimization problem using just a few lines of code. There is also an accelerated version of the proximal gradient method known as FISTA which is equally simple and likely converges faster.

Nicely explained. Thank you for that. I did not know of this method. I am going to learn more about it and implement it in MATLAB. — shashashamti2008, Feb 10 '18 at 10:09
The calculation of the Prox is given here - https://math.stackexchange.com/questions/1681658. — Royi, Mar 16 '18 at 16:02

Linear Least Squares with $ {L}_{2} $ Norm Regularization / Penalty Term

1 Answers1

Linked