0

I'm taking a ML course , recently we were given an assignment , where we were asked to implement a Least Squared Linear Regression with Regularization with P-norm where $1\leq\,P\leq2$ , $p=1$ for lasso and $p=2$ for ridge , now we are asked to implement a generalized solution so it can solve for any value between $1$ and $2$ , Is that really possible ? Currently I'm able to solve the ridge case using gradient descent with Constant step size . But am not getting how to approach the solution if it needs to be solved for any value of $P$ between $1$ and $2$.

The Objective Function:

$$ \frac{1}{2} \left\| A x - b \right\|_{2}^{2} + \lambda \left\| x \right\|_{p} $$

Where $ 1 \leq p \leq 2 $ is given.

Royi
  • 8,711
Arijit
  • 101
  • 1
    Welcome to MSE. Please use MathJax. – José Carlos Santos Aug 23 '17 at 15:07
  • 1
    You mean you need to minimize the p norm of $Ax-b$? Sure it is possible, it is just harder because the basic tricks fail. But you can still use any convex optimization routine. – Ian Aug 23 '17 at 15:08
  • I meant I want to minimize the [Least Squared Error]+Regularization , now for ridge I take the gradient of this function and a constant step size and iterate over a constant number of steps , but for p=1 the regularization term is |w| , now for lasso it can be solved by proximal grad. descent , but is there a way I can come up with a solution that will solve for all values of P ?? in 1 and 2 . – Arijit Aug 23 '17 at 15:14
  • I frankly would recommend a proximal gradient method in this instance. Building a prox function for a $p$-norm regularizer shouldn't be too difficult. – Michael Grant Aug 23 '17 at 18:03
  • Thanks any resource where I can study about proximal gradient descent ? – Arijit Aug 23 '17 at 20:08
  • Before writing an answer, I would like to understand what you mean by "ridge regression with p-norm". Is the problem of the form: $ \min_{x \in \mathbb{R}^n} { ||Ax - b||2^2 + \alpha \sum{i=1}^n |x_i|^p } $? – Alex Shtoff Aug 27 '17 at 08:41
  • 1
    yes , I mean Linear Regression with regularization , thats what your equation says , and the value of p should be 1<=p<=2 – Arijit Aug 27 '17 at 09:57

2 Answers2

2

There are 2 similar problems:

Problem I

$$ \arg \min_{x} \frac{1}{2} \left\| A x - b \right\|_{2}^{2} + \lambda \left\| x \right\|_{p} $$

Problem II

$$ \arg \min_{x} \frac{1}{2} \left\| A x - b \right\|_{2}^{2} + \lambda \left\| x \right\|_{p}^{p} $$

Solution Problem I

The function is given by:

$$ f \left( x \right) = \frac{1}{2} \left\| A x - b \right\|_{2}^{2} + \lambda \left\| x \right\|_{p} $$

The derivative is given by:

$$ \frac{d}{d x} f \left( x \right) = {A}^{T} \left( A x - b \right) + \lambda p X x $$

Where the matrix $ X $ is a diagonal matrix given by:

$$ {X}_{ii} = \left| {x}_{i} \right|^{p - 2} \frac{\left\| x \right\|_{p}^{ \frac{1}{p} - 1 }}{p} $$

The derivative vanishes at:

$$ x = \left( {A}^{T} A + \lambda p X \right)^{-1} {A}^{T} b $$

Since $ X $ dpends on $ x $ the method to solve it is using the Iterative Reweighted Least Squares (IRLS):

$$ {x}^{k + 1} = \left( {A}^{T} A + \lambda p {X}^{k} \right)^{-1} {A}^{T} b $$

The Code:

%% Solution by Iterative Reweighted Least Squares (IRLS) - Problem I

hObjFun = @(vX) (0.5 * sum((mA * vX - vB) .^ 2)) + (paramLambda * norm(vX, paramP));
vObjVal = zeros([numIterations, 1]);

mAA = mA.' * mA;
vAb = mA.' * vB;

vX          = mA \ vB; %<! Initialization by the Least Squares Solution
vObjVal(1)  = hObjFun(vX);

for ii = 2:numIterations

    mX = diag((sum(abs(vX) .^ paramP) .^ ((1 / paramP) - 1)) .* abs(vX) .^ (paramP - 2));

    vX = (mAA + (paramLambda * mX)) \ vAb;

    vObjVal(ii) = hObjFun(vX);
end

enter image description here

Solution Problem II

The function is given by:

$$ f \left( x \right) = \frac{1}{2} \left\| A x - b \right\|_{2}^{2} + \lambda \left\| x \right\|_{p}^{p} $$

The derivative is given by:

$$ \frac{d}{d x} f \left( x \right) = {A}^{T} \left( A x - b \right) + \lambda p X x $$

Where the matrix $ X $ is a diagonal matrix given by:

$$ {X}_{ii} = \left| {x}_{i} \right|^{p - 2} $$

The derivative vanishes at:

$$ x = \left( {A}^{T} A + \lambda p X \right)^{-1} {A}^{T} b $$

Since $ X $ dpends on $ x $ the method to solve it is using the Iterative Reweighted Least Squares (IRLS):

$$ {x}^{k + 1} = \left( {A}^{T} A + \lambda p {X}^{k} \right)^{-1} {A}^{T} b $$

Where:

$$ {X}_{ii}^{k} = \left| {x}_{i}^{k} \right|^{p - 2} $$

The Code is given by:

%% Solution by Iterative Reweighted Least Squares (IRLS)

hObjFun = @(vX) (0.5 * sum((mA * vX - vB) .^ 2)) + (paramLambda * sum(abs(vX) .^ paramP));
vObjVal = zeros([numIterations, 1]);

mAA = mA.' * mA;
vAb = mA.' * vB;

vX          = mA \ vB; %<! Initialization by the Least Squares Solution
vObjVal(1)  = hObjFun(vX);

for ii = 2:numIterations

    mX = diag(abs(vX) .^ (paramP - 2));

    vX = (mAA + (paramLambda * paramP * mX)) \ vAb;

    vObjVal(ii) = hObjFun(vX);
end

enter image description here

The code is available (Including validation by CVX) at my StackExchange Mathematics Q2403596 GitHub Repository.

Royi
  • 8,711
1

There are many ways of optimizing functions of this form. In the case that you have $1< p < \infty$ note that your function is differentiable, so you can use some simple stochastic gradient descent method (this can also be done in the $p=1$ case, but it gets a bit hairy, though it often will converge nicely).

Additionally, I would recommend (for the step size in gradient descent) to use a decreasing step size that is not summable, e.g. you want $\alpha_k \to 0$ but $\sum_k \alpha_k = \infty$ (for example, $\alpha_k = \alpha_0/k$) since this will speed up convergence and be less sensitive to hyperparameters.

If you're really excited about using the notion that it is convex, you can use a proximal gradient approach which will work for all values of $1\le p \le \infty$ : Boyd '13 is a nice reference though you'll have to work out some of the duals of these functions (these should be quite straightforward, but it may be some extra work).

A Newton method approach will actually do quite well here, too, so long as your regularization parameter isn't too large.

  • Is p norm differentiable at x=0 for |x|^3/2 ? – Arijit Aug 23 '17 at 20:07
  • Indeed! The $p$ norm is differentiable for $1<p<\infty$ as $|\varepsilon|^p/\varepsilon = \text{sgn}(\varepsilon)|\varepsilon|^{p-1} \to 0$ as $\varepsilon \to 0^\pm$. – Guillermo Angeris Aug 23 '17 at 20:13
  • It is not differentiable. In Convex Optimization context it has a Sub Gradient. Yet ti is not smooth. – Royi Aug 28 '17 at 14:32
  • Wait, I'm not sure what you mean. For any bounded $p>1$ the p-norm is definitely differentiable (e.g. $C^1$). For $p=1,\infty$ there is indeed only a sub-gradient. Note that I'm not claiming that it is smooth, I'm only claiming that it is once-differentiable. – Guillermo Angeris Aug 28 '17 at 14:35
  • I'm not sure what you mean by differntiable, Even the $ {L}_{2} $ norm isn't differntiable. See: https://math.stackexchange.com/questions/310325, https://math.stackexchange.com/questions/1175277 and https://math.stackexchange.com/questions/1110911. – Royi Aug 28 '17 at 14:37
  • @Royi Ah, yes, my bad; I should've been more explicit. The $\ell^p$ norm is not differentiable (which I guess is what I've been stating all along, intending something else) but rather the sum of $p$-th powers is. For finite $p$, this traces out the same pareto-optimal frontier as the $\ell^p$ norm on the objective. – Guillermo Angeris Aug 28 '17 at 14:41
  • I don't understand what you mean. The $ \left| x \right|{2} = \sqrt{ \sum{i} {x}_{i}^{2} } $ is a sum yet it is not differntiable. You can replace $ 2 $ by any $ 1 \leq p $ and it will stay non differntiable. – Royi Aug 28 '17 at 14:43
  • I'm talking about $\sum_i x_i^p$ not $\left(\sum_i x_i^p\right)^{1/p}$. – Guillermo Angeris Aug 28 '17 at 14:44