Least squares Problem with Non Negativity Constraints

Question

Let $\mathbf{x}=[x_1,\ldots,x_K]$. I have the following optimization problem:

\begin{array}{rl} \min \limits_{\mathbf{x}} & \| \mathbf{Ax}-\mathbf{b} \|^2 \\ \mbox{s.t.} & x_k\ge 0, \forall k \end{array}

Please I need your help to solve this problem.

Another thing: my main problem was \begin{array}{rl} \min \limits_{\mathbf{x}} & \| \mathbf{A'x}-\mathbf{b'} \|^2 \\ \mbox{s.t.} & x_k\ge 0, \forall k & \\ & \mathbf{x}^T \mathbf{1}=1 \end{array} Then I transformed it to the first problem by including the equality constraint in the objective function. Is it fine to do so ?

I deleted what I wrote since I realized I forgot the inequalities $x_k \geq 0$. — Holonomia, Jun 02 '15 at 19:17
@Crostul $\mathbf{A}$ is non negative. I don't know if this affects the problem.. — tam, Jun 02 '15 at 19:48
if A is non-negative you need to include that as one of the constraints — Set, Jun 02 '15 at 19:51
It can't be solved analytically. You need a quadratic programming engine. But once you have that, it's a piece of cake. — Michael Grant, Jun 02 '15 at 20:40
By the way there's a name for this type of problem: "nonnegative least squares". Matlab has a function lsqnonneg that solves nonnegative least squares problem, but you can probably get better results using CVX or some proximal algorithm. — littleO, Jun 03 '15 at 02:59
It is unclear to me that the NNLS module can handle the equality constraint. And I don't think that you can eliminate it, because then you will eliminate the non-negativity constraint on one variable as well. — Michael Grant, Jun 03 '15 at 11:16

dohmatob · Answer 1 · 2015-10-01T16:08:35.077

18

Your problem can be conveniently re-written as \begin{eqnarray} \underset{x \in \mathbb{R}^K}{\text{min }}f(x) + g(x), \end{eqnarray} where $f: x \mapsto \frac{1}{2}\|Ax-b\|^2$ and $g = i_{\mathbb{R}^K_+}$, the indicator function (in the convex analytic sense) of the nonnegative $K$th orthant. $f$ is smooth with Lipschitz gradient ($\|A\|^2$ is a possible Lipschitz constant) while $g$ has a simple proximal operator $prox_g(x) := (x)_+$ (the orthogonal projector unto the aforementioned orthant). So, proximal methods like FISTA are your friend.

In your "main problem", the aforementioned orthant is simply replaced with the standard simplex. The projector unto this simplex, though inaccessible in closed form, can be computed very cheaply using (for example) the simple algorithm presented in section 3 of the paper http://www.magicbroom.info/Papers/DuchiShSiCh08.pdf.

The code can be implemented in 3 lines of Python:

import numpy as np


def proj_simplex(v, z=1.):
    """Projects v unto the simplex {x >= 0, x_0 + x_1 + ... x_n = z}.

    The method is John Duchi's O (n log n) Algorithm 1.
    """
    # deterministic O(n log n)
    u = np.sort(v)[::-1]  # sort v in increasing order
    aux = (np.cumsum(u) - z) / np.arange(1., len(v) + 1.)
    return np.maximum(v - aux[np.nonzero(u > aux)[0][-1]], 0.)

BTW, what is the proximal operator of an "arbitrary" convex function $g$ ?

Formally, \begin{eqnarray} prox_g(x) := \underset{p \in \mathbb{R}^K}{\text{argmin }}\|p-x\|^2 + g(p). \end{eqnarray}

"Proximable" functions (i.e functions for which the argmin problem in the definition above are easy to solve, for any point $x$) play just as important a rule as differentiable functions. The proximal operator lets you make "implicit gradient steps". Indeed, one has the characterization \begin{eqnarray}p = prox_g(x)\text{ iff } x - p \in \partial g(p), \end{eqnarray} where \begin{eqnarray}\partial g(p) := \{u \in \mathbb{R}^K | g(q) \ge g(p) + \langle u, q - p\rangle \forall q \in \mathbb{R}^K\}\end{eqnarray} is the subdifferential of $g$ at $p$ (this reduces to the singleton $\{\nabla g(p)\}$ if $g$ is differentiable at $p$). In your problem(s) above, the proximal operator happens to be a projection operator. In fact for any closed convex subset $C \subseteq \mathbb{R}^K$, a little algebra reveals that \begin{eqnarray} prox_{i_C}(x) := \underset{p \in \mathbb{R}^K}{\text{argmin }}\|p-x\|^2 + i_C(p) = \underset{p \in C}{\text{argmin }}\|p-x\|^2 =: proj_C(x), \end{eqnarray} where $i_C$ is the indicator function of $C$ defined by $i_C(x) := 0$ if $x \in C$; $+\infty$ otherwise. A less trivial example is the $\ell_1$-norm $\|.\|_1$ whose proximal operator (at rank $\gamma > 0$) is the so-called soft-thresholding operator $prox_{\gamma\|.\|_1}(x) = soft_\gamma(x) = (v_k)_{1\le k \le K}$, where \begin{eqnarray} v_k := \left(1- \dfrac{\gamma}{|x_k|}\right)_+x_k. \end{eqnarray}

Proximal operators are a handy tool in modern convex analysis. They find great use in problems arising in signal processing, game theory, machine learning, etc. Here is a nice place to start learning about proximal operators and similar objects: http://arxiv.org/pdf/0912.3522.pdf.

Most importantly, "under mild conditions" one can show (see the previous reference) that a point $x^*$ minimizes $f + g$ iff

\begin{eqnarray} x^* = prox_{\gamma g}(x^* - \gamma \nabla f(x^*)), \forall \gamma > 0 \end{eqnarray}

Thus the minimizers of $f + g$ coincide with the fixed-points of the operators $prox_{\gamma g}\circ(Id - \gamma \nabla f)$, $\gamma > 0$. This suggests the following algorithm, known as the forward-backward algorithm (Mureau; Lions and Mercier; P.L Combettes et al.) \begin{eqnarray} x^{(n+1)} = \underbrace{prox_{\gamma_n g}}_{\text{backward / prox step}}\underbrace{(x^{(n)} - \gamma_n \nabla f(x^{(n)}))}_{\text{forward / gradient step}}, \end{eqnarray}

for an appropriately chosen sequence of step-sizes $(\gamma_n)_{n \in \mathbb{N}}$.

If $g$ is constant, so that it suffices to minimize $f$ alone, then the above iterates become \begin{eqnarray} x^{(n+1)} = x^{(n)} - \gamma_n \nabla f(x^{(n)}), \end{eqnarray}

and we recognize our old friend, the gradient descent algorithm, taught in high school.

N.B.: $(x)_+$ denotes the componentwise maximum of $x$ and $0$.

edited Oct 01 '15 at 16:08

answered Jun 03 '15 at 21:04

dohmatob

9,535

Yes! A very nice alternative answer. – Michael Grant Jun 04 '15 at 06:31
please I have a question: why we need the projector of $g$ (instead of $g$ itself) ? – tam Jun 04 '15 at 13:38
@mat: I've expanded on my answer. I hope things are clearer now. – dohmatob Jun 04 '15 at 19:20
@mat, I think you should switch to this answer if possible. This really is the "modern" way to do it. – Michael Grant Jun 08 '15 at 13:43
@MichaelGrant as you suggest, I will switch to this answer. Please I have a question: in the third answer, why we cannot put the optimal $\mathbf{x}$ (for a fixed $\alpha$) in $\mathbf{L}$, then derive the expression with respect to $\alpha$ to get the optimal $\alpha$ and replace it in the optimal $\mathbf{x}$ ? thank you very much! – tam Jun 13 '15 at 09:57
Great answer. However I think the forward-backward method has been around a long time -- this paper by Tseng mentions that it was proposed (for monotone inclusion problems) in the 1979 Lions and Mercier paper. – littleO Jun 29 '15 at 07:04
1

@littleO: Thanks for the catch. Corrected. Indeed, it seems P.L. Combettes is not the lawful reference for the fb scheme (and I don't think he'd claim it). Recently I spoke with an old wise man who told me the fb ideas were known to Mureau himself, and he reckons the latter actually proposed a version of the scheme at one point in the 50s or 60s. Also, Lions and Mercer published some proofs, I think. He told me a similar anecdote about the proximal point algorithm which is incorrectly attributed to Rockerfella. – dohmatob Jul 18 '15 at 14:26
Can you look at my question here http://math.stackexchange.com/questions/1949870/how-to-calculate-the-proximal-operator-for-l0-norm ? Thank you! – good2know Oct 01 '16 at 23:43
@dohmatob Thanks for this alternative answer. It looks very elegant, however I have to say the maths is a bit over my head. I'm a computer scientist and I know a fair bit about (unconstrained) LS optimization -- but nothing about proximal methods, indicator functions, Lipschitz gradients or simplex projectors. I came here because I need to solve a constrained LS problem. You appear to be very knowledgeable on the subject matter. Would you mind to provide some ELI5 to make your answer more accessible? – rsp1984 Jan 29 '19 at 11:11
Hi @rsp1984. Post your questionan and I'd be happy to help. About the answer above, I'd be happy to provide help for specific difficulties you ma have in following the display. N.B.: What's "EL15" ? – dohmatob Jan 29 '19 at 13:17
@dohmatob Thanks. I don't have a question other than the original question. I need to solve an LS problem with non-negativity constraints. I found your answer very interesting -- but as I said I only understand small pieces of it. In general it would be good to get the 15000 feet view of what you are describing. It appears you can minimize f(x) + g(x) with a "proximal method" and the "magic" seems to be in g(x). However when you define g, the x is not being used on the right side of the equation and I also don't know what an indicator function is. ELI5 == "explain like I'm 5 (yrs old)" :) – rsp1984 Jan 29 '19 at 15:07
@dohmatob: I actually just found this paper (http://web.stanford.edu/~boyd/papers/pdf/prox_algs.pdf) which explains some of the "magic". Still would be good to get your explanations though. – rsp1984 Jan 29 '19 at 15:23
Expanded, the definition of $g$ means $g(x) = 0$ if all components of $x$ are nonnegative (i.e the constraints are met) and $g(x) = \infty$ otherwise. – dohmatob Jan 29 '19 at 15:24
Concerning explaining further, this can't be made any simpler though :). You need to pick up the notions of: indicator function, support function, convex conjugate, proximal operator. This notions are really easy to catch even with a basic math background. Don't let their names intimidate you. Also, the pdf you referenced above seems to be a good place to start fidling with those things. – dohmatob Jan 29 '19 at 15:26
Also $(z)_+$ is shorthand for $\max(0, z)$ (elementwise maximum). – dohmatob Jan 29 '19 at 15:28
@dohmatob I am working through the proximal algorithms notes and I find them relatively accessible. However I have great difficulty getting an intuitive understanding of what exactly the convex conjugate is and what it's good for. There's zero material out there that explains it well (as with so many other math topics). Of course it's f*(y) = sup(x'y - f(x)) ! But what does it mean? So I have a hyperplane with slope y and I subtract my f(x) from that and see what the supremum of the resulting thing is. But why? Where the intuition behind that? – rsp1984 Feb 02 '19 at 00:30
Why a Fenchel Conjugate ? https://people.ok.ubc.ca/bauschke/Research/68.pdf. First, it allows for a generalization of Hoelder's inequality (and therefore of the Cauchy-Schwarz inequality). This generalization if called the Fenchel-Young inequality: $f(x) + f^*(y) \ge x^Ty$. It allows us to generalize many fundamental ideas from linear programming: slackness, duality, etc. ... – dohmatob Feb 02 '19 at 07:27
@dohmatob: Ok, here's my personal ELI5 of convex conjugates: https://math.stackexchange.com/a/3097458/29851 Would you consider this an adequate explanation? – rsp1984 Feb 02 '19 at 17:05
OK, I haven't read everything, but it sounds good. Up-voted. In any case, note that different people understand different concepts via different paths. There is not canonical way to understanding. It's more or less a personal journey. – dohmatob Feb 02 '19 at 19:26

Michael Grant · Accepted Answer · 2015-06-03T02:53:12.703

The problem as written has no analytic solution for general $A$, $b$. (Yes, indeed, there are exceptions. If the solution of the same problem with the inequalities removed happens to produce a positive solution, then you've solved the original problem. But in the far more likely event that the inequality-free solution has negative entries, there is no analytic solution.)

In order to find the solution, you need either a quadratic programming engine or a second-order cone programming engine.

But before you run such a solver, you have to convert it to the standard form the solver expects. For instance, the QP formulation of your problem is \begin{array}{ll} \text{minimize}_{x} & x^T A^TA x - 2b^T x + b^T b \\ \text{subject to} & \vec{1}^T x = 1 \\ & x \geq 0 \end{array} The second-order cone version is a bit more complex: \begin{array}{ll} \text{minimize}_{t,x} & t \\ \text{subject to} & \|A x - b \|_2 \leq t \\ & \vec{1}^T x = 1 \\ & x \geq 0 \end{array} A modeling framework can help here, by eliminating the need to do this transformation yourself. For instance, the MATLAB package CVX can express your problem as follows:

cvx_begin
    variable x(n)
    minimize(sum_square(A*x-b))
    subject to
        sum(x) == 1
        x >= 0
cvx_end

Disclaimer: I wrote CVX. But you certainly don't have to use it. In MATLAB alone there are a wealth of other choices, including YALMIP and the QP solver in MATLAB's own Optimization Toolbox.

score 0 · Answer 3 · answered Jun 02 '15 at 20:38

0

First, the lagrangian function is \begin{align} \mathbf{L}(x,\alpha)=\Vert Ax-b \Vert^2 - \alpha^\mathsf{T} x \end{align} Then gradient of lagrangian vanishes at the optimal point: \begin{align} \frac{\partial \mathbf{L}}{\partial x} = 2 A^\mathsf{T}(Ax-b)-\alpha=0\\ 2A^\mathsf{T}A x= \alpha+2A^\mathsf{T} b\\ \end{align} If $A^\mathsf{T}A$ is not invertible we can just add a small regualrization term to it, i.e. replace it by $(2A^\mathsf{T}A + \mu I)$ or use moore penrose inverse. \begin{align} x=(2A^\mathsf{T}A)^\dagger \big(\alpha+2A^\mathsf{T} b\big)+x_0\\ \end{align} which $x_0$ is a vector in null space of moore penrose inverse term and it vanishes. The dual function is \begin{align} \max_{\alpha\geq 0 } \min_{x} \mathbf{L(x,\alpha)} \end{align} You can easily obtain by putting $x$ into the lagrangian. It's a simple quadratic function with constraints $\alpha \geq 0$ which can be solved using ordinary quadratic solvers.

answered Jun 02 '15 at 20:38

user85361

845

This whole Lagrangian development is unnecessary. "Ordinary quadratic solvers" can handle the equality constraint just as readily as the non-negativity constraints on $x$. – Michael Grant Jun 02 '15 at 20:40
Yes. But may be mat wants to find another form of the problem for some reason. Otherwise the question is obvious. – user85361 Jun 02 '15 at 20:44
@MichaelGrant The question seems more analytical than practical. – callculus42 Jun 02 '15 at 20:45
2

@user85361, I don't think the original poster considers the question obvious at all. From my experience on this forum, people quite often look for closed-form/analytic solutions to problems that do not have one. The right thing to do is simply point that out and point to the relevant numerical solver. Constructing the Lagrangian in this case doesn't actually get you any closer to solving the problem. – Michael Grant Jun 02 '15 at 20:48
1

@Michael Grant, I don't have so much experience in this site as you do. I hope it'll help. – user85361 Jun 02 '15 at 20:54
1

Let me be clear, I certainly hope that you continue to contribute here! We can see from this answer that you have valuable knowledge. I would only suggest that ensuring you answer the question directly and completely is an important objective. you will see that for yourself when you craft a solid answer and the points come rolling in :-) – Michael Grant Jun 03 '15 at 11:14
@user85361, Could you explain the move from the Gradient of the Lagrangian and the equation with $ {x}_{0} $? Thank You. – Royi May 31 '16 at 18:00

Least squares Problem with Non Negativity Constraints

3 Answers3

Linked

Related