Orthogonal Projection onto the Unit Simplex

Question

The Unit Simplex is defined by:

$$ \mathcal{S} = \left\{ x \in \mathbb{{R}^{n}} \mid x \succeq 0, \, \boldsymbol{1}^{T} x = 1 \right\} $$

Orthogonal Projection onto the Unit Simplex is defined by:

$$ \begin{alignat*}{3} \arg \min_{x} & \quad & \frac{1}{2} \left\| x - y \right\|_{2}^{2} \\ \text{subject to} & \quad & x \succeq 0 \\ & \quad & \boldsymbol{1}^{T} x = 1 \end{alignat*} $$

How could one solve this convex optimization problem?

Orthogonal Projection onto the $ {L}_{2} $ Ball - https://math.stackexchange.com/questions/627034. — Royi, Aug 22 '17 at 15:48
Orthogonal Projection onto the $ {L}_{\infty} $ Ball - https://math.stackexchange.com/questions/1825747. — Royi, Aug 22 '17 at 15:49
Orthogonal Projection onto the ${L}_{1} $ - https://math.stackexchange.com/questions/2327504. — Royi, Aug 22 '17 at 15:50
Connection Between Orthogonal Projection onto the Unit Simplex and the Softmax Function. — Royi, Aug 22 '17 at 15:57
Should it be $\textbf{1}^Tx=1$ in the definition of the unit simplex? — Jürgen Sukumaran, Apr 03 '18 at 13:37
This question is missing a question mark. Right now it consists of just two declarative sentences. — John Hughes, Apr 03 '18 at 13:48
Interesting variation using $ {L}_{1} $ Norm instead of the Euclidean Distance - https://math.stackexchange.com/questions/2477400. — Royi, Apr 13 '20 at 07:03
This question is answered concisely in slide 12, lecture 6 ("The proximal mapping") of Vandenberghe's UCLA 236c notes: http://www.seas.ucla.edu/~vandenbe/236C/lectures/proxop.pdf — littleO, Apr 13 '20 at 08:13
The question is answered by the same author, just 2 mins after his post. Good way to claim reputation. — SiXUlm, Jul 16 '21 at 09:09

Royi · Accepted Answer · 2023-11-24T14:45:20.340

7

Projection onto the Simplex can be calculated as following.
The Lagrangian in that case is given by:

$$ L \left( x, \mu \right) = \frac{1}{2} {\left\| x - y \right\|}^{2} + \mu \left( \boldsymbol{1}^{T} x - 1 \right) \tag1 $$

The trick is to leave non negativity constrain implicit.
Hence the Dual Function is given by:

\begin{align} g \left( \mu \right) & = \inf_{x \succeq 0} L \left( x, \mu \right) && \text{} \\ & = \inf_{x \succeq 0} \sum_{i = 1}^{n} \left( \frac{1}{2} { \left( {x}_{i} - {y}_{i} \right) }^{2} + \mu {x}_{i} \right) - \mu && \text{Component wise form} \tag2 \end{align}

Taking advantage of the Component Wise form the solution is given:

\begin{align} x_i^\ast = \left(y_i-\mu\right)_+ \tag3 \end{align}

Where the solution includes the non negativity constrain by Projecting onto $ {\mathbb{R}}_{+} $

The solution is given by finding the $ \mu $ which holds the constrain (Pay attention, since the above was equality constrain, $ \mu $ can have any value and it is not limited to non negativity as $ \lambda $).

The objective function (From the KKT) is given by:

\begin{align} 0 = h \left( \mu \right) = \sum_{i = 1}^{n} {x}_{i}^{\ast} - 1 & = \sum_{i = 1}^{n} { \left( {y}_{i} - \mu \right) }_{+} - 1 \tag4 \end{align}

The above is a Piece Wise linear function of $ \mu $.

Since the function is continuous yet it is not differentiable due to its piece wise property theory says we must use derivative free methods for root finding. One could use the Bisection Method for instance.

The function Derivative given by:

$$ \begin{align} \frac{\mathrm{d} }{\mathrm{d} \mu} h \left( \mu \right) & = \frac{\mathrm{d} }{\mathrm{d} \mu} \sum_{i = 1}^{n} { \left( {y}_{i} - \mu \right) }_{+} \\ & = \sum_{i = 1}^{n} -{ \mathbf{1} }_{\left\{ {y}_{i} - \mu > 0 \right\}} \end{align} $$

In practice, it can be solved using Newton Iteration (Since falling into a joint between 2 sections has almost zero probability).

Accurate / Exact Solution

If we look at the values of the function $ h \left( \mu \right) = \sum_{i = 1}^{n} { \left( {y}_{i} - \mu \right) }_{+} - 1 $ one could easily infer a method to calculate the accurate solution:

In the above the parameter $ \mu $ took the values of the vector $ {y}_{i} $ with additional values at the edges (Value greater than the max value of $ {y}_{i} $ and value lower of the min value of $ {y}_{i} $).
By iterating the values one could easily track the 2 values which on each side they have value greater than $ 0 $ and lower then $ 0 $ (In case one of them is zero, then it is the optimal value of $ \mu $). Since it is linear function and we have 2 points we can infer all parameters of the model $ y = a x + b $. Than the optimal value of $ \hat{\mu} = - \frac{b}{a} $.

I wrote MATLAB code which implements the method with Newton Iteration at Mathematics StackExchange Question 2327504 - GitHub. I extended the method for the case $ \sum {x}_{i} = r, \; r > 0 $ (Pseudo Radius).
There is a test which compares the result to a reference calculated by CVX.

edited Nov 24 '23 at 14:45

answered Aug 22 '17 at 15:48

Royi

8,711

The constraint regarding $ \mu $ is given by $ h \left( \mu \right) = 0 $. I guess objective function isn't the most clear term :-). But in the case above it is the objective function of the Dual Function. – Royi Apr 03 '18 at 15:47
Yep. You got it right. – Royi Apr 04 '18 at 13:15
It seems to me that the minimization problem boils down to find $\mu$ such that $h \left( \mu \right)= 0$ or $\sum_{i = 1}^{n} { \left( {y}{i} - \mu \right) }{+} - 1 = 0$, right? – Akira Feb 28 '20 at 08:35
Yep. Indeed this is the case from the above (KKT). – Royi Feb 28 '20 at 08:37
I'm sorry but your solution seems much more simpler than the very complicated one proposed by Yunmei Chen and Xiaojing Ye. May I miss something? – Akira Feb 28 '20 at 08:41
That's the point. It is much simpler. This is why I like it :-). – Royi Feb 28 '20 at 08:48
I also have a similar problem here. I write down KKT conditions and end up with an unsolved system. If you don't mind, please give any suggestion on it! Btw, I could not understand why people include the constant $1/2$ in the objective function :) – Akira Feb 28 '20 at 08:52
It seems your problem is exactly the same! The $ \frac{1}{2} $ factor assists with a cleaner expression for the derivative (It cancels the factor 2 from the derivative). – Royi Feb 28 '20 at 09:06
How can we solve $h(\mu) = 0$ using Newton's method if $h$ is not differentiable? I believe convergence proofs for Newton's method require $h$ to be differentiable, which is not true in this case because $h$ is piecewise linear. (Also if during Newton's method we hit one of the points where $h$ is not differentiable, Newton's method fails.) – littleO Apr 13 '20 at 04:30
@littleO, You're right. I was a little sloppy there. I will update the answer. – Royi Apr 13 '20 at 06:34
Thanks. Although using Newton's method here is conceptually simpler, which is nice, I think the specialized algorithm in Projection Onto a Simplex is strictly better -- it should be faster, it has a convergence guarantee, and it can still be implemented in a short amount of code. – littleO Apr 13 '20 at 08:10
@littleO, I agree. I just wanted to derive it on my own. Also wanted to be able to solve for any positive value of $ \sum_{i} {x}_{i} = r $. It was easier for me with the Dual approach. I actually implemented the algorithm on the paper and also extended it (Change of one line) to support any positive $ r $. – Royi Apr 13 '20 at 08:47
@littleO, I added a simple (Simpler?) method how to extract the exact analytic solution of $ \hat{\mu} $ from the approach I wrote above. Now I don't see any advantage to the version in the paper :-). – Royi Apr 14 '20 at 09:10
Can you derive Equation (3) from Equation (2) in details? Can you prove Equation (3) and (4) is indeed the unique minimizer? – Hans Mar 02 '22 at 09:40
@Hans, Going (2)->(3) is very simple. Just take the derivative with respect to $ {x}_{i} $ in the summed term and see. Regarding (3) and (4), they are the minimizer by the definition of a KKT point for a convex problem. – Royi Mar 02 '22 at 10:16
That is what is confusing if not wrong. Differentiating your Lagrangian to $x_i$ does not generate $(\cdot)_+$. I thought you meant to derive it by other means than KKT, as indicated by your Lagrangian without the non negativity inequalities terms along with your enigmatic statement "the trick is to leave non negativity constrain implicit." The KKT method on the contrary explicitly incorporates the non-negativity constraint into the Lagrangian. – Hans Mar 02 '22 at 21:28
To invoke KKT, you should put the derivation explicitly into the KKT form and cite a theorem guaranteeing the necessity and sufficiency of the KKT point being the (unique) minimizer. – Hans Mar 02 '22 at 21:28
@Hans, I am not sure I get you. The element wise problem in 2 is really easy to solve. The derivative will get you the idea. Incorporation of the constraint is easy. If you want to employ KKT you may. I think for that sub problem it is so clear it can be skipped for brevity. I don’t cite the theorem as it is clear for everyone dealing with such problems of Convex Optimization. This is not an answer in undergraduate test. – Royi Mar 03 '22 at 04:43
Then let us follow your idea to its logical end one thing at a time. The derivative is $\frac{\partial f}{\partial x_i}=x_i-y_i+\mu=0$ or the optimizer $x^_i=y_i-\mu$. Where do you get the function $x^i=(y_i-\mu)+$ with the extra $_+$? – Hans Mar 03 '22 at 06:36
The problem to solve is $ \frac{1}{2} {\left( {x}{i} - {y}{i} \right)}^{2} +\mu {x}{i} \text{ subject to } {x}{i} \geq 0 $. Try solving it, I'm pretty sure you can do it in a minute or so and get the result I have. You may build the KKT of it, and see the solution I suggested hold it all (The gradient vanishes). – Royi Mar 03 '22 at 07:06
@Royi: How do you reduce the $n$ dimensional problem to $1$ dimension? Are you saying $\inf_{x \succeq 0} \sum_{i = 1}^{n} \left( \frac{1}{2} { \left( {x}{i} - {y}{i} \right) }^{2} + \mu {x}{i} \right) - \mu \Longleftrightarrow \inf{x_i\ge0}\frac{1}{2} { \left( {x}{i} - {y}{i} \right) }^{2} + \mu {x}_{i} - \mu, ,\forall i$? Why? – Hans Mar 03 '22 at 16:20
No response to my last question? This derivation is wrong. You have to either use KKT theorem, or establish the equivalence between minimizing over the sum over all dimenions and that over each dimension. You have done neither. – Hans Mar 04 '22 at 19:27
@Hans, the answer and derivation are perfect. If you struggle to see the properties of the problem, such as being separable, I suggest you open a specific question about your concern. – Royi Mar 05 '22 at 04:52
Yes, exactly, you need to prove the separability. In any case, you have not properly proved/justified that the suggested solution is indeed the (unique) minimizer. This is not, at least a complete, solution. Again, if KKT theorem is invoked, the solution becomes trivial. – Hans Mar 05 '22 at 10:04
@Hans, feel free to open a question about the separability of the L1 and L2 norms. I think you have written all your remarks, I don’t think they require any change. Farther discussion on a separate question. – Royi Mar 05 '22 at 11:31

score 3 · Answer 2 · edited Mar 10 '20 at 07:38

3

The best algorithm to compute the exact solution to this problem can be found in Projection Onto A Simplex.

edited Mar 10 '20 at 07:38

Royi

8,711

answered Mar 21 '18 at 02:02

Gonzalo Benavides

981

1

would be awesome if the key ideas can be summarized in the post. – Siong Thye Goh Mar 21 '18 at 02:23
@SiongThyeGoh, It based on the same idea as my solution below. The main difference is that instead of utilizing Newton Method to find the optimal $ \mu $ they use some other trick based on the sorted values of the vector $ y $. – Royi Mar 21 '18 at 13:15
This idea is much older than that paper. Late 80's, I believe, a JOTA paper if I remember. – copper.hat Feb 29 '20 at 06:25
I extracted exact solution from my approach above. – Royi Apr 14 '20 at 09:40
I think the algorithm from Condat's 2014 paper (linked to in another answer) might be better than this one. – littleO Apr 20 '20 at 04:21

madnessweasley · Answer 3 · 2020-04-20T04:23:33.270

3

The paper by Condat [1] presents a review and comparison of existing algorithms with a new proposal for projection onto the unit simplex. This paper lists the worst-case complexity and empirical complexity of those algorithms, and presents concise pseudo-code for all algorithms. In particular, the algorithm proposed by Condat takes $O(n)$ time in practice, whereas sorting-based methods take $O(n \log n)$ time in practice. I have implemented Condat's algorithm in the past, and can vouch for its speed relative to direct sorting-based approaches.

[1] Laurent Condat, Fast Projection onto the Simplex and the $\ell_1$ Ball.

EDIT: Condat has included C and MATLAB implementations of all the algorithms mentioned in his paper here: https://lcondat.github.io/software.html

edited Apr 20 '20 at 04:23

answered Apr 20 '20 at 04:14

madnessweasley

1,723

Feel like posting your code for Condat's algorithm? I think it could be helpful for a lot of people. – littleO Apr 20 '20 at 04:16
@littleO Thanks for the suggestion. I've included a link to Condat's webpage which has codes for all the algorithms. My own code was in one of the early versions of Julia, which is almost surely outdated – madnessweasley Apr 20 '20 at 04:24
My code above is also ‘log(n)’ approach. – Royi Apr 20 '20 at 06:44
1

@Royi I don't understand your explanation for the "accurate/exact solution", but each function evaluation in the bisection step takes $\approx n$ operations. Additionally, even evaluating $h(\mu)$ takes $\approx n$ operations, so I'm not sure if $O(\log n)$ is even possible – madnessweasley Apr 20 '20 at 06:52
I didn't mean the one utilizing Newton Method. Though in practice I think it will be faster than any method. At least it was much faster than the o(logn) methods from above. – Royi Apr 20 '20 at 07:04
@Royi Please check the paragraph preceding Section 3 of Condat's paper. He mentions that one of the algorithms compared against is akin to Newton's method for finding the root $\mu$ in your answer. Section 4 of the paper includes computational experiments in different regimes. – madnessweasley Apr 20 '20 at 10:03
As far as I can tell, the performance table doesn't compare pure Newton Iteration Method as in my code. By the way, I don't think that on his site you can find the algorithms as proposed in this paper. If you find, point me to is and I will compare it. – Royi Apr 20 '20 at 10:21
2

@Royi I guess the C code includes the results of the paper, and the MATLAB code only includes a simple sorting-based approach. You are probably correct that he hasn't compared against your exact approach (which might be faster). I'm sorry, I don't have the time to delve deeper into the numerical comparison. Condat seems to have made a fair (for me) numerical comparison with the literature, and because I had a favorable experience with his approach, I just wanted to share it with the community. – madnessweasley Apr 20 '20 at 10:31
Your answer is great. Hence I +1 it. I will have a look on his C code. Though I don't have C code for the Newton approach. Maybe a project for one day for comparison using MEX in MATLAB. – Royi Sep 25 '20 at 07:38

Orthogonal Projection onto the Unit Simplex

3 Answers3

Accurate / Exact Solution

Linked