L1 norm and L2 norm

Question

I was studying the Stephen Boyd's textbook on convex optimization. It says the following:

The amplitude distribution of the optimal residual for the l1-norm approximation problem will tend to have more zero and very small residuals , compared to the l2-norm approximation solution. In contrast, the l2-norm solution will tend to have relatively fewer large residuals (since large residuals incur a much larger penalty in l2-norm approximation than in l1-norm approximation).

I understand why the second sentence holds -- obviously, l2-norm places a higher penalty on a higher residual and hence would fewer higher residuals. But, I can't understand the first sentence. l1-norm places a higher penalty on the residuals between 0 and 1 than l2-norm and hence it seems to me that l2-norm should yield more small residuals. Can anybody explain to me why l1-norm generates more small residuals than l2 norm?

In fact, the two statements sounds contradictory to each other. If L2-norm generates fewer large residuals, it sounds like it generates more small residuals than L1-norm.

The fact that is says "in contrast" also indicates that it might be a typo, and perhaps "fewer" was intended in the first sentence instead of "more." — Jonas Meyer, May 07 '13 at 01:58

score 8 · Answer 1 · answered May 07 '13 at 02:23

Let me highlight the parts of the sentence that should be grouped together:

The amplitude distribution of the optimal residual for the l1-norm approximation problem will tend to have more (zero and very small residuals), compared to the l2-norm approximation solution. In contrast, the l2-norm solution will tend to have relatively fewer (large residuals) (since large residuals incur a much larger penalty in l2-norm approximation than in l1-norm approximation).

This doesn't mean that you won't see large residuals in l1-norm problems (you have to kind of read between the lines). This means that minimizing l1 error will tend to produce solutions that have:

a few residuals that are larger and
lots of very insignificant residuals.

In other words, the distribution of residuals will be very "spiky." (This is good, for example, when you want to be robust to outliers -- this method "lets" you have a few large residuals (i.e., large errors) while keeping most of the errors small.)

L2 residuals, on the other hand, will produce:

very few big residuals, because they're penalized a lot more,
but at the cost of having lots more small residuals that are still significant.

In other words, the distribution of residuals will be far less "spiky" and more "even." (This is good when you have no outliers and you want to keep the overall error small -- it will produce a better "fit.")

Thanks so much! Can you answer one more question? Why does l1 have a higher spike at around zero than l2? — DSKim, May 07 '13 at 06:11
It depends on the data, but assuming that the data has outliers, an l2 minimization will move toward the outliers in order to lower the squared error. But it will pull away from the other points, and the error for those points will increase. Thus, it pulls the error away from zero. L1 doesn't do that because the error scales linearly with distance, so a bunch of error for one outlier is essentially equivalent to small errors for everything else. Since you want the better fit, you may as well take the large error for the one (or few) outlier(s). — Josephine Moeller, May 07 '13 at 19:25
Additionally, if the data you're fitting is noisey, then with an L1 norm, the result can happily sit anywhere between the points, whereas an L2 norm will have its minimum in the middle. — will, Aug 19 '16 at 08:42

score 1 · Answer 2 · answered Mar 03 '16 at 07:04

1

In many situations, the data behave like remainders mod 9, 90. In those cases, it is better and correct to use L1 norm. The remainders must expand to full potential, and that too, without power functions [L2 norm has squaring].

answered Mar 03 '16 at 07:04

Prasant Nair

11

L1 norm and L2 norm

2 Answers2