3

Section 4.5 Example: Linear Least Squares of the textbook Deep Learning by Goodfellow, Bengio, and Courville, says the following:

Suppose we want to find the value of $\mathbf{x}$ that minimizes

$$f(\mathbf{x}) = \dfrac{1}{2}||\mathbf{A} \mathbf{x} - \mathbf{b}||_2^2 \tag{4.21}$$

Specialized linear algebra algorithms can solve this problem efficiently; however, we can also explore how to solve it using gradient-based optimization as a simple example of how these techniques work.

First, we need to obtain the gradient:

$$\nabla_{\mathbf{x}} f(\mathbf{x}) = \mathbf{A}^T (\mathbf{A}\mathbf{x} - \mathbf{b}) = \mathbf{A}^T \mathbf{A} \mathbf{x} - \mathbf{A}^T \mathbf{b} \tag{4.22}$$

We can then follow this gradient downhill, taking small steps. See algorithm 4.1 for details.


Algorithm 4.1 An algorithm to minimise $f(\mathbf{x}) = \dfrac{1}{2}||\mathbf{A} \mathbf{x} - \mathbf{b}||_2^2$ with respect to $\mathbf{x}$ using gradient descent, starting form an arbitrary value of $\mathbf{x}$.


Set the step size ($\epsilon$) and tolerance ($\delta$) to small, positive numbers.

while $||\mathbf{A}^T \mathbf{A} \mathbf{x} - \mathbf{A}^T \mathbf{b}||_2 > \delta$ do

$\ \ \ \mathbf{x} \leftarrow \mathbf{x} - \epsilon(\mathbf{A}^T \mathbf{A} \mathbf{x} - \mathbf{A}^T \mathbf{b})$

end while


One can also solve this problem using Newton's method. In this case, because the true function is quadratic, the quadratic approximation employed by Newton's method is exact, and the algorithm converges to the global minimum in a single step.

Now suppose we wish to minimize the same function, but subject to the constraint $\mathbf{x}^T \mathbf{x} \le 1$. To do so, we introduce the Lagrangian

$$L(\mathbf{x}, \lambda) = f(\mathbf{x}) + \lambda (\mathbf{x}^T \mathbf{x} - 1). \tag{4.23}$$

We can now solve the problem

$$\min_{\mathbf{x}} \max_{\lambda, \lambda \ge 0} L(\mathbf{x}, \lambda)$$

The smallest-norm solution to the unconstrained least-squares problem may be found using the Moore-Penrose pseudoinverse: $\mathbf{x} = \mathbf{A}^+ \mathbf{b}$. If this point is feasible, then it is the solution to the constrained problem. Otherwise, we must find a solution where the constraint is active. By differentiating the Lagrangian with respect to $\mathbf{x}$, we obtain the equation

$$\mathbf{A}^T \mathbf{A} \mathbf{x} - \mathbf{A}^T \mathbf{b} + 2 \lambda \mathbf{x} = 0 \tag{4.25}$$

This tells us that the solution will take the form

$$\mathbf{x} = (\mathbf{A}^T \mathbf{A} + 2 \lambda \mathbf{I})^{-1} \mathbf{A}^T \mathbf{b} \tag{4.26}$$

The magnitude $\lambda$ must be chosen such that the result obeys the constraints. We can find this value by performing gradient ancient on $\lambda$. To do so, observe

$$\dfrac{\partial}{\partial{\lambda}} L(\mathbf{x}, \lambda) = \mathbf{x}^T \mathbf{x} - 1 \tag{4.27}$$

When the norm of $\mathbf{x}$ exceeds $1$, this derivative is positive, so to follow the derivative uphill and increase the Lagrangian with respect to $\lambda$, we increase $\lambda$. Because the coefficient on the $\mathbf{x}^T \mathbf{x}$ penalty has increased, solving the linear equation for $\mathbf{x}$ will now yield a solution with a smaller norm. The process of solving the linear equation and adjusting $\lambda$ continues until $\mathbf{x}$ has the correct norm and the derivative is $0$.

My questions here relate to the norm, and are similar to those that I asked here.

  1. At the beginning of this section, the authors reference the norm of $\mathbf{A} \mathbf{x} - \mathbf{b}$. However, at the end of the section, the norm of $\mathbf{x}$, rather than $\mathbf{A} \mathbf{x} - \mathbf{b}$, seemingly comes out of nowhere. Similar to my questions referenced above (in the other thread), where did the norm of $\mathbf{x}$ come from?

  2. My understanding is that the $\mathbf{x}^T \mathbf{x}$ "penalty" that the authors are referencing at the end here is the term $\lambda (\mathbf{x}^T \mathbf{x} - 1)$ in $L(\mathbf{x}, \lambda) = f(\mathbf{x}) + \lambda (\mathbf{x}^T \mathbf{x} - 1)$; would that be correct? If so, then why does the coefficient on the $\mathbf{x}^T \mathbf{x}$ penalty increasing necessitate that solving the linear equation for $\mathbf{x}$ now yields a solution with a smaller norm?

  3. What is meant by "correct" norm in this last part?

I would greatly appreciate it if people would please take the time to clarify these points.

The Pointer
  • 4,182

2 Answers2

2
  1. The norm of ${\bf x}$ does not come out of "nowhere." Instead, it is a component of the only term in $L({\bf x}, \lambda) = f({\bf x}) + \lambda({\bf x}^t{\bf x} -1)$ that depends upon $\lambda$. (In short, $\frac{d L({\bf x},\lambda)}{d \lambda} = \frac{d}{d \lambda} \lambda ({\bf x}^t {\bf x} -1)$.) The authors could have written out the full term, then taken the derivative with respect to $\lambda$, where you would then see that the first term is independent of $\lambda$, and hence its derivative vanishes. They just ignore it straightaway.

  2. You want to search for a large value of $\lambda$ so that $L({\bf x}, \lambda) = f({\bf x}) + \lambda({\bf x}^t{\bf x} -1)$ leads to a small value of $|{\bf x}|$---actually, a value that is close to $1$. (Note that they are multiplied together so a large value of $\lambda$ forces a small value of $|{\bf x}|$ and vice versa.) Imagine the limiting case in the other direction: Suppose you had a value of $\lambda$ so small it approached $0$. Then $|{\bf x}|$ could become large. (That is clearly undesirable.)

  3. Here "correct" simply means that the magnitude of ${\bf x}$ is as small as possible given the other constraints. You could substitute the term "solution ${\bf x}$" for "correct ${\bf x}$."

The Pointer
  • 4,182
  • Thanks for the answer. I don't understand 1.; can you please elaborate? As for 2., the equation you're referring to is $L(\mathbf{x}, \lambda) = f(\mathbf{x}) + \lambda (\mathbf{x}^T \mathbf{x} - 1)$, right? – The Pointer Feb 01 '20 at 11:42
  • With regards to 2., finding a large value of $\lambda$ leads to a small value of the term $\lambda (\mathbf{x}^T \mathbf{x} - 1)$ in $L$, right? Not not necessarily a small value of $|{\bf x}|$? – The Pointer Feb 01 '20 at 21:19
  • @ThePointer. No. A large value of $\lambda$ leads to a small value of $|{\bf x}|$ (actually, one close to $1$).... which is what we want. The product will be whatever value is determined by the data and estimation problem. Clear? – David G. Stork Feb 01 '20 at 22:29
  • Why is that the case? – The Pointer Feb 01 '20 at 23:26
  • Why is what the case? The optimal solution (optimal value for $L$) depends upon the data and model as such it can be of any value it happens to be. If the optimal solution has $L = 1$, well, I can add data in special places to make $L = 2$. We seek a solution that is "simple," and that means small but finite $|{\bf x}|$, which we arbitrarily scale to $1$. Please: your questions are vague. Be precise so we can help. E.g.: "Why is that the case?"... why is WHAT the case? 1) product can be any value? 2) product is determined by data? 3) Large $\lambda$ means small $|{\bf x}|$?... – David G. Stork Feb 02 '20 at 02:06
  • What I meant was, if $\mathbf{x}^T \mathbf{x} = \vert\vert \mathbf{x} \vert \vert^2 \le 1$, then how is it the case that a large value of $\lambda$ leads to a small $\vert\vert \mathbf{x} \vert \vert^2$? – The Pointer Feb 02 '20 at 14:13
  • If the constraint corresponds to a given value, say $10$, then if $\lambda = 10$ then $|{\bf x}|^2 = 2$ (check that yourself); if instead $\lambda = 100$, then $|{\bf x}|^2 = 1.1$ (check that yourself). I don't think I can explain any clearer the elementary notion that if $x y = constant$ then $x$ large implies $y$ small, and vice versa. Think of the converse: Do you see why it is impossible for $\lambda = 100000$ and $|{\bf x}|^2 = 100000$ in this case? Or $\lambda = 0.0001$ and $|{\bf x}|^2 = 0.00001$? If not, well, I think I'll just admit defeat here and move on. – David G. Stork Feb 02 '20 at 16:36
  • Ahh, I understand now. The confusing thing was that the textbook refers to $\mathbf{x}^T \mathbf{x} \le 1$ as the constraint, but it seems that you're referring to the value of the term $\lambda (\mathbf{x}^T \mathbf{x} - 1)$ as the constraint? For instance, [...] – The Pointer Feb 03 '20 at 01:39
  • [...] you're saying that, if the constraint is $10$, then, if $\lambda = 10$, we have that $\vert\vert \mathbf{x} \vert\vert^2 = 2$: $$10(\mathbf{x}^T\mathbf{x} - 1) = 10 \Rightarrow \mathbf{x}^T\mathbf{x} = 2$$ But, as I said, this is putting a constraint on the term $\lambda (\mathbf{x}^T \mathbf{x} - 1)$ (the penalty) -- not $\vert\vert \mathbf{x} \vert\vert^2$ (what the textbook calls the constraint), right? So why does this discrepancy exist? – The Pointer Feb 03 '20 at 01:40
  • @ThePointer: I'm not putting a constraint on $L$!!! There is some value that is best. As I said: if the best value of the full constraint term happens to be $10$ (or whatever), then a large $\lambda$ implies a small $|{\bf x}|^2$. See page 610 of my book *Pattern classification*. I'm done here. Over and out. https://www.amazon.com/Pattern-Classification-Pt-1-Richard-Duda/dp/0471056693/ref=sr_1_1?keywords=Duda+hart+stork&qid=1580700405&sr=8-1 – David G. Stork Feb 03 '20 at 03:27
2
  1. The constraint $x^T x \leq 1$ implies that the norm of $x$ is less than $1$. That is $x^Tx =||x||^2$.

  2. Yes, you are correct. And if the norm of $x$ is greater than $1$, the $x^T x-1$ term is positive. The max over $\lambda$ is achieved at $\lambda = \infty$ with infinite value. Therefore, when you take the minimum over $x$, any solution will certainly satisfy the constraint $x^Tx\leq 1$, because otherwise, the expression is $+\infty$ and cannot be a minimum.

  3. The correct norm is the norm that satisfied the condition, that is $x^Tx =||x||^2 \leq 1$.

stochastic
  • 2,560
  • You mean $x^T x \le 1$ for 1. ? – The Pointer Feb 01 '20 at 10:01
  • @ThePointer yes, thank you for the correction – stochastic Feb 01 '20 at 16:20
  • I don't understand point 2.: Is it not said that the norm of $x$ must be $\le 1$? – The Pointer Feb 01 '20 at 21:22
  • @ThePointer for $x$ to be a solution, it has to have a norm less than 1. The point of the optimization algorithm is to reach a solution that satisfied this constraint. So the penalty term kicks in when this constraint is not satisfied, i.e. when the norm is larger than 1 – stochastic Feb 02 '20 at 19:50
  • Ok, after studying David's answer, I think I understand what you mean by your second point. $\min_{\mathbf{x}} \max_{\lambda, \lambda \ge 0} L(\mathbf{x}, \lambda) = f(\mathbf{x}) + \lambda (\mathbf{x}^T \mathbf{x} - 1)$ is only achieved when $\mathbf{x}^T \mathbf{x} = \vert\vert \mathbf{x} \vert\vert^2$ is as small as possible, since that will result in the term $\lambda (\mathbf{x}^T \mathbf{x} - 1)$ being negative (since $\mathbf{x}^T \mathbf{x} \le 1$, and so $\mathbf{x}^T \mathbf{x} - 1$ will be negative), right? And this will make $L(\mathbf{x}, \lambda)$ as small as possible? – The Pointer Feb 03 '20 at 04:45
  • 1
    @ThePointer Yes, almost. Let me make a small correction to what you said: the min-max expression is achieved only when $x^Tx = ||x||$ is less than or equal to $1$. Here is why: the maximum over $\lambda$ is taken first for all $x$ and then we find the $x$ that minimizes the expression. If $x^Tx \leq 1$, the penalty term is negative, and the maximum of $\lambda\geq 1$ is achieved at $\lambda = 0$, which completely gets rid of the penalty term. So that penalty term is there only when $x^Tx>1$. – stochastic Feb 03 '20 at 14:45
  • 1
    @ThePointer When $x^Tx >1$, $(x^Tx-1)$ is positive and the max over $\lambda$ is infinite (with infinite $\lambda$). Therefore, when you take the minimum over $x$, any solution will certainly satisfy the constraint $x^tx\leq 1$, because otherwise, the expression is $+\infty$ and cannot be a minimum. – stochastic Feb 03 '20 at 14:45