Why does taking derivatives of $L$ in Lagrangian multiplier problems let me find solutions to optimizations problems?

Question

Consider the problem

Maximize $f(\mathbf{x})$ subject to $g(\mathbf{x})=c$

Using the method of Lagrangian multpliers, I would set up a Lagrangian like

$$L = f(\mathbf{x})-\lambda (g(\mathbf{x})-c)$$

I would then solve for $\frac{\partial L}{\partial x_1}$,$\frac{\partial L}{\partial x_2}$,... and so on and use these to solve the problem.

My Question:

What is $L$? Why does it have this special property that when I take these derivatives and solve I can suddenly find the solution to the optimization problem? I just sort of compute without really understanding what I am doing. Why does this method work? Supposedly I have heard that $\lambda$ is the scalar necessary to make $\nabla g(\mathbf{x})$ equal to $\nabla f(\mathbf{x})$. And this has something to do with the normal vectors being parallel. But I don't really get how this helps me understand the role of $L$.

Do you know the Implicit Function Theorem (for vector-valued mappings)? There is an explanation based on it. — Vim, Jul 08 '15 at 18:19
"Yes I do." ---- the rest are extra characters so I can submit this comment — Stan Shunpike, Jul 08 '15 at 18:27
I'll post an answer when I have time tomorrow. But now I really need to get some sleep. — Vim, Jul 08 '15 at 18:35

score 5 · Answer 1 · answered Jul 08 '15 at 19:38

The "Lagrangian function" $L$ is a purely formal device without any intuitive content, but it condenses the complex geometric data into a simple recipe.

The "method" in question relies on the following fact: Given a function $f:{\mathbb R}^n\to{\mathbb R}$ and a surface ("condition") $$S:\quad g(x)=c\ ,$$ the function $f$ cannot assume a conditional local extremum at the point $p\in S$ unless $$\nabla f(p)=\lambda\>\nabla g(p)\tag{1}$$ for some $\lambda\in{\mathbb R}$. Therefore a conditionally extremal point $p$ will come to the fore when we go through the suggested motions.

The reason for the principle $(1)$ is the following: The vector $\nabla g(p)$, supposed to be $\ne0$, is orthogonal to the tangential (hyper)plane $T_pS$. If $\nabla f(p)$ is not parallel to $\nabla g (p)$ then there are vectors $u\in T_pS$ such that $\nabla f(p)\cdot u\ne0$. This means that there are "allowed" directions to move out of $p$ for which $f$ increases, as well as "allowed" directions for which $f$ decreases. In such a case $f$ cannot have a conditional local extremum at $p\in S$.

+1 for "The "Lagrangian function" is a purely formal device" — Vim, Jul 09 '15 at 00:52
I'll post another answer which is generally based on the same ideas as is yours, but more "mathematically" phrased. — Vim, Jul 09 '15 at 00:54
while it can surely be looked at as 'purely formal', there are intuitive, physical means through which it was derived and made sense of initially by Lagrange — obataku, Jul 09 '15 at 04:37
@oldrinb for the multi-constraint case (as is discussed in my answer), this intuition needs to be modified a little, I think — Vim, Jul 09 '15 at 06:13

Vim · Accepted Answer · 2015-07-10T03:26:13.943

Let's start with the following question, suppose $$ \Sigma:=\left\{x\in\Bbb R^m \mid f(x)=0\in\Bbb R^q,\quad f(x)\in\mathscr C^1(\Bbb R^m),\quad 1\le q<m \right\} $$ and there is a scalar field, or the so-called "goal function" $\theta(x)\in\Bbb R$ on $\Sigma$. What we are going to do is seek $x_{*}$ such that $$\theta (x_*)=\sup_{x\in\Sigma}\theta(x)\quad\text{or}\quad\inf_{x\in\Sigma}\theta(x)$$ Now we apply the Implicit Function Theorem. We split $x\in\Bbb R^m$ into two parts $(\tilde{x},\hat{x})\in\Bbb R^p\times\Bbb R^q$ where $p+q=m$, and hence $f(\tilde{x},\hat{x})=0\in\Bbb R^q$. According to the theorem, if for all $x=(\tilde{x},\hat{x})\in \Sigma$, we have ($D$ denotes Jacobian matrix) \begin{equation} \det (D_{\hat{x}}f)(x)=\frac{\partial f}{\partial \hat{x}}\ne 0 \end{equation} Then there exists a parameter domain $U_{\Sigma}\in\Bbb R^p$ such that there exists an implicit function $\xi(\tilde{x})$: \begin{equation} \xi(\tilde{x}):U_{\Sigma}\ni\tilde{x}\mapsto\xi(\tilde{x})\in \Bbb R^q \end{equation} which is determined by the constraint $f(\tilde{x},\xi(\tilde{x}))=0\in\Bbb R^q$, or equivalently $(\tilde{x},\xi(\tilde{x}))\in\Sigma$.

Here comes the key part: we are going to recognize $\Sigma$ as a "hyper surface" which maps elements from $\Bbb R^p$ to $\Bbb R^{p+q}$, and then parameterize it with the aid of the implicit function: if we regard $U_{\Sigma}$ as the "parameter domain" for the hyper surface $\Sigma$, then we are immediately able to define the parametrization mapping $\sigma$ for $\Sigma$ as \begin{equation}\sigma(\tilde{x}):U_{\Sigma}\ni\tilde{x}\mapsto \sigma(\tilde{x}):=(\tilde{x},\xi(\tilde{x}))\in\Sigma\subset \Bbb R^m\end{equation} Hence we can rewrite the "goal function" $\theta(x)$ as \begin{equation}\Theta(\tilde{x}):U_{\Sigma}\ni\tilde{x}\mapsto \Theta(\tilde{x}):=\theta\circ\sigma(\tilde{x})\in\Bbb R\end{equation} The significant difference between the original form $\theta(x)$ and the rewritten form $\Theta(\tilde{x})$ is that the latter is defined directly on an open domain $U_{\Sigma}$, "freed" from any constraint. Therefore, if we are to seek local extrema for $\Theta(\tilde{x})$, all we have to do is simply let \begin{equation} (D\Theta)(\tilde{x})=(D\theta\circ\sigma)(\tilde{x})=0\in\Bbb R^{1\times p} \end{equation}

By Chain Rule, we have $$ (D\theta\circ\sigma)(\tilde{x})=(D\theta)(\sigma(\tilde{x}))(D\sigma)(\tilde{x}) $$ Note that $\sigma(\tilde{x})=(\tilde{x},\xi(\tilde{x}))$ and hence $$(D\theta)(\cdot)=\left[(D_{\tilde{x}}\theta)(\cdot),(D_{\hat{x}}\theta)(\cdot)\right]$$ and ($I_p$ denotes the $p\times p$ identity matrix) $$(D\sigma)(\tilde{x})=\begin{bmatrix} I_p \\ (D\xi)(\tilde{x}) \end{bmatrix}$$ we have \begin{equation} (D\theta\circ\sigma)(\tilde{x})=(D_{\tilde{x}}\theta)(x)+(D_{\hat{x}}\theta)(x)(D\xi)(\tilde{x})=0\in\Bbb R^{1\times p} \end{equation} Again, aided by the Implicit Function Theorem, we have \begin{equation} (D\xi)(\tilde{x})=-(D_{\hat{x}}f)^{-1}(x)(D_{\tilde{x}}f)(x) \end{equation} plugging it into the previous equation, we obtain the following equation \begin{equation}(D_{\tilde{x}}\theta)(x)-(D_{\hat{x}}\theta)(x)(D_{\hat{x}}f)^{-1}(x)(D_{\tilde{x}}f)(x)=0\in\Bbb R^{1\times q}\end{equation} Together with the constraint $f(x)=0\in\Bbb R^q$, we have $$ \left\{ \begin{array}{l} (D_{\tilde{x}}\theta)(x)-(D_{\hat{x}}\theta)(x)(D_{\hat{x}}f)^{-1}(x)(D_{\tilde{x}}f)(x)=0\in\Bbb R^{1\times p}\\ f(x)=0\in\Bbb R^q \end{array} \right. $$ Provided that $\Sigma$ (and hence $U_{\Sigma}$) is compact, these $m$ equations can determine all the possible $x_*$s that are not located on $\partial\Sigma$, which, in my opinion, is the intrinsic form of the so-called Lagrange Multiplier Function.

To see how the common Lagrange function "coincides" with this form, let $$L(x,\lambda):\Bbb R^m\times\Bbb R^q\ni (x,\lambda)\mapsto L(x,\lambda):=\theta(x)+\lambda^Tf(x)\in\Bbb R$$ differentiate $L$ and we get \begin{align*} (DL)(x,\lambda)&=(DL)(\tilde{x},\hat{x},\lambda)=\left[(D_{\tilde{x}}L),(D_{\hat{x}}L),(D_{\lambda}L)\right](x,\lambda)\\ &=\left[(D_{\tilde{x}}\theta)(x)+\lambda^T(D_{\tilde{x}}f)(x),(D_{\hat{x}}\theta)(x)+\lambda^T(D_{\hat{x}}f)(x),(f(x))^T\right]\\ &=\left[0\in\Bbb R^{1\times p},0\in\Bbb R^{1\times q},0\in\Bbb R^{1\times q}\right] \end{align*} from which it follows that $$(D_{\hat{x}}\theta)(x)+\lambda^T(D_{\hat{x}}f)(x)=0\in\Bbb R^{1\times p}\implies \lambda^T=-(D_{\hat{x}}\theta)(x)(D_{\hat{x}}f)^{-1}(x)\in\Bbb R^{1\times q}$$ plugging it into $$(D_{\tilde{x}}\theta)(x)+\lambda^T(D_{\tilde{x}}f)(x)=0\in\Bbb R^{1\times p}$$ so to obtain $$(D_{\tilde{x}}\theta)(x)-(D_{\hat{x}}\theta)(x)(D_{\hat{x}}f)^{-1}(x)(D_{\tilde{x}}f)(x)=0\in\Bbb R^{1\times p}$$ together with $(f(x))^T=0\in\Bbb R^{q}$, we have returned to the $m$ equations in the intrinsic form.

+1 this is great! It will take me a day or two to process it fully. Thanks for being so thorough! — Stan Shunpike, Jul 09 '15 at 05:58
@StanShunpike by the way if you spot any problems or typos please feel free to ask me — Vim, Jul 09 '15 at 06:02
The "open domain" $U_{\Sigma}$ needs to be modified a little, I'll fix it later. — Vim, Jul 09 '15 at 10:21
Tip of the cap sir for an excellent answer. Just finished going through it. — Stan Shunpike, Jul 10 '15 at 02:38
@StanShunpike Thanks for reading it. Found any typos or snags? Well, I found many after I posted it and edited at least ten times. Seems that there is still a lot to improve : ) — Vim, Jul 10 '15 at 03:24

score 0 · Answer 3 · answered Jul 08 '15 at 18:47

I will explain it in 2D. I believe similar argument works in 3D.

Let $f(x,y)$ be the objective function, under the constraint $g(x,y)=k$. Suppose $g(x,y)=k$ is a smooth closed curve, and can be parametrized by $(x(t),y(t))$.

We are then trying to find the maximum or minimum of $f(x(t),y(t))$. The critical points satisfy $$f_x x'(t)+f_y y'(t)=0$$

which is equivalent to $$\nabla f\cdot(x'(t),y'(t))=0$$

This implies that $\nabla f$ is perpendicular to the tangent direction of the curve defined by $g(x,y)=k$, or, $\nabla f$ is parallel to the normal direction of the curve, i.e.,

$\nabla f=\lambda \nabla g$

where $\lambda$ is any constant.

AnilB · Answer 4 · 2015-07-08T20:10:51.953

Let's assume that you have two independent variables as $x_1$ and $x_2$. To satisfy the constraint equation $g(x_1,x_2)=c\quad$ the change in $g(x_1,x_2)\quad$ must be zero

$$dg(x_1,x_2)=\frac{\partial g}{\partial x_1}dx_1+\frac{\partial g}{\partial x_2}dx_2=0$$ $$\longrightarrow dx_2=-\frac{\frac{\partial g}{\partial x_1}}{\frac{\partial g}{\partial x_2}}dx1 $$

To satisfy the constraint you need to satisfy above relation between the independent variables. To maximize $f(x_1,x_2)$

$$df(x_1,x_2)=\frac{\partial f}{\partial x_1}dx_1+\frac{\partial f}{\partial x_2}dx_2=0$$ $$df(x_1,x_2)=\frac{\partial f}{\partial x_1}dx_1-\frac{\partial f}{\partial x_2}\frac{\frac{\partial g}{\partial x_1}}{\frac{\partial g}{\partial x_2}}dx_1=0$$

$$df(x_1,x_2)=\bigg(\frac{\partial f}{\partial x_1}-\frac{\partial f}{\partial x_2}\frac{\frac{\partial g}{\partial x_1}}{\frac{\partial g}{\partial x_2}}\bigg)dx1=0$$

Combining all together leaves you two equations

$$\frac{\partial f}{\partial x_1}-\frac{\partial f}{\partial x_2}\frac{\frac{\partial g}{\partial x_1}}{\frac{\partial g}{\partial x_2}}=0$$ $$g(x_1,x_2)=c$$

Now let's see what happens when you use the Lagrange method.

$$\frac{\partial L}{\partial x_1}=\frac{\partial f}{\partial x_1}-\lambda \frac{\partial g}{\partial x_1}=0$$ $$\frac{\partial L}{\partial x_2}=\frac{\partial f}{\partial x_2}-\lambda \frac{\partial g}{\partial x_2}=0$$ $$g(x_1,x_2)=c$$

To eliminate $\lambda$ you can use $$\frac{\partial L}{\partial x_2}=\frac{\partial f}{\partial x_2}-\lambda \frac{\partial g}{\partial x_2}=0 \Rightarrow \lambda=\frac{\frac{\partial f}{\partial x_2}}{\frac{\partial g}{\partial x_2}}$$

and the system reduces to

$$\frac{\partial L}{\partial x_1}=\frac{\partial f}{\partial x_1}-\frac{\partial g}{\partial x_1}\frac{\frac{\partial f}{\partial x_2}}{\frac{\partial g}{\partial x_2}}=0$$

$$g(x_1,x_2)=c $$

As you see both reults are identical.

Why does taking derivatives of $L$ in Lagrangian multiplier problems let me find solutions to optimizations problems?

4 Answers4