8

I am writing a program minimizing a real-valued non-linear function of around 90 real variables subject to around 30 non-linear constraints. I found handy explanation in CERN's Data Analysis BriefBook. I've implemented it and it works, but I am not able to derive how they obtained the equations at the bottom. Could anyone please explain how it can be achieved?

Copied from the link:

Trying to minimize $f(x_1,...,x_n)$ subject to $c_1(x_1,...,x_n) = ... = c_m(x_1,...,x_n) = 0$.

Reformulate as $$\partial F/\partial x_1 = \dots = \partial F/\partial x_n = \partial F/\partial\lambda_1 = \dots = \partial F/\partial\lambda_m = 0$$ for $$ F(x_1,\dots,x_n,\lambda_1,\dots,\lambda_m) = f(x_1,\dots,x_n) - \lambda^T c(x_1,\dots,x_n) $$

Using Lagrange multipliers $\lambda$ and Newton-Raphson, they arrive at (1): $$ A\Delta x - B^T\lambda = -a\\ B\Delta x = -c, $$ where $A$ is Hessian of $f$, $a = (\nabla f)^T$ is gradient of $f$ and $B=\nabla c$ is Jacobian of the constraints.

I can't seem to follow them. The way I understand it, they're applying Newton-Raphson to solve $\nabla F = 0$. I believe that $$ \nabla F = \left(a - B^T \lambda \atop -c \right) $$ Then, we need to take derivative of $\nabla F$ for Newton-Raphson. First off, it seems to me they're only taking the derivative w.r.t. $x$ and not to $\lambda$. Why is that?

Even so, this would lead to (1) if the derivative of $\nabla F$ was $\left(A \atop -B\right)$. However, while it's true that $\nabla a = A$ and $\nabla c = B$, I can't understand where the term $-B^T\lambda$ vanished to.

Many thanks to anyone who could shed some light on this.

2 Answers2

3

I also was interested in a proof of the Newton-Raphson+Lagrange multiplier result from the CERN page. Here is my derivation:

enter image description here

enter image description here

2

Check Sequential Quadratic Programming in any good optimization book. When there are only equality constraints, it really boils down to applying Newton's method to the system you gave. With $F(x,\lambda) := f(x) - \lambda^T c(x)$ (the Lagrangian), differentiating yields $$ \nabla F(x,\lambda) = \begin{bmatrix} \nabla f(x) - J(x)^T \lambda \\ - c(x) \end{bmatrix}, $$ where $$ J(x) = \begin{bmatrix} \nabla c_1(x)^T \\ \vdots \\ \nabla c_m(x)^T \end{bmatrix} $$ is the Jacobian of $c$ at $x$. Applying Newton's method to $\nabla F(x,\lambda) = 0$ requires that we differentiate one more time: $$ \begin{bmatrix} H(x,\lambda) & J(x)^T \\ J(x) & 0 \end{bmatrix} \begin{bmatrix} \Delta x \\ -\Delta \lambda \end{bmatrix} = - \begin{bmatrix} \nabla f(x) - J(x)^T \lambda \\ - c(x) \end{bmatrix}. $$ There are all sorts of difficulties that come in, ranging from the need to perform a linesearch for a merit function, to avoiding the Maratos effect. A good book such as Numerical Optimization (Nocedal & Wright, Springer) will explain everything.

Dominique
  • 3,144