Deriving the least square system equations from calculus and the normal equations

Question

My question is similar to this question.

For least squares, if $E=e^2 = \sum_{i=1}^m(a_0 + a_1 x_{1}^i+a_2x_2^i+...a_nx_n^i-y_i)^2$ ($n$ features and $m$ observations), if I set up a system of equations of $dE/da_i = 0$, is there a way to show algebraically that this system of equations is the same as found from the normal equations $A^TAa = A^Ty$ if $a = [a_0 ... a_n]$ and $ y = [y_1...y_n]$ (both $a$ and $y$ should be column vectors)? I understand that both methods arrive at the same set of equations but I have trouble showing that they do for a non-numeric case.

dantopa · Accepted Answer · 2019-07-16T02:28:33.057

$ \newcommand{\argmin}{\mathop{\rm arg,min}} \newcommand{\normt}[1]{\left\lVert #1 \right\rVert_{2}} \newcommand{\normts}[1]{\left\lVert #1 \right\rVert_{2}^{2}} \newcommand{\paren}[1]{ \left( #1 \right) } \newcommand{\at}{ \mathbf{A}^{T} } $

Problem Statement

Polynomial Approximation

Given a sequence of $m$ measurements $\left(x_{\mu},y_{\mu}\right)^{m}_{\mu=1}$, with distinct $x$ values, find the best polynomial approximation, $$ y(x) = c_{0} + \sum_{\nu=1}^{n-1}c_{\nu}x^{\nu},$$ using the $2-$norm, $l^{2}$. (Note this definition excludes the problem of computing $x^0$ at the origin.)

Pose Linear System

The problem is described by the linear system \begin{equation} \begin{array}{ccccc} \mathbf{A} &c &= &y \\ \begin{bmatrix} 1 & x_{1} & x_{1}^{2} & \cdots & x_{1}^{n-1}\\ \vdots & \vdots & \vdots && \vdots \\ 1 & x_{1} & x_{2}^{2} & \cdots & x_{m}^{n-1} \end{bmatrix} & \begin{bmatrix} c_{0} \\ \vdots \\ c_{n} \end{bmatrix} &= & \begin{bmatrix} y_{1} \\ \vdots \\ y_{m} \end{bmatrix} \label{z}\tag{0} \end{array} \end{equation} where the system matrix $\mathbf{A}\in\mathbf{R}^{m\times n}_{m}$, the data vector $y\in\mathbf{R}^{m}$, and the solution vector of amplitudes is $c\in\mathbf{R}^{n}$.

For the typical case of $m>n$, the problem is overdetermined. Because $n<\rho$, the solution is unique.

Least Squares Solution

That is, find the amplitudes $c_{LS}$ which minimize the total error in the linear system $\mathbf{A}c=y$: $$c_{LS}=\underset{c\in\mathbb{R}^{n}}{\argmin} \normts{\mathbf{A}c-y}$$

Merit Function

For expedience, cast the total error as a merit function: $$M(c) = \normts{\mathbf{A}c-y}.$$

Solution via Normal Equations

For the full column rank problem, the normal equations offer a prompt route to solution. The outline is that the original linear system is inconsistent as gthe data vector $y$ can not be composed using column vectors of the matrix $\mathbf{A}$. One remedy is to premultiply both sides of the linear system by the transpose matrix $\at$. The resulting system must be consistent as the vector $\at y$ is in the column space of $\at$ by construction.

The regularized linear system is then \begin{equation} \begin{array}{ccccc} \mathbf{A}^{T}\mathbf{A} &c &= &\mathbf{A}^{T}y \\ \begin{bmatrix} m & \sum_{\mu=1}^{m}x & \sum_{\mu=1}^{m}x^{2} & \cdots & \sum_{\mu=1}^{m}x^{n-1}\\ \sum_{\mu=1}^{m}x & \sum_{\mu=1}^{m}x^{2} & \sum_{\mu=1}^{m}x^{n-1} & \cdots & \sum_{\mu=1}^{m}x^{n}\\ \vdots & \vdots & \vdots && \vdots \\ \sum_{\mu=1}^{m}x^{n-1} & \sum_{\mu=1}^{m}x^{n} & \sum_{\mu=1}^{m}x^{n+1} & \cdots & \sum_{\mu=1}^{m}x^{2n-21}\\ \end{bmatrix} & \begin{bmatrix} c_{0} \\ \vdots \\ c_{n} \end{bmatrix} &= & \begin{bmatrix} \sum_{\mu=1}^{m}y_{\mu} \\ \sum_{\mu=1}^{m}x_{\mu}y_{\mu} \\ \vdots \\ \sum_{\mu=1}^{m}x_{\mu}^{n-1}y_{\mu} \end{bmatrix} \label{a}\tag{1} \end{array} \end{equation}

The least square pseudoinverse is a left-inverse: \begin{equation} \mathbf{A}^{\dagger} = \mathbf{A}^{-LS} =\paren{\mathbf{A}^{T}\mathbf{A}}^{-1}\mathbf{A}^{T} \end{equation} The particular solution is \begin{equation} x_{LS} =\mathbf{A}^{\dagger}y = \paren{\mathbf{A}^{T}\mathbf{A}}^{-1}\mathbf{A}^{T}y \end{equation}

Solution via Calculus

A prime motivator for using the $2-$norm, is that is continuous and permits differentiation, which opens the toolkit of the calculus. We find the extrema of the merit function by finding the simultaneous roots of the first derivatives. (Because the least squares problem is convex, the extrema will be minima.)

Evaluate First Derivatives

Using the shorthand notation $$\partial_{\nu} M(c) = \frac{\partial}{\partial \nu},$$ the derivatives become \begin{equation} \begin{array}{rclclc} \partial_{0} M(c) &= &-2\displaystyle\sum^{m}_{\mu=1}\paren{y_{\mu}-y\paren{x_{\mu}}} &= 0, \\ \partial_{1} M(c) &= &-2\displaystyle\sum^{m}_{\mu=1}\paren{y_{\mu}-y\paren{x_{\mu}}}x_{\mu} &= 0, \\ \partial_{2} M(c) &= &-2\displaystyle\sum^{m}_{\mu=1}\paren{y_{\mu}-y\paren{x_{\mu}}}x_{\mu}^{2} &= 0, \\ & \vdots\\ \partial_{1} M(c) &= &-2\displaystyle\sum^{m}_{\mu=1}\paren{y_{\mu}-y\paren{x_{\mu}}}x_{\mu}^{n-1} &= 0 \end{array} \end{equation} The requirement for simultaneous satisfaction implies the construction of a linear system.

Pose Linear System

Rearrangement of the equations leads to the symmetric linear system: \begin{equation} \begin{array}{ccccc} \mathbf{A}^{T}\mathbf{A} &c &= &\mathbf{A}^{T}y \\ \begin{bmatrix} m & \sum_{\mu=1}^{m}x & \sum_{\mu=1}^{m}x^{2} & \cdots & \sum_{\mu=1}^{m}x^{n-1}\\ \sum_{\mu=1}^{m}x & \sum_{\mu=1}^{m}x^{2} & \sum_{\mu=1}^{m}x^{n-1} & \cdots & \sum_{\mu=1}^{m}x^{n}\\ \vdots & \vdots & \vdots && \vdots \\ \sum_{\mu=1}^{m}x^{n-1} & \sum_{\mu=1}^{m}x^{n} & \sum_{\mu=1}^{m}x^{n+1} & \cdots & \sum_{\mu=1}^{m}x^{2n-21}\\ \end{bmatrix} & \begin{bmatrix} c_{0} \\ \vdots \\ c_{n} \end{bmatrix} &= & \begin{bmatrix} \sum_{\mu=1}^{m}y_{\mu} \\ \sum_{\mu=1}^{m}x_{\mu}y_{\mu} \\ \vdots \\ \sum_{\mu=1}^{m}x_{\mu}^{n-1}y_{\mu} \end{bmatrix} \label{b}\tag{2} \end{array} \end{equation} This system is the same as found in $\ref{a}$.

$ \newcommand{\argmin}{\mathop{\rm arg,min}} \newcommand{\normt}[1]{\left\lVert #1 \right\rVert_{2}} \newcommand{\normts}[1]{\left\lVert #1 \right\rVert_{2}^{2}} \newcommand{\paren}[1]{ \left( #1 \right) } \newcommand{\at}{ \mathbf{A}^{T} } $

Addendum: Derivation of the Normal Equations

An astute question by the OP reveals the answer needs more work. To paraphrase: "You pulled the normal equations from thin air and showed that they matched the minimization solution provided by calculus. But showing the solutions match is not as strong as showing why they must match."

Expand the Total Error

Expand the total error encoded in the merit function: $$M(c) = \normts{\mathbf{A}c-y} = \paren{\mathbf{A}c-y}^{T}\paren{\mathbf{A}c-y} = c^{T}\mathbf{A}^{T}\mathbf{A}c - c^{T}\mathbf{A}^{T}y - y^{T}c\mathbf{A} + y^{T}y$$

State the dimensions: $$ \mathbf{A}c: \paren{m\times n}\paren{n\times 1}=\paren{m\times 1}. $$ Therefore, $$ \paren{\mathbf{A}c}^{T}y: \paren{1\times m}\paren{m\times 1}=\paren{1\times 1} $$ is a scalar. And scalars are symmetric, so $$ - c^{T}\mathbf{A}^{T}y - y^{T}c\mathbf{A} = -2 c^{T}\mathbf{A}^{T}y. $$

Differentiate the Total Error

Again, the $2-$norm supports differentiation, which is done this time in vector form. $$ \partial_{k} M(c) = \frac{\partial}{\partial c_{k}} c = \frac{\partial}{\partial c_{k}} \begin{bmatrix} c_{1} \\ \vdots \\ c_{k} \\ \vdots \\ c_{n} \end{bmatrix} = \begin{bmatrix} 0 \\ \vdots \\ 1 \\ \vdots \\ 0 \end{bmatrix} = e_{k,n} $$ The unit vector $e_{k,n}$ has length $n$ and all entries are $0$ except for the $k$th term which is unity.

With this single rule, the general term becomes $$ \partial_{k} M(c) = e_{k,n}^{T} \mathbf{A}^{T}\mathbf{A}c + \mathbf{A}^{T}\mathbf{A} e_{k,n} - 2e_{k,n}^{T} \mathbf{A}^{T}y $$ Collecting the scalar terms, $$ \partial_{k} M(c) = 2 e_{k,n}^{T} \mathbf{A}^{T}\mathbf{A}c - 2e_{k,n}^{T} \mathbf{A}^{T}y $$ Remember, this represents $n$ terms $\paren{k=1..n}$. These sequence looks like this \begin{equation} \begin{array}{lccccc} k=1 &\begin{bmatrix} m & \sum_{\mu=1}^{m}x & \sum_{\mu=1}^{m}x^{2} & \cdots & \sum_{\mu=1}^{m}x^{n-1}\\ 0 & 0 & 0 & \cdots & 0\\ \vdots & \vdots & \vdots && \vdots \\ 0 & 0 & 0 & \cdots & 0\\ \end{bmatrix} & \begin{bmatrix} c_{0} \\ \vdots \\ c_{n} \end{bmatrix} &= & \begin{bmatrix} \sum_{\mu=1}^{m}y_{\mu} \\ 0 \\ \vdots \\ 0 \end{bmatrix} \\ %% k=2 &\begin{bmatrix} 0 & 0 & 0 & \cdots & 0\\ \sum_{\mu=1}^{m}x & \sum_{\mu=1}^{m}x^{2} & \sum_{\mu=1}^{m}x^{n-1} & \cdots & \sum_{\mu=1}^{m}x^{n}\\ \vdots & \vdots & \vdots && \vdots \\ 0 & 0 & 0 & \cdots & 0\\ \end{bmatrix} & \begin{bmatrix} c_{0} \\ \vdots \\ c_{n} \end{bmatrix} &= & \begin{bmatrix} 0 \\ \sum_{\mu=1}^{m}x_{\mu}y_{\mu} \\ \vdots \\ 0 \end{bmatrix} \\ %% &&&\vdots \\ %% k=n &\begin{bmatrix} 0 & 0 & 0 & \cdots & 0\\ 0 & 0 & 0 & \cdots & 0\\ \vdots & \vdots & \vdots && \vdots \\ \sum_{\mu=1}^{m}x^{n-1} & \sum_{\mu=1}^{m}x^{n} & \sum_{\mu=1}^{m}x^{n+1} & \cdots & \sum_{\mu=1}^{m}x^{2n-1}\\ \end{bmatrix} & \begin{bmatrix} c_{0} \\ \vdots \\ c_{n} \end{bmatrix} &= & \begin{bmatrix} 0 \\ 0 \\ \vdots \\ \sum_{\mu=1}^{m}x_{\mu}^{n-1}y_{\mu} \end{bmatrix} \end{array} \end{equation}

These equations are linear, and the requirement for simultaneous solution suggests they be combined into the form in $\ref{b}$, the normal equations.

Different Paths, Same Destination

Stepping back, the two paths to the normal equations aren't really two paths. Both are minimization by finding the roots of the first derivative. The first method used scalar differentiation, the second used vector differentiation. You could summarize by saying "differentiation of the total error (and subsequent root extraction) produce the same result - as expected since differentiation is a linear operator." With scalar differentiation we tackle the terms individually, with vector differentiation, we tackle them all at once.

Excursions

Equivalence of least squares solutions and normal equations solutions: Deriving the Normal Equation used to solve Least Squares Problems

Least squares and uniqueness: Unique least square solutions

Intuitively, why does it make sense that the normal equation and calculus method produce the exact same set of equations? Is it because both methods solve for the $c$ that minimizes the same error function? — Yandle, Jul 15 '19 at 22:14
Thanks for the addendum! I'm wondering if looking at it this way is also correct: the $c$ that minimizes $M(c)$ is the vector such that $Ac-y$ is in the null space of $A^T$ (as to minimize $M(c)$, $Ac-y$ must be orthogonal to the column space of $A$), hence to solve for $c$ we can write $A^T(Ac-y)=0 \to A^TAc=A^Ty$? — Yandle, Jul 16 '19 at 23:02
@Yandle: Yes, this is an elegant line of thought. The vector $\mathbf{A}c-y$ is the residual error vector, the projection of the data vector into the null space of the column space. — dantopa, Jul 17 '19 at 00:32

Deriving the least square system equations from calculus and the normal equations

1 Answers1

Problem Statement

Polynomial Approximation

Pose Linear System

Least Squares Solution

Merit Function

Solution via Normal Equations

Solution via Calculus

Evaluate First Derivatives

Pose Linear System

Addendum: Derivation of the Normal Equations

Expand the Total Error

Differentiate the Total Error

Different Paths, Same Destination

Excursions

Linked