$
\newcommand{\argmin}{\mathop{\rm arg,min}}
\newcommand{\normt}[1]{\left\lVert #1 \right\rVert_{2}}
\newcommand{\normts}[1]{\left\lVert #1 \right\rVert_{2}^{2}}
\newcommand{\paren}[1]{ \left( #1 \right) }
\newcommand{\at}{ \mathbf{A}^{T} }
$
Problem Statement
Polynomial Approximation
Given a sequence of $m$ measurements $\left(x_{\mu},y_{\mu}\right)^{m}_{\mu=1}$, with distinct $x$ values, find the best polynomial approximation,
$$ y(x) = c_{0} + \sum_{\nu=1}^{n-1}c_{\nu}x^{\nu},$$
using the $2-$norm, $l^{2}$. (Note this definition excludes the problem of computing $x^0$ at the origin.)
Pose Linear System
The problem is described by the linear system
\begin{equation}
\begin{array}{ccccc}
\mathbf{A} &c &= &y \\
\begin{bmatrix}
1 & x_{1} & x_{1}^{2} & \cdots & x_{1}^{n-1}\\
\vdots & \vdots & \vdots && \vdots \\
1 & x_{1} & x_{2}^{2} & \cdots & x_{m}^{n-1}
\end{bmatrix} &
\begin{bmatrix}
c_{0} \\ \vdots \\ c_{n}
\end{bmatrix} &= &
\begin{bmatrix}
y_{1} \\ \vdots \\ y_{m}
\end{bmatrix}
\label{z}\tag{0}
\end{array}
\end{equation}
where the system matrix $\mathbf{A}\in\mathbf{R}^{m\times n}_{m}$, the data vector $y\in\mathbf{R}^{m}$, and the solution vector of amplitudes is $c\in\mathbf{R}^{n}$.
For the typical case of $m>n$, the problem is overdetermined. Because $n<\rho$, the solution is unique.
Least Squares Solution
That is, find the amplitudes $c_{LS}$ which minimize the total error in the linear system $\mathbf{A}c=y$:
$$c_{LS}=\underset{c\in\mathbb{R}^{n}}{\argmin} \normts{\mathbf{A}c-y}$$
Merit Function
For expedience, cast the total error as a merit function:
$$M(c) = \normts{\mathbf{A}c-y}.$$
Solution via Normal Equations
For the full column rank problem, the normal equations offer a prompt route to solution. The outline is that the original linear system is inconsistent as gthe data vector $y$ can not be composed using column vectors of the matrix $\mathbf{A}$. One remedy is to premultiply both sides of the linear system by the transpose matrix $\at$. The resulting system must be consistent as the vector $\at y$ is in the column space of $\at$ by construction.
The regularized linear system is then
\begin{equation}
\begin{array}{ccccc}
\mathbf{A}^{T}\mathbf{A} &c &= &\mathbf{A}^{T}y \\
\begin{bmatrix}
m & \sum_{\mu=1}^{m}x & \sum_{\mu=1}^{m}x^{2} & \cdots & \sum_{\mu=1}^{m}x^{n-1}\\
\sum_{\mu=1}^{m}x & \sum_{\mu=1}^{m}x^{2} & \sum_{\mu=1}^{m}x^{n-1} & \cdots & \sum_{\mu=1}^{m}x^{n}\\
\vdots & \vdots & \vdots && \vdots \\
\sum_{\mu=1}^{m}x^{n-1} & \sum_{\mu=1}^{m}x^{n} & \sum_{\mu=1}^{m}x^{n+1} & \cdots & \sum_{\mu=1}^{m}x^{2n-21}\\
\end{bmatrix} &
\begin{bmatrix}
c_{0} \\ \vdots \\ c_{n}
\end{bmatrix} &= &
\begin{bmatrix}
\sum_{\mu=1}^{m}y_{\mu} \\ \sum_{\mu=1}^{m}x_{\mu}y_{\mu} \\ \vdots \\ \sum_{\mu=1}^{m}x_{\mu}^{n-1}y_{\mu}
\end{bmatrix}
\label{a}\tag{1}
\end{array}
\end{equation}
The least square pseudoinverse is a left-inverse:
\begin{equation}
\mathbf{A}^{\dagger} = \mathbf{A}^{-LS} =\paren{\mathbf{A}^{T}\mathbf{A}}^{-1}\mathbf{A}^{T}
\end{equation}
The particular solution is
\begin{equation}
x_{LS} =\mathbf{A}^{\dagger}y = \paren{\mathbf{A}^{T}\mathbf{A}}^{-1}\mathbf{A}^{T}y
\end{equation}
Solution via Calculus
A prime motivator for using the $2-$norm, is that is continuous and permits differentiation, which opens the toolkit of the calculus. We find the extrema of the merit function by finding the simultaneous roots of the first derivatives. (Because the least squares problem is convex, the extrema will be minima.)
Evaluate First Derivatives
Using the shorthand notation
$$\partial_{\nu} M(c) = \frac{\partial}{\partial \nu},$$
the derivatives become
\begin{equation}
\begin{array}{rclclc}
\partial_{0} M(c) &= &-2\displaystyle\sum^{m}_{\mu=1}\paren{y_{\mu}-y\paren{x_{\mu}}} &= 0, \\
\partial_{1} M(c) &= &-2\displaystyle\sum^{m}_{\mu=1}\paren{y_{\mu}-y\paren{x_{\mu}}}x_{\mu} &= 0, \\
\partial_{2} M(c) &= &-2\displaystyle\sum^{m}_{\mu=1}\paren{y_{\mu}-y\paren{x_{\mu}}}x_{\mu}^{2} &= 0, \\
& \vdots\\
\partial_{1} M(c) &= &-2\displaystyle\sum^{m}_{\mu=1}\paren{y_{\mu}-y\paren{x_{\mu}}}x_{\mu}^{n-1} &= 0
\end{array}
\end{equation}
The requirement for simultaneous satisfaction implies the construction of a linear system.
Pose Linear System
Rearrangement of the equations leads to the symmetric linear system:
\begin{equation}
\begin{array}{ccccc}
\mathbf{A}^{T}\mathbf{A} &c &= &\mathbf{A}^{T}y \\
\begin{bmatrix}
m & \sum_{\mu=1}^{m}x & \sum_{\mu=1}^{m}x^{2} & \cdots & \sum_{\mu=1}^{m}x^{n-1}\\
\sum_{\mu=1}^{m}x & \sum_{\mu=1}^{m}x^{2} & \sum_{\mu=1}^{m}x^{n-1} & \cdots & \sum_{\mu=1}^{m}x^{n}\\
\vdots & \vdots & \vdots && \vdots \\
\sum_{\mu=1}^{m}x^{n-1} & \sum_{\mu=1}^{m}x^{n} & \sum_{\mu=1}^{m}x^{n+1} & \cdots & \sum_{\mu=1}^{m}x^{2n-21}\\
\end{bmatrix} &
\begin{bmatrix}
c_{0} \\ \vdots \\ c_{n}
\end{bmatrix} &= &
\begin{bmatrix}
\sum_{\mu=1}^{m}y_{\mu} \\ \sum_{\mu=1}^{m}x_{\mu}y_{\mu} \\ \vdots \\ \sum_{\mu=1}^{m}x_{\mu}^{n-1}y_{\mu}
\end{bmatrix}
\label{b}\tag{2}
\end{array}
\end{equation}
This system is the same as found in $\ref{a}$.
$
\newcommand{\argmin}{\mathop{\rm arg,min}}
\newcommand{\normt}[1]{\left\lVert #1 \right\rVert_{2}}
\newcommand{\normts}[1]{\left\lVert #1 \right\rVert_{2}^{2}}
\newcommand{\paren}[1]{ \left( #1 \right) }
\newcommand{\at}{ \mathbf{A}^{T} }
$
Addendum: Derivation of the Normal Equations
An astute question by the OP reveals the answer needs more work. To paraphrase: "You pulled the normal equations from thin air and showed that they matched the minimization solution provided by calculus. But showing the solutions match is not as strong as showing why they must match."
Expand the Total Error
Expand the total error encoded in the merit function:
$$M(c)
= \normts{\mathbf{A}c-y}
= \paren{\mathbf{A}c-y}^{T}\paren{\mathbf{A}c-y}
= c^{T}\mathbf{A}^{T}\mathbf{A}c - c^{T}\mathbf{A}^{T}y - y^{T}c\mathbf{A} + y^{T}y$$
State the dimensions:
$$
\mathbf{A}c: \paren{m\times n}\paren{n\times 1}=\paren{m\times 1}.
$$
Therefore,
$$
\paren{\mathbf{A}c}^{T}y: \paren{1\times m}\paren{m\times 1}=\paren{1\times 1}
$$
is a scalar. And scalars are symmetric, so
$$
- c^{T}\mathbf{A}^{T}y - y^{T}c\mathbf{A} = -2 c^{T}\mathbf{A}^{T}y.
$$
Differentiate the Total Error
Again, the $2-$norm supports differentiation, which is done this time in vector form.
$$
\partial_{k} M(c) =
\frac{\partial}{\partial c_{k}} c
=
\frac{\partial}{\partial c_{k}}
\begin{bmatrix}
c_{1} \\ \vdots \\ c_{k} \\ \vdots \\ c_{n}
\end{bmatrix}
=
\begin{bmatrix}
0 \\ \vdots \\ 1 \\ \vdots \\ 0
\end{bmatrix}
= e_{k,n}
$$
The unit vector $e_{k,n}$ has length $n$ and all entries are $0$ except for the $k$th term which is unity.
With this single rule, the general term becomes
$$
\partial_{k} M(c) = e_{k,n}^{T} \mathbf{A}^{T}\mathbf{A}c + \mathbf{A}^{T}\mathbf{A} e_{k,n} - 2e_{k,n}^{T} \mathbf{A}^{T}y
$$
Collecting the scalar terms,
$$
\partial_{k} M(c) = 2 e_{k,n}^{T} \mathbf{A}^{T}\mathbf{A}c - 2e_{k,n}^{T} \mathbf{A}^{T}y
$$
Remember, this represents $n$ terms $\paren{k=1..n}$. These sequence looks like this
\begin{equation}
\begin{array}{lccccc}
k=1
&\begin{bmatrix}
m & \sum_{\mu=1}^{m}x & \sum_{\mu=1}^{m}x^{2} & \cdots & \sum_{\mu=1}^{m}x^{n-1}\\
0 & 0 & 0 & \cdots & 0\\
\vdots & \vdots & \vdots && \vdots \\
0 & 0 & 0 & \cdots & 0\\
\end{bmatrix} &
\begin{bmatrix}
c_{0} \\ \vdots \\ c_{n}
\end{bmatrix} &= &
\begin{bmatrix}
\sum_{\mu=1}^{m}y_{\mu} \\ 0 \\ \vdots \\ 0
\end{bmatrix} \\
%%
k=2
&\begin{bmatrix}
0 & 0 & 0 & \cdots & 0\\
\sum_{\mu=1}^{m}x & \sum_{\mu=1}^{m}x^{2} & \sum_{\mu=1}^{m}x^{n-1} & \cdots & \sum_{\mu=1}^{m}x^{n}\\
\vdots & \vdots & \vdots && \vdots \\
0 & 0 & 0 & \cdots & 0\\
\end{bmatrix} &
\begin{bmatrix}
c_{0} \\ \vdots \\ c_{n}
\end{bmatrix} &= &
\begin{bmatrix}
0 \\ \sum_{\mu=1}^{m}x_{\mu}y_{\mu} \\ \vdots \\ 0
\end{bmatrix} \\
%%
&&&\vdots \\
%%
k=n
&\begin{bmatrix}
0 & 0 & 0 & \cdots & 0\\
0 & 0 & 0 & \cdots & 0\\
\vdots & \vdots & \vdots && \vdots \\
\sum_{\mu=1}^{m}x^{n-1} & \sum_{\mu=1}^{m}x^{n} & \sum_{\mu=1}^{m}x^{n+1} & \cdots & \sum_{\mu=1}^{m}x^{2n-1}\\
\end{bmatrix} &
\begin{bmatrix}
c_{0} \\ \vdots \\ c_{n}
\end{bmatrix} &= &
\begin{bmatrix}
0 \\ 0 \\ \vdots \\ \sum_{\mu=1}^{m}x_{\mu}^{n-1}y_{\mu}
\end{bmatrix}
\end{array}
\end{equation}
These equations are linear, and the requirement for simultaneous solution suggests they be combined into the form in $\ref{b}$, the normal equations.
Different Paths, Same Destination
Stepping back, the two paths to the normal equations aren't really two paths. Both are minimization by finding the roots of the first derivative. The first method used scalar differentiation, the second used vector differentiation. You could summarize by saying "differentiation of the total error (and subsequent root extraction) produce the same result - as expected since differentiation is a linear operator." With scalar differentiation we tackle the terms individually, with vector differentiation, we tackle them all at once.
Excursions
Equivalence of least squares solutions and normal equations solutions: Deriving the Normal Equation used to solve Least Squares Problems
Least squares and uniqueness:
Unique least square solutions