I've doubt related to Derivation of normal equations for maximum likelihood and least squares: Excuse me and do let me know, if I shouldn't have asked a new question.
This question is related to section 3.1.1, pae 148 of Pattern Recognition book by Christopher Bishop.
My doubt is related to point (b) in @LCS11 explanation/ statement.
Reiterating: This is related to mathematical derivation of log function (of Gaussian conditional probability) :
Step 1: $ lnp(\mathbf {t|x,w,} \beta)$ = $\sum_{n=1}^N ln \mathcal N (t_n|\mathbf w^T \it \phi(\mathbf x_n), \beta^{-1})$
Step 2, expanding RHS: $\begin{align*} \frac{N}{2} \ln \beta - \frac{N}{2}\ln 2 \pi - \frac{\beta}{2}\sum_{n=1}^N\{t_n - \mathbf w^T \mathbf \phi(\mathbf x_n) \}^2 \end{align*}$
taking gradient of above equation w.r.t $\mathbf w$ and equating to $\quad 0$:
Step 3: $\begin{align*} \sum_{n=1}^N\{t_n - \mathbf w^T \mathbf \phi(\mathbf x_n) \}\phi(\mathbf x_n)^T = \quad0 \end{align*}$
Step 4: $\begin{align*} \sum_{n=1}^Nt_n \phi(\mathbf x_n)^T = \mathbf w^T(\sum_{n=1}^N \mathbf \phi(\mathbf x_n) \phi(\mathbf x_n)^T) \end{align*}$
solving for $\mathbf w$:
Step 5: $\begin{align*} \mathbf w^T(\sum_{n=1}^N \mathbf \phi(\mathbf x_n) \phi(\mathbf x_n)^T) = \sum_{n=1}^Nt_n \phi(\mathbf x_n)^T \end{align*}$
hereafter, I fail to understand how author moves on to give:
$\begin{align*} \quad\mathbf w_{ML}=(\mathbf \Phi^T \mathbf \Phi)^{-1} \mathbf \Phi^T \mathbf {\mathtt t} \end{align*}$
Question: Pls. help me for I couldn't understand how (after Step 5):
$\sum_{n=1}^N \mathbf \phi(\mathbf x_n) \phi(\mathbf x_n)^T = (\mathbf \Phi^T \mathbf \Phi)$ and $\sum_{n=1}^N \phi(\mathbf x_n) t_n = \mathbf \Phi^T \mathbf {\mathtt t}$.
where, $\mathbf \Phi \in R^{NxM}$ is design-matrix (below):
\begin{bmatrix} \phi_0(\mathbf x_1)&\phi_1(\mathbf x_1)&\dots&\phi_{M-1}(\mathbf x_1)\\ \phi_0(\mathbf x_2)&\phi_1(\mathbf x_2)&\dots&\phi_{M-1}(\mathbf x_2)\\ \vdots &\vdots & \dots & \vdots\\ \phi_0(\mathbf x_N)&\phi_1(\mathbf x_N)&\dots&\phi_{M-1}(\mathbf x_N)\end{bmatrix}
Related Question: the author receives following after applying gradient w.r.t $\mathbf w$:
\begin{align*} \nabla ln p(\mathbf {t|w}, \beta) = \beta \sum_{n=1}^N [t_n - \mathbf{w^T \phi(x_n)] \mathbf {\phi(x_n)^T}} \end{align*}
we equate above gradient to 0:
\begin{align*} \mathbf w^T(\sum_{n=1}^N \mathbf \phi(\mathbf x_n) \phi(\mathbf x_n)^T) = \sum_{n=1}^Nt_n \phi(\mathbf x_n)^T \end{align*}
I convert it to vector form to get (correct me, in case of wrong understanding):
\begin{align*} \mathbf w^T(\mathbf {\Phi \Phi^T}) = \mathbf {t\Phi^T} \end{align*}
applying transpose gives:
\begin{align*} (\mathbf {\Phi \Phi^T}) \mathbf w = \mathbf {\Phi t^T} \end{align*}
pre-multiplying both sides by $(\mathbf {\Phi \Phi^T})^{-1}$
\begin{align*} \mathbf w_{ML} = (\mathbf {\Phi \Phi^T})^{-1} \mathbf {\Phi t^T} \end{align*}
I'm not sure how the author arrives to the same result as explained by @Syd Amerikaner. Pls. guide/ point out where am I going off-track
I've read Strang and Meyer, but it's been some time; and the rustiness shows for I'm asking an elementary question.