PRML Bishop equation 3.15 - Maximum likelihood and least squares

Question

I've doubt related to Derivation of normal equations for maximum likelihood and least squares: Excuse me and do let me know, if I shouldn't have asked a new question.

This question is related to section 3.1.1, pae 148 of Pattern Recognition book by Christopher Bishop.

My doubt is related to point (b) in @LCS11 explanation/ statement.

Reiterating: This is related to mathematical derivation of log function (of Gaussian conditional probability) :

Step 1: $ lnp(\mathbf {t|x,w,} \beta)$ = $\sum_{n=1}^N ln \mathcal N (t_n|\mathbf w^T \it \phi(\mathbf x_n), \beta^{-1})$

Step 2, expanding RHS: $\begin{align*} \frac{N}{2} \ln \beta - \frac{N}{2}\ln 2 \pi - \frac{\beta}{2}\sum_{n=1}^N\{t_n - \mathbf w^T \mathbf \phi(\mathbf x_n) \}^2 \end{align*}$

taking gradient of above equation w.r.t $\mathbf w$ and equating to $\quad 0$:

Step 3: $\begin{align*} \sum_{n=1}^N\{t_n - \mathbf w^T \mathbf \phi(\mathbf x_n) \}\phi(\mathbf x_n)^T = \quad0 \end{align*}$

Step 4: $\begin{align*} \sum_{n=1}^Nt_n \phi(\mathbf x_n)^T = \mathbf w^T(\sum_{n=1}^N \mathbf \phi(\mathbf x_n) \phi(\mathbf x_n)^T) \end{align*}$

solving for $\mathbf w$:

Step 5: $\begin{align*} \mathbf w^T(\sum_{n=1}^N \mathbf \phi(\mathbf x_n) \phi(\mathbf x_n)^T) = \sum_{n=1}^Nt_n \phi(\mathbf x_n)^T \end{align*}$

hereafter, I fail to understand how author moves on to give:

$\begin{align*} \quad\mathbf w_{ML}=(\mathbf \Phi^T \mathbf \Phi)^{-1} \mathbf \Phi^T \mathbf {\mathtt t} \end{align*}$

Question: Pls. help me for I couldn't understand how (after Step 5):

$\sum_{n=1}^N \mathbf \phi(\mathbf x_n) \phi(\mathbf x_n)^T = (\mathbf \Phi^T \mathbf \Phi)$ and $\sum_{n=1}^N \phi(\mathbf x_n) t_n = \mathbf \Phi^T \mathbf {\mathtt t}$.

where, $\mathbf \Phi \in R^{NxM}$ is design-matrix (below):

\begin{bmatrix} \phi_0(\mathbf x_1)&\phi_1(\mathbf x_1)&\dots&\phi_{M-1}(\mathbf x_1)\\ \phi_0(\mathbf x_2)&\phi_1(\mathbf x_2)&\dots&\phi_{M-1}(\mathbf x_2)\\ \vdots &\vdots & \dots & \vdots\\ \phi_0(\mathbf x_N)&\phi_1(\mathbf x_N)&\dots&\phi_{M-1}(\mathbf x_N)\end{bmatrix}

Related Question: the author receives following after applying gradient w.r.t $\mathbf w$:

\begin{align*} \nabla ln p(\mathbf {t|w}, \beta) = \beta \sum_{n=1}^N [t_n - \mathbf{w^T \phi(x_n)] \mathbf {\phi(x_n)^T}} \end{align*}

we equate above gradient to 0:

\begin{align*} \mathbf w^T(\sum_{n=1}^N \mathbf \phi(\mathbf x_n) \phi(\mathbf x_n)^T) = \sum_{n=1}^Nt_n \phi(\mathbf x_n)^T \end{align*}

I convert it to vector form to get (correct me, in case of wrong understanding):

\begin{align*} \mathbf w^T(\mathbf {\Phi \Phi^T}) = \mathbf {t\Phi^T} \end{align*}

applying transpose gives:

\begin{align*} (\mathbf {\Phi \Phi^T}) \mathbf w = \mathbf {\Phi t^T} \end{align*}

pre-multiplying both sides by $(\mathbf {\Phi \Phi^T})^{-1}$

\begin{align*} \mathbf w_{ML} = (\mathbf {\Phi \Phi^T})^{-1} \mathbf {\Phi t^T} \end{align*}

I'm not sure how the author arrives to the same result as explained by @Syd Amerikaner. Pls. guide/ point out where am I going off-track

I've read Strang and Meyer, but it's been some time; and the rustiness shows for I'm asking an elementary question.

There is something wrong; if $\phi$ maps from $\mathbb R^N$ to $\mathbb R$, $\phi(\boldsymbol x)$ is a scalar for any $\boldsymbol x\in\mathbb R^N$. Thus, the expression $\phi(\boldsymbol x_n)\phi(\boldsymbol x_n)^T$ becomes $\phi(\boldsymbol x_n)^2$. So why is there a transpose? Thus, I suspect that that $\phi$ maps from $\mathbb R^N$ to $\mathbb R^M$. — Syd Amerikaner, Mar 12 '22 at 09:58
In that case, put $$\boldsymbol\Phi = \begin{pmatrix} \phi(\boldsymbol x_1)^T \ \phi(\boldsymbol x_2)^T \ \vdots \ \phi(\boldsymbol x_n)^T \end{pmatrix},$$ and $$\boldsymbol t = \begin{pmatrix} t_1 \ t_2 \ \vdots \ t_n \end{pmatrix}.$$ Then the apply the usual rules from matrix multiplication to obtain the equalities. — Syd Amerikaner, Mar 12 '22 at 10:01
Thanks. In my quest to keep my question brief and erred. I'll add details. — Continue2Learn, Mar 12 '22 at 11:34

Syd Amerikaner · Accepted Answer · 2022-03-13T10:49:35.657

Let $$\boldsymbol t = \begin{pmatrix} t_1 \\ t_2 \\ \vdots \\ t_N \end{pmatrix},\qquad\text{and}\qquad \boldsymbol\Phi = \begin{pmatrix} \boldsymbol\phi(\boldsymbol x_1)^\top \\ \boldsymbol\phi(\boldsymbol x_2)^\top \\ \vdots \\ \boldsymbol\phi(\boldsymbol x_N)^\top \end{pmatrix}.$$ Note that $$\boldsymbol\phi(\boldsymbol x_n) = \begin{pmatrix} \phi_0(\boldsymbol x_n) \\ \phi_1(\boldsymbol x_n) \\ \vdots \\ \phi_{M-1}(\boldsymbol x_n) \end{pmatrix}.$$

The log likelihood function is given by \begin{align*}\log\mathcal L(\boldsymbol w) &= \frac N2 \log(\beta) -\frac N2 \log(2\pi) - \frac \beta 2 \sum_{n=1}^N\big(t_n - \boldsymbol w^\top\boldsymbol\phi(\boldsymbol x_n)\big)^2. \end{align*} Since $\boldsymbol w^\top\boldsymbol\phi(\boldsymbol x_n)$ is a scalar (i.e. an element in $\mathbb R$), it holds that $\boldsymbol\phi(\boldsymbol x_n)^\top\boldsymbol w$ = $\boldsymbol w^\top\boldsymbol\phi(\boldsymbol x_n)$. Thus, \begin{align*}\log\mathcal L(\boldsymbol w) &= \frac N2 \log(\beta) -\frac N2 \log(2\pi) - \frac \beta 2 \sum_{n=1}^N\big(t_n - \boldsymbol w^\top\boldsymbol\phi(\boldsymbol x_n)\big)^2 \\ &= \frac N2 \log(\beta) -\frac N2 \log(2\pi) - \frac \beta 2 \sum_{n=1}^N\big(t_n - \boldsymbol\phi(\boldsymbol x_n)^\top\boldsymbol w\big)^2. \end{align*} Note that $\sum_{n=1}^N\big(t_n - \boldsymbol w^\top\boldsymbol\phi(\boldsymbol x_n)\big)^2 = (\boldsymbol t - \boldsymbol\Phi\boldsymbol w)^\top(\boldsymbol t - \boldsymbol\Phi\boldsymbol w)$ as for any vector $\boldsymbol\in\mathbb R^N$ it holds that $\boldsymbol z^\top \boldsymbol z = \sum_{n=1}^N z_n^2$. It's a basic linear algebra exercise to show this equality. Next, \begin{align*} (\boldsymbol t - \boldsymbol\Phi\boldsymbol w)^\top(\boldsymbol t - \boldsymbol\Phi\boldsymbol w) &= \boldsymbol t^\top\boldsymbol t - \boldsymbol t^\top\boldsymbol\Phi\boldsymbol w - (\boldsymbol\Phi\boldsymbol w)^\top\boldsymbol t + (\boldsymbol\Phi\boldsymbol w)^\top\boldsymbol\Phi\boldsymbol w \\ &= \boldsymbol t^\top\boldsymbol t - 2\boldsymbol t^\top\boldsymbol\Phi\boldsymbol w + \boldsymbol w^\top\boldsymbol\Phi^\top\boldsymbol\Phi\boldsymbol w \end{align*} using again the fact that $(\boldsymbol\Phi\boldsymbol w)^\top\boldsymbol t$ is a scalar and thus $(\boldsymbol\Phi\boldsymbol w)^\top\boldsymbol t = \boldsymbol t^\top\boldsymbol\Phi\boldsymbol w$. Since the log likelihood is differentiable, setting the derivative to zero yields the maximum. Recall the derivative rules \begin{align*} \frac{\partial}{\partial\boldsymbol w}C &= \boldsymbol 0, \\ \frac{\partial}{\partial\boldsymbol w}\big(\boldsymbol t^\top\boldsymbol\Phi\boldsymbol w\big) &= \boldsymbol\Phi^\top\boldsymbol t,\\ \frac{\partial}{\partial\boldsymbol w}\big(\boldsymbol w^\top\boldsymbol\Phi^\top\boldsymbol\Phi\boldsymbol w\big) &= 2\boldsymbol\Phi^\top\boldsymbol\Phi\boldsymbol w, \end{align*} and the linearty of the derivative operator to obtain $$\frac{\partial}{\partial\boldsymbol w}\big(\log\mathcal L(\boldsymbol w)\big) = \boldsymbol 0 - \frac\beta 2\Big(\boldsymbol 0 - 2\boldsymbol\Phi^\top\boldsymbol t + 2\boldsymbol\Phi^\top\boldsymbol\Phi\boldsymbol w\Big) = \beta\boldsymbol\Phi^\top\boldsymbol t - \beta\boldsymbol\Phi^\top\boldsymbol\Phi\boldsymbol w.$$ Setting the derivative to zero yields $$\boldsymbol 0 = \beta\boldsymbol\Phi^\top\boldsymbol t - \beta\boldsymbol\Phi^\top\boldsymbol\Phi\boldsymbol w,$$ or $$\beta\boldsymbol\Phi^\top\boldsymbol t = \beta\boldsymbol\Phi^\top\boldsymbol\Phi\boldsymbol w.$$ Dividing both sides by $\beta$ and premultiplying both sides by $(\boldsymbol\Phi^\top\boldsymbol\Phi)^{-1}$ from the left yields $$\boldsymbol w = (\boldsymbol\Phi^\top\boldsymbol\Phi)^{-1}(\boldsymbol\Phi^\top\boldsymbol t).$$

Beautiful; got it! A query: does the derivative follow the denominator layout, as mentioned in wikipedia - is it the case throughout the PRML chapters - in case you've gone through the book/ chapters; and, an minor input: should we need to add a minus (-) sign before beta/2 after log function is applied. — Continue2Learn, Mar 13 '22 at 05:55
I've a question related to author's proof, I'll be adding to above question; I would appreciate if fellow learner could answer that. Excuse me if I break any protocol, and do point out to me. — Continue2Learn, Mar 13 '22 at 06:25
Thanks for spotting the mistake; I of course its $-\frac{\beta}2$. I fixed the mistake. Regarding your other questions: I don't have access to the source, but it doesn't matter whether you decide for the numerator or denominator layout as the derivative becomes either $\boldsymbol \Phi^\top\boldsymbol t - \boldsymbol\Phi^\top\boldsymbol\Phi\boldsymbol w$ (denominator layout) or $\boldsymbol t^\top\boldsymbol\Phi - \boldsymbol w^\top\boldsymbol\Phi^\top\boldsymbol\Phi$. Both expression only differ by a transpose. — Syd Amerikaner, Mar 13 '22 at 10:53
From standard matrix multiplication rules it follows that \begin{align}\boldsymbol\Phi^\top\boldsymbol\Phi &= \Big(\begin{array}{cccc} \boldsymbol\phi(\boldsymbol x_1) & \phi(\boldsymbol x_2) & \cdots & \phi(\boldsymbol x_n)\end{array}\Big)\begin{pmatrix} \phi(\boldsymbol x_1)^\top \ \phi(\boldsymbol x_2)^\top \ \vdots \ \phi(\boldsymbol x_n)^\top \end{pmatrix} \ &= \sum_{n=1}^N\boldsymbol\phi(\boldsymbol x_n)\boldsymbol\phi(\boldsymbol x_n)^\top.\end{align} — Syd Amerikaner, Mar 13 '22 at 10:57
Got it. Many thanks for patiently answering my questions. Regards. — Continue2Learn, Mar 13 '22 at 11:38

PRML Bishop equation 3.15 - Maximum likelihood and least squares

1 Answers1