MAP Solution for Linear Regression - What is a Gaussian prior?

Question

I am looking at some slides that compute the MLE and MAP solution for a Linear Regression problem.

It states that the problem can be defined as such:

We can compute the MLE of w as such:

Now they talk about computing the MAP of w

I simply can't understand the concept of this Gaussian prior distribution. I've never seen any distribution that looks like this. Also, I have no idea how lamda inverse * I, or wTw comes into place. Can someone please help me out?

Is there any actual problem? Have you seen regularization before? — , Sep 14 '18 at 19:09

Ahmad Bazzi · Accepted Answer · 2022-07-07T19:14:42.430

What is MAP?

The MAP criterion is derived from Bayes Rule, i.e. \begin{equation} P(A \vert B) = \frac{P(B \vert A)P(A)}{P(B)} \end{equation} If $B$ is chosen to be your data $\mathcal{D}$ and $A$ is chosen to be the parameters that you'd want to estimate, call it $w$, you will get \begin{equation} \underbrace{P(w \vert \mathcal{D})}_{\text{Posterior}} = \frac{1}{\underbrace{P(\mathcal{D})}_{\text{Normalization}}} \overbrace{P(\mathcal{D} \vert w)}^{\text{Likelihood}}\overbrace{P(w)}^{\text{Prior}} \tag{0} \end{equation}

Relation between MAP and ML

If $w$ is not random, then Maximum Likelihood is MAP, why ? because $P(w)$ is $1$, i.e. a distribution of a non-random term. In your case, however, $w$ is modeled as \begin{equation} w \sim \mathcal{N}(0, \lambda^{-1}I) \end{equation} and \begin{equation} D \vert w \sim \mathcal{N}(w^T x, \sigma^2) \end{equation}

Deriving the Gaussian Prior

Using the Normal distribution PDF with mean vector $\mu$ and covariance matrix $\Sigma$, which in the Multivariate case is \begin{equation} f(x) = \frac{1}{\sqrt{(2\pi)^{N} \det \Sigma}}exp(-\frac{1}{2} (x - \mu)^T \Sigma^{-1} (x - \mu)) \tag{1} \end{equation} $w$ is a Normal with zero mean $\mu = 0$ and variance $\Sigma = \lambda^{-1} I$. Plug it in $(1)$, you will get \begin{equation} f(w) = \frac{1}{\sqrt{(2\pi)^D \frac{1}{\lambda^D}}}exp(-\frac{1}{2}(w - 0)^T (\frac{1}{\lambda} I)^{-1} (w - 0)) \end{equation} that is (Your slides are missing a numerator of $\lambda^{\frac{D}{2}}$ but that doesn't matter since we will optimize with respect to $w$. I will keep it as below): \begin{equation} f(w) = \frac{\lambda^{\frac{D}{2}}}{(2\pi)^{\frac{D}{2}}}exp(-\frac{\lambda}{2} w^T w) \end{equation}

Deriving the Likelihood function

Also, you can use equation (1), to get $f(\mathcal{D} \vert w)$, we first need $f(y_k \vert w)$, or if you prefer, you can use the univariate Normal distribution, with mean $w^T x$ and variance $\sigma^2$, i.e. \begin{equation} f(y_k \vert w) = \frac{1}{\sqrt{2\pi\sigma^2}}exp(-\frac{1}{2\sigma^2}(y_k- x^Tw)^2) \end{equation} But since $y_1 \ldots y_D$ are independent, then \begin{equation} f(\mathcal{D} \vert w) = f(y_1 \ldots y_D \vert w) = \prod_{k=1}^N f(y_k \vert w) = \prod_{k=1}^N \frac{1}{\sqrt{2\pi\sigma^2}}exp(-\frac{1}{2\sigma^2}(y_k- x^Tw)^2) \end{equation} Now take the log of equation $(0)$, you will have \begin{equation} \log P(w \vert \mathcal{D} ) = \log P( \mathcal{D} \vert w) + \log P(w) - \log P (\mathcal{D}) \end{equation} The MAP maximizes with respect to $w$, so \begin{equation} \hat{w} = \operatorname{argmax}_w \log P(w \vert \mathcal{D} ) \end{equation} that is \begin{equation} \hat{w} = \operatorname{argmax}_w \Big( \log P( \mathcal{D} \vert w) + \log P(w) - \log P (\mathcal{D}) \Big) \end{equation} $ \log P (\mathcal{D})$ is independent of $w$, so we're good without it \begin{equation} \hat{w} = \operatorname{argmax}_w \Big( \log P( \mathcal{D} \vert w) + \log P(w) \Big) \tag{o} \end{equation} Notice that \begin{equation} \log P( \mathcal{D} \vert w) = \log \big( \prod_{k=1}^D \frac{1}{\sqrt{2\pi\sigma^2}}exp(-\frac{1}{2\sigma^2}(y_k- x^Tw)^2) \big) \end{equation} Log of products is sum of logs, so \begin{equation} \log P( \mathcal{D} \vert w) = \sum_{k=1}^D \log \frac{1}{\sqrt{2\pi\sigma^2}}exp(-\frac{1}{2\sigma^2}(y_k- x^Tw)^2) \end{equation} which is \begin{equation} \log P( \mathcal{D} \vert w) = \sum_{k=1}^D \log \frac{1}{\sqrt{2\pi\sigma^2}} -\frac{1}{2\sigma^2}\sum_{k=1}^N (y_k- x^Tw)^2 \end{equation} $\log \frac{1}{\sqrt{2\pi\sigma^2}}$ is independent of the sum index hence \begin{equation} \log P( \mathcal{D} \vert w) = D\log \frac{1}{\sqrt{2\pi\sigma^2}} -\frac{1}{2\sigma^2}\sum_{k=1}^N (y_k- x^Tw)^2 \tag{*} \end{equation} Now take the log of $f(w)$, you'll get \begin{equation} \log f(w) = \log \lambda^{\frac{D}{2}} - \log (2\pi)^{\frac{D}{2}} - \frac{\lambda}{2}w^Tw \tag{**} \end{equation}

Deriving the MAP criterion

Replace $(*)$ and $(**)$ in $(o)$, you will get \begin{equation} \hat{w} = \operatorname{argmax}_w \Big( D\log \frac{1}{\sqrt{2\pi\sigma^2}} -\frac{1}{2\sigma^2}\sum_{k=1}^N (y_k- x^Tw)^2) + \log \lambda^{\frac{D}{2}} - \log (2\pi)^{\frac{D}{2}} - \frac{\lambda}{2}w^Tw \Big) \end{equation} Again, remove terms that do not depend on $w$, you will get \begin{equation} \hat{w} = \operatorname{argmax}_w \Big( -\frac{1}{2\sigma^2}\sum_{k=1}^N (y_k- x^Tw)^2 - \frac{\lambda}{2}w^Tw \Big) \end{equation} Maximizing $-x$ is equivalent of minimizing $x$ if $x \geq 0$, which is our case, hence \begin{equation} \hat{w} = \operatorname{argmin}_w \Big( \frac{1}{2\sigma^2}\sum_{k=1}^N (y_k- x^Tw)^2 + \frac{\lambda}{2}w^Tw \Big) \end{equation}

In terms of Linear Regression, this is known as Regularization, a.k.a Tikhonov Regularization

Why did you revert the changes? There are $N$ target variables, look at your product. Also, the mean is based on a particular instance $x_{k}.$ In other words, what is $x?$ It is not defined. — sunspots, Jul 08 '22 at 05:08