1

The basic setup in multiple linear regression model is

\begin{align} Y &= \begin{bmatrix} y_{1} \\ y_{2} \\ \vdots \\ y_{n} \end{bmatrix} \end{align}

\begin{align} X &= \begin{bmatrix} 1 & x_{11} & \dots & x_{1k}\\ 1 &x_{21} & \dots & x_{2k}\\ \vdots & \dots & \dots\\ 1 & x_{n1} & \dots & x_{nk} \end{bmatrix} \end{align}

\begin{align} \beta &= \begin{bmatrix} \beta_{0} \\ \beta_{1} \\ \vdots \\ \beta_{k} \end{bmatrix} \end{align}

\begin{align} \epsilon &= \begin{bmatrix} \epsilon_{1} \\ \epsilon_{2} \\ \vdots \\ \epsilon_{n} \end{bmatrix} \end{align}

The regression model is $Y=X \beta + \epsilon$.

To find least square estimator of $\beta$ vector, we need to minimize $S(\beta)=\Sigma_{i=1}^n \epsilon_i^2 = \epsilon ' \epsilon = (y-x\beta)'(y-x\beta)=y'y-2\beta 'x'y + \beta 'x'x \beta$

$$\frac{\partial S(\beta)}{\partial \beta}=0$$

My question: how to get $-2x'y+2x'x \beta$?

Jack
  • 175
Mariana
  • 1,253

2 Answers2

2

The sum of the squared errors can be written as

$$ \left\lVert \epsilon \right\rVert^2 = \left\lVert Y - X\beta \right \rVert^2 $$ $$ = (Y - X\beta)^T(Y - X\beta) = Y^TY - \beta^TX^TY - Y^TX\beta + \beta^TX^TX\beta $$ $$ = \left\lVert Y \right\rVert^2 - 2Y^TX\beta + \left\lVert X\beta \right\rVert^2 $$

Then, finding the gradient $\frac{d\left\lVert \epsilon \right\rVert^2}{d\beta}$

$$ \frac{d\left\lVert \epsilon \right\rVert^2}{d\beta} = -2X^TY + 2X^TX\beta $$

Reviewing term by term in that differentiation (since differentiation is linear operator!)

$\left\lVert Y \right\rVert^2$ does not depend on $\beta$ and becomes $0$.

$2Y^TX\beta$ is a is a sum where the $i^{th}$ term is $2\beta_iY^Tx_i$ where $x_i$ is the $i^{th}$ row of $X$. Therefore, the gradient of this term evaluates to $-2X^TY$.

$\left\lVert X\beta \right\rVert^2$ can be differentiated using the product rule

In general, the Jacobian of $f(x) = Ax $ is $ J_f = A^T$

Jack
  • 175
  • Your last sentence is very helpful. But how to apply the rule in $\beta 'x'x \beta$? – Mariana Sep 16 '21 at 12:12
  • 1
    I find this source is more helpful: http://www.gatsby.ucl.ac.uk/teaching/courses/sntn/sntn-2017/resources/Matrix_derivatives_cribsheet.pdf – Mariana Sep 16 '21 at 12:20
0

$ \def\b{\beta}\def\e{\varepsilon}\def\l{\lambda}\def\p{\partial} \def\L{\left}\def\R{\right}\def\LR#1{\L(#1\R)} \def\trace#1{\operatorname{Tr}\LR{#1}} \def\qiq{\quad\implies\quad} \def\c#1{\color{red}{#1}} \def\CLR#1{\c{\LR{#1}}} \def\grad#1#2{\frac{\p #1}{\p #2}} \def\fracLR#1#2{\LR{\frac{#1}{#2}}} \def\gradLR#1#2{\LR{\grad{#1}{#2}}} $Instead of expanding $\,\e\,$ before differentiating, do the differentiation first. $$\eqalign{ \e &= X\b-Y \\ \l &= \e^T\e \\ d\l &= 2\e^T\c{d\e} = 2\e^T\CLR{X\,d\b} = \LR{2X^T\e}^Td\b \\ \grad{\l}{\b} &= 2X^T\c{\e} = 2X^T\CLR{X\b-Y} \\ }$$ To find the minimizer, set this gradient to zero and solve for $\b$. $$\eqalign{ X^TX\b &= X^TY \\ \b &= \c{\LR{X^TX}^{-1}X^T}Y \;=\; \c{X^+}Y \\ }$$ where $X^+$ denotes the pseudoinverse of $X$.

greg
  • 35,825