1

Let $x_i \in \mathbb{R}^n$, $y_i\in\mathbb{R}$, $i=1,\cdots,l$, be a train set for a linear model on the form $y = w^Tx$ for some $w\in\mathbb{R}^n$.

We have a loss function as mean square error (MSE): $$L(w) = \frac{1}{l} \sum_{i=0}^l(w^Tx_i-y_i)^2 = \frac{1}{l}||Xw-y||^2,$$ where $X = \begin{bmatrix}x_1^T\\\vdots\\ x_l^T\end{bmatrix}$.

So, can someone explain me why when we make $L'(w) = 0$, we get $w = (X^TX)^{-1}X^Ty$?

Lucas Resende
  • 1,286
  • 9
  • 21

1 Answers1

2

All we need to do is to compute the derivative of $L(w)$ and equals it to zero.

If $f(x) = ||x||^2$, then $f'(x) = 2x$. Since $X$ is a linear transformation and $y$ is constant, we have $(Xw-y)' = X$. By the chain rule we have: $$ L'(w) = \frac{1}{l}2(Xw-y)^TX = \frac{1}{l}2( w^TX^TX - y^TX ) $$

If we equals to zero we have $$ \frac{1}{l}2( w^TX^TX - y^TX ) = 0 \Rightarrow w^T X^TX = y^TX \Rightarrow w^T = y^TX(X^TX)^{-1},$$ where the inverse of $X^TX$ exists if only if $\{x_1, \cdots, x_l\}$ generates $\mathbb{R}^n$, and for that we need, at least, $l\geq n$ (it holds for large amounts of data).

Now, transposing $w^T$ we have $$ w = (X^TX)^{-1}X^Ty, $$ as desired.

Lucas Resende
  • 1,286
  • 9
  • 21
  • Where did you get T in this formula $$L'(w) = 2(Xw-y)^TX. Because there is no transposition in the original formula. Why when we take a derivative T is appears? – randomuser228 Aug 09 '20 at 08:31
  • and why when we transposing w, (X^T*X) doesnt change? – randomuser228 Aug 09 '20 at 08:39
  • The derivative of $f(x) = ||x||^2$ isn't really $2x$, but $\nabla f (x) = 2x$. The derivative itself is the linear transformation given by the matrix $diag(2x)$ and is such that $diag(2x)h = \langle \nabla f (x), h\rangle = \nabla f(x)^T h$. The derivative of $g(w) = (Xw-y)$ is the transformation given by $X$. The chain rule gives you a linear transformation $ (f\circ g)'(w) = f'(g(w)) \circ g'(w) : \mathbb{R}^n \to \mathbb{R}$. Let $h\in\mathbb{R}^n$, then $g'(w)h = Xh \in \mathbb{R}^l$. Composing with $f$ we have $f'(Xw-y) (Xh) = \nabla f(Xw-y)^T (Xh) = 2(Xw-y)^TXh$. – Lucas Resende Aug 09 '20 at 18:54
  • And the transpose of $(X^TX)^{-1}$ is itself because for every invertible $A$ we have $(A^{-1})^T = (A^T)^{-1}$. And so $((X^TX)^{-1})^T =((X^TX)^{T})^{-1} = (X^TX)^{-1} $. – Lucas Resende Aug 09 '20 at 18:56
  • But why we can't take a derivative of matrix, but we can take a gradient of matrix? – randomuser228 Aug 09 '20 at 20:09
  • Oh, I'm sorry. the derivative of $f$ is indeed the transformation from $\mathbb{R}^l \to \mathbb{R}$ given by $h\mapsto \langle \nabla f(x), h \rangle$. I misthink it. Forget about the $diag(2x)$, it is wrong. – Lucas Resende Aug 09 '20 at 20:28
  • Can you explain me again about T in formula $$L'(w) = 2(Xw-y)^TX, please. I don't understand. – randomuser228 Aug 09 '20 at 22:03
  • The dot product of $x$ and $y$ can be viewed as $x^Ty = y^Tx$. It is just that. – Lucas Resende Aug 09 '20 at 22:04
  • If we take a derivative of $$L(w) =\frac {1}{l}(||Xw-y||)^2$$, we get $$L'(w) =\frac {2}{l}(||Xw-y||)X$$, right? – randomuser228 Aug 09 '20 at 22:15
  • No, we don't. In high dimensions the derivative is a linear transformation and the product is just a composition. To apply the chain rule you need to do compose $f(x) = ||x||^2$ with $g(w) = Xw-y$ and you get the linear transformation $(f\circ g)'(w) = f'(g(w))\circ g'(w)$. What is the image of $h\in \mathbb{R}^n$ by the transformation $(f\circ g)'(w)$? – Lucas Resende Aug 09 '20 at 22:22
  • I don't know((( – randomuser228 Aug 09 '20 at 22:30
  • Maybe it is a background problem. Have you ever did a real analysis course? – Lucas Resende Aug 09 '20 at 22:31
  • I finished my first year at University and there was mathematical analysis for a year. But over the summer I had forgotten it. – randomuser228 Aug 09 '20 at 22:34
  • I understand that you are imagine a complex function as composition of simple functions. But I cant understand where does T comes from? Where did you take a transposition? And where did you take h? – randomuser228 Aug 09 '20 at 22:49
  • We have $f'(g(w))h = \langle \nabla f(g(w)), h \rangle = (\nabla f(g(w)))^Th$ and when we compose with $g'(w)$ we have $f'(g(w))\circ g'(w) k = f'(g(w)) (Xk) = (\nabla f(g(w)))^T(Xk) = 2(Xw-y)^TXk$, for every $k$, then the linear transformation is given by $2(Xw-y)^TX$. – Lucas Resende Aug 09 '20 at 22:54
  • Sorry for a very stupid question, but what is mean ⟨...⟩? – randomuser228 Aug 09 '20 at 23:01
  • It is the internal dot. – Lucas Resende Aug 09 '20 at 23:02
  • But what is h and k? Something like free terms? – randomuser228 Aug 09 '20 at 23:04
  • The derivative is a linear transformation! It is a function that takes a vector in $\mathbb{R}^n$ (or $\mathbb{R}^l$ in the case of $f'$) and output it somewhere. To define a function you need to say where it goes. So, a took a $k$ (or a $h$) and followed his path. – Lucas Resende Aug 09 '20 at 23:07
  • It requires some background, you may enjoy reading this book: https://www.amazon.com.br/Elements-Analysis-Robert-Gardner-Bartle/dp/047105464X. – Lucas Resende Aug 09 '20 at 23:10