0

Trying to understand the multivariable proof of the least square estimations. I have difficulty differentiating the loss function with regards to $\vec{\beta}$. \begin{align} \frac{\partial L\left(D, \vec{\beta}\right)}{\partial\vec{\beta}} &= \frac{\partial \left(Y^\textsf{T}Y - Y^\textsf{T}X\vec{\beta} - \vec{\beta}^\textsf{T}X^\textsf{T}Y + \vec{\beta}^\textsf{T}X^\textsf{T}X\vec{\beta}\right)}{\partial \vec{\beta}} \\ &= -2X^\textsf{T}Y + 2X^\textsf{T}X\vec{\beta} \end{align}

I am stucked when we have to derive with regards to $\beta$ a term containing $\beta^T$. How should that deal be with ?

  • 1
    The given derivative uses the (in my view confusing) convention that the derivative is the transpose of the actual derivative. For the derivative of a vector w.r.t. to a vector you can use the following heuristic $$\frac{\partial \vec{f}(\vec{x},\vec{x}^T)}{\partial \vec{x}} = \left(\frac{\partial f}{\partial \vec{x}}\right){\vec{x}^T} + \left(\frac{\partial f}{\partial \vec{x}^T}\right){\vec{x}}^T$$ – Ninad Munshi Apr 05 '21 at 15:16
  • Thanks, according to your heuristic, I would write that : $$\frac{ \partial (\beta^TX^TX\beta)}{\partial \beta} = \beta^T X^T X + X^TX\beta $$ but that's not equal to $ 2X^T X \beta $. I must be missing something – outofthegreen Apr 05 '21 at 15:25
  • 1
    you're forgetting the extra transpose on the second term – Ninad Munshi Apr 05 '21 at 15:28
  • Thanks, and why is the derivative of $-Y^TX\vec{\beta}$ with regard to beta equal to $-X^TY$ ? Shouldn't it be : $-Y^TX$ ? – outofthegreen Apr 05 '21 at 15:37
  • That is because machine learning is a mockery of mathematics, with contradictory and confusing conventions. – Ninad Munshi Apr 05 '21 at 15:39
  • Fair point but is there a mistake or am I missing something ? – outofthegreen Apr 05 '21 at 15:41
  • The answer is in the first sentence of my first comment. They sometimes do and sometimes do not use the convention the derivative is the transpose what it actually is. I don't just say words to waste space, I promise. – Ninad Munshi Apr 05 '21 at 15:45
  • Oh okay so you meant that the whole 2nd term was, because written horizontally, considered as transpose of the actual derivative. I got it. This is indeed a bit confusing. Thank you for your patient help. – outofthegreen Apr 05 '21 at 15:57
  • 1
    You're welcome! If it helps, I recommend sticking with a convention where the partials are the correct derivative (derivative w.r.t. a column vector is a row vector and vice versa) and the "gradient" is the transpose of the partial. This will get you through 50% of machine learning material hassle free but more importantly leave you able to actually compute things. – Ninad Munshi Apr 05 '21 at 16:04
  • I'll remember it, thanks for the tip. – outofthegreen Apr 05 '21 at 16:27
  • Please specify the dimensions of $X$ and $Y$. –  Apr 05 '21 at 19:39

0 Answers0