Computing the gradient of least-squares cost

Question

Trying to understand the multivariable proof of the least square estimations. I have difficulty differentiating the loss function with regards to $\vec{\beta}$. \begin{align} \frac{\partial L\left(D, \vec{\beta}\right)}{\partial\vec{\beta}} &= \frac{\partial \left(Y^\textsf{T}Y - Y^\textsf{T}X\vec{\beta} - \vec{\beta}^\textsf{T}X^\textsf{T}Y + \vec{\beta}^\textsf{T}X^\textsf{T}X\vec{\beta}\right)}{\partial \vec{\beta}} \\ &= -2X^\textsf{T}Y + 2X^\textsf{T}X\vec{\beta} \end{align}

I am stucked when we have to derive with regards to $\beta$ a term containing $\beta^T$. How should that deal be with ?

The given derivative uses the (in my view confusing) convention that the derivative is the transpose of the actual derivative. For the derivative of a vector w.r.t. to a vector you can use the following heuristic $$\frac{\partial \vec{f}(\vec{x},\vec{x}^T)}{\partial \vec{x}} = \left(\frac{\partial f}{\partial \vec{x}}\right){\vec{x}^T} + \left(\frac{\partial f}{\partial \vec{x}^T}\right){\vec{x}}^T$$ — Ninad Munshi, Apr 05 '21 at 15:16
Thanks, according to your heuristic, I would write that : $$\frac{ \partial (\beta^TX^TX\beta)}{\partial \beta} = \beta^T X^T X + X^TX\beta $$ but that's not equal to $ 2X^T X \beta $. I must be missing something — outofthegreen, Apr 05 '21 at 15:25
Thanks, and why is the derivative of $-Y^TX\vec{\beta}$ with regard to beta equal to $-X^TY$ ? Shouldn't it be : $-Y^TX$ ? — outofthegreen, Apr 05 '21 at 15:37
That is because machine learning is a mockery of mathematics, with contradictory and confusing conventions. — Ninad Munshi, Apr 05 '21 at 15:39
Fair point but is there a mistake or am I missing something ? — outofthegreen, Apr 05 '21 at 15:41
The answer is in the first sentence of my first comment. They sometimes do and sometimes do not use the convention the derivative is the transpose what it actually is. I don't just say words to waste space, I promise. — Ninad Munshi, Apr 05 '21 at 15:45
Oh okay so you meant that the whole 2nd term was, because written horizontally, considered as transpose of the actual derivative. I got it. This is indeed a bit confusing. Thank you for your patient help. — outofthegreen, Apr 05 '21 at 15:57
You're welcome! If it helps, I recommend sticking with a convention where the partials are the correct derivative (derivative w.r.t. a column vector is a row vector and vice versa) and the "gradient" is the transpose of the partial. This will get you through 50% of machine learning material hassle free but more importantly leave you able to actually compute things. — Ninad Munshi, Apr 05 '21 at 16:04

Computing the gradient of least-squares cost

0 Answers0

Linked