The loss function of a Least Squares Regression is defined as (for example, in this question) :
$L(w) = (y - Xw)^T (y - Xw) = (y^T - w^TX^T)(y - Xw)$
Taking the derivatives of the loss w.r.t. the parameter vector $w$:
\begin{align} \frac{d L(w)}{d w} & = \frac{d}{dw} (y^T - w^TX^T)(y - Xw) \\ & = \frac{d}{dw} (y^Ty - y^TXw - w^TX^Ty + w^TX^TXw) \\ & = \frac{d}{dw} (y^Ty - y^TXw - (y^TXw)^T + w^TX^TXw) \end{align}
as the second and third terms are scalars resulting in the same quantity, this implies,
\begin{align} & = \frac{d}{dw} (y^Ty - 2y^TXw + w^TX^TXw) \end{align}
My question is:
for the second term, shouldn't the derivative wrt $w$ be $-2y^TX$ ?
and because $\frac{d}{dx}(x^TAx) = x^T(A^T + A)$, (see this question for explanation)
shouldn't the derivative for the third term (which is also a scalar), be the following due to chain rule? \begin{align} \frac{d}{dw} (w^TX^TXw) + \frac{d}{dw} w(X^TXw)^T = w^T(X^TX + X^TX) = 2 w^TX^TX \end{align}
From the above expressions, shouldn't the result of the derivative of the loss function be: $-2y^TX + 2 w^TX^TX$ ?
What I see in textbooks (including, for example, page 25 of this stanford.edu notes and page 10 of this harvard.edu notes ) is a different expression: $-2X^Ty + 2 X^TXw$.
What am I missing here?
If so, thanks for the clarification. I was getting really confused.
– vbp Nov 21 '15 at 21:53isites.harvard.edu
is broken, but a snapshot is saved on the Wayback Machine. – The Amplitwist Jun 20 '22 at 13:30