Unclear about matrix calculus in least squares regression

Question

The loss function of a Least Squares Regression is defined as (for example, in this question) :

$L(w) = (y - Xw)^T (y - Xw) = (y^T - w^TX^T)(y - Xw)$

Taking the derivatives of the loss w.r.t. the parameter vector $w$:

\begin{align} \frac{d L(w)}{d w} & = \frac{d}{dw} (y^T - w^TX^T)(y - Xw) \\ & = \frac{d}{dw} (y^Ty - y^TXw - w^TX^Ty + w^TX^TXw) \\ & = \frac{d}{dw} (y^Ty - y^TXw - (y^TXw)^T + w^TX^TXw) \end{align}

as the second and third terms are scalars resulting in the same quantity, this implies,

\begin{align} & = \frac{d}{dw} (y^Ty - 2y^TXw + w^TX^TXw) \end{align}

My question is:

for the second term, shouldn't the derivative wrt $w$ be $-2y^TX$ ?

and because $\frac{d}{dx}(x^TAx) = x^T(A^T + A)$, (see this question for explanation)

shouldn't the derivative for the third term (which is also a scalar), be the following due to chain rule? \begin{align} \frac{d}{dw} (w^TX^TXw) + \frac{d}{dw} w(X^TXw)^T = w^T(X^TX + X^TX) = 2 w^TX^TX \end{align}

From the above expressions, shouldn't the result of the derivative of the loss function be: $-2y^TX + 2 w^TX^TX$ ?

What I see in textbooks (including, for example, page 25 of this stanford.edu notes and page 10 of this harvard.edu notes ) is a different expression: $-2X^Ty + 2 X^TXw$.

What am I missing here?

Looks to me like your solution is the transpose of the other. So it just depends on if you want to treat the answer as a row vector or column vector. — Michael, Nov 21 '15 at 19:15
yes, it is- but how do all textbooks have the same representation, and not the one I derived? Is my derivation wrong, or did I miss something? If you read the harvard.edu notes, it explains the derivation in a different way. — vbp, Nov 21 '15 at 20:28
Why do you ask "is my derivation wrong?" If you choose to represent derivatives as row vectors, you get your answer, and if as column vectors you just take the transpose of your derivatives everywhere. It looks like you are just asking about conventions. Hopefully textbooks are consistent in how they represent these derivatives (as row or column), but if they are not you can perhaps forgive the authors. Also, there is no reason for different books to use the same representations. Perhaps you can write a book that represents the derivatives your way. — Michael, Nov 21 '15 at 21:19
So, is it just about convention? Are you saying, the more popular convention is to express $d(x^TAx)/dx = Ax + A^Tx = (A+A^T)x$, instead of $d(x^TAx)/dx = x^TA^T + x^TA = x^T (A^T+A)$ ?
If so, thanks for the clarification. I was getting really confused. — vbp, Nov 21 '15 at 21:53
could you please post it as an answer so that i can approve it ? Thanks. — vbp, Nov 21 '15 at 22:35
Well, you have two posted answers now so you can choose one of those. It seems lots of people want to write their own textbooks with their own notation! — Michael, Nov 24 '15 at 04:29
Also note that in general, you can post your own answer if you come up with something based on the comments. It is actually encouraged for some basic problems where there are hints given, you could say something like "based on the hints given by, I can fill in the details by..." I don't know if that is relevant here, but in some questions it is relevant. — Michael, Nov 24 '15 at 04:32
The link to isites.harvard.edu is broken, but a snapshot is saved on the Wayback Machine. — The Amplitwist, Jun 20 '22 at 13:30

lynn · Answer 1 · 2015-11-22T00:44:16.837

2

Let $z=(Xw-y)$, then the loss function can be expressed in terms of the Frobenius norm or better yet, the Frobenius product as $$L=\|z\|^2_F = z:z$$ The differential of this function is simply $$\eqalign{ dL &= 2\,z:dz \cr &= 2\,z:X\,dw \cr &= 2\,X^Tz:dw \cr }$$ Since $dL=\frac{\partial L}{\partial w}:dw,\,$ the gradient is $$\eqalign{ \frac{\partial L}{\partial w} &= 2\,X^Tz \cr &= 2\,X^T(Xw-y) \cr }$$ The advantage of this derivation is that it holds true even if the vectors $\{w,y,z\}$ are replaced by rectangular matrices.

edited Nov 22 '15 at 00:44

answered Nov 22 '15 at 00:30

lynn

1,746

what does the symbol $:$ indicate, how do you go from $dL$ to $2z: Xdw$ to $2X^Tz: dw$? – vbp Nov 22 '15 at 14:02
The colon (:) represents the Frobenius product. The algebraic properties of the Frobenius product follow from those of the trace function, to which it is equivalent, i.e. $A:B={\rm tr}(A^TB)$. I find the product notation convenient and easier to work with than the trace. – lynn Nov 22 '15 at 15:46

rych · Answer 2 · 2015-11-22T01:55:06.183

1

Let $A=Xw-y$ and find the derivative map of the squared norm $L=\|A\|^{2}$: $D_A\|A\|^{2}(H)=\left.\frac{d}{dt}\right|_{0}\|A+tH\|^{2}=\left.\frac{d}{dt}\right|_{0}\langle A+tH,A+tH\rangle=2\langle A,H\rangle$
Use the chain rule $D_{w}(L\circ A)=D_{A}L\circ D_{w}A$ as follows, $D_w\|A(w)\|^{2}(h)=\left.\frac{d}{dt}\right|_{0}\|A(w+th)\|^{2}=2\langle A,D_wA(h)\rangle$
Arrive at the result: $2\langle Xw-y,Xh\rangle$
Rewrite $2\langle Xw-y,Xh\rangle = 2\langle X^T(Xw-y),h\rangle$ and define the gradient vector $2X^T(Xw-y)$

edited Nov 22 '15 at 01:55

answered Nov 22 '15 at 01:45

rych

4,205

What is $H$ ? Your answer is not clear- please explain the step of the derivation. – vbp Nov 22 '15 at 14:00
1

$H$ (and $h$) is an auxiliary element of the linear space, the argument of the linear map. Then I simply use the definition for the derivative and the chain rule. Finally because there is inner product, we can introduce the gradient vector that you're probably after. – rych Nov 22 '15 at 23:32

Unclear about matrix calculus in least squares regression

2 Answers2

Linked