1

The loss function of a Least Squares Regression is defined as (for example, in this question) :

$L(w) = (y - Xw)^T (y - Xw) = (y^T - w^TX^T)(y - Xw)$

Taking the derivatives of the loss w.r.t. the parameter vector $w$:

\begin{align} \frac{d L(w)}{d w} & = \frac{d}{dw} (y^T - w^TX^T)(y - Xw) \\ & = \frac{d}{dw} (y^Ty - y^TXw - w^TX^Ty + w^TX^TXw) \\ & = \frac{d}{dw} (y^Ty - y^TXw - (y^TXw)^T + w^TX^TXw) \end{align}

as the second and third terms are scalars resulting in the same quantity, this implies,

\begin{align} & = \frac{d}{dw} (y^Ty - 2y^TXw + w^TX^TXw) \end{align}

My question is:

for the second term, shouldn't the derivative wrt $w$ be $-2y^TX$ ?

and because $\frac{d}{dx}(x^TAx) = x^T(A^T + A)$, (see this question for explanation)

shouldn't the derivative for the third term (which is also a scalar), be the following due to chain rule? \begin{align} \frac{d}{dw} (w^TX^TXw) + \frac{d}{dw} w(X^TXw)^T = w^T(X^TX + X^TX) = 2 w^TX^TX \end{align}

From the above expressions, shouldn't the result of the derivative of the loss function be: $-2y^TX + 2 w^TX^TX$ ?

What I see in textbooks (including, for example, page 25 of this stanford.edu notes and page 10 of this harvard.edu notes ) is a different expression: $-2X^Ty + 2 X^TXw$.

What am I missing here?

vbp
  • 121
  • Looks to me like your solution is the transpose of the other. So it just depends on if you want to treat the answer as a row vector or column vector. – Michael Nov 21 '15 at 19:15
  • yes, it is- but how do all textbooks have the same representation, and not the one I derived? Is my derivation wrong, or did I miss something? If you read the harvard.edu notes, it explains the derivation in a different way. – vbp Nov 21 '15 at 20:28
  • 2
    Why do you ask "is my derivation wrong?" If you choose to represent derivatives as row vectors, you get your answer, and if as column vectors you just take the transpose of your derivatives everywhere. It looks like you are just asking about conventions. Hopefully textbooks are consistent in how they represent these derivatives (as row or column), but if they are not you can perhaps forgive the authors. Also, there is no reason for different books to use the same representations. Perhaps you can write a book that represents the derivatives your way. – Michael Nov 21 '15 at 21:19
  • So, is it just about convention? Are you saying, the more popular convention is to express $d(x^TAx)/dx = Ax + A^Tx = (A+A^T)x$, instead of $d(x^TAx)/dx = x^TA^T + x^TA = x^T (A^T+A)$ ?

    If so, thanks for the clarification. I was getting really confused.

    – vbp Nov 21 '15 at 21:53
  • could you please post it as an answer so that i can approve it ? Thanks. – vbp Nov 21 '15 at 22:35
  • Well, you have two posted answers now so you can choose one of those. It seems lots of people want to write their own textbooks with their own notation! – Michael Nov 24 '15 at 04:29
  • Also note that in general, you can post your own answer if you come up with something based on the comments. It is actually encouraged for some basic problems where there are hints given, you could say something like "based on the hints given by, I can fill in the details by..." I don't know if that is relevant here, but in some questions it is relevant. – Michael Nov 24 '15 at 04:32
  • The link to isites.harvard.edu is broken, but a snapshot is saved on the Wayback Machine. – The Amplitwist Jun 20 '22 at 13:30

2 Answers2

2

Let $z=(Xw-y)$, then the loss function can be expressed in terms of the Frobenius norm or better yet, the Frobenius product as $$L=\|z\|^2_F = z:z$$ The differential of this function is simply $$\eqalign{ dL &= 2\,z:dz \cr &= 2\,z:X\,dw \cr &= 2\,X^Tz:dw \cr }$$ Since $dL=\frac{\partial L}{\partial w}:dw,\,$ the gradient is $$\eqalign{ \frac{\partial L}{\partial w} &= 2\,X^Tz \cr &= 2\,X^T(Xw-y) \cr }$$ The advantage of this derivation is that it holds true even if the vectors $\{w,y,z\}$ are replaced by rectangular matrices.

lynn
  • 1,746
  • what does the symbol $:$ indicate, how do you go from $dL$ to $2z: Xdw$ to $2X^Tz: dw$? – vbp Nov 22 '15 at 14:02
  • The colon (:) represents the Frobenius product. The algebraic properties of the Frobenius product follow from those of the trace function, to which it is equivalent, i.e. $A:B={\rm tr}(A^TB)$. I find the product notation convenient and easier to work with than the trace. – lynn Nov 22 '15 at 15:46
1
  1. Let $A=Xw-y$ and find the derivative map of the squared norm $L=\|A\|^{2}$: $D_A\|A\|^{2}(H)=\left.\frac{d}{dt}\right|_{0}\|A+tH\|^{2}=\left.\frac{d}{dt}\right|_{0}\langle A+tH,A+tH\rangle=2\langle A,H\rangle$

  2. Use the chain rule $D_{w}(L\circ A)=D_{A}L\circ D_{w}A$ as follows, $D_w\|A(w)\|^{2}(h)=\left.\frac{d}{dt}\right|_{0}\|A(w+th)\|^{2}=2\langle A,D_wA(h)\rangle$

  3. Arrive at the result: $2\langle Xw-y,Xh\rangle$

  4. Rewrite $2\langle Xw-y,Xh\rangle = 2\langle X^T(Xw-y),h\rangle$ and define the gradient vector $2X^T(Xw-y)$

rych
  • 4,205
  • What is $H$ ? Your answer is not clear- please explain the step of the derivation. – vbp Nov 22 '15 at 14:00
  • 1
    $H$ (and $h$) is an auxiliary element of the linear space, the argument of the linear map. Then I simply use the definition for the derivative and the chain rule. Finally because there is inner product, we can introduce the gradient vector that you're probably after. – rych Nov 22 '15 at 23:32