When I try to find the gradient of the MSE loss $$L(w)=\|y-Xw\|_2^2$$ (ignoring constant factors) I find two different solutions, one is the transpose of the other:
Compute the gradient of mean square error claims the gradient is $$\nabla L(w)=X^TXw-X^Ty$$
MSE Loss function and derivatives claims it is $$\nabla L(w)=w^TX^TX-y^TX$$
I also tried doing it myself:
$$ \begin{aligned} L(w) &= \|y-Xw\|^2 \\ &= (y-Xw)^T(y-Xw) \\ &= (y^T-w^TX^T)(y-Xw) \\ &= (y^Ty-y^TXw-w^TX^Ty+w^TX^TXw) \end{aligned} $$
and
$$\nabla L(w) = -y^TX - y^TX + w^T(X^TX+(X^TX)^T) = 2(w^TX^TX-y^TX)$$
which is equivalent to the second post. Why am I getting the transposed solution?
So if I want to get a column vector I should use d(Ax)/dx = A^T and d(x^TAx)/dx = (A^T + A)x?
– R3lay Jul 22 '23 at 09:38