0

For the loss function of Linear Regression $L(w) = (y - Xw)^T(y - Xw) $, where $y_{N \times 1}, X_{N\times D}, w_{D\times 1}$ dimensional matrices. I tried to apply product rule:

$\nabla_wL(w) = -(y - Xw)^T X - X^T(y- Xw) $.

But as you can see the dimension of first term ($1\times D$) is not matching with second term $(D \times 1)$. Also if I transpose the first term then I get the desired loss function. Where am I doing wrong?

Delsilon
  • 316
  • You can see your loss function as norm 2, i.e., $\lVert z \rVert_2^2$. So, the derivative of this squared norm 2 w.r.t. z is $2z$. To match the dimensions you have to either transpose the first part or second part depending on your choice of the size of the gradient per se. Anyway, there are a lot of cases to take the derivative of such norm 2. See here for instance: https://math.stackexchange.com/questions/1540047/unclear-about-matrix-calculus-in-least-squares-regression?rq=1 – user550103 Sep 19 '18 at 19:40
  • Yes, I am aware of this method. But I was trying to apply product rule for example what if I consider $X_1$ and $X_2$ instead of just X. What will happen in that case? – Delsilon Sep 20 '18 at 05:09
  • 1
    If you consider $X_1^T X_2$, then I would suggest to utilize the differential first and thereby obtain the gradient. \begin{align} X_1^T X_2 = X_1 : X_2 = {\rm tr}\left(X_1^T X_2\right) \ , \end{align} where double dot, i.e., "$:$", represents Frobenius product. So, the differential would be $d\left( X_1 : X_2\right) = dX_1: X_2 + X_1 : dX_2 = X_2:dX_1 + X_1:dX_2$. In the last step $dX_1: X_2 = X_2:dX_1$ is due to cyclic property of trace/Frobenius. So, compute $dX_1$ and $dX_2$, then you can obtain your gradients. – user550103 Sep 20 '18 at 05:40

0 Answers0