I trying to work through derivation of Ordinary Least Squares for Linear Regression and I'm stumbling over taking the partial derivative of this term $$\frac{\partial (\beta^TX^TX\beta)}{\partial\beta}$$ I found this post so I can see the derivative should be $$X^TX\beta + (\beta X^TX)^T = 2X^TX\beta$$ but I still don't see why how the chain rule was used to get the derivative for $\beta$. Where does the addition come from and how is the chain rule for matrices different from the chain rule for regular variables?
Asked
Active
Viewed 338 times
1
-
Just take the gradient of $(1/2)|Ax-b|_2^2$ using the chain rule and you'll get $A^T(Ax-b)$. – littleO Sep 09 '17 at 23:36
2 Answers
1
It is the same chain rule.
Actually it is the product rule in this case.
in single variable calculus:
$\frac {d}{dx} x^2 = \frac {d}{dx} (x)(x) = x + x = 2x$
You might not usually do it that way, but should agree that that works.
$\frac {d}{d\beta} \beta^TX^TX\beta = \frac {d}{d\beta}(X\beta)^T(X\beta) = (\frac {d}{d\beta}(X\beta)^T))(X\beta) + ((X\beta)^T)(\frac {d}{d\beta}(X\beta)$

user317176
- 11,017
0
To find the gradient of $f(x)=(1/2)\|Ax-b\|^2$, note that $f(x)=g(h(x))$ where $h(x) =Ax$ and $g(y)=(1/2) \|y-b\|^2$. By the chain rule, $f'(x)=g'(h(x))h'(x)$. Clearly $h'(x)=A$ and $g'(y) = (y-b)^T$. Thus, we have $f'(x)=(Ax-b)^TA$. If we use the convention that $\nabla f(x) = f'(x)^T$, it follows that $\nabla f(x) = A^T(Ax-b)$.

littleO
- 51,938