Chain rule for matrices

Question

I trying to work through derivation of Ordinary Least Squares for Linear Regression and I'm stumbling over taking the partial derivative of this term $$\frac{\partial (\beta^TX^TX\beta)}{\partial\beta}$$ I found this post so I can see the derivative should be $$X^TX\beta + (\beta X^TX)^T = 2X^TX\beta$$ but I still don't see why how the chain rule was used to get the derivative for $\beta$. Where does the addition come from and how is the chain rule for matrices different from the chain rule for regular variables?

Just take the gradient of $(1/2)|Ax-b|_2^2$ using the chain rule and you'll get $A^T(Ax-b)$. — littleO, Sep 09 '17 at 23:36

user317176 · Accepted Answer · 2017-09-09T23:33:25.103

1

It is the same chain rule.

Actually it is the product rule in this case.

in single variable calculus:

$\frac {d}{dx} x^2 = \frac {d}{dx} (x)(x) = x + x = 2x$

You might not usually do it that way, but should agree that that works.

$\frac {d}{d\beta} \beta^TX^TX\beta = \frac {d}{d\beta}(X\beta)^T(X\beta) = (\frac {d}{d\beta}(X\beta)^T))(X\beta) + ((X\beta)^T)(\frac {d}{d\beta}(X\beta)$

edited Sep 09 '17 at 23:33

answered Sep 09 '17 at 22:35

user317176

11,017

score 0 · Answer 2 · answered Sep 09 '17 at 23:48

To find the gradient of $f(x)=(1/2)\|Ax-b\|^2$, note that $f(x)=g(h(x))$ where $h(x) =Ax$ and $g(y)=(1/2) \|y-b\|^2$. By the chain rule, $f'(x)=g'(h(x))h'(x)$. Clearly $h'(x)=A$ and $g'(y) = (y-b)^T$. Thus, we have $f'(x)=(Ax-b)^TA$. If we use the convention that $\nabla f(x) = f'(x)^T$, it follows that $\nabla f(x) = A^T(Ax-b)$.

Chain rule for matrices

2 Answers2