I always have some trouble doing matrix derivative, such as the following $$\frac{\partial(W^TX^T-Y^T)(XW-Y)}{\partial W} =\frac{\partial(W^TX^TXW - W^TX^TY - Y^TXW)}{\partial W} $$ After that, I do not know how to calculate matrix derivative when some term involves matrix transpose or matrix inverse. Can someone help me, and provide some rules regarding matrix derivatives, especially when the terms involves matrix transpose and matrix inverse? Thank you very much!
1 Answers
Take $X$ and $Y$ to be constant matrices, and define $f(W)\stackrel{\text{def}}{=}(W^{\mathsf{T}}X^{\mathsf{T}}-Y^{\mathsf{T}})(XW-Y)$. Then by the distributive property of matrix multiplication and linearity of matrix transposition, $$\begin{split} f(W+h\Delta W)&=((W+h\Delta W)^{\mathsf{T}}X^{\mathsf{T}}-Y^{\mathsf{T}})(X(W+h\Delta W)-Y)\\ &=(W^{\mathsf{T}}X^{\mathsf{T}}-Y^{\mathsf{T}})(XW-Y)\\ &\quad+h(\Delta W^{\mathsf{T}}X^{\mathsf{T}}(XW-Y)+(W^{\mathsf{T}}X^{\mathsf{T}}-Y^{\mathsf{T}})X\Delta W)\\ &\quad + h^2\Delta W^{\mathsf{T}}X^{\mathsf{T}}X\Delta W \\ \therefore f(W+h\Delta W)&=f(W)+h\langle\nabla f(W),\Delta W\rangle+o(h) \end{split}$$ where the linear map $\nabla f(W)$ is defined by $$\langle\nabla f(W),\Delta W\rangle\stackrel{\text{def}}{=}\Delta W^{\mathsf{T}}X^{\mathsf{T}}(XW-Y)+(W^{\mathsf{T}}X^{\mathsf{T}}-Y^{\mathsf{T}})X\Delta W\text{.}$$
By definition, $\nabla f(W)$ is the Gâteaux derivative of $f$ at $W$—for a change $\Delta W$ in $W$, it gives the "first-order" change in $f(W)$.

- 7,912
-
The answer is $2X^TXW - 2X^TY$, but I just don't know how it is derived. – Steve Yang Mar 07 '18 at 20:34
-
The derivative is the linear map $\langle \nabla f(W),\Delta W\rangle = \Delta W^{\mathsf{T}}X^{\mathsf{T}}(XW-Y)+(W^{\mathsf{T}}X^{\mathsf{T}}-Y^{\mathsf{T}})X\Delta W$—you can't simplify further unless you know the terms in this expression are symmetric. – K B Dave Mar 07 '18 at 20:46
-
To calculate it, you do what I did: expand $f(W+h\Delta W)$ through first order in $h$. Then the coefficient of $h$ is linear in $\Delta W$, and this linear map is the derivative. – K B Dave Mar 07 '18 at 20:47
-
what if X is a mn matrix, W is a n1 vector, and Y is a n*1 vector? – Steve Yang Mar 07 '18 at 21:10
-
Everything already said still applies. But then $W$ is a vector and $f$ is a scalar, so this is a duplicate of https://math.stackexchange.com/questions/2594184/gradients-of-functions-involving-matrices-and-vectors-e-g-nabla-w-wtx – K B Dave Mar 07 '18 at 21:21
-
Yes, thanks, that's what I am looking for. But in their answer, $$\lim_{h \to 0} \frac{f (\mathrm x + h \mathrm v) - f (\mathrm x)}{h} = \mathrm v^\top \mathrm A ,\mathrm x + \mathrm x^\top \mathrm A ,\mathrm v = \langle \mathrm v , \mathrm A ,\mathrm x \rangle + \langle \mathrm A^\top \mathrm x , \mathrm v \rangle = \langle \mathrm v , \color{blue}{\left(\mathrm A + \mathrm A^\top\right) ,\mathrm x} \rangle,$$ what does the $\langle \rangle$ mean? – Steve Yang Mar 08 '18 at 00:13
-
Let us continue this discussion in chat. – K B Dave Mar 08 '18 at 00:25