I have seen lots of people asking this question - $dF/dW = ??$ when $F = WX$. Here $W$ is a $m \times n$ matrix and $X$ is $n \times p$ matrix.
The simple answer they give is $X^{T}$. How did it appear to be like this?
I googled this question - CS231N of stanford gave an explanation of this thing. Yes if you derive it - it is supposed to be a higher order tensor (4 free indices). It is kind of like a matrix whose elements are itself a matrix.
In case you are thinking whether I checked this site questions before asking this question and thinking of closing this question - I would show some of my findings from here and other resources I came by.
This question attempted to demystify the answer. The answer given here is elaborate. But wait a sec, here he mentioned that this can be realized using Kronecker product. Now isn't it a bit way around? What if we want to derive it from the basic rules? (Like multiply two matrices and then deriving each of the $mp$ terms w.r.t all the matrix elements of $X$.
Resources mentioned in CS231N. Yes I checked those. I understand the materials on matrix derivative. And no, I can't find the correlation between these two.
What am I missing? How to derive these kind of expressions from the basics?
I want to make sure that I understand this. Thanks.
- The CS231N resource I mentioned. link - Vector, Matrix, and Tensor Derivatives Erik Learned-Miller
- Another resource from the same CS231N course link- Derivatives, Backpropagation, and Vectorization Justin Johnson