I am studying backpropagation in Neural Networks and I'm currently looking at the following video (not needed to answer the question): https://www.youtube.com/watch?v=GlcnxUlrtek&t=29s
There is a lot of great questions and answers on this site around this theme, so I will refer to these.
Loss function: $J = \sum_{i}^n(y_i - \hat{y}_i)^2\tag{1}$
, where the following formulas describe the network (and the function $f$ is the sigmoid function):
$\hat{\mathbf y} = f(\mathbf z^{(3)})\tag{2}$
$\mathbf z^{(3)} = \mathbf a^{(2)} \mathbf W^{(2)} \tag{3}$
$\mathbf a^{(2)} = f(\mathbf z^{(2)})\tag{4}$
$\mathbf z^{(2)} = \mathbf X \mathbf W^{(1)} \tag{5}$
I want to learn to find the derivatives $\frac{\partial J}{\partial \mathbf W^{(2)}}$ and $\frac{\partial J}{\partial \mathbf W^{(1)}}$ in a clean way, without hacking together something to get the result. I found similar questions here [Understanding the derivatives in backpropagation algorithm and here Not understanding derivative of a matrix-matrix product., but I still have some questions.
- Can I use matrix notation in formula (1) instead of the summation symbol?
- Do I need to understand the differential approach that @greg uses to understand this? I only know elementary linear algebra, so how to use the Trace and Hadamard product in this context is unknown to me.
Attempt:
First, I temporarily forget about the sum (which is unsatisfactory to me): \begin{align} \frac{\partial J}{\partial \mathbf W^{2}} & = (y - \hat{y}) (-\frac{\partial \hat{y}}{\partial \mathbf W^{(2)}}) \\ & = (y - \hat{y}) (-\frac{\partial \hat{y}}{\partial \mathbf z^{(3)}})\frac{\partial \mathbf z^{(3)}}{\partial \mathbf W^{(1)}} \\ & = (y - \hat{y}) (-\frac{\partial \hat{y}}{\partial \mathbf z^{(3)}}) \frac{\partial \mathbf z^{(3)}}{\partial \mathbf a^{(2)}} \frac{\partial \mathbf a^{(2)}}{\partial \mathbf W^{(1)}} \\ \end{align}
Now from the answer of @GeorgSaliba (Not understanding derivative of a matrix-matrix product.) I can use the formula $\frac{\partial AXB}{\partial X} = B^T \otimes A$ with putting $A$ to be the identity matrix, that $\frac{\partial \mathbf z^{(3)}}{\partial \mathbf a^{(2)}} = (\mathbf W^{(2)})^T$. But I do not understand this formula from @GeorgSaliba if I have to calculate $\frac{\partial AX}{\partial X} = I^T \otimes A$. Is this Konecker product between $I^T$ and $A$ equal to $A^T$?
Continuing my calculations above, and calling the two first terms $\delta_1$, I now get:
\begin{align} \frac{\partial J}{\partial \mathbf W^{2}} & = \delta_1 (\mathbf W^{(2)})^T \frac{\partial \mathbf a^{(2)}}{\partial \mathbf W^{(1)}} \\ & = \delta_1 (\mathbf W^{(2)})^T \frac{\partial f(\mathbf z^{(2)})}{\partial \mathbf z^{(2)}} \frac{\partial \mathbf z^{(2)}}{\partial \mathbf W^{(1)}} \\ \end{align}
Now supposedly the last term becomes $X^T$ (Do I use @GeorgSaliba's rule here?), and also it "jumps" to the beginning, such that the result becomes: $ = \mathbf X^T \delta_1 (\mathbf W^{(2)})^T \frac{\partial f(\mathbf z^{(2)})}{\partial \mathbf z^{(2)}}$
And also, this jump made us take the sum (which we disregarded) back into the equation. I feel like I lack a deep understanding of how to to this. I need some "rules" to follow, so that I can sometime in the future do this own my own. I am so, so willing to read up on anything, the differential, trace and Konicker products if I really need to. Appreciate any help! Thank you so much!