Backpropagation derivation in Neural Networks

Question

I am studying backpropagation in Neural Networks and I'm currently looking at the following video (not needed to answer the question): https://www.youtube.com/watch?v=GlcnxUlrtek&t=29s

There is a lot of great questions and answers on this site around this theme, so I will refer to these.

Loss function: $J = \sum_{i}^n(y_i - \hat{y}_i)^2\tag{1}$

, where the following formulas describe the network (and the function $f$ is the sigmoid function):

$\hat{\mathbf y} = f(\mathbf z^{(3)})\tag{2}$

$\mathbf z^{(3)} = \mathbf a^{(2)} \mathbf W^{(2)} \tag{3}$

$\mathbf a^{(2)} = f(\mathbf z^{(2)})\tag{4}$

$\mathbf z^{(2)} = \mathbf X \mathbf W^{(1)} \tag{5}$

I want to learn to find the derivatives $\frac{\partial J}{\partial \mathbf W^{(2)}}$ and $\frac{\partial J}{\partial \mathbf W^{(1)}}$ in a clean way, without hacking together something to get the result. I found similar questions here [Understanding the derivatives in backpropagation algorithm and here Not understanding derivative of a matrix-matrix product., but I still have some questions.

Can I use matrix notation in formula (1) instead of the summation symbol?
Do I need to understand the differential approach that @greg uses to understand this? I only know elementary linear algebra, so how to use the Trace and Hadamard product in this context is unknown to me.

Attempt:

First, I temporarily forget about the sum (which is unsatisfactory to me): \begin{align} \frac{\partial J}{\partial \mathbf W^{2}} & = (y - \hat{y}) (-\frac{\partial \hat{y}}{\partial \mathbf W^{(2)}}) \\ & = (y - \hat{y}) (-\frac{\partial \hat{y}}{\partial \mathbf z^{(3)}})\frac{\partial \mathbf z^{(3)}}{\partial \mathbf W^{(1)}} \\ & = (y - \hat{y}) (-\frac{\partial \hat{y}}{\partial \mathbf z^{(3)}}) \frac{\partial \mathbf z^{(3)}}{\partial \mathbf a^{(2)}} \frac{\partial \mathbf a^{(2)}}{\partial \mathbf W^{(1)}} \\ \end{align}

Now from the answer of @GeorgSaliba (Not understanding derivative of a matrix-matrix product.) I can use the formula $\frac{\partial AXB}{\partial X} = B^T \otimes A$ with putting $A$ to be the identity matrix, that $\frac{\partial \mathbf z^{(3)}}{\partial \mathbf a^{(2)}} = (\mathbf W^{(2)})^T$. But I do not understand this formula from @GeorgSaliba if I have to calculate $\frac{\partial AX}{\partial X} = I^T \otimes A$. Is this Konecker product between $I^T$ and $A$ equal to $A^T$?

Continuing my calculations above, and calling the two first terms $\delta_1$, I now get:

\begin{align} \frac{\partial J}{\partial \mathbf W^{2}} & = \delta_1 (\mathbf W^{(2)})^T \frac{\partial \mathbf a^{(2)}}{\partial \mathbf W^{(1)}} \\ & = \delta_1 (\mathbf W^{(2)})^T \frac{\partial f(\mathbf z^{(2)})}{\partial \mathbf z^{(2)}} \frac{\partial \mathbf z^{(2)}}{\partial \mathbf W^{(1)}} \\ \end{align}

Now supposedly the last term becomes $X^T$ (Do I use @GeorgSaliba's rule here?), and also it "jumps" to the beginning, such that the result becomes: $ = \mathbf X^T \delta_1 (\mathbf W^{(2)})^T \frac{\partial f(\mathbf z^{(2)})}{\partial \mathbf z^{(2)}}$

And also, this jump made us take the sum (which we disregarded) back into the equation. I feel like I lack a deep understanding of how to to this. I need some "rules" to follow, so that I can sometime in the future do this own my own. I am so, so willing to read up on anything, the differential, trace and Konicker products if I really need to. Appreciate any help! Thank you so much!

Here's a question that demonstrates back-propagation through an arbitrary number of layers. — greg, Apr 13 '20 at 21:10
Thanks @greg! Yes, I saw the one, but I lack an understanding of the approach taken with the differential etc, so I’m afraid I’m just not getting it. Also, n=1 in that example, so it does not explain the sum above. Anyway, I know I have to learn it on my own - I just do not know where to start — Erosennin, Apr 14 '20 at 06:26
So, I don’t see the point of going from working with partial derivatives over to the differential. Do we have to do this? Given that we have to do this, why do we write $dL = dg : da_k$? I have never seen this before. What does it mean? — Erosennin, Apr 15 '20 at 19:53
The trace is a fundamental operation in matrix calculus, and the Frobenius product is a particularly convenient write to write it. Also note that there should only be one differential in each product (e.g. $g:da_k$), since the product of two differentials (e.g. $dg:da_k$), is zero. — greg, Apr 15 '20 at 20:32
Thanks for trying to help me. I guess I’m not conveying that I’m looking for an understanding of why everything is done. Why start with dL/dg = da_k in this form, and why moving dg to the other side gives me the trace. Anyway, if anybody can give me a hint to where to find a place to read up on this or have somebody explain it, I would be very grateful :) — Erosennin, Apr 15 '20 at 20:50

Backpropagation derivation in Neural Networks

0 Answers0