Understanding the derivatives in backpropagation algorithm

Question

I'm having trouble understanding the derivatives in the backpropagation algorithm. I'll use the example presented here.

If you're unfamiliar with the algorithms I'm talking about - it's Okay, my question is only about derivatives.

So I have the following functions:

$$ x_1 = W_1x_0$$ $$ x_2 = f_1(x_1)$$ $$E = \frac{1}{2} || x_2 - y||^2$$

where $x_0$ is a vector of size 4x1, $W_2$ is a matrix of size 5x4, and $f_1$ is some nonlinear function (for example, the logistic function). $y$ is a vector with the same dimension as $x_2$.

Now, I need to take the derivative of E w.r.t. $W_1$. I'll use the chain rule:

$$ \frac{\partial E}{\partial W_1} = \frac{\partial E}{\partial x_2} \frac{\partial x_2}{\partial x_1} \frac{\partial x_1}{\partial W_2}$$

I can understand the first derivative: the derivative of a scalar (that comes from the function E) w.r.t. a vector is a vector.

I'm not sure about the next part. The derivative of $x_2$ w.r.t. $x_1$ is the derivative of a vector w.r.t. a vector. Isn't that supposed to be a matrix, somehow?

And the part I least understand is the last: The derivative of $x_1$ w.r.t. $W_1$. Isn't it impossible to take the derivative of a vector w.r.t. a matrix?

Using the vector/matrix notation is not a so good idea. What we want is the partial derivative of $E$ with respect to a single weight $w_{ij}$. So did you understand the total derivative ? For example let $g(t) = f(x_1(t),x_2(t))$, what is $g'(t)$ ? — reuns, Jan 21 '17 at 21:22
Thanks @user1952009 for your reply: yes, I understand the whole process when the derivatives are w.r.t a single weight $w_{ij}$, and also when the derivatives are w.r.t. a single vector every time (e.g., $w_i$). However, many times people use the notation of a derivative w.r.t. to a matrix, which is where I am completely lost. — Cheshie, Jan 21 '17 at 21:31
So what is $g'(t)$ ? And what is $\frac{\partial E}{\partial x_i}$ and $\frac{\partial E}{\partial w_{ij}}$ (with $x_i$ the output of a single neuron and $w_{ij}$ a single weight) ? Only after that, the matrix/vector notation is natural. — reuns, Jan 21 '17 at 21:38
Well, $\frac{\partial E}{\partial x_2^{(i)}}$, for example, is simply $20.5(x_2^{(i)}-y)$, where $x_2^{(i)}$ is the ith element of the vector $x_2$. And $\frac{\partial E}{\partial w_{ij}}$ is $2 * 0.5 * (x_2^{(i)} - y) * f_1 '(x_1^{(i)}) * x_0^{(j)}$ — Cheshie, Jan 21 '17 at 21:47
Um, I'm not sure I understood your notation. Do you mean $g'(t)$ w.r.t. $t$? Are $x_1(t)$ and $x_2(t)$ in your example, are they functions...? — Cheshie, Jan 21 '17 at 21:54
I mean $\frac{\partial g}{\partial t}$ the derivative of $g$ with respect to $t$. Yes of course $x_1(t),x_2(t)$ are some functions of $t$ — reuns, Jan 21 '17 at 21:59
OK, sorry, I don't know the answer. My guess: $\frac{\partial g}{\partial t} = \frac{\partial f}{\partial t} \frac{\partial x_1}{\partial t} \frac{\partial x_2}{\partial t}$. — Cheshie, Jan 21 '17 at 22:02
No, it is $\frac{\partial g}{\partial t} =\frac{\partial f}{\partial x_1}\frac{\partial x_1}{\partial t}+\frac{\partial f}{\partial x_2}\frac{\partial x_2}{\partial t}$ and more generally if $g(t) = f(x_1(t),\ldots,x_n(t))$ then $\frac{\partial g}{\partial t} =\sum_{i=1}^n \frac{\partial f}{\partial x_i}\frac{\partial x_i}{\partial t}$, and this is the only thing you need to prove for understanding non-recurrent (layered) neural networks. — reuns, Jan 21 '17 at 22:06
Okay. Can you please give me a tiny clue of how this has to do with the equations I wrote in my question? — Cheshie, Jan 21 '17 at 22:11
Since you didn't say what is your neural network, I can't say. But the total derivative is needed every time there is more than one neuron in one of the non-input layers. The simplest case is a neural network with 3 layers : 1 input neuron, 1 output neuron, and 2 hidden neurons (and no bias). Once you know how to compute the partial derivatives in this, you know the back-propagation algorithm. — reuns, Jan 21 '17 at 22:23

score 2 · Answer 1 · answered Dec 10 '17 at 21:08

For clarity, rather than subscripts, give every variable a distinct name and write down the definition and differential for each $$\eqalign{ v &= Wx &\implies dv=dW\,x \cr s &= \sigma(v) &\implies ds = (s-s\circ s)\circ dv \cr S &= {\rm Diag}(s) &\implies ds = (S-S^2)\,dv \cr E &= \frac{1}{2}(s-y):(s-y) &\implies dE=(s-y):ds \cr }$$ where (:) denotes the trace/Frobenius product, $A:B={\rm tr}(A^TB)$
and $(\circ)$ denotes the elementwise/Hadamard product.
The logistic function $\sigma(v),\,$ was chosen as a concrete example of an activation function.

Now it's just a matter of successively substituting differentials $$\eqalign{ dE &= (s-y):ds \cr &= (s-y):(S-S^2)\,dv \cr &= (S-S^2)(s-y):dW\,x \cr &= (S-S^2)(s-y)x^T:dW \cr \cr \frac{\partial E}{\partial W} &= (S-S^2)(sx^T-yx^T) \cr\cr }$$ The nice thing about the differential approach is that you don't need to deal with awkward higher-order tensors, such as the gradient of a vector with respect to a matrix. Whereas the differential of a vector (or matrix) behaves just like any other vector (or matrix).

Understanding the derivatives in backpropagation algorithm

1 Answers1

Linked