I'm having trouble understanding the derivatives in the backpropagation algorithm. I'll use the example presented here.
If you're unfamiliar with the algorithms I'm talking about - it's Okay, my question is only about derivatives.
So I have the following functions:
$$ x_1 = W_1x_0$$ $$ x_2 = f_1(x_1)$$ $$E = \frac{1}{2} || x_2 - y||^2$$
where $x_0$ is a vector of size 4x1, $W_2$ is a matrix of size 5x4, and $f_1$ is some nonlinear function (for example, the logistic function). $y$ is a vector with the same dimension as $x_2$.
Now, I need to take the derivative of E w.r.t. $W_1$. I'll use the chain rule:
$$ \frac{\partial E}{\partial W_1} = \frac{\partial E}{\partial x_2} \frac{\partial x_2}{\partial x_1} \frac{\partial x_1}{\partial W_2}$$
I can understand the first derivative: the derivative of a scalar (that comes from the function E) w.r.t. a vector is a vector.
I'm not sure about the next part. The derivative of $x_2$ w.r.t. $x_1$ is the derivative of a vector w.r.t. a vector. Isn't that supposed to be a matrix, somehow?
And the part I least understand is the last: The derivative of $x_1$ w.r.t. $W_1$. Isn't it impossible to take the derivative of a vector w.r.t. a matrix?