1

I am reading about backpropagation for fully connected neural networks and I found a very interesting article by Jeremy Jordan. It explains the process from start to finish. There is a section though that confused me a bit. The partial derivative of the cost function (MSE) with regard to the $\theta_{jk}^{(2)}$ weights is:

$$\frac{\partial J(\theta)}{\partial \theta_{jk}^{(2)}} = \left( \frac{\partial J(\theta)}{\partial a_j^{(3)}}\right) \left( \frac{\partial a_j^{(3)}}{\partial z_j^{(3)}}\right) \left(\frac{\partial z_j^{(3)}}{\partial \theta_{jk}^{(2)}} \right) \tag{1}$$

The article defines the next equation as the "error" term. The equation $eq:2$ is the combination of the first two partials in the chain rule:

$$ \delta_i^{(3)} = \frac {1}{m} (y_i - a_i^{(3)}) f^{'}(a^{(3)}) \tag{2}$$

Where:

  • $ i: $ The index of the neuron in the layer
  • $ ^{(3)}: $ Denotes the layer (in this case 3 is the output layer)
  • $ z_i: $ The weighted sum of the inputs of the $i_{th}$ neuron
  • $ m: $ The number of training samples
  • $ y_i: $ The expected value of the $ i_{th} $ neuron
  • $ a_i: $ The predicted value of the $ i_{th} $ neuron
  • $ f^{'}: $ The derivative of the activation function

So a few lines after the definition above the article states:

$ \delta^{(3)} $ is a vector of length j where j is equal to the number of output neurons $$ \delta^{(3)} = \begin{bmatrix} y_1 - a_1^{(3)} \newline y_2 - a_2^{(3)} \newline \cdots \newline y_j - a_j^{(3)} \newline \end{bmatrix} f^{'}(a^{(3)}) \tag{3} $$

Q1. I strongly suspect that the $ f^{'}(a^{(3)}) $ is a vector of length $j$ and not a scalar. Basically, it is a vector containing the derivative of the activation function for every neuron of the output layer. How is it possible in $eq:3$ to multiply it with another vector and still get a vector and not a $j\ x\ j$ matrix? Is the multiplication elementwise?

Q2. How is the $ f^{'}(a^{(3)}) $ calculated for every neuron for multiple training samples? From what I understand, while training with batches I would have to average the $ (y_i - a_i^{(3)}) $ term for the whole batch for every neuron. So in fact the term $ (y_i - a_i^{(3)}) $ is the sum for the whole batch and that's why the $ \frac {1}{m} $ is present. Does that apply to the derivative too? Meaning do I have to calculate the average of the derivative for the whole batch for each neuron?

Q3. What does $ f^{'}(a^{(3)}) $ actually mean? Is this the derivative of the activation function evaluated with the values of the $a_i^{(3)}$ outputs? Or is it the derivative of the activation function evaluated with the values of the weighted sum $ z_i $ that is actually passed through the activation function to produce the $a_i^{(3)} = f(z_i)$ output? And if the second would I have to keep track of the average of the $z_i$ for each neuron in order to obtain the average of the $ f^{'} $

2 Answers2

0

Re your Q1 & Q3: assuming single training example for now, indeed you're right mathematically speaking $f^{'}(a^{(3)})$ shouldn't be a constant scalar and from the author's derivation section above your referenced equation (3), this derivative of activation function in the same layer should be evaluated at different values of $z_1^{(3)}, z_2^{(3)}$ as the (weighted sum) net input for the 2 demonstrated neurons at the last output layer (and they've been already computed in previous feedforward pass), thus confirms your Q3's 2nd interpretation that the derivative of the activation function is evaluated with the values of the weighted sum $z_i$ that is actually passed through the activation function. In fact you may also refer to Delta rule which is a special case of backpropagation algorithm which confirms the same interpretation.

For a neuron $j$ with activation function ${g(x)}$, the delta rule for neuron $j$'s $i$th weight $w_{ji}$ is given by $\Delta w_{ji}=\alpha (t_{j}-y_{j})g'(h_{j})x_{i}$, where $\alpha$ is a small constant called learning rate, $g(x)$ is the neuron's activation function, $g'$ is the derivative of $g$, $t_{j}$ is the target output, $h_{j}$ is the weighted sum of the neuron's inputs, $y_{j}$ is the actual output, $x_{i}$ is the $i$th input.

Finally $\delta^{(3)}=[\delta_1^{(3)}, \space \delta_2^{(3)}]$ is a $1 \times 2$ vector representing "error" terms for the same 2 output neurons which can be confirmed by the author's conclusion section in more abstract linear algebra notations. Thus it's best to view $f^{'}(a^{(3)})$ as a $2 \times 2$ square matrix with eigenvalues identical to $f^{'}(z_1^{(3)})$ and $f^{'}(z_2^{(3)})$, respectively, and we should actually treat the explicit vector in your equation (3) as a $ 1 \times 2$ vector and then the final result matches as $[y_1-a_1^{(3)}, \space y_2-a_2^{(3)}]f^{'}(a^{(3)}) = \delta^{(3)}$.

As for your final Q2 when training multiple examples using scaled loss function you don't usually need to calculate any additional average, the scaled loss function already takes care to minimize the mean square error. The only difference now is you're now dealing with much larger vectors/matrices such as your equation (3). Say you have 3 training sets, then the above $\delta^{(3)}$ will be a $ 1 \times 6$ vector and $f^{'}(a^{(3)})$ will be a $ 6 \times 6$ matrix. Basically you start with the same random small values for all the same 8 weights in above same network architecture, but you'll have to compute larger vectors and matrices during both feedforward and back propagations of any epoch.

cinch
  • 2,082
  • 1
  • 4
  • 9
0

The author is rather free with changing from row to column format. The main philosophy or framework seems to be to implement the directions of "forward" evaluation and "backwards" gradient differentiation in the left-right direction, in diagrams as well as in formulas.

However, this philosophy is broken several times, for instance in writing $a=f(z)$ instead of $f(z)=a$, or in using weight matrices that are indexed for matrix-vector multiplication, that is, the usual right-to-left direction $(z^{(3)})^T=\theta^{(2)} (a^{(2)})^T$, which following the philosophy should be written as $a^{(2)}(\theta^{(2)})^T=z^{(3)}$.

But then again the philosophy gets reversed in formulas like $$ \delta^{(l)} = \delta^{(l + 1)}\,\Theta^{(l)}\,f'\left( a^{(l)} \right) $$ which clearly is right-to-left, which suggests that the gradients $δ^{(l)}$ are row vectors and the argument vectors like $a^{(l)}$ are column vectors.

In short, it's a mess.


Despite that, your questions have direct answers without relying too much on what directional philosophy is used

Q1. $f'(a^{(3)})$ as used and positioned relative to other vectors and matrices is the diagonal matrix with the entries $f'(a_j^{(3)})$ on the diagonal. This comes out as component-wise product in matrix-vector or vector-matrix multiplications.

Q2. If you were to discuss that, you would need another index in all the formulas indicating the training sample. Such as $J(x^{[k]},\theta)$ as the residual of the net output to $y^{[k]}$. The gradient of the sum $\frac1m\sum_{k=1}^mJ(x^{[k]},\theta)$ would be the sum of all single gradients that get computed independently of each other. Another interpretation of the factor $\frac1m$ is that it is the gradient at the top level $J$ that gets propagated backwards to the gradients of each variable.

Q3. Of course some doubt is appropriate. As the activated value is a function of the linear combined input, $a^{(3)}=f(z^{(3)})$, so is the derivative $\frac{\partial a^{(3)}}{\partial z^{(3)}}=f'(z^{(3)})$. Both must have the same arguments. This is just a typo, perhaps copy-pasted some times.


I've tried how to consistently implement the left-to-right philosophy, but it is too cumbersome. One would have to use something unfamiliar like some kind of reverse polish notation, so instead of $v=\phi(u,w)$ one would have to write $[u,w:\,\phi]=v$ or similar. So it is better to stay with right-to-left consistently, as also the author ended up doing. Thus $x,z,a$ are column vectors, gradients (against the tradition in differential calculus) are row vectors. In algorithmic differentiation it is one tradition to denote tangent vectors for forward differentiation with a dot, $\dot x, \dot z,\dot a$, and gradients that get pushed back with a bar, so $\bar x,\bar z=\delta, \bar a$.

The construction principle for gradient propagation is that if $v=\phi(u,w)$, then the relation of tangents pushed forward to the level of before and after the operation $\phi$ and gradients pushed back to that stage satisfy $$ \bar v\dot v=\bar u\dot u+\bar w\dot w. $$ Inserting $\dot v=\phi_{,u}\dot u+\phi_{,w}\dot w$ results in $$ \bar v\phi_{,u}\dot u+\bar v\phi_{,w}\dot w=\bar u\dot u+\bar w\dot w. $$ Comparing both sides gives $\bar u=\bar v\phi_{,u}$, $\bar w=\bar v\phi_{,w}$.

In a wider context $\bar w$ is a linear functional, meaning with scalar value, of $\dot w$. So if $w$ is a matrix, then the linear functional is obtained via the trace, ${\rm Tr}(\bar w·\dot w)$. So if for instance $\phi(u,w)=w·u$ in a matrix-vector product, then by the product rule $\dot v=w·\dot u+\dot w·u$ and $$ {\rm Tr}(\bar w·\dot w)=\bar v·\dot w·u={\rm Tr}(u·\bar v·\dot w), $$ so the comparison gives $\bar w=u·\bar v$.


The example network in atomic formulas is \begin{align} z^{(2)}&=\Theta^{(1)}·x \\ a^{(2)}&=f(z^{(2)}) \\ z^{(3)}&=\Theta^{(2)}·a^{(2)} \\ a^{(3)}&=f(z^{(3)}) \\ \end{align} and then $J$ is computed via some loss function from $a^{(3)}$ and the reference value $y$.

Starting from the gradient $\bar a^{(3)}$ computed from the loss function, the pushed-back gradients compute as \begin{align} \bar z^{(3)} &= \bar a^{(3)}·{\rm diag}(f'(z^{(3)})) \\ \bar a^{(2)} &= \bar z^{(3)}·\Theta^{(2)} \\ \bar \Theta^{(2)} &= a^{(2)}·\bar z^{(3)} \\ \bar z^{(2)} &= \bar a^{(2)}·{\rm diag}(f'(z^{(2)})) \\ \bar x &= \bar z^{(2)}·\Theta^{(1)} \\ \bar \Theta^{(1)} &= x·\bar z^{(2)} \\ \end{align}

Of course one can combine some of these formulas, like $$ \delta^{(2)}=\bar z^{(2)} =\bar z^{(3)}·\Theta^{(2)}·{\rm diag}(f'(z^{(2)})) =\delta^{(3)}·\Theta^{(2)}·{\rm diag}(f'(z^{(2)})) $$ and if $J=\frac12\sum |a_j^{(3)}-y_j|^2$, then also $$ \delta^{(3)}=\bar z^{(3)}=\bar J\,[a_1^{(3)}-y_1,a_2^{(3)}-y_2,…]·{\rm diag}(f'(z^{(3)})) $$ with for instance $\bar J=\frac1m$.

Lutz Lehmann
  • 261
  • 1
  • 5