I am reading about backpropagation for fully connected neural networks and I found a very interesting article by Jeremy Jordan. It explains the process from start to finish. There is a section though that confused me a bit. The partial derivative of the cost function (MSE) with regard to the $\theta_{jk}^{(2)}$ weights is:
$$\frac{\partial J(\theta)}{\partial \theta_{jk}^{(2)}} = \left( \frac{\partial J(\theta)}{\partial a_j^{(3)}}\right) \left( \frac{\partial a_j^{(3)}}{\partial z_j^{(3)}}\right) \left(\frac{\partial z_j^{(3)}}{\partial \theta_{jk}^{(2)}} \right) \tag{1}$$
The article defines the next equation as the "error" term. The equation $eq:2$ is the combination of the first two partials in the chain rule:
$$ \delta_i^{(3)} = \frac {1}{m} (y_i - a_i^{(3)}) f^{'}(a^{(3)}) \tag{2}$$
Where:
- $ i: $ The index of the neuron in the layer
- $ ^{(3)}: $ Denotes the layer (in this case 3 is the output layer)
- $ z_i: $ The weighted sum of the inputs of the $i_{th}$ neuron
- $ m: $ The number of training samples
- $ y_i: $ The expected value of the $ i_{th} $ neuron
- $ a_i: $ The predicted value of the $ i_{th} $ neuron
- $ f^{'}: $ The derivative of the activation function
So a few lines after the definition above the article states:
$ \delta^{(3)} $ is a vector of length j where j is equal to the number of output neurons $$ \delta^{(3)} = \begin{bmatrix} y_1 - a_1^{(3)} \newline y_2 - a_2^{(3)} \newline \cdots \newline y_j - a_j^{(3)} \newline \end{bmatrix} f^{'}(a^{(3)}) \tag{3} $$
Q1. I strongly suspect that the $ f^{'}(a^{(3)}) $ is a vector of length $j$ and not a scalar. Basically, it is a vector containing the derivative of the activation function for every neuron of the output layer. How is it possible in $eq:3$ to multiply it with another vector and still get a vector and not a $j\ x\ j$ matrix? Is the multiplication elementwise?
Q2. How is the $ f^{'}(a^{(3)}) $ calculated for every neuron for multiple training samples? From what I understand, while training with batches I would have to average the $ (y_i - a_i^{(3)}) $ term for the whole batch for every neuron. So in fact the term $ (y_i - a_i^{(3)}) $ is the sum for the whole batch and that's why the $ \frac {1}{m} $ is present. Does that apply to the derivative too? Meaning do I have to calculate the average of the derivative for the whole batch for each neuron?
Q3. What does $ f^{'}(a^{(3)}) $ actually mean? Is this the derivative of the activation function evaluated with the values of the $a_i^{(3)}$ outputs? Or is it the derivative of the activation function evaluated with the values of the weighted sum $ z_i $ that is actually passed through the activation function to produce the $a_i^{(3)} = f(z_i)$ output? And if the second would I have to keep track of the average of the $z_i$ for each neuron in order to obtain the average of the $ f^{'} $