Backpropagation During Neural Networks Training - Units while updating weights

Question

I found this article that describes how neural networks work. This paragraph near the end caught my eye and explains how weights are updated:

So we see that $\theta_i := \theta_i + \nabla\theta_i$ where $\nabla\theta_i=-\eta*\frac{d}{d\theta_i}(CostFunction(outputOfNeuralNetwork))$

I was just wondering if the units of each variable here were consistent and make this a valid update operation.

My attempt at understanding this is:

the weight $\theta_i$ is written in $[\theta_i units]$
$\frac{d}{d\theta_i}(CostFunction(outputOfNeuralNetwork))$ is written in $[\frac{error units}{\theta_i units}]$
the learning rate $\eta$ is written in $[\theta_i units]$ and is a small value multiplied by $-1$ to indicate a small, backwards step on the $\theta_i$ axis.

Since the units of $\frac{d}{d\theta_i}(CostFunction(outputOfNeuralNetwork))$ and $\eta$ cancel out the $[\theta_i units]$, we can successfully add the quantity of $-\eta*\frac{d}{d\theta_i}(CostFunction(outputOfNeuralNetwork))$ to $\theta_i$ and suppose that $[error units]$ and $[\theta_i units]$ are actually the same for all $i$

Have I understood this correctly?

EDIT: or perhaps $[error units]$ don't actually exist; the neurons in the very last layer are all probabilities (written in no units just like radians). The same goes for the units of a weight parameter; the weights are just used to scale terms in a linear combination that looks like $a*x_0 + b*x_2 + c*x_3 + ... + bias$ where $a, b, c, d,...$ are the weight parameters. I take it the $bias$ parameters are just unitless values that fit incoming input into the appropriate domain of some activation function. So I guess this is a valid update operation because there are no units???

Mohammad Ahmed · Answer 1 · 2021-06-28T20:41:55.903

Let's suppose this picture in which we have three layers

$$ \nabla_{z_{2}^{2}}C$$ $$ \frac{\partial C}{\partial Z_{2}^{2}} = \frac{\partial C}{\partial Z_{1}^{3}} * \frac{\partial Z_{1}^{3}}{\partial Z_{2}^{2}} + \frac{\partial C}{\partial Z_{2}^{3}} * \frac{\partial Z_{2}^{3}}{\partial Z_{2}^{2}} $$ $$ \delta_{2}^{2} = \sum_{k} \frac{\partial Z_{k}^{l+1}}{\partial Z_{j}^{l}} * \frac{\partial C}{\partial Z_{k}^{l+1}} \rightarrow \delta_j^l \rightarrow eq(1) $$ Now, find the value of $\frac{\partial Z_{2}^{3}}{\partial Z_{2}^{2}}$. Here, according to the given figure $$z_{1}^{3} = a_{1}^{2}*w_{11}^{3} + a_{2}^{2}*w_{12}^{3} + a_{3}^{2}*w_{13}^{3} + b $$ $$\frac{\partial z_{1}^{3}} {\partial z_{2}^{2}} = \frac {\partial } {\partial z_{2}^{2}} a_{1}^{2} * w_{11}^{3} + \frac {\partial } {\partial z_{2}^{2}} a_{2}^{2} * w_{12}^{3} + \frac {\partial } {\partial z_{2}^{2}} a_{3}^{2} * w_{13}^{3} + \frac {\partial } {\partial z_{2}^{2}} {b_{1}^{3}}$$

$$\frac{\partial z_{1}^{3}} {\partial z_{2}^{2}} = \frac {\partial } {\partial z_{2}^{2}} a_{2}^{2} * w_{12}^{3} = w_{12}^{3} * \frac {\partial } {\partial z_{2}^{2}} a_{2}^{2} = w_{12}^{3} * \sigma{'}(z_{2}^{2})$$ $$\frac{\partial z_{1}^{3}} {\partial z_{2}^{2}} = w_{12}^{3} * \sigma{'}(z_{2}^{2})$$ Now, generalize the notion w.r.t error $\delta$ $$\frac{\partial z_{k}^{l+1}} {\partial z_{j}^{l}} = w_{kj}^{l+1} * \sigma{'}(z_{l}^{j})$$ In figure $\delta_{j}^{l}$ is the error for the current layer, which is the partial derivative of the cost function w.r.t to the previous non-linearity you computed mathematically is is written as $\sum_{j,l} \frac{\partial C}{\partial a_{j}^{l}}$ If you further compute the derivated w.r.t to the weights then $ \frac{\partial a_{j}^{l}}{\partial z_{j}^{l}} $ where $\frac{\partial a_{j}^{l}}{\partial z_{j}^{l}} = \sigma^{'}(z_{j}^{l})$ So, after computing the partial Derivatives using Back-prop the final error which is back-propagated from $\delta^{final}$ to $\delta_{j}^{l} is$ $$ \delta{j}^{l} = \sum_k w_{kj}^{l+1} * \sigma^{'}(z_{j}^{l}) * \delta_{k}^{l+1}$$ In back-propagation, the loss is computed by chain rule, in addition, all $i_s$ are not the same, and there is a difference between cost function (Logistic Cost function) and probability distribution function like (Softmax) after distribution cost is computed by some function then it is back-propagated. So, a neuron where cost is computed in itself is a unit. A bias is used either to activate the neuron or not. Or in the other sense suppose some of the inputs are zero then bias will help the neuron to activate. Each neuron is a unit of computation mathematically and practically.

score 0 · Answer 2 · answered Jun 28 '21 at 13:32

θis are not the same for all i: they are specific to the ith weight of θ, and Δθi is a vector containing the ith weight update difference.

Every neuron is built using an activation function using a weight and a bias. For example tanh(weight*input+bias).

Backpropagation During Neural Networks Training - Units while updating weights

2 Answers2