I found this article that describes how neural networks work. This paragraph near the end caught my eye and explains how weights are updated:
So we see that $\theta_i := \theta_i + \nabla\theta_i$ where $\nabla\theta_i=-\eta*\frac{d}{d\theta_i}(CostFunction(outputOfNeuralNetwork))$
I was just wondering if the units of each variable here were consistent and make this a valid update operation.
My attempt at understanding this is:
- the weight $\theta_i$ is written in $[\theta_i units]$
- $\frac{d}{d\theta_i}(CostFunction(outputOfNeuralNetwork))$ is written in $[\frac{error units}{\theta_i units}]$
- the learning rate $\eta$ is written in $[\theta_i units]$ and is a small value multiplied by $-1$ to indicate a small, backwards step on the $\theta_i$ axis.
Since the units of $\frac{d}{d\theta_i}(CostFunction(outputOfNeuralNetwork))$ and $\eta$ cancel out the $[\theta_i units]$, we can successfully add the quantity of $-\eta*\frac{d}{d\theta_i}(CostFunction(outputOfNeuralNetwork))$ to $\theta_i$ and suppose that $[error units]$ and $[\theta_i units]$ are actually the same for all $i$
Have I understood this correctly?
EDIT: or perhaps $[error units]$ don't actually exist; the neurons in the very last layer are all probabilities (written in no units just like radians). The same goes for the units of a weight parameter; the weights are just used to scale terms in a linear combination that looks like $a*x_0 + b*x_2 + c*x_3 + ... + bias$ where $a, b, c, d,...$ are the weight parameters. I take it the $bias$ parameters are just unitless values that fit incoming input into the appropriate domain of some activation function. So I guess this is a valid update operation because there are no units???