Why do we want the variance of the layers to remain the same throughout a deep network?

Question

I've been reading the literature on vanishing/exploding gradients and specifically how they connect to weight initialization. An idea I've come across a few times, which seems very important in this area, is that we want the variance to remain the same throughout the layers of a neural network, that is, if $v_n$ is the variance of the $n$-th layer, we want all the $v_n$ to be about equal. For example in Kumar 2017:

In this paper, I revisit the oldest, and most widely used approach to the problem with the goal of resolving some of the unanswered theoretical questions which remain in the literature. The problem can be stated as follows: If the weights in a neural network are initialized using samples from a normal distribution, $N(0,v^2)$, how should $v^2$ be chosen to ensure that the variance of the outputs from the different layers are approximately the same?

The paper goes on to claim that "the first systematic analysis of this problem was conducted by Gloriot and Bengio", citing Gloriot & Bengio (2010). But that paper seems to just assume that the reader already accepts keeping the variance stable is a good idea, I can't find anything like an explanation in the paper. They just make this claim:

From a forward-propagation point of view, to keep information flowing we would like that $$\forall(i, i′), Var[z^i] =Var[z^{i′}]$$ From a back-propagation point of view we would similarly like to have $$\forall(i, i′), Var\left(\frac{∂Cost}{∂s^i}\right)=Var\left(\frac{∂Cost}{∂s^{i′}}\right)$$

What variance is actually being taken here? Variance with respect to what?
What exactly is helped by the variance being stable in this way?
Whatever the answer to (2) is, what is the proof or evidence that that's the case?

It's more to keep the training smoother in my understanding as it might help in keeping a check on back prop grads as well otherwise they will be unstable and it won't be feasible to learn anything for the network — Aditya, Oct 12 '20 at 18:28
@Aditya Why would constant variance keep backprop gradients stable? — Jack M, Oct 12 '20 at 19:06
If your gradient's are stable, that means you will never have issues like vanishing grads etc and your grads and thus training across your whole model will be much more stable as a result. — Aditya, Oct 13 '20 at 08:28

Javier TG · Accepted Answer · 2020-10-25T08:32:34.580

What variance is actually being taken here? Variance with respect to what?

$Var(W^i)\rightarrow$ The variance of the random initialization of the weights
$Var(x)\rightarrow$ The variance of each feature
$Var(z^i)\rightarrow$ This is the variance of the values of $z^i_k$ for each neuron $k$ of the layer $i$ which is the same for all the neurons in the layer $i$

What exactly is helped by the variance being stable in this way?

Following the notation of the article, let's compute the gradient of the cost function w.r.t. the parameters in two consecutive layers (that we are going to call: layer $i$ and layer $i+1$). In this layers, the quantity used to update their respective weight matrices, is given by:

$$\text{Layer 1}\rightarrow \frac{\partial Cost}{\partial W^i}= \frac{\partial Cost}{\partial s^i}\frac{\partial s^i}{\partial W^i}=\frac{\partial Cost}{\partial s^i} (z^{i-1})^T$$

$$\text{Layer 2}\rightarrow \frac{\partial Cost}{\partial W^{i+1}}= \frac{\partial Cost}{\partial s^{i+1}}\frac{\partial s^{i+1}}{\partial W^{i+1}}=\frac{\partial Cost}{\partial s^{i+1}} (z^{i})^T$$

There we can see that if we consider: $$ Var\left(\frac{\partial Cost}{\partial s^i}\right) = Var\left(\frac{\partial Cost}{\partial s^{i+1}}\right) \,\,\,\,\,\leftrightarrow\,\,\,\,\, Var(z^i)=Var(z^{i-1})$$

Then we would have: $$Var\left(\frac{\partial Cost}{\partial W^i} \right) = Var\left(\frac{\partial Cost}{\partial W^{i+1}} \right)$$

This is a good thing because having the same variance in the updates of both layers means that the updates are globally spread in the same way, so assuming that the mean value of $\partial Cost/\partial W$ in both layers is the same, then this would mean that globally these layers are learning at the same rythm.

Whatever the answer to (2) is, what is the proof or evidence that that's the case?

A good advantage of the previous reasoning is that if we could achieve this happening in the whole neural network, then all the layers in the NN would be learning at the same rythm! So problems like vanishing or exploding gradient would be avoided.

So the variance of a layer is just kind of a measure of how "big" the values in the layer are? — Jack M, Oct 12 '20 at 20:48
Yes it is, but we would have to assume that the mean of our variable of interest is $0$ (which is normally the case in the moment of initialitation of the weights). Just for visualization purposes imagine that $Var(x_1) = 2 Var(x_2)$, then if $\bar{x}_1=\bar{x}_2=0$, the magnitude of $x_2$ (measured by its absolute value) tends to be bigger because it's more spread w.r.t. the $0$ value. — Javier TG, Oct 12 '20 at 20:55
In the case where the mean $\neq 0$, then we could not consider that the variance is a measure of how big the values in the layers are. Imagine a big mean value, then a big variance may lead to low values. But whatever the mean is, if the layers under the same variance share the same mean, then we would have similar rythms of learning as we saw in the post. — Javier TG, Oct 12 '20 at 21:02
Would you be able to say a bit more about why we care about the variances of the activations? I can understand that the variances of the differentials are directly related to unstable gradients, but in many sources such as here and in the articles linked to in the question people focus a lot on the variance of the activations (the outputs of the layers). — Jack M, Oct 26 '20 at 22:26
The variances of the activations ($Var(z^i)$) are relevant too in this question because if they are the same through all the layers (together with $\frac{\partial Cost}{\partial s^i}$) we would avoid unstable gradients as explained in the post. — Javier TG, Oct 26 '20 at 22:57

Why do we want the variance of the layers to remain the same throughout a deep network?

1 Answers1