I've been reading the literature on vanishing/exploding gradients and specifically how they connect to weight initialization. An idea I've come across a few times, which seems very important in this area, is that we want the variance to remain the same throughout the layers of a neural network, that is, if $v_n$ is the variance of the $n$-th layer, we want all the $v_n$ to be about equal. For example in Kumar 2017:
In this paper, I revisit the oldest, and most widely used approach to the problem with the goal of resolving some of the unanswered theoretical questions which remain in the literature. The problem can be stated as follows: If the weights in a neural network are initialized using samples from a normal distribution, $N(0,v^2)$, how should $v^2$ be chosen to ensure that the variance of the outputs from the different layers are approximately the same?
The paper goes on to claim that "the first systematic analysis of this problem was conducted by Gloriot and Bengio", citing Gloriot & Bengio (2010). But that paper seems to just assume that the reader already accepts keeping the variance stable is a good idea, I can't find anything like an explanation in the paper. They just make this claim:
From a forward-propagation point of view, to keep information flowing we would like that $$\forall(i, i′), Var[z^i] =Var[z^{i′}]$$ From a back-propagation point of view we would similarly like to have $$\forall(i, i′), Var\left(\frac{∂Cost}{∂s^i}\right)=Var\left(\frac{∂Cost}{∂s^{i′}}\right)$$
- What variance is actually being taken here? Variance with respect to what?
- What exactly is helped by the variance being stable in this way?
- Whatever the answer to (2) is, what is the proof or evidence that that's the case?