How to choosing the random value for parameter w in deep learning network?

Question

I did watch the course DeepLearning of Andrew Ng and he told that we should create parameter w small like:

parameters['W' + str(l)] = np.random.randn(layer_dims[l], layer_dims[l - 1]) ** 0.001

But in the last application assignment. They choose another way:

layers_dims = [12288, 20, 7, 5, 1]
def initialize_parameters_deep(layer_dims):
    np.random.seed(3)
    parameters = {}
    L = len(layer_dims)

    for l in range(1, L):
        parameters['W' + str(l)] = np.random.randn(layer_dims[l], layer_dims[l - 1]) / np.sqrt(layer_dims[l - 1])
        parameters['b' + str(l)] = np.zeros((layer_dims[l], 1))
        assert (parameters['W' + str(l)].shape == (layer_dims[l], layer_dims[l - 1]))
        assert (parameters['b' + str(l)].shape == (layer_dims[l], 1))
    return parameters

And the result of this way is very good but if I choose w like the old above, It's just have 34% correct!

So do you can explain ?

Hi and welcome to AI SE! You accepted the answer https://ai.stackexchange.com/a/20242/2444, but if you feel that the answer is also valuable, you should also upvote it by clicking on the arrow that points upwards close to the answer. In this way, you encourage the user to provide more answers! — nbro, Apr 14 '20 at 13:19

score 3 · Accepted Answer · answered Apr 14 '20 at 03:34

Weights initialisation is strictly related to the vanishing/exploding gradient problem. For a complete explanation, please check this awesome page (also from deeplearning.ai). Here I'll summarise the main concepts:

initialising weight all to zero will cause all weights to have the same derivative value with respect to the loss function, hence the network would be incapable of learning anything.
initialising to zero the biases has no drawback since they are constants (no effect at all when computing the derivative anyway).
initialising weights with too high or too small values will lead to an exploding gradient (oscillating gradient values without convergence) or a vanishing gradient (small gradient values variation that converge before reaching the loss global minimum).

In order to avoid these problems, a method called Xavier initialisation has been proposed: the weights should be initialised in such a way that they will generate activations scores with a distribution that has:

Mean 0
Constant variance across layers (i.e. no vanishing/exploding)

The value "np.sqrt(layer_dims[l - 1])" pup up when imposing the second constrain. For a formal prove check the page I linked. To grasp the concept just focus on the fact that the variance of the weights of a layer depends on the amount of nodes of the previous layer. This mean that for layers preceded by layers with a big amount of hidden nodes, the weights will be initialised with a small variance, and this is ok cause we don't want a small bunch nodes to have stronger influence on the subsequent activations. But in layers which instead are preceded by layers with a small amount of nodes, it's ok to allow the weights to vary more.

How to choosing the random value for parameter w in deep learning network?

1 Answers1