Let's assume a vanilla MLP for classification with a given activation function for hidden layers.
I know it is a known best practice to normalize the input of the network between 0 and 1 if sigmoid is the activation function and -0.5 and 0.5 if tanh is the activation function.
What about ReLu ?
Should I normalise the network input between 0 and 1, -0.5 and 0.5, or -1 and 1
Any known best practices there?
I am not talking about normalisation of the input of the ReLu like using Batch Normalisation just before or just after the ReLu : https://arxiv.org/pdf/1508.00330
But I am talking about normalising the input of the whole network.