2

It's a common practice to normalize inputs to the neural Network.

Let's assume we have a vector of activations.

One of techniques, the Layer Normalization simply looks at the vector's components, re-centers this activated vector from $\mu$ to zero, then divides by the standard deviation $\sigma$

How is it then possible to distinguish activations [1,2,3,4] from [4,5,6,7] if both will be re-centered to the same vector [-1.5, -0.5, 0.5, 1.5f] and then divided by std deviation? Also, I can see such a problem when merely normalizing the input-state-vectors for any Neural Net.

Edit:

there seems to be a hint in first half of page 4 in the paper, however due to my weakness in Maths I can't comprehend it :(

Edit after accepted the answer:

guys, don't forget that Layer Norm (and Batch norm) both have Gain and Bias terms. If the network performs poorly, the gain is tweaked to undo division-by-std-deviation, and tweaking the bias is used to undo the shift (the re-centering). This allows some neurons to indeed pay attention to scaling and shifting, when really required.

Kari
  • 2,726
  • 2
  • 20
  • 49

1 Answers1

1

The math you are talking about is in equation (7):

Let x′ be a new data point obtained by re-scaling x by δ. Then we have, enter image description here

It is easy to see re-scaling individual data points does not change the model’s prediction under layer normalization. Similar to the re-centering of the weight matrix in layer normalization, we can also show that batch normalization is invariant to re-centering of the dataset.

It proves that the prediction over x' would be the same as in x. So what you say is right, they are indistinguishable.

And I think this is what we want because improves generalization in predictions!

For example in image classification, you are trying to detect orangutangs, if an image I represent an orangutang, ( I x 5)+2 still is an orangutang!