3

I am trying to GradientCheck my c++ LSTM. The structure is as follows:

output vector (5D)

Dense Layer with softmax (5 neurons)

LSTM layer (5 neurons)

input vector (5D)

My gradient check uses this formula

$$d = \frac{\vert \vert (g-n) \vert \vert _2 }{ \vert \vert g \vert \vert _2 + \vert \vert n \vert \vert _2}$$

The Dense Layer returns descrepancy = 2.843e-05 with an epsilon of 5e-3

With 5e-5 or lower the network doesn't see any changes in Cost at all, anything greater than 5e-2 results in low precision.

This is only for top layer, and I am using 'float' 32 bit values for everything.


For the LSTM layer, I am using 5e-3 as well, but descrepancy is 0.95


The LSTM seems to converge fairly quickly on any task

0.95 bothered me, so I manually compared NumericalGradient array against the BackpropGradient array, side by side. All of the gradients match sign, and only differ in size of each entry.

For example:

numerical:    backpropped:

-0.015223 -0.000385 0.000000 0.000000 -0.058794 -0.001509 -0.000381 -9.238e-06 9.537e-05 2.473e-06 0.000215 6.266e-0.6 -0.015223 -0.000385 ...

As you can see, the signs do indeed match, and the numerical gradient is always larger than the back propped gradient

Would you say it is acceptable and I can blame float precision?


Edit:

Somewhat solved, - I simply forgot to turn-off "average the initial gradients by the number of timesteps" during backprop of my LSTM.

That's why my gradients were always smaller in the "backpropped" column.

I am now getting descrepancy of 0.00025 for LSTM

Edit: setting epsilon to 0.02 (lol) seems like a sweet-spot, as it results in descrepancy of 6.5e-05. Anything larger or smaller makes it deviate from 6.5e-05, so it seems like a numerical issue ...Only 2 layers deep though, weird af

Someone had this precision before?

Kari
  • 2,726
  • 2
  • 20
  • 49

1 Answers1

3

The fact that there is a sweet spot is a common issue in numerical differentiation. The issue is that using a high epsilon value will give you high error due to the fact that the numerical derivatives have error $O(\epsilon)$ (or $O(\epsilon^2)$ if you use centered difference), and using a very low epsilon will result in cancelling errors (see catastrophic cancellation question) due to the fact that you are substracting two numbers that are very close. It seems that $\epsilon = 0.01$ is a fair trade-off between these two extreme situations in your case.

David Masip
  • 6,051
  • 2
  • 24
  • 61