GradientChecking, can I blame float precision?

Question

I am trying to GradientCheck my c++ LSTM. The structure is as follows:

output vector (5D)

Dense Layer with softmax (5 neurons)

LSTM layer (5 neurons)

input vector (5D)

My gradient check uses this formula

$$d = \frac{\vert \vert (g-n) \vert \vert _2 }{ \vert \vert g \vert \vert _2 + \vert \vert n \vert \vert _2}$$

The Dense Layer returns descrepancy = 2.843e-05 with an epsilon of 5e-3

With 5e-5 or lower the network doesn't see any changes in Cost at all, anything greater than 5e-2 results in low precision.

This is only for top layer, and I am using 'float' 32 bit values for everything.

For the LSTM layer, I am using 5e-3 as well, but descrepancy is 0.95

The LSTM seems to converge fairly quickly on any task

0.95 bothered me, so I manually compared NumericalGradient array against the BackpropGradient array, side by side. All of the gradients match sign, and only differ in size of each entry.

For example:

numerical:    backpropped:
-0.015223     -0.000385
 0.000000      0.000000
-0.058794     -0.001509
-0.000381     -9.238e-06
 9.537e-05     2.473e-06
 0.000215      6.266e-0.6
-0.015223     -0.000385
 ...

As you can see, the signs do indeed match, and the numerical gradient is always larger than the back propped gradient

Would you say it is acceptable and I can blame float precision?

Edit:

Somewhat solved, - I simply forgot to turn-off "average the initial gradients by the number of timesteps" during backprop of my LSTM.

That's why my gradients were always smaller in the "backpropped" column.

I am now getting descrepancy of 0.00025 for LSTM

Edit: setting epsilon to 0.02 (lol) seems like a sweet-spot, as it results in descrepancy of 6.5e-05. Anything larger or smaller makes it deviate from 6.5e-05, so it seems like a numerical issue ...Only 2 layers deep though, weird af

Someone had this precision before?

score 3 · Accepted Answer · answered Apr 26 '18 at 07:43

The fact that there is a sweet spot is a common issue in numerical differentiation. The issue is that using a high epsilon value will give you high error due to the fact that the numerical derivatives have error $O(\epsilon)$ (or $O(\epsilon^2)$ if you use centered difference), and using a very low epsilon will result in cancelling errors (see catastrophic cancellation question) due to the fact that you are substracting two numbers that are very close. It seems that $\epsilon = 0.01$ is a fair trade-off between these two extreme situations in your case.

GradientChecking, can I blame float precision?

1 Answers1

Linked