You might choose to demand predictions only after N steps of your sequence have elapsed. Then, predictions are trust-worthy. You've got to give your LSTM something to begin from, some context so to speak.
Usually you sum the errors your network produced across all timesteps, but in such a case you ignore its outputs until the N'th timestep onwards.
As follows, any prediction it makes shouldn't contribute to the sum of your errors: TotalErrorOfTheSequence
Notice, in such a case, when backpropagating, your gradient won't flow into the network before N'th timestep.
You can also learn the initial state directly, by backpropagating with respect to the actual cell_t0
. Traditionally you would initialize cell_t0
with zeros, but because you can compute a gradient towards it, you can pull its starting value closer and closer to what's usually needed.
It's possible to improve the best starting-state of LSTM with noise, which does actually improve the quality of predictions. It mitigates network overfitting on early states of your sequence where the error is usually the largest due to small context available, and where corrections are strong. "Forecasting with Recurrent Neural Networks: 12 Tricks". - read the 3rd trick
Here is my understanding, please correct it if it's wrong:
1) train your Cell state as a variable (as described above), to get a good value.
2) Construct a random variable, which is centered around your cell_t0
.
3) Collect a buffer of several hundred G-values, then start sampling randomly and use each one as a maximum possible deviation from your learned 'cell_t0'. So you randomly deviate away from the learned 'cell_t0' by some sampled G-Value
Each of such G values is the value of the gradient towards the actual already learned cell_0
I think it will make sense to keep track of which such G values are the oldest, and slowly remove them from the buffer when they become obsolete
The noise is to be disabled during testing.
Once again, not sure if I understood that approach correctly
Also have a look at these posts:
https://datascience.stackexchange.com/a/33994/43077
https://stats.stackexchange.com/a/319854/187816