LSTM training/prediction with no starting sequence

Question

ML newbie here. As an exercise, I'm trying to build a character based language model based on a simple 1 layer LSTM. Based on what I've learned about LSTMs, a common usage is to take in a sequence of characters and then predict the next character. What I don't fully understand is how I would go about predicting the very first character when there's no preceding sequence yet (or, by extension, predicting a character when there's not a long enough preceding sequence as input to the LSTM units).

The best solution I can thing of is to reserve a special character in the vocabulary to represent the abscence of any characters. A toy example:

Full training corpus: "foo"
LSTM unit count: 2
"Absent character" symbol: ABSENT
"End of file" symbol: EOF
Training inputs:
    sample: [ABSENT, ABSENT] label: 'f'
    sample: [ABSENT, 'f'] label: 'o'
    sample: ['f', 'o'] label: 'o'
    sample: ['o', 'o'] label: EOF

My question is: what's the best practice for doing this type of thing? Am I on the right track?

Kari · Answer 1 · 2018-08-20T01:02:45.230

You might choose to demand predictions only after N steps of your sequence have elapsed. Then, predictions are trust-worthy. You've got to give your LSTM something to begin from, some context so to speak.

Usually you sum the errors your network produced across all timesteps, but in such a case you ignore its outputs until the N'th timestep onwards. As follows, any prediction it makes shouldn't contribute to the sum of your errors: TotalErrorOfTheSequence

Notice, in such a case, when backpropagating, your gradient won't flow into the network before N'th timestep.

You can also learn the initial state directly, by backpropagating with respect to the actual cell_t0. Traditionally you would initialize cell_t0 with zeros, but because you can compute a gradient towards it, you can pull its starting value closer and closer to what's usually needed.

It's possible to improve the best starting-state of LSTM with noise, which does actually improve the quality of predictions. It mitigates network overfitting on early states of your sequence where the error is usually the largest due to small context available, and where corrections are strong. "Forecasting with Recurrent Neural Networks: 12 Tricks". - read the 3rd trick

Here is my understanding, please correct it if it's wrong:

1) train your Cell state as a variable (as described above), to get a good value.

2) Construct a random variable, which is centered around your cell_t0.

3) Collect a buffer of several hundred G-values, then start sampling randomly and use each one as a maximum possible deviation from your learned 'cell_t0'. So you randomly deviate away from the learned 'cell_t0' by some sampled G-Value

Each of such G values is the value of the gradient towards the actual already learned cell_0

I think it will make sense to keep track of which such G values are the oldest, and slowly remove them from the buffer when they become obsolete

The noise is to be disabled during testing.

Once again, not sure if I understood that approach correctly

Also have a look at these posts:

https://datascience.stackexchange.com/a/33994/43077 https://stats.stackexchange.com/a/319854/187816

LSTM training/prediction with no starting sequence

1 Answers1