I've just realized my prediction approach for LSTM might not be correct.
I am trying to predict character by character, by reading over the book. The way I've approached the problem is as follows:
b c d e ^ carry cell state forward ^ ^ ^ LSTM_t0 -------------------------> LSTM_t1 -----> LSTM_t2 -----> LSTM_t3 ^ ^ ^ ^ a b c d
This means I have 4 timesteps, at at each one I feed next letter into LSTM, expecting it to immediately predict the next letter.
Should I instead do this:
ignore ignore ignore e ^ ^ ^ ^ LSTM_t0 ----> LSTM_t1 -----> LSTM_t2 -----> LSTM_t3 ^ ^ ^ ^ a b c d
In the first case, I am able to get 4 loss-values, but in the second example, I only have 1 source of gradient, at _t3
My main concern is in first example, I demand LSTM to make prediction of 'b' and 'c' without supplying it enough previous context. It's fine for 'd' and 'e', but asking for answer at timestep 0 and 1 is a bit unfair?
What would be best for this particular example?
Now there may be better models for f(abcd) that may not be sequential. But that is outside the scope of his question.
– user18764 Apr 18 '18 at 15:22