I drove myself into a corner with this, can someone please explain?
I feel I'm missing something obvious...
If, for LSTM, each layer is trained with inputs from t and t-1, than that'd mean that if I've got a training set of a 10 000 observations, the network is trained to get 10 000 observations and produce a result as a function of all of them. If I use it on a test set of say 1 000 observations, why would it work?
Or if I want to make a prediction, from a single observation, whz would that work at all?
Should, in the case of LSTMs, the train test (in the toy example above) be 10 000 observations long (i.e. 9 000 old 'train' observations and 1 000 new 'test' ones)?