How to design batches in a stateful RNN

Question

I am using TF Eager to train a stateful RNN (GRU).

I have several variable length time sequences about 1 minute long which I split into windows of length 1s.

In TF Eager, like in Keras, if stateful=True, "the last state for each sample at index i in a batch will be used as initial state for the sample of index i in the following batch." (source)

Thus, how should I design my batches? I obviously cant sample random windows from random sequences. I also cant split a sequence into windows and place adjacent windows in the same batch (e.g. batch 1 = [[seq1 0-1s], [seq 1 1-2s], [seq1 2-3s], ...]), as the state from the previous window wont get passed to the next window, as is the point of a stateful RNN.

I was thinking of mixing sequences in the same batch as in:

batch 1 = [[seq1 0-1s], [seq2 0-1s], [seq3 0-1s], ...]
batch 2 = [[seq1 1-2s], [seq2 1-2s], [seq3 1-2s], ...]
...

However, there the issue is that the sequences have different length, and thus some will finish before others.

So what is the best way to implement this?

(FYI, I couldn't find anything in the academic literature or blogoshhere which discusses this, so refs would be great)

Thanks!

Have you considered padding all sequences with silence (all 0‘s) to get the same length? Your approach sounds like it‘s correct. — AlexR, May 01 '19 at 10:13
I have not. Is that the standard practice? One con would be wasted inference which would lengthen training process — DankMasterDan, May 01 '19 at 13:09

Esmailian · Accepted Answer · 2019-05-03T10:00:27.920

Your specific case

After [seq1 0-1s] (1st sec of long sequence seq1) at index 0 of batch b, there is [seq1 1-2s] (2nd sec of the same sequence seq1) at index 0 of batch b+1, this is exactly what is required when we set stateful=True.

Note that the samples inside each batch must be the same length, if this is done correctly, difference in sequence-length between (not inside) batches causes no problem. That is, when all samples from batch b are processed, then next batch b+1 will be processed, and so on and so forth.

A general example

As a general example, for stateful=True and batch_size=2, a data set like

seq1: s11, s12, s13, s14
seq2: s21, s22, s23
seq3: s31, s32, s33, s34, s35
seq4: s41, s42, s43, s44, s45, s46

where sij denotes j-th time step, must be structured like

    batch 1         batch 2         batch 3         batch 4  

0   s21, s22        s23, <pad>      s31, s32, s33   s34, s35, <pad>   ...
1   s11, s12        s13, s14        s41, s42, s43   s44, s45, s46

or like (with overlap)

    batch 1         batch 2         batch 3         

0   s21, s22        s22, s23        s23, <pad>    ...
1   s11, s12        s12, s13        s13, s14

where, for example, long sequence s21, s22, s23 (3 time steps) is broken down to two sub-sequences s21, s22 and s23, <pad>. Also, as you see, it is possible to have batches with different sequence lengths (by using a custom batch generator).

Note that <pad> (padded values) should be masked to prevent RNN from considering them as actual values (more info in this post). We can also avoid using padded values by opting for batch_size=1 which might be too restrictive (more info in this post).

Here are two examples of a sequence with 5 time steps:

            s11   s12   s13   s14   s15

example 1   23,   25,   27,   24,    28     # 5 temperature readings for t to t+4

example 2   I,    like, LSTM, very,  much   # 5 128-dim word embeddings

Some resources

You may find this article on stateful vs stateless LSTM helpful. Some quotes from the article:

The stateless LSTM with the same configuration may perform better on this problem than the stateful version.

and

When a large batch size is used, a stateful LSTM can be simulated with a stateless LSTM.
This blog on Stateful LSTM in Keras by Philippe Remy
Some opinions on Keras github, like this.

Thanks! 2 q's: (1) what do you mean when you say the 0'th index of batch 1 is s11, s12. Each batch index will only fit 1 second (e.g. only either s11 or s12)? Also, you increase this to 3 samples in batch 3,4 (e.g. s31 s32 s33). (2) What do you mean by <pad>? Do you mean place a sample with all 0's? Wouldn't this negatively affect training time? — DankMasterDan, May 01 '19 at 16:11
@DankMasterDan I have added extra remarks regarding your questions, Hope it helps. Samples inside a batch must be the same length, by using masking there would be no wasted training time. — Esmailian, May 01 '19 at 19:09
Esmailian, thnx! You answered my 2nd question, but not my first. Is the 0th sample of batch 1 s21 or s22? I dont see how it can be both? Concretely, if each sample s_ij is shape [5, 128] (as is 128 dim word embedding), batch shape will be [2,5,128]. So how can batch 1 contain s21, s22, s11, and s12? — DankMasterDan, May 01 '19 at 20:43
@DankMasterDan each sij is one time step (e.g. one word with dimension [1 128]). A 1-second chunk of 1-minute sequence could be either 1 time-step or N time-steps (N > 1). — Esmailian, May 02 '19 at 08:58
got it. Had a brain fart there for a seconds. This is exactly what I was looking for! — DankMasterDan, May 03 '19 at 17:55
I asked a related question on how to implement this in TensorFlow: https://stackoverflow.com/questions/55978050/using-tf-dataset-api-to-process-sequences-for-stateful-rnn — DankMasterDan, May 03 '19 at 23:08
This is exactly what I've been looking for, thank you. Although it was rather harsh to implement this reshaping as a general function. Does Keras not have shaping function for its RNNs? — meliksahturker, Dec 31 '20 at 08:54

How to design batches in a stateful RNN

1 Answers1

Your specific case

A general example

Some resources

Linked