7

I am using TF Eager to train a stateful RNN (GRU).

I have several variable length time sequences about 1 minute long which I split into windows of length 1s.

In TF Eager, like in Keras, if stateful=True, "the last state for each sample at index i in a batch will be used as initial state for the sample of index i in the following batch." (source)

Thus, how should I design my batches? I obviously cant sample random windows from random sequences. I also cant split a sequence into windows and place adjacent windows in the same batch (e.g. batch 1 = [[seq1 0-1s], [seq 1 1-2s], [seq1 2-3s], ...]), as the state from the previous window wont get passed to the next window, as is the point of a stateful RNN.

I was thinking of mixing sequences in the same batch as in:

batch 1 = [[seq1 0-1s], [seq2 0-1s], [seq3 0-1s], ...]
batch 2 = [[seq1 1-2s], [seq2 1-2s], [seq3 1-2s], ...]
...

However, there the issue is that the sequences have different length, and thus some will finish before others.

So what is the best way to implement this?

(FYI, I couldn't find anything in the academic literature or blogoshhere which discusses this, so refs would be great)

Thanks!

DankMasterDan
  • 293
  • 3
  • 10
  • Have you considered padding all sequences with silence (all 0‘s) to get the same length? Your approach sounds like it‘s correct. – AlexR May 01 '19 at 10:13
  • I have not. Is that the standard practice? One con would be wasted inference which would lengthen training process – DankMasterDan May 01 '19 at 13:09

1 Answers1

6

Your specific case

After [seq1 0-1s] (1st sec of long sequence seq1) at index 0 of batch b, there is [seq1 1-2s] (2nd sec of the same sequence seq1) at index 0 of batch b+1, this is exactly what is required when we set stateful=True.

Note that the samples inside each batch must be the same length, if this is done correctly, difference in sequence-length between (not inside) batches causes no problem. That is, when all samples from batch b are processed, then next batch b+1 will be processed, and so on and so forth.

A general example

As a general example, for stateful=True and batch_size=2, a data set like

seq1: s11, s12, s13, s14
seq2: s21, s22, s23
seq3: s31, s32, s33, s34, s35
seq4: s41, s42, s43, s44, s45, s46

where sij denotes j-th time step, must be structured like

    batch 1         batch 2         batch 3         batch 4  

0   s21, s22        s23, <pad>      s31, s32, s33   s34, s35, <pad>   ...
1   s11, s12        s13, s14        s41, s42, s43   s44, s45, s46

or like (with overlap)

    batch 1         batch 2         batch 3         

0   s21, s22        s22, s23        s23, <pad>    ...
1   s11, s12        s12, s13        s13, s14   

where, for example, long sequence s21, s22, s23 (3 time steps) is broken down to two sub-sequences s21, s22 and s23, <pad>. Also, as you see, it is possible to have batches with different sequence lengths (by using a custom batch generator).

Note that <pad> (padded values) should be masked to prevent RNN from considering them as actual values (more info in this post). We can also avoid using padded values by opting for batch_size=1 which might be too restrictive (more info in this post).

Here are two examples of a sequence with 5 time steps:

            s11   s12   s13   s14   s15

example 1   23,   25,   27,   24,    28     # 5 temperature readings for t to t+4

example 2   I,    like, LSTM, very,  much   # 5 128-dim word embeddings

Some resources

  1. You may find this article on stateful vs stateless LSTM helpful. Some quotes from the article:

    The stateless LSTM with the same configuration may perform better on this problem than the stateful version.

    and

    When a large batch size is used, a stateful LSTM can be simulated with a stateless LSTM.

  2. This blog on Stateful LSTM in Keras by Philippe Remy

  3. Some opinions on Keras github, like this.

Esmailian
  • 9,312
  • 2
  • 32
  • 48
  • Thanks! 2 q's: (1) what do you mean when you say the 0'th index of batch 1 is s11, s12. Each batch index will only fit 1 second (e.g. only either s11 or s12)? Also, you increase this to 3 samples in batch 3,4 (e.g. s31 s32 s33). (2) What do you mean by <pad>? Do you mean place a sample with all 0's? Wouldn't this negatively affect training time? – DankMasterDan May 01 '19 at 16:11
  • @DankMasterDan I have added extra remarks regarding your questions, Hope it helps. Samples inside a batch must be the same length, by using masking there would be no wasted training time. – Esmailian May 01 '19 at 19:09
  • Esmailian, thnx! You answered my 2nd question, but not my first. Is the 0th sample of batch 1 s21 or s22? I dont see how it can be both? Concretely, if each sample s_ij is shape [5, 128] (as is 128 dim word embedding), batch shape will be [2,5,128]. So how can batch 1 contain s21, s22, s11, and s12? – DankMasterDan May 01 '19 at 20:43
  • @DankMasterDan each sij is one time step (e.g. one word with dimension [1 128]). A 1-second chunk of 1-minute sequence could be either 1 time-step or N time-steps (N > 1). – Esmailian May 02 '19 at 08:58
  • got it. Had a brain fart there for a seconds. This is exactly what I was looking for! – DankMasterDan May 03 '19 at 17:55
  • I asked a related question on how to implement this in TensorFlow: https://stackoverflow.com/questions/55978050/using-tf-dataset-api-to-process-sequences-for-stateful-rnn – DankMasterDan May 03 '19 at 23:08
  • This is exactly what I've been looking for, thank you. Although it was rather harsh to implement this reshaping as a general function. Does Keras not have shaping function for its RNNs? – meliksahturker Dec 31 '20 at 08:54