1

Most videos in benchmark datasets like UCF101 are short (<40 sec) and monotonous, as in they 'focus' on a person performing a specific action (jumping, running, etc) . The whole video can be run through the ConvNet+LSTM with a logit output predicting the class of the video.

What if these conditions are very different? The video (or videos) are

  • long, e.g. 10-20 minutes,
  • 'diverse', as in switch between different, seemingly unrelated scenes, e.g. full-screen subtitles,
  • there are only two classes, but the difference between them is subtle, as in sentiment type.

With regards to the first condition, there's a problem with the GPU use - if I extract, for example, 1000 frames from the video to represent it, I don't know if there's any way to put them all to CUDA. The way I'm thinking of solving this problem is by splitting the video into consecutive segments (e.g. 10) and building a batch from these segments, e.g. if I take 100 frames in each segment (data point), it will be size (10, 100, 3, H, W). But the label is more like (1,1) rather than (10,1), because there a label per whole video.

So it will be a sequence split into minibatches, that right?

I want to employ a ConvNet+LSTM for predicting the label of the video. The way I see it, LSTM can output a logit value per minibatch, in total 10 values, that will be inputs in a single-layer classifier with 1 output.

I haven't implemented it yet, but in general my question is this:

HOW CAN AN LSTM HANDLE A VERY LONG VIDEO WITH DIVERSE INFORMATION IN IT?

CAN I EMPLOY A SEQUENCE OF LSTMS?

Alex
  • 767
  • 6
  • 17

0 Answers0