11

I have been trying to find an answer to these questions, but I only find conflicting information. Is the transformer as a whole autoregressive or not? And what about the decoder? I understand that the decoder during inference proceeds autoregressively, but I am not sure about during training time.

Here are posts saying that the Transformer is not autoregressive:

Minimal working example or tutorial showing how to use Pytorch's nn.TransformerDecoder for batch text generation in training and inference modes?

Here are some saying that it is:

What would be the target input for Transformer Decoder during test phase?

https://www.tensorflow.org/text/tutorials/transformer

https://towardsdatascience.com/illustrated-guide-to-transformers-step-by-step-explanation-f74876522bc0

https://huggingface.co/transformers/summary.html#seq-to-seq-models

Dametime
  • 223
  • 1
  • 2
  • 4

2 Answers2

17

The normal Transformer decoder is autoregressive at inference time and non-autoregressive at training time.

The non-autoregressive training can be done because of two factors:

  1. We don't use the decoder's predictions as the next timestep input. Instead, we always use the gold tokens. This is referred to as teacher forcing.
  2. The hidden states of all time steps are computed simultaneously in the attention heads. This is different in recurrent units (LSTMs, GRUs), where we need to have the previous timestep's hidden state to compute the current timestep's one.

At inference time, we don't have gold tokens, so we use the prediction of the decoder as the next timestep input.

Note that, despite being autoregressive at inference time, efficient implementations normally cache the hidden states of the previous timesteps, so they are not re-computed at each step.

As a whole, the Transformer model is autoregressive, because its decoder is autoregressive.

There are non-autoregressive variants of the Transformer (e.g. this), but they are more research topics than out-of-the-box solutions.

noe
  • 26,410
  • 1
  • 46
  • 76
  • 2
    "The normal Transformer decoder is autoregressive at inference time and non-autoregressive at training time." That's the essence. Very well put into the words! – Kamil Czerski May 11 '22 at 20:25
  • "efficient implementations normally cache the hidden states of the previous timesteps, so they are not re-computed at each step". Why are anything but the final hidden states needed at inference time? I'm having trouble understanding why the others are generated. – dashnick Apr 07 '23 at 02:36
  • @dashnick could you please ask a new question with your doubt? – noe Apr 07 '23 at 10:41
  • Will do thanks. – dashnick Apr 07 '23 at 16:35
  • @noe see here https://datascience.stackexchange.com/q/120792/45069 – dashnick Apr 07 '23 at 16:58
1

The vanilla transformer decoder models proposed in the original paper "Attention Is All You Need" and the OpenAI GPT-series models are autoregressive models at inference time. In addition, I think the term "autoregressive" matters mostly for the inference behavior. In fact, primarily thanks to the masked attention mechanism, with appropriate caching optimization, one can only feed one token to and produce one token from a transformer decoder during inference. For more information, please check the analysis from my blog post "Transformer Autoregressive Inference Optimization".

The vanilla transformer decoder models are not trained in an autoregressive fashion, although it can be trained in such a way, just like the other classical language modeling models (such as recurrent neural networks) or time-series models. As Noe pointed out that we use golden tokens during training, instead of the tokens being predicted from the model that is being trained.

One can still train an autoregressive model using the predicted tokens if the user really wants to. But it would not make too much sense, especially when the model is not well trained at the beginning of the training. For example, if the user is trying to ask the model to learn language modeling of the sentence ["How", "are", "you", "?"]. But because the model is not mature, the autoregressive predictions starting from the first token become ["day", "wow", "oh", "haha", "bro", "."]. How would the user use these predicted tokens as inputs and correctly models the sentence ["How", "are", "you", "?"]? Specifically, in the vanilla transformer decoder, what the model would be learning would be:

  1. Given ["How"], predict ["day"].
  2. Given ["How", "are"], predict ["wow"].
  3. Given ["How", "are", "you"], predict ["oh"].
  4. ...

which absolutely makes no sense.

There are non-autoregressive models for sequence-to-sequence tasks. However, the modeling often requires much more sophisticated orchestration. Please refer to my blog post "Non-Autoregressive Model and Non-Autoregressive Decoding for Sequence to Sequence Tasks " for details if you want.

Lei Mao
  • 49
  • 2