1

I have followed this tutorial https://www.youtube.com/watch?v=U0s0f995w14 to create a minified version of a transformer architecture but I am confused about the final shape of the output.

Heres the code: https://github.com/aladdinpersson/Machine-Learning-Collection/blob/558557c7989f0b10fee6e8d8f953d7269ae43d4f/ML/Pytorch/more_advanced/transformer_from_scratch/transformer_from_scratch.py#L2

on the final lines (slightly modified):

print(x.shape)
print(trg[:, :-1].shape)
out = model(x, trg[:, :-1])
print(out.shape)

The output shapes don't seem to make sense

torch.Size([2, 9]) #input sentence (num_examples, num_tokens)
torch.Size([2, 7]) #generated sentence so far (num_examples, num_tokens_generated)
torch.Size([2, 7, 10]) # probabilities for next token (num_examples, ???, size_vocab)

The transformer is supposed to predict the next token across 2 training examples (which is why theres a two for the number of examples and a 10 for the size of the vocab), by generating probabilites for each token in the vocab. But I can't make sense of why theres a 7 there. The only explanation I can come up with is that it outputs all predictions simulatenously, but that would require feeding the outputs iteratively through the transformer, but that never happens (see lines 267-270).

So is there a mistake or am I not understanding something correctly? What is that output shape supposed to represent?

Can somebody make sense of this?

1 Answers1

1

7 is the length of the target sequence passed as argument to the model, which is trg[:, :-1], that is, the target sequence except the last token. The last token is removed because it contains either the end-of-sequence token of the longest sentence in the batch or padding tokens of the shorter sequences in the batch, and therefore it is useless.

The output of the decoder is of the same length as its input. The shape of trg[:, :-1] is [2, 7], so the shape of the output is the same.

Note that in the video they are invoking the model in an unusual way, because they are passing a whole target sequence to the model but they are not training it. Normally, the model would be used in one of the following ways:

  • In training mode, receives a full target sequence and its output is used to compute the loss and update the network weights via gradient descent.
  • In inference mode, is used auto-regressively, that is, we decode token by token, incorporating each new predicted token into the input for the next step.

I guess they used the model this way just to illustrate that the model works.

noe
  • 26,410
  • 1
  • 46
  • 76
  • I'm confused. According to attention is all you need paper, the output is a set of probabilities over the vocab for the next token. Why would it return the same sentence it got fed into it? – user1510024 Jun 11 '23 at 17:47
  • The output is the probability distribution over the vocab for all the tokens. At inference time, we choose to ignore everything but the last position to make the prediction of the next token. At training time, we take the output at all the positions to compute the loss. – noe Jun 11 '23 at 17:49
  • Okay, but then shouldn't the output be 8 then? Or is 7 the context size and in the real application the rest is simply padded with 0s? Also wouldn't the Transformer simply learn a 1-1 mapping of everything but the the last token? – user1510024 Jun 11 '23 at 18:13
  • The last position is usually removed. At inference time, this removal is optional, you may or may not remove it; removing the last token does not hurt because it either contains end-of-sequence or padding, neither of which is not needed in any case. In training, the removal of the last position is actually needed because we need to shift the expected outputs one position (so that the expected output at the first position is the target input at the second position), which means that we lose one token in the expected output and, therefore, we need to remove the last token of the target input. – noe Jun 11 '23 at 18:20
  • So if in case of the decoder, the input length (previously generated token) is equal to the output length (previously generated tokens plus next token) does this mean 7 is equal to this transformers context size? Also again, wouldn't that lead to the Transformer just learning a 1 to 1 mapping between everything but the last token? I just don't see the point of assigning part of the input as the output – user1510024 Jun 11 '23 at 18:31
  • About the 1:1 mapping: as I said, the expected output is shifted one position so that the input at position $i$ has as expected output the input at position $i+1$, that is, at each position the output is not the input at the same position but the input at the next position (i.e. the next token). – noe Jun 11 '23 at 18:53
  • About the context size: I don't know what you are referring to "this transformer's context size". Please clarify what you mean by "context" here. – noe Jun 11 '23 at 18:54
  • https://analyticsindiamag.com/context-is-everything-the-context-length-problem-with-gpt-models/#:~:text=GPT's%20context%20length%20limitation,of%20the%20seminal%20GPT%2D4%20. I thought maybe there is a fixed length token input – user1510024 Jun 12 '23 at 05:33
  • So you are saying during training time the transformer is given the entire sentence and asked to predict the entire sentence only once per epoch? Auto-regressive token by token generation is only done during inference? – user1510024 Jun 12 '23 at 06:21
  • Autoregressive token generation is only done during inference, yes; you can check this answer for a deeper explanation of this. I think you are mistaking the concept of "epoch" in your statement, though; I suggest you check your understanding with this answer. – noe Jun 12 '23 at 09:03
  • About the fixed-length input: transformers accept variable-length input, but there is usually a maximum number of tokens due to the positional encodings. Check more details about it in here, here and here. – noe Jun 12 '23 at 09:05
  • 1
    If you have further questions, please create a new question on the site. – noe Jun 12 '23 at 09:06