How does Chat GPT encode a question?

Question

Chat GPT is based on a decoder-only Transformer so it does not have an encoder. Given that, how is a user's question passed as input to Chat GPT's decoder? In a regular encoder-decoder architecture, the final embeddings of the encoder are passed to the decoder along with the token. Then, the decoder auto-regressively outputs tokens. How would this work in a decoder-only architecture?

Let's say I have the following question: "How many countries are there in the world?" and its token form is [3, 5, 8, 2, 10, 4, 1, 6, 7]. How will the decoder take in the input?

In an encoder-decoder architecture, I would just pass the token to the decoder which would, based on the encoder embedding, auto-regressively output the next token in the sequence to form an answer.

In the decoder-only architecture, how will that work? If I pass the first token of the question (token "3") to the decoder, it will output what it thinks is the most likely token after "3" but it will have no context... So how is the context taken into account given we are not encoding it?

I am aware of this post which is similar but I would more specifically want to know how we go from a decoder-only model that inputs the most likely word in general, to a model that outputs a whole answer based on a specific question. Put differently, how does a problem which is inherently a sequence to sequence problem (question answering) get solved using an autoregressive model instead of a seq2seq model?

One potential answer I have been looking into is training fine-tuning but I am still unsure how a model that out puts a single token x[n+1] = f(x[n],...,x[1]) can be fine-tuned based on a training set of two sequences (one question and one answer per record).

Do you have a reference for "Chat GPT is based on a decoder-only Transformer"? — Dr. Snoopy, Dec 11 '23 at 07:38
only from forums: https://community.openai.com/t/is-gpt-group-of-models-decoder-only-model/286586/2, https://datascience.stackexchange.com/questions/118260/chatgpts-architecture-decoder-only-or-encoder-decoder#:~:text=Here%20is%20one%20of%20its,has%20an%20encoder%2Ddecoder%20architecture. — joan, Dec 11 '23 at 14:43

score -2 · Answer 1 · answered Dec 11 '23 at 06:38

-2

In the case of ChatGPT's decoder-only architecture, the input question is typically passed as a prompt or initial context to guide the generation of responses. Rather than explicitly encoding the question and passing it to the decoder, the model uses the conditioning prompt to infer and generate relevant responses.

When you provide the question prompt, such as "How many countries are there in the world?" or its corresponding token representation [3, 5, 8, 2, 10, 4, 1, 6, 7], ChatGPT uses this information as the starting point. The model then processes the prompt and generates a response token by token, considering the context provided.

answered Dec 11 '23 at 06:38

han

30
4

thanks. so let's say f() is the decoder-only model and the answer predicted is "there are 195" ([20, 2, 13]) . Then, we would have: 1/ f([3, 5, 8, 2, 10, 4, 1, 6, 7]) = 20; 2/ f([5, 8, 2, 10, 4, 1, 6, 7, 20]) = 2; f([3, 5, 8, 2, 10, 4, 1, 6, 7, 20, 2]) = 13? I guess that would work but it's hard for me to understand how the model could infer enough context from the prompt to output the correct response given it is trained to output the most likely next token, in general, not in the context of a specific question. – joan Dec 11 '23 at 14:45
@John In neural networks, information about the context of a question is usually conveyed through internal states. These states are updated and adapted at each generation step.
In the context of sequential response generation, the model selects the next token at each step based on the statistical relationships learned during the training process. It estimates the probability of possible tokens based on previous tokens and the current context of the question.
– han Dec 11 '23 at 15:34
thanks, so those representations are learned during model fine-tuning? Also, is the example I gave above correct? Would a training set for such fine tuning look like the following? [[[3, 5, 8, 2, 10, 4, 1, 6, 7], [20]], [[5, 8, 2, 10, 4, 1, 6, 7, 20], [2]], [[8, 2, 10, 4, 1, 6, 7, 20, 2], [13]]] – joan Dec 11 '23 at 17:53
@John Yes. The example is correct. – han Dec 13 '23 at 02:43

How does Chat GPT encode a question?

1 Answers1