Chat GPT is based on a decoder-only Transformer so it does not have an encoder. Given that, how is a user's question passed as input to Chat GPT's decoder? In a regular encoder-decoder architecture, the final embeddings of the encoder are passed to the decoder along with the token. Then, the decoder auto-regressively outputs tokens. How would this work in a decoder-only architecture?
Let's say I have the following question: "How many countries are there in the world?" and its token form is [3, 5, 8, 2, 10, 4, 1, 6, 7]. How will the decoder take in the input?
In an encoder-decoder architecture, I would just pass the token to the decoder which would, based on the encoder embedding, auto-regressively output the next token in the sequence to form an answer.
In the decoder-only architecture, how will that work? If I pass the first token of the question (token "3") to the decoder, it will output what it thinks is the most likely token after "3" but it will have no context... So how is the context taken into account given we are not encoding it?
I am aware of this post which is similar but I would more specifically want to know how we go from a decoder-only model that inputs the most likely word in general, to a model that outputs a whole answer based on a specific question. Put differently, how does a problem which is inherently a sequence to sequence problem (question answering) get solved using an autoregressive model instead of a seq2seq model?
One potential answer I have been looking into is training fine-tuning but I am still unsure how a model that out puts a single token x[n+1] = f(x[n],...,x[1]) can be fine-tuned based on a training set of two sequences (one question and one answer per record).