How are Q, K, and V Vectors Trained in a Transformer Self-Attention?

Question

I am new to transformers, so this may be a silly question, but I was reading about transformers and how they use attention, and it involves the usage of three special vectors. Most articles say that one will understand their purpose after reading about how they are used for attention. I believe I understand what they do, but I'm unsure about how they are created.

I'm aware that they come from the multiplication of the input vector by three corresponding weights, but I'm not sure how these weights are derived. Are they chosen at random and trained like a standard neural network, and if so how if there's no predefined attention data in the training corpus?

I'm very new to this, so I hope everything I'm saying makes sense. If I've got something completely wrong, please tell me!

score 5 · Answer 1 · answered Feb 17 '20 at 10:22

These matrices are not learned parameters but are a result of previous (yet parameterized) computations. In self-attentive layers, are all three of them the same, they are the outputs of the previous layers. In encoder-decoder attention, the queries are decoder states from the previous layer, keys and values and the encoder states.

In Equation 1 of the Attention is all you need paper, these are just parameters that come from outside:

It just says, what do you do, if get some queries, keys, and values from somewhere outside. This also the case of the unnumbered equation on top of page 5. Here, you only project the keys, queries, and values for the heads of multiple attentions.

Here $W_i^Q$, $W_i^K$, and $W_i^V$ are learnable parameters and they learned by the standard back-propagation algorithm.

Note that although keys and values (and queries in the self-attention) are equal only at the input of the $\text{Multihead}$ function. The $\text{Attention}$ function already gets the projected states.

Thank you so much! That really helps, but at what stage are the weights trained? And wouldn't that backpropagation interfere with that of the rest of the system? Is there not other backpropagation elsewhere? — arctic_hen7, Feb 17 '20 at 19:34

score 4 · Answer 2 · answered Feb 17 '20 at 13:41

4

Q, K, V vectors are trained with standard backpropagation. All trainable parameters are initialized at random, and then adjusted step by step with a Gradient Descent algorithm.

Surprisingly, they are trained just as any standard ANN! It's pretty amazing what they can achieve with such a classical trick.

answered Feb 17 '20 at 13:41

Leevo

6,225
3
16
52

1

Hi @Leevo, do you mean the WQ, WK, WV or you mean the component values of Q, K, V themselves? – EyeQ Tech Feb 25 '23 at 16:51
hi, you're right let's make a distinction. Let's call Q, K, V the input tensors of the Attention mechanism. They are external inputs. As a first step, they are pushed through a linear Dense layer, we can call their outputs WQ, WK, WV. This passage is done via trained weights. Thanks for pointing that out, now it's more clear. – Leevo Mar 20 '23 at 21:19

score 2 · Answer 3 · answered Feb 11 '24 at 11:33

I'm a PhD student in natural language processing, and I hope I can clear up some of the terminology used in previous answers to this question in a way that's helpful for full understanding.

To clarify terminology (source):

Depending on the text, "query", "key", and "value" can either refer to the original encodings, the learned weight matrices that multiply the encodings, or the value of the original encodings multiplied by the weights. Usually $Q$, $K$ and $V$ either refer to the original encodings (as I have done here) or the encodings multiplied by the weights (as the transformer paper does), and $W^Q$, $W^K$, $W^V$ refer to the learned weight matrices, but you often have to figure it out by context.

My guess is that you've been confused by these different representations.

In most of the other answers to this question, people seem to be using $Q$, $K$, and $V$ to refer to the original encodings. But, as I mention in the quote, the "Attention is All You Need" paper uses $Q$, $K$, and $V$ to refer to the original encodings multiplied by the weights.

Because of this, it's also unclear what your question is asking. If you're asking "how are the original encodings trained", and you're working in natural language processing, then you might find this paper interesting (it's what BERT cites as what it uses for its embeddings, and future embeddings are built on this). However, in many transformer systems (such as BERT), the original encodings are fixed and not learned at all during model training. However, the weight matrices in transformers are always trained and are the main focus of the technique.

So, if your question is how the weight matrices are trained, then the answer is that they're trained using gradient descent based on whatever the end task is. Attention is differentiable, meaning that if you can compute how much error is in each of the outputs of the attention module, you can also compute the direction to change every single weight in each of the three ($W^Q$, $W^K$, $W^V$) weight matrices to reduce that error. And gradient descent is just nudging the weights in that direction.

If your question is "How are the multiplication of the original encodings and the weight matrices trained?" then the answer is that usually the original encodings are held fixed (or are gotten from a previous layer), and we're only interested in how the weight matrices are trained (in which case, see the previous paragraph.)

Hopefully this helps provide a more clear answer to your question.

Hi, thanks for the answer, very helpful. Been trying to understand this myself. May I request you to elaborate, ideally with an example, the part where you write how the weight matrices are trained, then the answer is that they're trained using gradient descent based on whatever the end task is. So if you have an English and a French corpus and you're building a translation system, how do you decide the cost functions and the actual/predicted output for the training? — ahron, Mar 10 '24 at 16:23
I would direct you to the original paper, but when I looked, it's missing some info. Apparently cross-entropy loss is the assumed loss function for machine translation tasks (source: GPT-4-turbo and this). GPT-4-turbo says: "The cross-entropy loss function is applied at each position of the output sequence, comparing the model's predicted probability distribution over the vocabulary to the actual next word in the sequence. The loss is then averaged over the sequence length and the batch size for training." — Pro Q, Mar 14 '24 at 06:29
Basically, (assuming English -> French), you give the model the entire English sequence, and at each step the model outputs a distribution across all French tokens (the token with the highest weight is the "predicted" token). You then compute cross entropy between the one-hot of the correct token, and the model's output distribution across all tokens. You run this process for each output token in the French output. The cross-entropy error is then averaged across each of the output tokens, and is used as the loss for backpropagation throughout the entire model. — Pro Q, Mar 14 '24 at 06:34
(As a side note, usually there's a fixed matrix that converts from a small ~300-dimension vector into the multiple-thousand-dimensional vector of all possible output tokens, and another that does the reverse. For an example, see https://fasttext.cc/docs/en/english-vectors.html) — Pro Q, Mar 14 '24 at 06:35
If you have more questions, please feel free to ask! I think this may also be a reasonable new question as well which I might go into more detail for in a full answer, rather than just comments here. (Also note that while I'm a PhD in natural language processing, my work is in annotation prediction, not translation, which is why I'm relying on GPT-4 and internet searches to back up its claims here, rather than just knowing this off the top of my head.) — Pro Q, Mar 19 '24 at 02:38

score 2 · Answer 4 · answered Feb 24 '21 at 19:44

As Jindřich has said, Q, K, V come from previous computations, they are not trained directly with backpropagation. However, the weights $W_i^Q, W_i^K, W_i^V$ are trained directly with backpropagation.

Expanding on this, in the "Attention is all you need paper", in the self attention used by the encoder and decoder, Q, K, V are the same matrix.

If we just look at the self attention in the encoder, in the first layer Q, K, V are the representation of the input sentence, after the embedding and positional encoding steps. The number of rows in these matrices is equal to the number of tokens in the input sequence, and the number of columns is based on the architecture (it's 64 in the paper). The outputs from the first encoder layer are then used as Q, K, V for the next layer (again these are all the same matrix).

The decoders attention self attention layer is similar, however the decoder also contains attention layers for attending to the encoder. For this attention, the Q matrix comes the decoders self-attention, and K,V are the outputs of the final encoder layer.

I am not sure if i understand correctly, if in self-attention, Q==K==V, and Q, K, V are calculated based off W^q, W^k, W^v, does that mean W^q==W^k==W^v? — shelper, Oct 18 '21 at 02:04

How are Q, K, and V Vectors Trained in a Transformer Self-Attention?

4 Answers4