I'm a PhD student in natural language processing, and I hope I can clear up some of the terminology used in previous answers to this question in a way that's helpful for full understanding.
To clarify terminology (source):
Depending on the text, "query", "key", and "value" can either refer to
the original encodings, the learned weight matrices that multiply the
encodings, or the value of the original encodings multiplied by the
weights. Usually $Q$, $K$ and $V$ either refer to the original
encodings (as I have done here) or the encodings multiplied by the
weights (as the transformer paper does), and $W^Q$, $W^K$, $W^V$
refer to the learned weight matrices, but you often have to figure it
out by context.
My guess is that you've been confused by these different representations.
In most of the other answers to this question, people seem to be using $Q$, $K$, and $V$ to refer to the original encodings. But, as I mention in the quote, the "Attention is All You Need" paper uses $Q$, $K$, and $V$ to refer to the original encodings multiplied by the weights.
Because of this, it's also unclear what your question is asking. If you're asking "how are the original encodings trained", and you're working in natural language processing, then you might find this paper interesting (it's what BERT cites as what it uses for its embeddings, and future embeddings are built on this). However, in many transformer systems (such as BERT), the original encodings are fixed and not learned at all during model training. However, the weight matrices in transformers are always trained and are the main focus of the technique.
So, if your question is how the weight matrices are trained, then the answer is that they're trained using gradient descent based on whatever the end task is. Attention is differentiable, meaning that if you can compute how much error is in each of the outputs of the attention module, you can also compute the direction to change every single weight in each of the three ($W^Q$, $W^K$, $W^V$) weight matrices to reduce that error. And gradient descent is just nudging the weights in that direction.
If your question is "How are the multiplication of the original encodings and the weight matrices trained?" then the answer is that usually the original encodings are held fixed (or are gotten from a previous layer), and we're only interested in how the weight matrices are trained (in which case, see the previous paragraph.)
Hopefully this helps provide a more clear answer to your question.