104

I'm trying to read and understand the paper Attention is all you need and in it, there is a picture:

enter image description here

I don't know what positional encoding is. by listening to some youtube videos I've found out that it is an embedding having both meaning and position of a word in it and has something to do with $sin(x)$ or $cos(x)$

but I couldn't understand what exactly it is and how exactly it is doing that. so I'm here for some help. thanks in advance.

Kari
  • 2,726
  • 2
  • 20
  • 49
Peyman
  • 1,143
  • 2
  • 8
  • 8

4 Answers4

68

For example, for word $w$ at position $pos \in [0, L-1]$ in the input sequence $\boldsymbol{w}=(w_0,\cdots, w_{L-1})$, with 4-dimensional embedding $e_{w}$, and $d_{model}=4$, the operation would be $$\begin{align*}e_{w}' &= e_{w} + \left[sin\left(\frac{pos}{10000^{0}}\right), cos\left(\frac{pos}{10000^{0}}\right),sin\left(\frac{pos}{10000^{2/4}}\right),cos\left(\frac{pos}{10000^{2/4}}\right)\right]\\ &=e_{w} + \left[sin\left(pos\right), cos\left(pos\right),sin\left(\frac{pos}{100}\right),cos\left(\frac{pos}{100}\right)\right]\\ \end{align*}$$

where the formula for positional encoding is as follows $$\text{PE}(pos,2i)=sin\left(\frac{pos}{10000^{2i/d_{model}}}\right),$$ $$\text{PE}(pos,2i+1)=cos\left(\frac{pos}{10000^{2i/d_{model}}}\right).$$ with $d_{model}=512$ (thus $i \in [0, 255]$) in the original paper.

This technique is used because there is no notion of word order (1st word, 2nd word, ..) in the proposed architecture. All words of input sequence are fed to the network with no special order or position; in contrast, in RNN architecture, $n$-th word is fed at step $n$, and in ConvNet, it is fed to specific input indices. Therefore, proposed model has no idea how the words are ordered. Consequently, a position-dependent signal is added to each word-embedding to help the model incorporate the order of words. Based on experiments, this addition not only avoids destroying the embedding information but also adds the vital position information.

This blog by Kazemnejad explains that the specific choice of ($sin$, $cos$) pair helps the model in learning patterns that rely on relative positions. As an example, consider a pattern like

if 'are' comes after 'they', then 'playing' is more likely than 'play'

which relies on relative position "$pos(\text{are}) - pos(\text{they})$" being 1, independent of absolute positions $pos(\text{are})$ and $pos(\text{they})$. To learn this pattern, any positional encoding should make it easy for the model to arrive at an encoding for "they are" that (a) is different from "are they" (considers relative position), and (b) is independent of where "they are" occurs in a given sequence (ignores absolute positions), which is what $\text{PE}$ manages to achieve.

This article by Jay Alammar explains the paper with excellent visualizations. The example on positional encoding calculates $\text{PE}(.)$ the same, with the only difference that it puts $sin$ in the first half of embedding dimensions (as opposed to even indices) and $cos$ in the second half (as opposed to odd indices). As pointed out by ShaohuaLi, this difference does not matter since vector operations would be invariant to the permutation of dimensions.

Esmailian
  • 9,312
  • 2
  • 32
  • 48
  • 9
    You also have this excellent article purely focused on the positional embedding : https://kazemnejad.com/blog/transformer_architecture_positional_encoding/ – Yohan Obadia Feb 05 '20 at 21:51
  • 1
    Is the 10000 in the denominator related to this comment by Jay Alammar in his post: "Let’s assume that our model knows 10,000 unique English words (our model’s 'output vocabulary')? – tallamjr Dec 03 '20 at 15:53
  • 1
    Which half of the positional encoding is sin and cos doesn't matter. The dot product is the same after shuffling the embedding dimensions :-) – Shaohua Li Jan 11 '21 at 09:15
  • @ShaohuaLi Thanks, I believe you are correct. Updated. – Esmailian Jan 11 '21 at 09:55
  • "In ConvNet, it is fed to specific input indices." This is true as well for transformers. I think the reason position can't be as easily easily inferred from the order within the input vector is that the input vector consists of a sequence of length-n tokens and query, key, and value MLPs have no built-in structure to delineate token boundaries. – crizCraig Jul 22 '21 at 20:57
  • 2
    @crizCraig In Transformer we do not have a unit with index 0 for the first word and so on. In other words, we can feed n words to n units in arbitrary order which is finally manifested as exchanging rows/columns in word-word attention matrix. – Esmailian Jul 22 '21 at 21:56
  • Yep you're right, thanks @Esmailian. The tokens in the input window are stacked, similar to how inputs across a batch are stacked. – crizCraig Jul 23 '21 at 18:07
  • "this difference does not matter since vector operations would be invariant to the permutation of dimensions." ->I think it's not. Consider some simple examples. @ShaohuaLi's comment is not correct, unless dot product between the same vectors. (The original transformer doesn't even do dot product, rather it adds the positional encoding vector with the embedding vector.) – starriet 주녕차 Nov 23 '22 at 03:39
  • 1
    @starriet If a positional encoding is added to a feature vector, the dot product between two such sums can be decomposed to two types of interactions: 1. dot product between two different positional encodings, and 2. dot product between a positional encoding and a feature vector. It should be apparent that the Type 1 dot product is shuffle-invariant w.r.t. the PE dimension. Type 2 dot product is not shuffle-invariant, but the PE channels have totally different semantics as the feature vector channels, therefore shuffling PE dimensions, the model is still equivalent to the original model. – Shaohua Li Nov 23 '22 at 14:27
  • @ShaohuaLi Let me describe my thoughts: 1. The "Type 1" dot product you mentioned is not shuffle-invariant because it's two different functions that are being shuffled(sin, cos here), not the components themselves. The inputs of those two functions are determined by the index of the components in PE dimension. 2. What do you mean by "PE channels have totally different semantics", and how does it make the dot product between a PE vector and a feature(embedding) vector remain the same even after changing the components of the PE vector? Could you elaborate on this? – starriet 주녕차 Dec 10 '22 at 06:08
  • @starriet 1. While let's think about the simplest case of a 2D PE. it's either (sin, cos) or (cos, sin). If two PEs are (sin1, cos1) and (sin2, cos2), their dot product is the same as (cos1, sin1), (cos2, sin2), isn't it? 2. Consider the feature vectors are randomly initialized. How to define the order of PE dimension is arbitrary. It's just as arbitrary as the dimension order of feature vector (I mean, nobody defines in advance which feature dimension means what, and the model assigns semantics by learning from data) – Shaohua Li Dec 11 '22 at 14:42
  • @ShaohuaLi 1. It's not, if the dimension is longer. Consider 4D PE. It's something like (sin0, cos0, sin2, cos2). This is different from (sin0, sin0, cos2, cos2). 2. As you said, the feature vector(in this context, embedding vector) will not be random anymore as the weights are trained. If it's random, why do we make the embedding vectors in the first place? – starriet 주녕차 Dec 16 '22 at 01:36
  • @starriet I'm very frustrated that you didn't get the obvious thing. In case 1, I mean what matters is the dot product. Permuting dimensions of two PEs, as long as the way of permutation is the same, the dot product of the two PEs remains unchanged. This is basic math. – Shaohua Li Dec 17 '22 at 02:36
  • @ShaohuaLi I think you misunderstood my comments and what we're discussing. Yes, it's unchanged if we simply permutate components of vectors. You're right, that's just basic math. But, here, the components are not permutated like that. As I mentioned above, only the functions(here, sin and cos) are permutated. That's the difference between the original paper of Transformer and the article by Jay Alammar which is mentioned in this post. Please let me know if you think that's not the difference. Thanks. – starriet 주녕차 Dec 17 '22 at 14:16
  • P.S. also, as mentioned above, PEs are added to the embedding vectors, so the dot products become different anyway. Your "Type1" and "Type2" explanation is good and I know what you're trying to say, but "Type1" is not just a simple shuffling(rather it's a shuffling of sin and cos) as I said in the above comment so it's not shuffle-invariant. "Type2" is also not shuffle-invariant obviously, and even if PE has different semantics, if everything else is the same(including the embedding vectors) but only PEs are different, I don't think the result would be the same. – starriet 주녕차 Dec 17 '22 at 15:00
  • For future ref) And I forgot to mention that, in the Transformer, the results of positional encoding(embedding vector + pos encoding vector) are not directly computed via dot products. Instead, they get linearly projected before the dot product, changing the dimension from d_model to d_k. Just for future readers, to avoid confusion. – starriet 주녕차 Dec 28 '22 at 05:22
  • The article by Amirhossein was really clear and useful – drkostas Jun 05 '23 at 21:31
60

Here is an awesome recent Youtube video that covers position embeddings in great depth, with beautiful animations:

Visual Guide to Transformer Neural Networks - (Part 1) Position Embeddings

Taking excerpts from the video, let us try understanding the “sin” part of the formula to compute the position embeddings:

enter image description here

Here “pos” refers to the position of the “word” in the sequence. P0 refers to the position embedding of the first word; “d” means the size of the word/token embedding. In this example d=5. Finally, “i” refers to each of the 5 individual dimensions of the embedding (i.e. 0, 1,2,3,4)

While “d” is fixed, “pos” and “i” vary. Let us try understanding the later two.

"pos"

enter image description here

If we plot a sin curve and vary “pos” (on the x-axis), you will land up with different position values on the y-axis. Therefore, words with different positions will have different position embeddings values.

There is a problem though. Since “sin” curve repeat in intervals, you can see in the figure above that P0 and P6 have the same position embedding values, despite being at two very different positions. This is where the ‘i’ part in the equation comes into play.

"i"

enter image description here

If you vary “i” in the equation above, you will get a bunch of curves with varying frequencies. Reading off the position embedding values against different frequencies, lands up giving different values at different embedding dimensions for P0 and P6.

Correction

Thanks to @starriet for the correction. "i" is not the index of the element "within" each vector, but the "sequence" of the vector. It is used for making alternate even and odd sequences (2i and 2i+1).

Batool
  • 716
  • 6
  • 5
  • 5
    The video link is awesome! It clearly explains the concept. – Azhar Khan Feb 10 '22 at 13:34
  • 2
    Is there no collision induced by adding the word and position vectors? Or do we just no care? – Ryder Brooks Oct 05 '22 at 17:34
  • 2
    @RyderBrooks I had exactly the same question. My hypothesis is that the model will naturally learn word embeddings that ensure that the positional modification does not induce collisions (since collisions would surely impair performance, and hence be trained out) but I would love someone with more knowledge to clarify that! – atkins Oct 29 '22 at 19:50
  • 2
    @RyderBrooks The first FAQ in this article addresses this point a bit, and suggests a couple of references: https://kazemnejad.com/blog/transformer_architecture_positional_encoding/ – atkins Oct 29 '22 at 19:55
  • 4
    This is not correct. i itself is not the index of the element. It's just used for making alternate even and odd sequences (2i and 2i+1). The linked video is also incorrect. Actually, the original paper("attention is all you need") has a slight error on this notation, I guess that's why many people got it wrong. – starriet 주녕차 Nov 23 '22 at 03:13
14

Positional encoding is a re-representation of the values of a word and its position in a sentence (given that is not the same to be at the beginning that at the end or middle).

But you have to take into account that sentences could be of any length, so saying '"X" word is the third in the sentence' does not make sense if there are different length sentences: 3rd in a 3-word-sentence is completely different to 3rd in a 20-word-sentence.

What a positional encoder does is to get help of the cyclic nature of $sin(x)$ and $cos(x)$ functions to return information of the position of a word in a sentence.

  • 4
    thank you. could you elaborate on how this positional encoder does this with $sin$ and $cos$? – Peyman Apr 28 '19 at 16:56
8

To add to other answers, OpenAI's ref implementation calculates it in natural log-space (to improve precision, I think). They did not come up with the encoding.

Here is the PE lookup table generation rewritten in C with a nested for loop:

int d_model = 512, max_len = 5000;
double pe[max_len][d_model];

for (int i = 0; i < max_len; i++) { for (int k = 0; k < d_model; k = k + 2) { double div_term = exp(k * -log(10000.0) / d_model); pe[i][k] = sin(i * div_term); pe[i][k + 1] = cos(i * div_term); } }

Eris
  • 81
  • 1
  • 1