Transformers Well Explained: Positional Encoding

3 min readMar 22, 2024

This is the third part of a four-article series that explains transforms. Each article is associated with a hands-on notebook. In the previous articles, we explained word embeddings in detail and then trained an embedding with the task of predicting the third word of a trigram given two previous words. Then we delved into masking and trained an embedding by figuring masked words. Now we will introduce the simple concept of “Positional Encoding” and how it helps in transformer nets.

In the previous article, we trained an embedding by masking words and asked a neural net model to predict them. We introduced several concepts and variables such as “sequence size”, “padding”, and “masking”. The model acts on a sequence of tokens and uses them to predict the masked words. However, it doesn't have information about the token order. As you know order is important in language processing. we know that word position matters. The “Why transformers are better than LSTMs” is different from “Why LSTMs are better than transformers”.

To fix the transformers missing positional information the authors of transformers deliberately added positional information. consider this:

[“Being”, “strong”, “is”, “all”, “what”, “matters”]

To add the positional information, we simply add the order. Like this.

[1, 2, 3, 4, 5, 6]

Wrong!

Position as real-valued vectors!

Neural networks operate on real-valued vectors. We can’t simply add position as an integer in the input. For back-propagation to work and for higher layers ingesting positional information, the position should be represented as a constant vector alongside the embedding. In mathematical terms, each token is now represented as follows:

The difference between P and E is that E is trainable while P is constant it is fixed and computed once during model initialization. By adding position as a constant we ensure that the position is taken into consideration during both the feedforward and the back-propagation step.

From an application perspective, we can simply use Pytorch’s “nn.embedding” layer to get the positional vectors during model initialization while making sure that it is not trainable by setting requires_grad to False.

What The Transformer’s Authors Did

Long story short they needed a mathematical trick to create for each position [a.k.a “1”, “2”, ... etc] in the input sequence, a vector of a specific size ensuring that the vectors shouldn’t be similar no matter how much the dimensionality of the vector is. The authors of the transformers introduced the following mathematical trick to generate the positional vectors for the input sequence:

This means that for every position, “pos” calculates the vector entry “i” using the above trigonometric functions, depending on whether “i” is even or odd. Assume we have the embedding dimensionality equal to 30 and we have the sequence size equal to 10. Then, for each token, embedding will be added to the following vectors according to its position.

    1 ->  [a0, a1, a2, a3, a4, a5, a6, ..., a30]
    2 ->  [b0, b1, b2, b3, b4, b5, b6, ..., b30]
    .
    10 -> [c0, c1, c2, c3, c4, c5, c6, ..., c30]

Where the entries are calculated as follows:

  a0 = sin( "1"/10000**(2*"0"/"30") ) // even index 0 use sin
  a1 = cos( "1"/10000**(2*"0"/"30") ) // odd index 1 use cos
  a2 = sin( "1"/10000**(2*"1"/"30") ) // even index 2 use sin
  a3 = cos( "1"/10000**(2*"1"/"30") ) // odd index 3 use cos
  ...

In the linked notebook we trained an Arabic language embedding through the masking task while using a positional embedding alongside the word embeddings.

Transformers Well Explained: Positional Encoding

Position as real-valued vectors!

What The Transformer’s Authors Did

Written by Ahmad Mustapha

No responses yet