The Right Way to Explain Attention

Ahmad Mustapha
3 min readMar 28, 2024

In this article, we will explain “Attention”.

Disclaimer! If you search online for attention, you will find K, Q, and V everywhere. Forget about these terms. I will explain “Attention” in the way it should be explained without bringing up the scientific jargon [keys, queries, values] from where it was initially borrowed.

Photo by Marek Piwnicki on Unsplash

Attention is Abstraction

My following statement is very important: From an application perspective, attention allows higher layers in the architecture to operate on relations, grammar, and semantics rather than on raw words. Like how convolution allows the higher layers to operate on visual concepts rather than on raw pixels. Language is understood in a context. Words by themselves are symbols they get meaning when they are grouped. That is why we can take a paragraph and say “This paragraph talks about immigration” even if the word is not mentioned in the paragraph. This is the role of “attention” I would rather call it “abstraction” but well I didn't coin the term.

The Simple Math Behind Attention

We know how words can be expressed as vectors in an n-dimensional space. But how can I express a sentence? A sentence is more than a collection of words. It includes grammar, time, action, and meaning. A sentence is a collection of words related to each other. If words are vectors in a space, then a sentence is the similarity matrix between those words. How to compute the similarity or “distance” between two vectors. By computing their dot product.

Consider the sentence S=“The cat is on the mat”. The visual concepts that pop into our minds are based on the sentence as a whole and how different parts relate to each other. Now each one of these words is represented as a vector that holds information about the word and its position (Remember we added positional encoding).

Let V be the raw representation of the sentence S. V is a matrix of size (n,d). Where n is the number of tokens 6. And d the dimensionality of the word embedding. V is a collection of words that don’t contain the interaction of words in it. To get the attended representation of S, we multiply V itself with the interaction between words or the similarity matrix.

Attention is that simple.

             [The] [Cat] [Is] [On] [The] [Mat]
[The] : : : : : :
V.V^T= [Cat] : : : : : :
[Is] : : : : : :
[On] : : : : : :
[The] : : : : : :
[Mat] : : : : : :

: = Cosine similarity between v1 and v2

Trainable Attention[s]

Now the attended matrix that we get assumes the representation or the word embedding is perfect. The truth is the word embedding isn’t we are training it to be perfect. Similarly, the attention must be trained. You see we will end eventually by multiple attentions because each attention learns something about the sentence. Some attentions learn grammatical rules, others learn semantical rules. we have a lot of rules.

For some rules, the words should be represented differently. Some words should get priority. Other words should be dimmed. And so on. Hence for attention, we introduce three weights W1, W2, and W3. We multiply the weights with the Vs to get V1, V2, and V3 to get the trainable attention.

By adding weights and using multiple attentions (Multihead Attention) we allow our model to learn different sets of rules. By adding more attention layers, we allow our model to even learn more and more complex rules.

N.B. In practice we scale the similarity down and we apply SoftMax over it. Also, V1, V2, and V3 happens to be called Q, K, and V.

In the linked notebook we trained an Arabic language embedding through the masking task while using a word embedding, positional encoding, and attention.

--

--

Ahmad Mustapha

A computer engineer obtained a master's degree in engineering from the AUB. He worked on different AI projects of different natures.