Introduction
Before 2017, language models struggled with long-range dependencies. Words far apart in a sentence would lose their relationship. Then came the paper “Attention Is All You Need”—and everything changed.
The core innovation? Self-attention and multi-head attention. Today, every major AI model (GPT, BERT, Gemini, Llama) uses these mechanisms.
In this post, we’ll break down:
- What self-attention is and why it matters
- How attention scores are computed
- Multi-head attention — why one head is not enough
No advanced math. Just clear intuition.
The Problem: RNNs and Their Limits
Before Transformers, recurrent neural networks (RNNs) processed words one by one. To understand the last word of a sentence, the RNN had to remember everything before it. This led to:
- Vanishing gradients (forgetting early words)
- Sequential processing (slow, no parallelization)
Example sentence:
“The cat that chased the mouse under the old wooden table was tired.”
To connect “was” with “cat” (not “table”), an RNN struggles. Self-attention solves this in one step.
What Is Self-Attention?
Self-attention allows every word in a sentence to look at every other word simultaneously and decide how much attention to pay to each.
Think of it like a meeting:
- Each person (word) prepares a question (query).
- Each person also has information (Key) and value (Value)
- You find who has the most relevant answer by matching your Query with their keys.
- Then you collect their Values
Three Vectors for Each Word
For every input word, the model creates three vectors:
- Query (Q): What am I looking for?
- Key (K): What do I offer?
- Value (V): The actual information I carry
These are learned during training.
How Attention Scores Are Calculated
Step-by-step for a sentence of length n:
- Dot Product: For each word, compute its Query against every word’s key. textscore(Q_i, K_j) = Q_i · K_j
- Scale: Divide by square root of dimension (stabilizes gradients).
- Softmax: Convert scores to probabilities (sum = 1).
- Weighted Sum: Multiply each value by its probability and sum up.
The result is a context-aware representation of each word.
Example
Sentence: “I love you”
To understand “love,” the model might assign the following:
- 20% attention to “I”
- 60% attention to “love” (itself)
- 20% attention to “you”
These weights depend entirely on the sentence context.
Why “Self”? Because It’s Within the Same Sentence
- Self-attention = Attention between words of the same sequence.
- Cross-attention = Between encoder and decoder (we’ll cover in Post 2).
Self-attention captures syntactic and semantic relationships:
- Pronouns to nouns (“she” → “Maria”)
- Adjectives to nouns (“beautiful” → “sunset”)
- Long-range dependencies (opening clause to closing verb)
Multi-Head Attention: Seeing Different Relationships
Single self-attention is powerful but limited. It learns one type of relationship. A word might need to be understood:
- Grammatical role (subject vs. object)
- Semantic meaning (synonyms, antonyms)
- Positional proximity (nearby vs. far words)
Multi-head attention runs multiple self-attention operations in parallel, each with its own learned Q, K, and V projections.
How It Works
- Input → Split into
hheads (typically 8 or 12) - Each head computes self-attention independently
- Outputs of all heads are concatenated and projected
What Each Head Learns
Research shows different heads specialize:
- Head 1: Grammar (subject-verb agreement)
- Head 2: Coreference (pronouns to nouns)
- Head 3: Semantic similarity
- Head 4: Positional patterns
Together, they give the model a richer understanding.
Mathematical Overview (Simplified)
text
Attention(Q, K, V) = softmax(QKᵀ / √dₖ) V MultiHead(Q, K, V) = Concat(head₁, ..., headₕ)Wᴼ where headᵢ = Attention(QWᵢQ, KWᵢK, VWᵢV)
Don’t memorize this. Understand the idea:
QKᵀmeasures similaritysoftmaxturns similarity into probability- Multiply by
Vto collect information - Do this multiple times in parallel (multi-head)
Why This Matters for Your Company
If you’re building:
- Chatbots — Attention captures user intent
- Search engines—Query-key matching is attention
- Document summarization—Attention finds important sentences
- Code generation—Attention links variable usage across many lines
No modern NLP product ships without attention.
Summary
| Concept | One-Line Explanation |
|---|---|
| Self-attention | Every word looks at every other word |
| Query, Key, Value | Learned vectors that guide attention |
| Softmax | Converts similarity scores to probabilities |
| Multi-head | Multiple attention mechanisms in parallel |
Leave a Reply