Self-Attention & Multi-Head Attention — The Brain of Modern AI

Introduction

Before 2017, language models struggled with long-range dependencies. Words far apart in a sentence would lose their relationship. Then came the paper “Attention Is All You Need”—and everything changed.

The core innovation? Self-attention and multi-head attention. Today, every major AI model (GPT, BERT, Gemini, Llama) uses these mechanisms.

In this post, we’ll break down:

What self-attention is and why it matters
How attention scores are computed
Multi-head attention — why one head is not enough

No advanced math. Just clear intuition.

The Problem: RNNs and Their Limits

Before Transformers, recurrent neural networks (RNNs) processed words one by one. To understand the last word of a sentence, the RNN had to remember everything before it. This led to:

Vanishing gradients (forgetting early words)
Sequential processing (slow, no parallelization)

Example sentence:

“The cat that chased the mouse under the old wooden table was tired.”

To connect “was” with “cat” (not “table”), an RNN struggles. Self-attention solves this in one step.

What Is Self-Attention?

Self-attention allows every word in a sentence to look at every other word simultaneously and decide how much attention to pay to each.

Think of it like a meeting:

Each person (word) prepares a question (query).
Each person also has information (Key) and value (Value)
You find who has the most relevant answer by matching your Query with their keys.
Then you collect their Values

Three Vectors for Each Word

For every input word, the model creates three vectors:

Query (Q): What am I looking for?
Key (K): What do I offer?
Value (V): The actual information I carry

These are learned during training.

How Attention Scores Are Calculated

Step-by-step for a sentence of length n:

Dot Product: For each word, compute its Query against every word’s key. textscore(Q_i, K_j) = Q_i · K_j
Scale: Divide by square root of dimension (stabilizes gradients).
Softmax: Convert scores to probabilities (sum = 1).
Weighted Sum: Multiply each value by its probability and sum up.

The result is a context-aware representation of each word.

Example

Sentence: “I love you”

To understand “love,” the model might assign the following:

20% attention to “I”
60% attention to “love” (itself)
20% attention to “you”

These weights depend entirely on the sentence context.

Why “Self”? Because It’s Within the Same Sentence

Self-attention = Attention between words of the same sequence.
Cross-attention = Between encoder and decoder (we’ll cover in Post 2).

Self-attention captures syntactic and semantic relationships:

Pronouns to nouns (“she” → “Maria”)
Adjectives to nouns (“beautiful” → “sunset”)
Long-range dependencies (opening clause to closing verb)

Multi-Head Attention: Seeing Different Relationships

Single self-attention is powerful but limited. It learns one type of relationship. A word might need to be understood:

Grammatical role (subject vs. object)
Semantic meaning (synonyms, antonyms)
Positional proximity (nearby vs. far words)

Multi-head attention runs multiple self-attention operations in parallel, each with its own learned Q, K, and V projections.

How It Works

Input → Split into h heads (typically 8 or 12)
Each head computes self-attention independently
Outputs of all heads are concatenated and projected

What Each Head Learns

Research shows different heads specialize:

Head 1: Grammar (subject-verb agreement)
Head 2: Coreference (pronouns to nouns)
Head 3: Semantic similarity
Head 4: Positional patterns

Together, they give the model a richer understanding.

Mathematical Overview (Simplified)

text

Attention(Q, K, V) = softmax(QKᵀ / √dₖ) V
MultiHead(Q, K, V) = Concat(head₁, ..., headₕ)Wᴼ
where headᵢ = Attention(QWᵢQ, KWᵢK, VWᵢV)

Don’t memorize this. Understand the idea:

QKᵀ measures similarity
softmax turns similarity into probability
Multiply by V to collect information
Do this multiple times in parallel (multi-head)

Why This Matters for Your Company

If you’re building:

Chatbots — Attention captures user intent
Search engines—Query-key matching is attention
Document summarization—Attention finds important sentences
Code generation—Attention links variable usage across many lines

No modern NLP product ships without attention.

Summary

Concept	One-Line Explanation
Self-attention	Every word looks at every other word
Query, Key, Value	Learned vectors that guide attention
Softmax	Converts similarity scores to probabilities
Multi-head	Multiple attention mechanisms in parallel