Self-Attention & Multi-Head Attention — The Brain of Modern AI


Introduction

Before 2017, language models struggled with long-range dependencies. Words far apart in a sentence would lose their relationship. Then came the paper “Attention Is All You Need”—and everything changed.

The core innovation? Self-attention and multi-head attention. Today, every major AI model (GPT, BERT, Gemini, Llama) uses these mechanisms.

In this post, we’ll break down:

  • What self-attention is and why it matters
  • How attention scores are computed
  • Multi-head attention — why one head is not enough

No advanced math. Just clear intuition.


The Problem: RNNs and Their Limits

Before Transformers, recurrent neural networks (RNNs) processed words one by one. To understand the last word of a sentence, the RNN had to remember everything before it. This led to:

  • Vanishing gradients (forgetting early words)
  • Sequential processing (slow, no parallelization)

Example sentence:

“The cat that chased the mouse under the old wooden table was tired.”

To connect “was” with “cat” (not “table”), an RNN struggles. Self-attention solves this in one step.


What Is Self-Attention?

Self-attention allows every word in a sentence to look at every other word simultaneously and decide how much attention to pay to each.

Think of it like a meeting:

  • Each person (word) prepares a question (query).
  • Each person also has information (Key) and value (Value)
  • You find who has the most relevant answer by matching your Query with their keys.
  • Then you collect their Values

Three Vectors for Each Word

For every input word, the model creates three vectors:

  1. Query (Q): What am I looking for?
  2. Key (K): What do I offer?
  3. Value (V): The actual information I carry

These are learned during training.


How Attention Scores Are Calculated

Step-by-step for a sentence of length n:

  1. Dot Product: For each word, compute its Query against every word’s key. textscore(Q_i, K_j) = Q_i · K_j
  2. Scale: Divide by square root of dimension (stabilizes gradients).
  3. Softmax: Convert scores to probabilities (sum = 1).
  4. Weighted Sum: Multiply each value by its probability and sum up.

The result is a context-aware representation of each word.

Example

Sentence: “I love you”

To understand “love,” the model might assign the following:

  • 20% attention to “I”
  • 60% attention to “love” (itself)
  • 20% attention to “you”

These weights depend entirely on the sentence context.


Why “Self”? Because It’s Within the Same Sentence

  • Self-attention = Attention between words of the same sequence.
  • Cross-attention = Between encoder and decoder (we’ll cover in Post 2).

Self-attention captures syntactic and semantic relationships:

  • Pronouns to nouns (“she” → “Maria”)
  • Adjectives to nouns (“beautiful” → “sunset”)
  • Long-range dependencies (opening clause to closing verb)

Multi-Head Attention: Seeing Different Relationships

Single self-attention is powerful but limited. It learns one type of relationship. A word might need to be understood:

  • Grammatical role (subject vs. object)
  • Semantic meaning (synonyms, antonyms)
  • Positional proximity (nearby vs. far words)

Multi-head attention runs multiple self-attention operations in parallel, each with its own learned Q, K, and V projections.

How It Works

  • Input → Split into h heads (typically 8 or 12)
  • Each head computes self-attention independently
  • Outputs of all heads are concatenated and projected

What Each Head Learns

Research shows different heads specialize:

  • Head 1: Grammar (subject-verb agreement)
  • Head 2: Coreference (pronouns to nouns)
  • Head 3: Semantic similarity
  • Head 4: Positional patterns

Together, they give the model a richer understanding.


Mathematical Overview (Simplified)

text

Attention(Q, K, V) = softmax(QKᵀ / √dₖ) V
MultiHead(Q, K, V) = Concat(head₁, ..., headₕ)Wᴼ
where headᵢ = Attention(QWᵢQ, KWᵢK, VWᵢV)

Don’t memorize this. Understand the idea:

  • QKᵀ measures similarity
  • softmax turns similarity into probability
  • Multiply by V to collect information
  • Do this multiple times in parallel (multi-head)

Why This Matters for Your Company

If you’re building:

  • Chatbots — Attention captures user intent
  • Search engines—Query-key matching is attention
  • Document summarization—Attention finds important sentences
  • Code generation—Attention links variable usage across many lines

No modern NLP product ships without attention.


Summary

ConceptOne-Line Explanation
Self-attentionEvery word looks at every other word
Query, Key, ValueLearned vectors that guide attention
SoftmaxConverts similarity scores to probabilities
Multi-headMultiple attention mechanisms in parallel

Jay Avatar

Leave a Reply

Your email address will not be published. Required fields are marked *