Introduction
Self-attention and multi-head attention are the engines. But an engine alone doesn’t make a car. The Transformer architecture has two other critical components:
- Encoder-Decoder structure—Enables sequence-to-sequence tasks (translation, summarization)
- Positional encoding — Because self-attention has no built-in sense of order
In this post, we’ll understand how these pieces fit together to create the most influential AI architecture of the decade.
The Encoder-Decoder Paradigm
The Transformer is not a single block. It’s a pipeline of two main components:
Encoder: Understanding the Input
The encoder reads the entire input sequence (e.g., an English sentence) and produces a deep representation for each word.
Encoder layers (stacked 6 or 12 times):
- Multi-head self-attention
- Feed-forward neural network
- Residual connections + LayerNorm around each sub-layer
Output of encoder: A set of context-rich vectors, one per input token.
Decoder: Generating the Output
The decoder generates the output sequence one token at a time (e.g., French translation).
Decoder layers have three sub-layers:
- Masked multi-head self-attention (can’t look at future output tokens)
- Cross-attention (encoder-decoder attention) — queries from decoder, keys/values from encoder
- Feed-forward network
The masked attention prevents cheating: when generating word 3, the decoder cannot see words 4, 5, 6.
How They Work Together: Translation Example
Task: Translate English “I love you” to French “Je t’aime.”
Step 1 — Encoding:
- Input [“I,” “love,” “you”] passes through encoder
- Encoder produces vectors enriched with context
Step 2 — Decoding (autoregressive):
- Start with
<START>token - Decoder attends to encoder’s output (cross-attention) to understand input
- Generate first word: “Je”
- Feed “Je” back into decoder
- Generate second word: “t’”
- Feed “Je t’” back
- Generate “aim”
- Stop at
<END>
Each generation step uses:
- Self-attention: On previously generated words
- Cross-attention: On encoder’s representation of input
This is why Transformers excel at sequence-to-sequence tasks.
The Silent Crisis: Self-Attention Has No Order
Here’s a shocking fact: Self-attention treats a sentence as a bag of words. Consider these two sentences:
- “Dog bites man”
- “Man bites dog”
Self-attention sees the same set of words: {dog, bites, man}. The attention scores are identical because the mechanism is permutation-invariant.
But meaning is completely different! We need to encode position.
Positional Encoding: Adding Back the Order
Transformers inject information about a word’s position before attention.
The Simple Approach (No—Too Limiting)
Naively adding [0, 1, 2, 3…] fails because:
- Large positions dominate small ones
- No generalization to longer sentences
The Transformer Solution: Sinusoidal Encoding
The original paper used sine and cosine functions of different frequencies:
text
PE(pos, 2i) = sin(pos / 10000^(2i/d_model)) PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model))
Where:
pos= position of the word (0, 1, 2…)i= dimension indexd_model= embedding size
Why Sinusoidal?
- Unique pattern for each position
- Bounded values between -1 and 1 (no domination)
- Relative positioning — model can learn to attend based on distance (posₖ – posⱼ)
- No learned parameters — works for longer sequences than seen during training
Learned vs. Fixed Positional Encoding
Modern models have experimented with both:
| Model | Approach |
|---|---|
| Original Transformer | Fixed sinusoidal |
| BERT | Learned positional embeddings |
| GPT-3 | Learned |
| Llama | Rotary positional embeddings (RoPE) |
Learned encodings are more flexible but require seeing max length during training. Sinusoidal extrapolation is better.
Visualizing the Complete Architecture
text
Input Sequence
↓
[Positional Encoding + Embedding]
↓
Encoder Stack (N×)
├─ Multi-Head Self-Attention
├─ Add & LayerNorm
├─ Feed-Forward Network
└─ Add & LayerNorm
↓
Encoder Output (Memory)
↓
Decoder Stack (N×)
├─ Masked Multi-Head Self-Attention
├─ Add & LayerNorm
├─ Cross-Attention (Q from decoder, K/V from encoder)
├─ Add & LayerNorm
├─ Feed-Forward Network
└─ Add & LayerNorm
↓
Linear + Softmax
↓
Output Sequence
Why This Architecture Won
- Parallelization: Unlike RNNs, all positions processed simultaneously
- Long-range dependencies: No forgetting (thanks to attention)
- Scalability: More layers, heads, and parameters predictably improve performance
- Transfer learning: Pretrained encoders (BERT) or decoders (GPT) fine-tune beautifully
Summary
| Component | Role |
|---|---|
| Encoder | Understands input sequence |
| Decoder | Generates output step-by-step |
| Cross-attention | The decoder queries encoder’s memory |
| Positional encoding | Adds word order information |
| Masked attention | Prevents peeking at future tokens |
Leave a Reply