Encoder-Decoder Architecture and Positional Encoding: The Complete Transformer Blueprint

Introduction

Self-attention and multi-head attention are the engines. But an engine alone doesn’t make a car. The Transformer architecture has two other critical components:

Encoder-Decoder structure—Enables sequence-to-sequence tasks (translation, summarization)
Positional encoding — Because self-attention has no built-in sense of order

In this post, we’ll understand how these pieces fit together to create the most influential AI architecture of the decade.

The Encoder-Decoder Paradigm

The Transformer is not a single block. It’s a pipeline of two main components:

Encoder: Understanding the Input

The encoder reads the entire input sequence (e.g., an English sentence) and produces a deep representation for each word.

Encoder layers (stacked 6 or 12 times):

Multi-head self-attention
Feed-forward neural network
Residual connections + LayerNorm around each sub-layer

Output of encoder: A set of context-rich vectors, one per input token.

Decoder: Generating the Output

The decoder generates the output sequence one token at a time (e.g., French translation).

Decoder layers have three sub-layers:

Masked multi-head self-attention (can’t look at future output tokens)
Cross-attention (encoder-decoder attention) — queries from decoder, keys/values from encoder
Feed-forward network

The masked attention prevents cheating: when generating word 3, the decoder cannot see words 4, 5, 6.

How They Work Together: Translation Example

Task: Translate English “I love you” to French “Je t’aime.”

Step 1 — Encoding:

Input [“I,” “love,” “you”] passes through encoder
Encoder produces vectors enriched with context

Step 2 — Decoding (autoregressive):

Start with <START> token
Decoder attends to encoder’s output (cross-attention) to understand input
Generate first word: “Je”
Feed “Je” back into decoder
Generate second word: “t’”
Feed “Je t’” back
Generate “aim”
Stop at <END>

Each generation step uses:

Self-attention: On previously generated words
Cross-attention: On encoder’s representation of input

This is why Transformers excel at sequence-to-sequence tasks.

The Silent Crisis: Self-Attention Has No Order

Here’s a shocking fact: Self-attention treats a sentence as a bag of words. Consider these two sentences:

“Dog bites man”
“Man bites dog”

Self-attention sees the same set of words: {dog, bites, man}. The attention scores are identical because the mechanism is permutation-invariant.

But meaning is completely different! We need to encode position.

Positional Encoding: Adding Back the Order

Transformers inject information about a word’s position before attention.

The Simple Approach (No—Too Limiting)

Naively adding [0, 1, 2, 3…] fails because:

Large positions dominate small ones
No generalization to longer sentences

The Transformer Solution: Sinusoidal Encoding

The original paper used sine and cosine functions of different frequencies:

text

PE(pos, 2i)   = sin(pos / 10000^(2i/d_model))
PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model))

Where:

pos = position of the word (0, 1, 2…)
i = dimension index
d_model = embedding size

Why Sinusoidal?

Unique pattern for each position
Bounded values between -1 and 1 (no domination)
Relative positioning — model can learn to attend based on distance (posₖ – posⱼ)
No learned parameters — works for longer sequences than seen during training

Learned vs. Fixed Positional Encoding

Modern models have experimented with both:

Model	Approach
Original Transformer	Fixed sinusoidal
BERT	Learned positional embeddings
GPT-3	Learned
Llama	Rotary positional embeddings (RoPE)

Learned encodings are more flexible but require seeing max length during training. Sinusoidal extrapolation is better.

Visualizing the Complete Architecture

text

Input Sequence
     ↓
[Positional Encoding + Embedding]
     ↓
Encoder Stack (N×)
  ├─ Multi-Head Self-Attention
  ├─ Add & LayerNorm
  ├─ Feed-Forward Network
  └─ Add & LayerNorm
     ↓
Encoder Output (Memory)
     ↓
Decoder Stack (N×)
  ├─ Masked Multi-Head Self-Attention
  ├─ Add & LayerNorm
  ├─ Cross-Attention (Q from decoder, K/V from encoder)
  ├─ Add & LayerNorm
  ├─ Feed-Forward Network
  └─ Add & LayerNorm
     ↓
Linear + Softmax
     ↓
Output Sequence

Why This Architecture Won

Parallelization: Unlike RNNs, all positions processed simultaneously
Long-range dependencies: No forgetting (thanks to attention)
Scalability: More layers, heads, and parameters predictably improve performance
Transfer learning: Pretrained encoders (BERT) or decoders (GPT) fine-tune beautifully

Summary

Component	Role
Encoder	Understands input sequence
Decoder	Generates output step-by-step
Cross-attention	The decoder queries encoder’s memory
Positional encoding	Adds word order information
Masked attention	Prevents peeking at future tokens