Encoder-Decoder Architecture and Positional Encoding: The Complete Transformer Blueprint


Introduction

Self-attention and multi-head attention are the engines. But an engine alone doesn’t make a car. The Transformer architecture has two other critical components:

  1. Encoder-Decoder structure—Enables sequence-to-sequence tasks (translation, summarization)
  2. Positional encoding — Because self-attention has no built-in sense of order

In this post, we’ll understand how these pieces fit together to create the most influential AI architecture of the decade.


The Encoder-Decoder Paradigm

The Transformer is not a single block. It’s a pipeline of two main components:

Encoder: Understanding the Input

The encoder reads the entire input sequence (e.g., an English sentence) and produces a deep representation for each word.

Encoder layers (stacked 6 or 12 times):

  1. Multi-head self-attention
  2. Feed-forward neural network
  3. Residual connections + LayerNorm around each sub-layer

Output of encoder: A set of context-rich vectors, one per input token.

Decoder: Generating the Output

The decoder generates the output sequence one token at a time (e.g., French translation).

Decoder layers have three sub-layers:

  1. Masked multi-head self-attention (can’t look at future output tokens)
  2. Cross-attention (encoder-decoder attention) — queries from decoder, keys/values from encoder
  3. Feed-forward network

The masked attention prevents cheating: when generating word 3, the decoder cannot see words 4, 5, 6.


How They Work Together: Translation Example

Task: Translate English “I love you” to French “Je t’aime.”

Step 1 — Encoding:

  • Input [“I,” “love,” “you”] passes through encoder
  • Encoder produces vectors enriched with context

Step 2 — Decoding (autoregressive):

  • Start with <START> token
  • Decoder attends to encoder’s output (cross-attention) to understand input
  • Generate first word: “Je”
  • Feed “Je” back into decoder
  • Generate second word: “t’”
  • Feed “Je t’” back
  • Generate “aim”
  • Stop at <END>

Each generation step uses:

  • Self-attention: On previously generated words
  • Cross-attention: On encoder’s representation of input

This is why Transformers excel at sequence-to-sequence tasks.


The Silent Crisis: Self-Attention Has No Order

Here’s a shocking fact: Self-attention treats a sentence as a bag of words. Consider these two sentences:

  • “Dog bites man”
  • “Man bites dog”

Self-attention sees the same set of words: {dog, bites, man}. The attention scores are identical because the mechanism is permutation-invariant.

But meaning is completely different! We need to encode position.


Positional Encoding: Adding Back the Order

Transformers inject information about a word’s position before attention.

The Simple Approach (No—Too Limiting)

Naively adding [0, 1, 2, 3…] fails because:

  • Large positions dominate small ones
  • No generalization to longer sentences

The Transformer Solution: Sinusoidal Encoding

The original paper used sine and cosine functions of different frequencies:

text

PE(pos, 2i)   = sin(pos / 10000^(2i/d_model))
PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model))

Where:

  • pos = position of the word (0, 1, 2…)
  • i = dimension index
  • d_model = embedding size

Why Sinusoidal?

  • Unique pattern for each position
  • Bounded values between -1 and 1 (no domination)
  • Relative positioning — model can learn to attend based on distance (posₖ – posⱼ)
  • No learned parameters — works for longer sequences than seen during training

Learned vs. Fixed Positional Encoding

Modern models have experimented with both:

ModelApproach
Original TransformerFixed sinusoidal
BERTLearned positional embeddings
GPT-3Learned
LlamaRotary positional embeddings (RoPE)

Learned encodings are more flexible but require seeing max length during training. Sinusoidal extrapolation is better.


Visualizing the Complete Architecture

text

Input Sequence
     ↓
[Positional Encoding + Embedding]
     ↓
Encoder Stack (N×)
  ├─ Multi-Head Self-Attention
  ├─ Add & LayerNorm
  ├─ Feed-Forward Network
  └─ Add & LayerNorm
     ↓
Encoder Output (Memory)
     ↓
Decoder Stack (N×)
  ├─ Masked Multi-Head Self-Attention
  ├─ Add & LayerNorm
  ├─ Cross-Attention (Q from decoder, K/V from encoder)
  ├─ Add & LayerNorm
  ├─ Feed-Forward Network
  └─ Add & LayerNorm
     ↓
Linear + Softmax
     ↓
Output Sequence

Why This Architecture Won

  • Parallelization: Unlike RNNs, all positions processed simultaneously
  • Long-range dependencies: No forgetting (thanks to attention)
  • Scalability: More layers, heads, and parameters predictably improve performance
  • Transfer learning: Pretrained encoders (BERT) or decoders (GPT) fine-tune beautifully

Summary

ComponentRole
EncoderUnderstands input sequence
DecoderGenerates output step-by-step
Cross-attentionThe decoder queries encoder’s memory
Positional encodingAdds word order information
Masked attentionPrevents peeking at future tokens


Jay Avatar

Leave a Reply

Your email address will not be published. Required fields are marked *