Auto-regressive vs. Masked Language Models: Which One Actually Thinks Like You? Pre-training, scaling laws, emergent abilities, and a hands-on GPT-2 mini-project — explained without the PhD.

If you have heard of ChatGPT or BERT, you have already met the two most important families of language models: auto-regressive (AR) models and masked language models (MLM). They learn language in completely different ways, and that difference shapes what they can and cannot do.

In this post, I will explain everything in plain English: pre-training, scaling laws, and emergent abilities. Then we will actually build a mini-project: generate text with GPT-2 using the Hugging Face transformers library. No API keys are required, and you can run this on your own computer.

Part 1: Auto-regressive vs. Masked – The Core Difference

Auto-regressive (AR) Models – The Writers

Examples: GPT-2, GPT-3, GPT-4, Llama, Gemini (chat versions)

How they learn: They read text from left to right, one word at a time, and try to predict the next word. After guessing, they move forward and repeat. They never look at future words, only the past. This is called causal language modeling.

Training task: “Given all previous words, what is the next word?”

Human analogy: You are writing an email. You type word by word. You do not jump ahead and then edit the middle. You simply continue.

What they are good at: Generating stories, code, poems, emails – any task that requires producing new text.

Masked Language Models (MLM) – The Understanders

Examples: BERT, RoBERTa, DistilBERT, ALBERT

How they learn: Take a sentence and randomly hide (mask) about 15 percent of the words. The model must guess those missing words using both the left and the right context. It sees the whole sentence except the blanks.

Training task: “What word goes into this [MASK]?” This is called masked language modeling.

Human analogy: Solving a crossword clue. Given “The [MASK] sat on the mat”, you need the whole sentence to guess that the missing word is probably “cat”.

What they are good at: Understanding tasks – sentiment analysis, spam detection, question answering, and named entity recognition.

Quick Comparison Table

Feature	Auto-regressive (GPT)	Masked (BERT)
Direction	Left to right only	Bidirectional (left and right)
Training task	Predict next word	Fill in the [MASK]
Best at	Generation	Understanding
Can it chat?	Yes (ChatGPT)	Not really
Example use	Write a blog post	Detect negative reviews

Part 2: Pre-training – The Secret Sauce

Both AR and MLM models start with no knowledge. They read massive amounts of raw text – Wikipedia, books, Reddit, news articles – without any human labels. This is called self-supervised learning.

AR pre-training: Next-word prediction on trillions of words. The model learns grammar, facts, reasoning patterns, and even biases.
MLM pre-training: Fill-in-the-blank on the same data. The model learns deep bidirectional context and word relationships.

After pre-training (which costs millions of dollars in compute), you can fine-tune the model on a small labeled dataset for your specific task, such as sentiment analysis or translation. That is why pre-trained models are so powerful: they already understand the structure of language.

Part 3: Scaling Laws – Bigger Usually Means Better

Researchers at OpenAI, Google, and DeepMind discovered something surprising. If you plot model performance against compute, data, and number of parameters, you get a smooth and predictable power law. Doubling the compute leads to a predictable improvement.

What scales?

Parameters: The number of knobs the model can tune. More knobs mean more memory capacity.
Data: More text to learn from.
Compute: More GPU hours for training.

Example: GPT-2 had 1.5 billion parameters. GPT-3 had 175 billion. GPT-4 is rumored to have over 1 trillion. Each jump cost exponentially more but delivered qualitatively better results.

Important nuance: Scaling laws apply to both AR and MLM models, but AR models benefit more from scaling when it comes to generation tasks. MLM scaling improves understanding accuracy but does not suddenly make the model a good writer.

Part 4: Emergent Abilities – The Magic of Large AR Models

Here is where things get interesting. When you scale an auto-regressive model past a certain size – roughly 10 billion parameters or more – new abilities emerge. These abilities were not explicitly trained for, but they appear on their own.

Examples of emergent abilities:

Few-shot learning: “Here are two examples of sentiment analysis. Now you do it on a third example.” This works without any fine-tuning.
Basic arithmetic: Adding, subtracting, and even multiplication.
Code generation: Writing Python functions from natural language descriptions.
Instruction following: “Translate this to French, then summarize it.”
Chain-of-thought reasoning: When asked to “think step by step”, the model actually reasons through the problem.

Why does this happen? Researchers do not know exactly. It is like a small ant colony that simply digs tunnels, but a huge ant colony suddenly builds bridges. The scale creates new dynamics. Masked language models like BERT rarely show these emergent abilities. They stay good at understanding but never magically learn to write stories or do math.

Key takeaway: Emergence is mostly an auto-regressive phenomenon. That is why GPT-3 and GPT-4 feel “smart” in a general way, while BERT feels like a specialized tool.

Part 5: Mini-Project – Generate Text with GPT-2

Let us build something real. We will use a small, free, open-source auto-regressive model: GPT-2. This is the little cousin of ChatGPT. You can run this code on your own computer or on Google Colab. No API key is required.

Step 1: Install the required libraries

Open your terminal (or a Google Colab cell) and run the following command:

pip install transformers torch

What this does: The transformers library from Hugging Face gives us access to pre-trained models like GPT-2. The torch library is PyTorch, which runs the neural network computations.

Step 2: Understand the code, line by line

Below is the complete Python script. I will explain each part so you understand exactly what is happening.

# Import the necessary classes from the transformers library
from transformers import GPT2LMHeadModel, GPT2Tokenizer

# Load the pre-trained GPT-2 model and its tokenizer
# "gpt2" refers to the smallest version (124 million parameters)
model_name = "gpt2"
tokenizer = GPT2Tokenizer.from_pretrained(model_name)
model = GPT2LMHeadModel.from_pretrained(model_name)

# GPT-2 does not have a padding token by default.
# We set the padding token to be the same as the end-of-sentence token.
tokenizer.pad_token = tokenizer.eos_token

def generate_text(prompt, max_length=150, temperature=0.7, top_k=50):
    """
    Generate text using GPT-2.

    Parameters:
    - prompt: the input text that starts the generation
    - max_length: total length of prompt + generated text (in tokens)
    - temperature: controls randomness; lower = more deterministic, higher = more random
    - top_k: only sample from the top k most likely words at each step
    """
    
    # Convert the prompt text into token IDs that the model understands
    # return_tensors="pt" means return PyTorch tensors
    inputs = tokenizer(prompt, return_tensors="pt")
    
    # Generate new tokens
    outputs = model.generate(
        inputs.input_ids,               # The tokenized prompt
        attention_mask=inputs.attention_mask,  # Tells the model which tokens are real vs padding
        max_length=max_length,          # Stop when we reach this length
        temperature=temperature,        # Controls randomness
        top_k=top_k,                    # Only consider top k most likely next words
        do_sample=True,                 # Sample randomly instead of always picking the most likely
        pad_token_id=tokenizer.eos_token_id  # Use end-of-sentence token for padding
    )
    
    # Convert the generated token IDs back into human-readable text
    generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return generated_text

# Example 1: Story starter
prompt1 = "Once upon a time, in a small village"
print("Prompt:", prompt1)
print("Generation:")
print(generate_text(prompt1, max_length=100, temperature=0.8))
print("\n" + "="*50 + "\n")

# Example 2: Opinion or explanation
prompt2 = "The best thing about artificial intelligence is that"
print("Prompt:", prompt2)
print("Generation:")
print(generate_text(prompt2, max_length=120, temperature=0.7))
print("\n" + "="*50 + "\n")

# Example 3: Code generation style
prompt3 = "def greet_user(name):"
print("Prompt:", prompt3)
print("Generation:")
print(generate_text(prompt3, max_length=80, temperature=0.6))

Step 3: Detailed explanation of key functions and parameters

Tokenizer: The tokenizer converts text into numbers (token IDs) that the model can process. It also converts numbers back into text. Different models require their own specific tokenizer.

GPT2LMHeadModel: This is the actual neural network. “LMHead” means it has a language modeling head on top, which outputs probabilities for the next word.

Temperature: This controls how creative or random the output is.
– Low temperature (0.2 to 0.5): The model picks the most likely words. Output is safe but boring.
– High temperature (0.8 to 1.2): The model takes more risks. Output is more creative but can become nonsensical.
– Temperature of 0 is not allowed because it would mean always picking the top word (deterministic).

Top-k sampling: Instead of considering all possible next words, the model only looks at the top k most likely words. For example, top_k=50 means the model will only sample from the 50 best candidates. This prevents very unlikely words from appearing.

Do_sample=True: This enables random sampling. If set to False, the model would always pick the single most likely next word, which leads to repetitive and dull text.

Max_length: This is the total length of the prompt plus the generated text, measured in tokens (not words). A token is roughly 3/4 of a word on average in English.

Step 4: Run the code

Save the script as generate.py and run it with:

python generate.py

If you are using Google Colab, simply paste the code into a cell and run it. The first time you run it, the model will download (about 500 MB). After that, it runs locally.

Step 5: Experiment with your own prompts

Try changing the prompts to anything you like. For example:

“The future of renewable energy is”
“Explain how a computer works in simple terms”
“Write a haiku about machine learning”

Also experiment with the parameters. Set temperature to 0.3 and see how the output becomes repetitive. Set it to 1.1 and see how it becomes more surprising but sometimes chaotic.

What you just built

You have successfully used a real auto-regressive language model. GPT-2 is tiny compared to GPT-4, but the same principles apply: left-to-right generation, pre-training on massive data, and the ability to complete text in surprisingly coherent ways. You can already see small signs of emergent behavior, such as the model completing code or answering questions in a sensible manner.

Challenge for you: Change the prompt to something like “Three reasons why masked language models are better for search engines than auto-regressive models”. See how GPT-2 handles this. It will not be perfect, but you will see the auto-regressive approach in action.

Final Summary – Your Cheat Sheet

Auto-regressive (GPT): Left to right, predicts next word. Great at generating text. Shows emergent abilities when scaled.
Masked (BERT): Bidirectional, fills in masked words. Great at understanding text. Rarely shows emergent abilities.
Pre-training: Self-supervised learning on raw text. No human labels required.
Scaling laws: More parameters, more data, and more compute lead to predictably better performance.
Emergent abilities: Unexpected skills like math, reasoning, and code generation that appear only in large auto-regressive models.

Your next step: Try replacing GPT-2 with “distilgpt2” (a smaller, faster version) or even “gpt2-medium” if you have enough GPU memory. You can also explore masked models by loading “bert-base-uncased” and using the fill-mask pipeline. The Hugging Face ecosystem makes all of this easy.

That is it. You now understand the core of modern large language models and you can actually build with them. Use this knowledge for your own projects, blog posts, or conversations with fellow developers.