Activation Functions and Loss Functions

Activation Functions and Loss Functions: The Engines of Neural Network Learning

Introduction

If the perceptron is the brick of artificial intelligence, then activation functions and loss functions are the mortar and the blueprint. A neural network without an activation function is merely a glorified linear regression model. A network without a loss function is a ship without a compass—capable of moving but utterly unable to navigate toward a meaningful destination.

In the previous exploration of the perceptron, we encountered two specific functions: the step activation function (which output 0 or 1 based on a threshold) and the perceptron loss (implicitly, the number of misclassifications). These worked for the simplest binary classification tasks, but they collapsed when faced with the complexity of real-world data, multi-class problems, or deep networks.

Modern deep learning, from convolutional neural networks (CNNs) to transformers and large language models (LLMs), relies on a rich palette of activation and loss functions. This article will demystify both families, explain their mathematical properties, compare their trade-offs, and provide practical guidance on when to use which.

Part I: Activation Functions

What Is an Activation Function?

An activation function is a mathematical operation applied to the output of a neuron (the weighted sum of inputs plus bias) before passing it to the next layer. In the language of neural networks: $Output = f (\sum_{i = 1}^{n} w_{i} x_{i} + b)$ Output=f(i=1∑nwixi+b)

Where $f$ f is the activation function.

The purpose of an activation function is twofold:

Introduce Non-Linearity: Without non-linear activation functions, stacking multiple layers would be mathematically equivalent to a single linear layer. Non-linearity allows the network to learn complex patterns, curves, and interactions between features.
Constrain or Transform Output: Some activation functions squash outputs into a specific range (e.g., [0, 1] or [-1, 1]), while others allow unbounded positive values.

The Evolution: From Step to Smooth

The perceptron used the step function: $f (z) = {\begin{cases} 1 & if z \geq 0 \\ 0 & otherwise \end{cases}$ f(z)={10if z≥0otherwise

The step function is non-linear but has a fatal flaw for learning: its derivative is zero everywhere except at the threshold (where it is undefined). Gradient-based optimization requires smooth, differentiable functions. The step function cannot tell you how much to adjust weights—only that an error occurred.

Modern Activation Functions in Depth

1. Sigmoid (Logistic Function)

Formula: $σ (z) = \frac{1}{1 + e^{- z}}$ σ(z)=1+e−z1

Range: (0, 1)

Derivative: $σ^{'} (z) = σ (z) \cdot (1 - σ (z))$ σ′(z)=σ(z)⋅(1−σ(z))

When to Use:

Output layer of binary classification networks (interpreting as probability).
Hidden layers of shallow networks (historical use; now largely replaced).

Advantages:

Smooth, differentiable, and monotonic.
Outputs can be interpreted as probabilities.
Historically important and widely documented.

Disadvantages (Severe):

Vanishing Gradient Problem: For very positive or very negative inputs, the derivative approaches zero. In deep networks, gradients shrink exponentially as they backpropagate, preventing earlier layers from learning.
Not Zero-Centered: Outputs are always positive, which can cause inefficient gradient updates (zigzagging optimization).
Expensive Computation: Involves exponential operations.

Modern Verdict: Avoid in hidden layers. Use only for binary classification output layers.

2. Tanh (Hyperbolic Tangent)

Formula: $\tanh (z) = \frac{e^{z} - e^{- z}}{e^{z} + e^{- z}} = 2 σ (2 z) - 1$ tanh(z)=ez+e−zez−e−z=2σ(2z)−1

Range: (-1, 1)

Derivative: $\tanh^{'} (z) = 1 - \tanh^{2} (z)$ tanh′(z)=1−tanh2(z)

When to Use:

Hidden layers of smaller networks (though largely replaced by ReLU).
Situations where zero-centered outputs are beneficial.

Advantages:

Zero-centered output (mean ~0), which improves gradient flow.
Steeper gradient than sigmoid near zero, allowing faster learning.
Still smooth and differentiable.

Disadvantages:

Still suffers from vanishing gradient for extreme values (though less severe than sigmoid).
Exponential computation remains expensive.

Modern Verdict: Outperforms sigmoid for hidden layers but is generally inferior to ReLU family.

3. ReLU (Rectified Linear Unit) — The Workhorse of Deep Learning

Formula: $ReLU (z) = \max (0, z)$ ReLU(z)=max(0,z)

Range: [0, ∞)

Derivative: ${ReLU}^{'} (z) = {\begin{cases} 1 & if z > 0 \\ 0 & if z \leq 0 \end{cases}$ ReLU′(z)={10if z>0if z≤0

When to Use:

Default choice for hidden layers in almost all modern deep networks (CNNs, MLPs, transformers, etc.).

Advantages:

Solves Vanishing Gradient: For positive inputs, gradient is exactly 1. No exponential decay.
Computationally Trivial: Just a max comparison—no exponentials, no divisions.
Sparsity: Outputs are exactly zero for negative inputs, which can make the network more efficient.
Enables Deep Networks: The primary reason networks with dozens (or hundreds) of layers can be trained effectively.

Disadvantages:

Dying ReLU Problem: If a neuron’s weights are updated such that it always receives negative inputs, its gradient becomes zero forever. That neuron “dies” and never recovers.
Not Zero-Centered: Outputs are non-negative.
Unbounded: Can produce extremely large activations (though usually managed by weight initialization and normalization).

Dying ReLU Example:
If a large gradient flows through a ReLU neuron, it might push its weights so that for all training examples, $z < 0$ z<0. From that point onward, the gradient is zero, and the neuron contributes nothing. With proper initialization and learning rates, this is rare but possible.

Modern Verdict: The default choice for hidden layers. Start here unless you have a specific reason to do otherwise.

4. Leaky ReLU and Variants — Fixing the Dying Problem

Leaky ReLU Formula: $LeakyReLU (z) = {\begin{cases} z & if z > 0 \\ α z & otherwise \end{cases}$ LeakyReLU(z)={zαzif z>0otherwise

Where $α$ α is a small constant (typically 0.01).

Range: (-∞, ∞)

Derivative: ${LeakyReLU}^{'} (z) = {\begin{cases} 1 & if z > 0 \\ α & otherwise \end{cases}$ LeakyReLU′(z)={1αif z>0otherwise

When to Use:

When you observe “dying neurons” with standard ReLU.
For very deep networks (e.g., 100+ layers) as a safety measure.

Advantages:

Preserves all benefits of ReLU.
Eliminates dying ReLU by allowing a small gradient for negative inputs.
Minimal computational overhead.

Variants:

Parametric ReLU (PReLU): Learns $α$ α during training.
Exponential Linear Unit (ELU): Smooth negative region, pushes mean activation toward zero.
Swish: $z \cdot σ (z)$ z⋅σ(z), discovered by automated search, used in some advanced architectures.

Modern Verdict: Use Leaky ReLU if ReLU causes dead neurons. Otherwise, standard ReLU is fine.

5. Softmax — The Multi-Class Output King

Formula: $Softmax (z_{i}) = \frac{e^{z_{i}}}{\sum_{j = 1}^{K} e^{z_{j}}} for i = 1, \dots, K$ Softmax(zi)=∑j=1Kezjezifor i=1,…,K

Range: (0, 1) for each output, and all outputs sum to 1.

When to Use:

Output layer of multi-class classification networks (exclusively). The number of neurons equals the number of classes.

Advantages:

Produces a valid probability distribution over classes.
Differentiable and smooth.
The exponential amplifies differences: the largest logit becomes the dominant probability.

Disadvantages:

Computationally expensive for many classes (requires computing all exponentials and a sum).
Can suffer from numerical overflow for large logits (mitigated by subtracting the max logit).

Modern Verdict: The only correct choice for multi-class classification output.

Activation Function Selection Cheat Sheet

Network Component	Recommended Activation	Alternatives
Hidden layers (general)	ReLU	Leaky ReLU, ELU
Hidden layers (very deep)	Leaky ReLU or Swish	PReLU
Binary classification output	Sigmoid	None
Multi-class classification output	Softmax	None
Regression output (unbounded)	Linear (no activation)	None
Regression output (bounded to [0,1])	Sigmoid	None

Part II: Loss Functions

What Is a Loss Function?

A loss function (also called a cost function or objective function) quantifies how “wrong” the network’s predictions are compared to the true targets. During training, the optimization algorithm (e.g., stochastic gradient descent, Adam) adjusts the network’s weights to minimize the loss.

If activation functions are the heart of the network, loss functions are its conscience.

The Perceptron Loss and Its Failure

The original perceptron used a simple loss: the number of misclassifications. This is the 0-1 loss: $L_{0 - 1} (y, \hat{y}) = {\begin{cases} 0 & if y = \hat{y} \\ 1 & otherwise \end{cases}$ L0−1(y,y^)={01if y=y^otherwise

The problem? The 0-1 loss is not differentiable, discontinuous, and provides no information about how close a prediction was. A prediction that is barely wrong (0.49 vs 0.51) incurs the same loss as a completely confident wrong prediction (0.99 vs 0.01). Gradient-based optimization is impossible.

Modern loss functions are smooth, differentiable, and provide meaningful gradient signals.

Loss Functions for Regression Tasks

1. Mean Squared Error (MSE) / L2 Loss

Formula: $L_{MSE} = \frac{1}{n} \sum_{i = 1}^{n} (y_{i} - {\hat{y}}_{i})^{2}$ LMSE=n1i=1∑n(yi−y^i)2

When to Use:

Regression problems where outliers are rare or you want to heavily penalize large errors.
Normally distributed targets (Gaussian noise assumption).

Advantages:

Smooth, convex (for linear models), and easy to optimize.
Heavily penalizes large errors, pushing the model to reduce outliers.

Disadvantages:

Sensitive to Outliers: Squaring amplifies the effect of outliers, pulling the model away from the majority of data.
Units are squared (e.g., “meters squared” for a length prediction), which can be unintuitive.

2. Mean Absolute Error (MAE) / L1 Loss

Formula: $L_{MAE} = \frac{1}{n} \sum_{i = 1}^{n} ∣ y_{i} - {\hat{y}}_{i} ∣$ LMAE=n1i=1∑n∣yi−y^i∣

When to Use:

Regression with outliers (MAE is robust to outliers).
When you want a linear penalty proportional to the error.

Advantages:

Robust to outliers (error grows linearly, not quadratically).
Units match the original target.

Disadvantages:

Gradient is constant (not proportional to error magnitude), making it harder to converge precisely.
Not differentiable at zero (though subgradients work in practice).

3. Huber Loss (Best of Both Worlds)

Formula: $L_{Huber} (y, \hat{y}) = {\begin{cases} \frac{1}{2} (y - \hat{y})^{2} & for ∣ y - \hat{y} ∣ \leq δ \\ δ \cdot ∣ y - \hat{y} ∣ - \frac{1}{2} δ^{2} & otherwise \end{cases}$ LHuber(y,y^)={21(y−y^)2δ⋅∣y−y^∣−21δ2for ∣y−y^∣≤δotherwise

When to Use:

Regression with potential outliers where MSE is too sensitive but MAE is too slow to converge.

Advantages:

Quadratic near zero (smooth, precise convergence).
Linear for large errors (robust to outliers).
Differentiable everywhere.

Modern Verdict: The preferred robust regression loss.

Loss Functions for Classification Tasks

1. Binary Cross-Entropy (Log Loss)

Formula: $L_{BCE} = - \frac{1}{n} \sum_{i = 1}^{n} [y_{i} \log ({\hat{y}}_{i}) + (1 - y_{i}) \log (1 - {\hat{y}}_{i})]$ LBCE=−n1i=1∑n[yilog(y^i)+(1−yi)log(1−y^i)]

When to Use:

Binary classification (one output neuron with sigmoid activation).

Intuition:

If $y = 1$ y=1, loss is $- \log (\hat{y})$ −log(y^). Penalty goes to infinity as $\hat{y} \to 0$ y^→0.
If $y = 0$ y=0, loss is $- \log (1 - \hat{y})$ −log(1−y^). Penalty goes to infinity as $\hat{y} \to 1$ y^→1.

Advantages:

Provides very strong gradients when predictions are confidently wrong.
The natural loss for probabilistic binary classification.
Works perfectly with sigmoid output activation.

Modern Verdict: The standard for binary classification.

2. Categorical Cross-Entropy

Formula: $L_{CCE} = - \sum_{i = 1}^{K} y_{i} \log ({\hat{y}}_{i})$ LCCE=−i=1∑Kyilog(y^i)

Where $K$ K is the number of classes, $y$ y is a one-hot encoded vector, and $\hat{y}$ y^ comes from softmax.

When to Use:

Multi-class classification (mutually exclusive classes, e.g., digit classification: 0-9).

Advantages:

Measures the divergence between predicted probability distribution and true distribution.
Naturally paired with softmax activation.
Large gradients when the model is confidently wrong.

Modern Verdict: The standard for multi-class classification.

3. Sparse Categorical Cross-Entropy

Formula: Same as categorical cross-entropy, but $y$ y is provided as an integer class index (e.g., 3) rather than a one-hot vector.

When to Use:

Multi-class classification with many classes (one-hot would be memory-inefficient).
Integer labels are more convenient.

Advantages:

Memory efficient for large $K$ K.
Same mathematical effect as categorical cross-entropy.

4. Hinge Loss (For SVMs and Some Neural Networks)

Formula: $L_{Hinge} = \max (0, 1 - y \cdot \hat{y})$ LHinge=max(0,1−y⋅y^)

Where $y \in {- 1, + 1}$ y∈{−1,+1} and $\hat{y}$ y^ is the raw output (before sigmoid).

When to Use:

When you care more about “margin” than calibrated probabilities.
Historically used with SVMs; sometimes used in neural networks for classification.

Advantages:

Encourages confident, correct predictions with a margin.
Less sensitive to outliers than cross-entropy.

Disadvantages:

Does not produce well-calibrated probabilities.
Not as widely used in deep learning as cross-entropy.

Loss Function Selection Cheat Sheet

Task Type	Output Activation	Recommended Loss
Binary classification	Sigmoid	Binary Cross-Entropy
Multi-class classification (mutually exclusive)	Softmax	Categorical Cross-Entropy
Multi-label classification (multiple classes per sample)	Sigmoid (per class)	Binary Cross-Entropy (averaged)
Regression (normal data)	Linear	Mean Squared Error (MSE)
Regression (with outliers)	Linear	Huber Loss or MAE
Regression (bounded output [0,1])	Sigmoid	MSE or Binary Cross-Entropy

The Symbiotic Relationship

Activation functions and loss functions are not independent choices. They must be paired correctly:

Binary classification: Sigmoid output + Binary Cross-Entropy
Multi-class classification: Softmax output + Categorical Cross-Entropy
Regression: Linear output + MSE/Huber

Mismatched pairs lead to training failure. For example, using MSE with softmax output works mathematically but produces poor gradients and slow convergence. Using binary cross-entropy with linear regression outputs is nonsensical because cross-entropy requires probability-like inputs between 0 and 1.

Practical Implementation Example (PyTorch)

Below is a concise example showing common activation and loss function pairings:

python

import torch
import torch.nn as nn

# Binary classification
class BinaryClassifier(nn.Module):
    def __init__(self, input_dim):
        super().__init__()
        self.fc = nn.Linear(input_dim, 1)
    
    def forward(self, x):
        return torch.sigmoid(self.fc(x))  # Activation in forward

model = BinaryClassifier(10)
criterion = nn.BCELoss()  # Binary Cross-Entropy

# Multi-class classification
class MultiClassifier(nn.Module):
    def __init__(self, input_dim, num_classes):
        super().__init__()
        self.fc = nn.Linear(input_dim, num_classes)
        # No activation in forward; use CrossEntropyLoss which includes softmax
    
    def forward(self, x):
        return self.fc(x)  # Raw logits

model2 = MultiClassifier(10, 5)
criterion2 = nn.CrossEntropyLoss()  # Includes softmax internally

# Regression
class Regressor(nn.Module):
    def __init__(self, input_dim):
        super().__init__()
        self.fc = nn.Linear(input_dim, 1)
    
    def forward(self, x):
        return self.fc(x)  # Linear activation

model3 = Regressor(10)
criterion3 = nn.MSELoss()  # Mean Squared Error

Practical Implementation Example (PyTorch)

“””
Example 1: Binary Classification using Sigmoid Activation + Binary Cross-Entropy Loss
Dataset: Synthetic binary classification with two features (moon-shaped data)
“””

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_moons
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import time

class BinaryClassifier:
“””
Neural network for binary classification with sigmoid activation and BCE loss.
Architecture: Input -> Hidden(16, ReLU) -> Hidden(8, ReLU) -> Output(1, Sigmoid)
“””

def __init__(self, input_dim, hidden_dims=[16, 8], learning_rate=0.01):
    """
    Initialize network with Xavier/Glorot initialization.

    Args:
        input_dim: Number of input features (2 for make_moons)
        hidden_dims: List of hidden layer sizes
        learning_rate: Step size for gradient descent
    """
    self.learning_rate = learning_rate

    # Build layer dimensions: [input_dim, hidden_dims..., 1]
    layer_dims = [input_dim] + hidden_dims + [1]
    self.num_layers = len(layer_dims) - 1

    # Initialize weights and biases
    self.weights = []
    self.biases = []

    for i in range(self.num_layers):
        # Xavier initialization for weights
        w = np.random.randn(layer_dims[i], layer_dims[i+1]) * np.sqrt(2.0 / (layer_dims[i] + layer_dims[i+1]))
        b = np.zeros((1, layer_dims[i+1]))

        self.weights.append(w)
        self.biases.append(b)

    # Store caches for backpropagation
    self.caches = []

def relu(self, z):
    """ReLU activation function."""
    return np.maximum(0, z)

def relu_derivative(self, z):
    """Derivative of ReLU for backpropagation."""
    return (z > 0).astype(float)

def sigmoid(self, z):
    """Sigmoid activation function for output layer."""
    # Clip to prevent numerical overflow
    z = np.clip(z, -500, 500)
    return 1.0 / (1.0 + np.exp(-z))

def sigmoid_derivative(self, z):
    """Derivative of sigmoid."""
    sig = self.sigmoid(z)
    return sig * (1 - sig)

def forward(self, X):
    """
    Forward pass through the network.

    Args:
        X: Input data of shape (n_samples, input_dim)

    Returns:
        output: Final predictions (probabilities)
    """
    self.caches = []  # Clear previous caches
    current_input = X

    # Forward through hidden layers (ReLU activation)
    for i in range(self.num_layers - 1):
        z = np.dot(current_input, self.weights[i]) + self.biases[i]
        a = self.relu(z)
        self.caches.append((current_input, z, a, 'relu'))
        current_input = a

    # Final layer (sigmoid activation)
    z_final = np.dot(current_input, self.weights[-1]) + self.biases[-1]
    a_final = self.sigmoid(z_final)
    self.caches.append((current_input, z_final, a_final, 'sigmoid'))

    return a_final

def compute_loss(self, y_true, y_pred):
    """
    Binary Cross-Entropy Loss.

    Formula: L = -[y*log(y_hat) + (1-y)*log(1-y_hat)]

    Args:
        y_true: Ground truth labels (0 or 1)
        y_pred: Predicted probabilities (0 to 1)

    Returns:
        loss: Scalar loss value
    """
    # Clip predictions to avoid log(0)
    y_pred = np.clip(y_pred, 1e-12, 1 - 1e-12)
    loss = -np.mean(y_true * np.log(y_pred) + (1 - y_true) * np.log(1 - y_pred))
    return loss

def backward(self, y_true, y_pred):
    """
    Backward pass using gradient descent.

    Args:
        y_true: Ground truth labels
        y_pred: Predicted probabilities from forward pass

    Returns:
        gradients: None (updates weights and biases in-place)
    """
    m = y_true.shape[0]  # Number of samples

    # Initialize gradient for the output layer
    # Derivative of BCE with sigmoid simplifies to (y_pred - y_true)
    dA = (y_pred - y_true) / m

    # Backpropagate through layers (from output to input)
    for i in reversed(range(self.num_layers)):
        input_prev, z, a, activation_type = self.caches[i]

        if activation_type == 'sigmoid':
            dZ = dA * self.sigmoid_derivative(z)
        else:  # relu
            dZ = dA * self.relu_derivative(z)

        # Compute gradients
        dW = np.dot(input_prev.T, dZ)
        dB = np.sum(dZ, axis=0, keepdims=True)

        # Update parameters
        self.weights[i] -= self.learning_rate * dW
        self.biases[i] -= self.learning_rate * dB

        # Propagate gradient to previous layer (if not first layer)
        if i > 0:
            dA = np.dot(dZ, self.weights[i].T)

def train(self, X_train, y_train, X_val, y_val, epochs=500, batch_size=32, verbose=True):
    """
    Train the network using mini-batch gradient descent.

    Args:
        X_train: Training features
        y_train: Training labels
        X_val: Validation features
        y_val: Validation labels
        epochs: Number of training epochs
        batch_size: Mini-batch size
        verbose: Print progress

    Returns:
        history: Dictionary of training and validation losses
    """
    history = {
        'train_loss': [],
        'val_loss': [],
        'train_acc': [],
        'val_acc': []
    }

    num_samples = X_train.shape[0]

    for epoch in range(epochs):
        # Shuffle training data
        indices = np.random.permutation(num_samples)
        X_shuffled = X_train[indices]
        y_shuffled = y_train[indices]

        epoch_loss = 0
        num_batches = 0

        # Mini-batch training
        for start_idx in range(0, num_samples, batch_size):
            end_idx = min(start_idx + batch_size, num_samples)
            X_batch = X_shuffled[start_idx:end_idx]
            y_batch = y_shuffled[start_idx:end_idx]

            # Forward pass
            predictions = self.forward(X_batch)

            # Compute loss
            batch_loss = self.compute_loss(y_batch, predictions)
            epoch_loss += batch_loss
            num_batches += 1

            # Backward pass and parameter update
            self.backward(y_batch, predictions)

        # Average loss for the epoch
        avg_train_loss = epoch_loss / num_batches

        # Validation metrics
        val_pred = self.forward(X_val)
        val_loss = self.compute_loss(y_val, val_pred)
        train_acc = self.accuracy(y_train, self.forward(X_train))
        val_acc = self.accuracy(y_val, val_pred)

        # Store history
        history['train_loss'].append(avg_train_loss)
        history['val_loss'].append(val_loss)
        history['train_acc'].append(train_acc)
        history['val_acc'].append(val_acc)

        if verbose and (epoch % 50 == 0):
            print(f"Epoch {epoch:3d} | Train Loss: {avg_train_loss:.4f} | "
                  f"Val Loss: {val_loss:.4f} | Train Acc: {train_acc:.2%} | "
                  f"Val Acc: {val_acc:.2%}")

    return history

def accuracy(self, y_true, y_pred):
    """Compute classification accuracy (threshold at 0.5)."""
    y_pred_class = (y_pred >= 0.5).astype(int)
    return np.mean(y_true == y_pred_class)

def predict(self, X, threshold=0.5):
    """Predict class labels (0 or 1)."""
    probabilities = self.forward(X)
    return (probabilities >= threshold).astype(int), probabilities

def plot_decision_boundary(model, X, y, title):
“””Plot decision boundary of the classifier.”””
x_min, x_max = X[:, 0].min() – 0.5, X[:, 0].max() + 0.5
y_min, y_max = X[:, 1].min() – 0.5, X[:, 1].max() + 0.5
xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.02),
np.arange(y_min, y_max, 0.02))
Z, _ = model.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)

plt.figure(figsize=(12, 5))

plt.subplot(1, 2, 1)
plt.contourf(xx, yy, Z, alpha=0.8, cmap='RdBu')
plt.scatter(X[:, 0], X[:, 1], c=y, cmap='RdBu', edgecolors='black')
plt.title(f'{title} - Decision Boundary')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')

return plt

def main():
“””Main execution function.”””
print(“=” * 60)
print(“Example 1: Binary Classification with Sigmoid + BCE”)
print(“=” * 60)

# Generate dataset (non-linearly separable)
print("\n[1] Generating make_moons dataset...")
X, y = make_moons(n_samples=2000, noise=0.2, random_state=42)
y = y.reshape(-1, 1)  # Reshape to column vector

# Split data
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.3, random_state=42)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=42)

# Standardize features
print("[2] Standardizing features...")
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_val = scaler.transform(X_val)
X_test = scaler.transform(X_test)

print(f"Training samples: {X_train.shape[0]}")
print(f"Validation samples: {X_val.shape[0]}")
print(f"Test samples: {X_test.shape[0]}")

# Initialize model
print("\n[3] Initializing binary classifier...")
model = BinaryClassifier(input_dim=2, hidden_dims=[32, 16, 8], learning_rate=0.05)

# Train model
print("[4] Training model...")
print("-" * 60)
start_time = time.time()
history = model.train(X_train, y_train, X_val, y_val, epochs=500, batch_size=64, verbose=True)
training_time = time.time() - start_time
print("-" * 60)
print(f"Training completed in {training_time:.2f} seconds")

# Final evaluation
print("\n[5] Final evaluation on test set...")
test_pred, test_probs = model.predict(X_test)
test_acc = model.accuracy(y_test, test_pred)
test_loss = model.compute_loss(y_test, test_probs)
print(f"Test Accuracy: {test_acc:.2%}")
print(f"Test Loss: {test_loss:.4f}")

# Plot results
print("\n[6] Generating visualizations...")

# Plot training curves
plt.figure(figsize=(15, 5))

plt.subplot(1, 3, 1)
plt.plot(history['train_loss'], label='Training Loss', linewidth=2)
plt.plot(history['val_loss'], label='Validation Loss', linewidth=2)
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.title('Loss Curves (Binary Cross-Entropy)')
plt.legend()
plt.grid(True, alpha=0.3)

plt.subplot(1, 3, 2)
plt.plot(history['train_acc'], label='Training Accuracy', linewidth=2)
plt.plot(history['val_acc'], label='Validation Accuracy', linewidth=2)
plt.xlabel('Epoch')
plt.ylabel('Accuracy')
plt.title('Accuracy Curves')
plt.legend()
plt.grid(True, alpha=0.3)

# Decision boundary
plt.subplot(1, 3, 3)
X_full = np.vstack([X_train, X_val, X_test])
y_full = np.vstack([y_train, y_val, y_test])
x_min, x_max = X_full[:, 0].min() - 0.5, X_full[:, 0].max() + 0.5
y_min, y_max = X_full[:, 1].min() - 0.5, X_full[:, 1].max() + 0.5
xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.02),
                     np.arange(y_min, y_max, 0.02))
Z, _ = model.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
plt.contourf(xx, yy, Z, alpha=0.8, cmap='RdBu')
plt.scatter(X_full[:, 0], X_full[:, 1], c=y_full.flatten(), cmap='RdBu', edgecolors='black', s=20)
plt.title('Decision Boundary (Sigmoid + BCE)')
plt.xlabel('Feature 1 (Standardized)')
plt.ylabel('Feature 2 (Standardized)')

plt.tight_layout()
plt.savefig('binary_classification_results.png', dpi=150)
print("   Saved: binary_classification_results.png")

print("\n[7] Summary statistics:")
print(f"   Final Training Loss: {history['train_loss'][-1]:.4f}")
print(f"   Final Validation Loss: {history['val_loss'][-1]:.4f}")
print(f"   Final Validation Accuracy: {history['val_acc'][-1]:.2%}")
print(f"   Best Validation Accuracy: {max(history['val_acc']):.2%}")

print("\n✅ Example 1 completed successfully!")
plt.show()

if name == “main“:
main()

Conclusion

Activation functions and loss functions are the silent partners in every neural network success story. The activation function injects the non-linearity that allows networks to approximate any function, while the loss function provides the objective that guides learning toward useful solutions.

The journey from the perceptron’s step function and 0-1 loss to ReLU and cross-entropy represents decades of hard-won insight. Today, the defaults are well-established: ReLU for hidden layers, softmax/sigmoid for classification outputs, and cross-entropy as the loss. But understanding why these work—and when to deviate (Leaky ReLU for dying neurons, Huber loss for outliers)—separates practitioners who copy-paste from those who truly engineer solutions.

As you build your own networks, remember: the activation function shapes what the network can express; the loss function defines what it should value. Master both, and you master the learning problem itself.