{"id":3531,"date":"2026-06-11T10:41:33","date_gmt":"2026-06-11T10:41:33","guid":{"rendered":"https:\/\/www.mhtechin.com\/support\/?p=3531"},"modified":"2026-06-11T10:41:33","modified_gmt":"2026-06-11T10:41:33","slug":"backpropagation-and-gradient-descent","status":"publish","type":"post","link":"https:\/\/www.mhtechin.com\/support\/backpropagation-and-gradient-descent\/","title":{"rendered":"Backpropagation and Gradient Descent"},"content":{"rendered":"\n<h3 class=\"wp-block-heading\">Backpropagation and Gradient Descent: The Engines of Deep Learning<\/h3>\n\n\n\n<h3 class=\"wp-block-heading\">Table of Contents<\/h3>\n\n\n\n<ol start=\"1\" class=\"wp-block-list\">\n<li>Introduction<\/li>\n\n\n\n<li>The Optimization Problem<\/li>\n\n\n\n<li>Gradient Descent: First Principles\n<ul class=\"wp-block-list\">\n<li>Intuition: Walking Down the Mountain<\/li>\n\n\n\n<li>Learning Rate: The Step Size<\/li>\n\n\n\n<li>Batch, Stochastic, and Mini-Batch GD<\/li>\n<\/ul>\n<\/li>\n\n\n\n<li>Backpropagation: The Chain Rule in Action\n<ul class=\"wp-block-list\">\n<li>Why Backpropagation?<\/li>\n\n\n\n<li>Forward Pass<\/li>\n\n\n\n<li>Backward Pass (Error Propagation)<\/li>\n\n\n\n<li>Computing Gradients for All Layers<\/li>\n<\/ul>\n<\/li>\n\n\n\n<li>Numerical Example: Backprop by Hand<\/li>\n\n\n\n<li>Vanishing &amp; Exploding Gradients<\/li>\n\n\n\n<li>Advanced Optimizers: Momentum, RMSprop, Adam<\/li>\n\n\n\n<li>Python Implementation from Scratch<\/li>\n\n\n\n<li>Gradient Checking: Debugging Your Backprop<\/li>\n\n\n\n<li>Common Pitfalls &amp; Best Practices<\/li>\n\n\n\n<li>Conclusion<\/li>\n<\/ol>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\" \/>\n\n\n\n<h3 class=\"wp-block-heading\">1. Introduction<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Every neural network learns by minimizing a&nbsp;<strong>loss function<\/strong>&nbsp;\u2014 a measure of how wrong its predictions are. But how does the network know&nbsp;<em>which direction<\/em>&nbsp;to adjust its thousands (or billions) of weights?<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Two algorithms answer this question:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Gradient Descent<\/strong>\u00a0provides the\u00a0<em>strategy<\/em>: follow the negative gradient of the loss function to find the minimum.<\/li>\n\n\n\n<li><strong>Backpropagation<\/strong>\u00a0provides the\u00a0<em>mechanism<\/em>: efficiently compute that gradient for every parameter in a multi-layer network.<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\">Without backpropagation, training deep networks would be computationally impossible. Without gradient descent, we would have no way to use those gradients. Together, they are the twin engines of modern deep learning \u2014 powering everything from perceptrons to large language models.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\" \/>\n\n\n\n<h3 class=\"wp-block-heading\">2. The Optimization Problem<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Consider a neural network with parameters&nbsp;<strong>\u03b8<\/strong>&nbsp;(all weights and biases). For a given input&nbsp;<strong>x<\/strong>, the network produces a prediction&nbsp;<strong>\u0177<\/strong>. We have a loss function&nbsp;<strong>L(\u03b8)<\/strong>&nbsp;that measures the error between&nbsp;<strong>\u0177<\/strong>&nbsp;and the true label&nbsp;<strong>y<\/strong>.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Our goal is to find:<math display=\"block\"><semantics><mrow><msup><mi>\u03b8<\/mi><mo>\u2217<\/mo><\/msup><mo>=<\/mo><mi>arg<\/mi><mo>\u2061<\/mo><munder><mrow><mi>min<\/mi><mo>\u2061<\/mo><\/mrow><mi>\u03b8<\/mi><\/munder><mi>L<\/mi><mo stretchy=\"false\">(<\/mo><mi>\u03b8<\/mi><mo stretchy=\"false\">)<\/mo><\/mrow><\/semantics><\/math><em>\u03b8<\/em>\u2217=arg<em>\u03b8<\/em>min\u200b<em>L<\/em>(<em>\u03b8<\/em>)<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">For deep networks, there is&nbsp;<strong>no closed-form solution<\/strong>&nbsp;(unlike linear regression). Instead, we use iterative optimization: start with random parameters, repeatedly compute the gradient of the loss with respect to each parameter, and take a small step in the opposite direction.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The gradient \u2207L(\u03b8) is a vector of partial derivatives:<math display=\"block\"><semantics><mrow><mi mathvariant=\"normal\">\u2207<\/mi><mi>L<\/mi><mo stretchy=\"false\">(<\/mo><mi>\u03b8<\/mi><mo stretchy=\"false\">)<\/mo><mo>=<\/mo><mrow><mo fence=\"true\">[<\/mo><mfrac><mrow><mi mathvariant=\"normal\">\u2202<\/mi><mi>L<\/mi><\/mrow><mrow><mi mathvariant=\"normal\">\u2202<\/mi><msub><mi>w<\/mi><mn>1<\/mn><\/msub><\/mrow><\/mfrac><mo separator=\"true\">,<\/mo><mfrac><mrow><mi mathvariant=\"normal\">\u2202<\/mi><mi>L<\/mi><\/mrow><mrow><mi mathvariant=\"normal\">\u2202<\/mi><msub><mi>w<\/mi><mn>2<\/mn><\/msub><\/mrow><\/mfrac><mo separator=\"true\">,<\/mo><mo>\u2026<\/mo><mo separator=\"true\">,<\/mo><mfrac><mrow><mi mathvariant=\"normal\">\u2202<\/mi><mi>L<\/mi><\/mrow><mrow><mi mathvariant=\"normal\">\u2202<\/mi><msub><mi>b<\/mi><mn>1<\/mn><\/msub><\/mrow><\/mfrac><mo separator=\"true\">,<\/mo><mo>\u2026<\/mo><mtext>\u2009<\/mtext><mo fence=\"true\">]<\/mo><\/mrow><\/mrow><\/semantics><\/math>\u2207<em>L<\/em>(<em>\u03b8<\/em>)=[\u2202<em>w<\/em>1\u200b\u2202<em>L<\/em>\u200b,\u2202<em>w<\/em>2\u200b\u2202<em>L<\/em>\u200b,\u2026,\u2202<em>b<\/em>1\u200b\u2202<em>L<\/em>\u200b,\u2026]<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Each partial derivative tells us:&nbsp;<em>If we increase this parameter slightly, how much will the loss change?<\/em><\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\" \/>\n\n\n\n<h3 class=\"wp-block-heading\">3. Gradient Descent: First Principles<\/h3>\n\n\n\n<h3 class=\"wp-block-heading\">3.1 Intuition: Walking Down the Mountain<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Imagine standing on a foggy mountain at night. You want to reach the valley (minimum loss), but you can only feel the slope beneath your feet. Gradient descent says:&nbsp;<em>Take a step in the steepest downhill direction.<\/em><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The update rule is deceptively simple:<math display=\"block\"><semantics><mrow><msub><mi>\u03b8<\/mi><mtext>new<\/mtext><\/msub><mo>=<\/mo><msub><mi>\u03b8<\/mi><mtext>old<\/mtext><\/msub><mo>\u2212<\/mo><mi>\u03b7<\/mi><mo>\u22c5<\/mo><mi mathvariant=\"normal\">\u2207<\/mi><mi>L<\/mi><mo stretchy=\"false\">(<\/mo><msub><mi>\u03b8<\/mi><mtext>old<\/mtext><\/msub><mo stretchy=\"false\">)<\/mo><\/mrow><\/semantics><\/math><em>\u03b8<\/em>new\u200b=<em>\u03b8<\/em>old\u200b\u2212<em>\u03b7<\/em>\u22c5\u2207<em>L<\/em>(<em>\u03b8<\/em>old\u200b)<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Where&nbsp;<strong>\u03b7<\/strong>&nbsp;(eta) is the&nbsp;<strong>learning rate<\/strong>&nbsp;\u2014 a small positive number controlling step size.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">3.2 The Learning Rate: Too Big, Too Small, Just Right<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Choosing the learning rate is one of the most critical decisions in training:<\/p>\n\n\n\n<figure class=\"wp-block-table\"><table class=\"has-fixed-layout\"><thead><tr><th class=\"has-text-align-left\" data-align=\"left\">Learning Rate<\/th><th class=\"has-text-align-left\" data-align=\"left\">Effect<\/th><th class=\"has-text-align-left\" data-align=\"left\">Consequence<\/th><\/tr><\/thead><tbody><tr><td><strong>Too small<\/strong><\/td><td>Tiny steps<\/td><td>Extremely slow convergence, may get stuck in local minima<\/td><\/tr><tr><td><strong>Too large<\/strong><\/td><td>Giant leaps<\/td><td>Oscillation, divergence, loss goes to infinity<\/td><\/tr><tr><td><strong>Just right<\/strong><\/td><td>Steady descent<\/td><td>Smooth convergence to minimum<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">In practice, learning rates are often tuned empirically, starting with common values like 0.01, 0.001, or 0.0001, and using&nbsp;<strong>learning rate schedules<\/strong>&nbsp;(decay over time).<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">3.3 Batch, Stochastic, and Mini-Batch Gradient Descent<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Computing the true gradient \u2207L(\u03b8) requires evaluating the loss on&nbsp;<strong>every training example<\/strong>&nbsp;\u2014 expensive for large datasets. Three variants balance accuracy and speed:<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>1. Batch Gradient Descent (BGD)<\/strong><\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Uses the entire dataset to compute the gradient.<\/li>\n\n\n\n<li><strong>Pros:<\/strong>\u00a0Accurate gradient estimate, stable convergence.<\/li>\n\n\n\n<li><strong>Cons:<\/strong>\u00a0Very slow for large datasets, cannot update online.<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>2. Stochastic Gradient Descent (SGD)<\/strong><\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Uses\u00a0<strong>one random sample<\/strong>\u00a0to approximate the gradient.<\/li>\n\n\n\n<li>Update after every sample: \u03b8 \u2190 \u03b8 &#8211; \u03b7\u00b7\u2207L_i(\u03b8)<\/li>\n\n\n\n<li><strong>Pros:<\/strong>\u00a0Very fast, can escape shallow local minima, online learning.<\/li>\n\n\n\n<li><strong>Cons:<\/strong>\u00a0Noisy gradient (loss jumps around), never settles perfectly.<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>3. Mini-Batch Gradient Descent (The Winner)<\/strong><\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Uses a small batch (e.g., 32, 64, 128, 256) of samples.<\/li>\n\n\n\n<li><strong>Pros:<\/strong>\u00a0Best of both worlds \u2014 stable gradient estimates, hardware efficient (vectorized operations), widely used in practice.<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Why mini-batches work well:<\/strong><\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Modern GPUs excel at parallel matrix operations on small batches.<\/li>\n\n\n\n<li>The gradient from a batch is a good-enough approximation of the true gradient.<\/li>\n\n\n\n<li>Noise in the gradient helps escape poor local minima.<\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\" \/>\n\n\n\n<h3 class=\"wp-block-heading\">4. Backpropagation: The Chain Rule in Action<\/h3>\n\n\n\n<h3 class=\"wp-block-heading\">4.1 Why Do We Need Backpropagation?<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">A neural network is a&nbsp;<strong>composition of functions<\/strong>:<math display=\"block\"><semantics><mrow><mover accent=\"true\"><mi>y<\/mi><mo>^<\/mo><\/mover><mo>=<\/mo><msub><mi>f<\/mi><mi>L<\/mi><\/msub><mo stretchy=\"false\">(<\/mo><msub><mi>f<\/mi><mrow><mi>L<\/mi><mo>\u2212<\/mo><mn>1<\/mn><\/mrow><\/msub><mo stretchy=\"false\">(<\/mo><mo>\u2026<\/mo><msub><mi>f<\/mi><mn>1<\/mn><\/msub><mo stretchy=\"false\">(<\/mo><mi>x<\/mi><mo separator=\"true\">,<\/mo><msub><mi>W<\/mi><mn>1<\/mn><\/msub><mo separator=\"true\">,<\/mo><msub><mi>b<\/mi><mn>1<\/mn><\/msub><mo stretchy=\"false\">)<\/mo><mo>\u2026<\/mo><mtext>\u2009<\/mtext><mo stretchy=\"false\">)<\/mo><mo stretchy=\"false\">)<\/mo><\/mrow><\/semantics><\/math><em>y<\/em>^\u200b=<em>f<\/em><em>L<\/em>\u200b(<em>f<\/em><em>L<\/em>\u22121\u200b(\u2026<em>f<\/em>1\u200b(<em>x<\/em>,<em>W<\/em>1\u200b,<em>b<\/em>1\u200b)\u2026))<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">To compute \u2202L\/\u2202W\u2081 (the gradient for the&nbsp;<strong>first layer&#8217;s weights<\/strong>), we must propagate the error backward through all subsequent layers. The&nbsp;<strong>chain rule<\/strong>&nbsp;from calculus makes this efficient.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">4.2 The Chain Rule Refresher<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">For composite functions h(x) = f(g(x)):<math display=\"block\"><semantics><mrow><msup><mi>h<\/mi><mo lspace=\"0em\" rspace=\"0em\">\u2032<\/mo><\/msup><mo stretchy=\"false\">(<\/mo><mi>x<\/mi><mo stretchy=\"false\">)<\/mo><mo>=<\/mo><msup><mi>f<\/mi><mo lspace=\"0em\" rspace=\"0em\">\u2032<\/mo><\/msup><mo stretchy=\"false\">(<\/mo><mi>g<\/mi><mo stretchy=\"false\">(<\/mo><mi>x<\/mi><mo stretchy=\"false\">)<\/mo><mo stretchy=\"false\">)<\/mo><mo>\u22c5<\/mo><msup><mi>g<\/mi><mo lspace=\"0em\" rspace=\"0em\">\u2032<\/mo><\/msup><mo stretchy=\"false\">(<\/mo><mi>x<\/mi><mo stretchy=\"false\">)<\/mo><\/mrow><\/semantics><\/math><em>h<\/em>\u2032(<em>x<\/em>)=<em>f<\/em>\u2032(<em>g<\/em>(<em>x<\/em>))\u22c5<em>g<\/em>\u2032(<em>x<\/em>)<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">For multivariate functions, the&nbsp;<strong>chain rule generalizes<\/strong>:<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">If z = f(y) and y = g(x), then:<math display=\"block\"><semantics><mrow><mfrac><mrow><mi mathvariant=\"normal\">\u2202<\/mi><mi>z<\/mi><\/mrow><mrow><mi mathvariant=\"normal\">\u2202<\/mi><mi>x<\/mi><\/mrow><\/mfrac><mo>=<\/mo><mfrac><mrow><mi mathvariant=\"normal\">\u2202<\/mi><mi>z<\/mi><\/mrow><mrow><mi mathvariant=\"normal\">\u2202<\/mi><mi>y<\/mi><\/mrow><\/mfrac><mo>\u22c5<\/mo><mfrac><mrow><mi mathvariant=\"normal\">\u2202<\/mi><mi>y<\/mi><\/mrow><mrow><mi mathvariant=\"normal\">\u2202<\/mi><mi>x<\/mi><\/mrow><\/mfrac><\/mrow><\/semantics><\/math>\u2202<em>x<\/em>\u2202<em>z<\/em>\u200b=\u2202<em>y<\/em>\u2202<em>z<\/em>\u200b\u22c5\u2202<em>x<\/em>\u2202<em>y<\/em>\u200b<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Backpropagation applies this rule repeatedly \u2014 from the output layer back to the input layer.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">4.3 Forward Pass: Computing Predictions<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Consider a simple 3-layer network:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Input:\u00a0<strong>x<\/strong>\u00a0(size d\u2080)<\/li>\n\n\n\n<li>Hidden Layer 1:\u00a0<strong>z\u2081 = W\u2081\u00b7x + b\u2081<\/strong>,\u00a0<strong>a\u2081 = \u03c3(z\u2081)<\/strong>\u00a0(activation, e.g., ReLU)<\/li>\n\n\n\n<li>Hidden Layer 2:\u00a0<strong>z\u2082 = W\u2082\u00b7a\u2081 + b\u2082<\/strong>,\u00a0<strong>a\u2082 = \u03c3(z\u2082)<\/strong><\/li>\n\n\n\n<li>Output Layer:\u00a0<strong>z\u2083 = W\u2083\u00b7a\u2082 + b\u2083<\/strong>,\u00a0<strong>\u0177 = \u03c3(z\u2083)<\/strong>\u00a0(sigmoid for binary classification)<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\">Where \u03c3 is a non-linear activation function.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">We store&nbsp;<strong>all intermediate values<\/strong>&nbsp;(a\u2081, z\u2081, a\u2082, z\u2082, \u0177) because the backward pass needs them.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">4.4 Backward Pass: Propagating Errors<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Let&nbsp;<strong>L<\/strong>&nbsp;be the loss function (e.g., Binary Cross-Entropy). We compute:<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Step 1: Gradient at output (\u0177)<\/strong><math display=\"block\"><semantics><mrow><msub><mi>\u03b4<\/mi><mn>3<\/mn><\/msub><mo>=<\/mo><mfrac><mrow><mi mathvariant=\"normal\">\u2202<\/mi><mi>L<\/mi><\/mrow><mrow><mi mathvariant=\"normal\">\u2202<\/mi><msub><mi>z<\/mi><mn>3<\/mn><\/msub><\/mrow><\/mfrac><mo>=<\/mo><mfrac><mrow><mi mathvariant=\"normal\">\u2202<\/mi><mi>L<\/mi><\/mrow><mrow><mi mathvariant=\"normal\">\u2202<\/mi><mover accent=\"true\"><mi>y<\/mi><mo>^<\/mo><\/mover><\/mrow><\/mfrac><mo>\u22c5<\/mo><mfrac><mrow><mi mathvariant=\"normal\">\u2202<\/mi><mover accent=\"true\"><mi>y<\/mi><mo>^<\/mo><\/mover><\/mrow><mrow><mi mathvariant=\"normal\">\u2202<\/mi><msub><mi>z<\/mi><mn>3<\/mn><\/msub><\/mrow><\/mfrac><\/mrow><\/semantics><\/math><em>\u03b4<\/em>3\u200b=\u2202<em>z<\/em>3\u200b\u2202<em>L<\/em>\u200b=\u2202<em>y<\/em>^\u200b\u2202<em>L<\/em>\u200b\u22c5\u2202<em>z<\/em>3\u200b\u2202<em>y<\/em>^\u200b\u200b<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">For sigmoid + binary cross-entropy, this simplifies beautifully:<math display=\"block\"><semantics><mrow><msub><mi>\u03b4<\/mi><mn>3<\/mn><\/msub><mo>=<\/mo><mover accent=\"true\"><mi>y<\/mi><mo>^<\/mo><\/mover><mo>\u2212<\/mo><mi>y<\/mi><\/mrow><\/semantics><\/math><em>\u03b4<\/em>3\u200b=<em>y<\/em>^\u200b\u2212<em>y<\/em><\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Step 2: Gradients for layer 3 parameters<\/strong><math display=\"block\"><semantics><mrow><mfrac><mrow><mi mathvariant=\"normal\">\u2202<\/mi><mi>L<\/mi><\/mrow><mrow><mi mathvariant=\"normal\">\u2202<\/mi><msub><mi>W<\/mi><mn>3<\/mn><\/msub><\/mrow><\/mfrac><mo>=<\/mo><msub><mi>\u03b4<\/mi><mn>3<\/mn><\/msub><mo>\u22c5<\/mo><msubsup><mi>a<\/mi><mn>2<\/mn><mi>T<\/mi><\/msubsup><\/mrow><\/semantics><\/math>\u2202<em>W<\/em>3\u200b\u2202<em>L<\/em>\u200b=<em>\u03b4<\/em>3\u200b\u22c5<em>a<\/em>2<em>T<\/em>\u200b<math display=\"block\"><semantics><mrow><mfrac><mrow><mi mathvariant=\"normal\">\u2202<\/mi><mi>L<\/mi><\/mrow><mrow><mi mathvariant=\"normal\">\u2202<\/mi><msub><mi>b<\/mi><mn>3<\/mn><\/msub><\/mrow><\/mfrac><mo>=<\/mo><msub><mi>\u03b4<\/mi><mn>3<\/mn><\/msub><\/mrow><\/semantics><\/math>\u2202<em>b<\/em>3\u200b\u2202<em>L<\/em>\u200b=<em>\u03b4<\/em>3\u200b<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Step 3: Propagate to previous layer<\/strong><math display=\"block\"><semantics><mrow><msub><mi>\u03b4<\/mi><mn>2<\/mn><\/msub><mo>=<\/mo><mo stretchy=\"false\">(<\/mo><msubsup><mi>W<\/mi><mn>3<\/mn><mi>T<\/mi><\/msubsup><mo>\u22c5<\/mo><msub><mi>\u03b4<\/mi><mn>3<\/mn><\/msub><mo stretchy=\"false\">)<\/mo><mo>\u2299<\/mo><msup><mi>\u03c3<\/mi><mo lspace=\"0em\" rspace=\"0em\">\u2032<\/mo><\/msup><mo stretchy=\"false\">(<\/mo><msub><mi>z<\/mi><mn>2<\/mn><\/msub><mo stretchy=\"false\">)<\/mo><\/mrow><\/semantics><\/math><em>\u03b4<\/em>2\u200b=(<em>W<\/em>3<em>T<\/em>\u200b\u22c5<em>\u03b4<\/em>3\u200b)\u2299<em>\u03c3<\/em>\u2032(<em>z<\/em>2\u200b)<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Where \u2299 is element-wise multiplication.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Step 4: Gradients for layer 2 parameters<\/strong><math display=\"block\"><semantics><mrow><mfrac><mrow><mi mathvariant=\"normal\">\u2202<\/mi><mi>L<\/mi><\/mrow><mrow><mi mathvariant=\"normal\">\u2202<\/mi><msub><mi>W<\/mi><mn>2<\/mn><\/msub><\/mrow><\/mfrac><mo>=<\/mo><msub><mi>\u03b4<\/mi><mn>2<\/mn><\/msub><mo>\u22c5<\/mo><msubsup><mi>a<\/mi><mn>1<\/mn><mi>T<\/mi><\/msubsup><\/mrow><\/semantics><\/math>\u2202<em>W<\/em>2\u200b\u2202<em>L<\/em>\u200b=<em>\u03b4<\/em>2\u200b\u22c5<em>a<\/em>1<em>T<\/em>\u200b<math display=\"block\"><semantics><mrow><mfrac><mrow><mi mathvariant=\"normal\">\u2202<\/mi><mi>L<\/mi><\/mrow><mrow><mi mathvariant=\"normal\">\u2202<\/mi><msub><mi>b<\/mi><mn>2<\/mn><\/msub><\/mrow><\/mfrac><mo>=<\/mo><msub><mi>\u03b4<\/mi><mn>2<\/mn><\/msub><\/mrow><\/semantics><\/math>\u2202<em>b<\/em>2\u200b\u2202<em>L<\/em>\u200b=<em>\u03b4<\/em>2\u200b<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Step 5: Propagate to layer 1<\/strong><math display=\"block\"><semantics><mrow><msub><mi>\u03b4<\/mi><mn>1<\/mn><\/msub><mo>=<\/mo><mo stretchy=\"false\">(<\/mo><msubsup><mi>W<\/mi><mn>2<\/mn><mi>T<\/mi><\/msubsup><mo>\u22c5<\/mo><msub><mi>\u03b4<\/mi><mn>2<\/mn><\/msub><mo stretchy=\"false\">)<\/mo><mo>\u2299<\/mo><msup><mi>\u03c3<\/mi><mo lspace=\"0em\" rspace=\"0em\">\u2032<\/mo><\/msup><mo stretchy=\"false\">(<\/mo><msub><mi>z<\/mi><mn>1<\/mn><\/msub><mo stretchy=\"false\">)<\/mo><\/mrow><\/semantics><\/math><em>\u03b4<\/em>1\u200b=(<em>W<\/em>2<em>T<\/em>\u200b\u22c5<em>\u03b4<\/em>2\u200b)\u2299<em>\u03c3<\/em>\u2032(<em>z<\/em>1\u200b)<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Step 6: Gradients for layer 1 parameters<\/strong><math display=\"block\"><semantics><mrow><mfrac><mrow><mi mathvariant=\"normal\">\u2202<\/mi><mi>L<\/mi><\/mrow><mrow><mi mathvariant=\"normal\">\u2202<\/mi><msub><mi>W<\/mi><mn>1<\/mn><\/msub><\/mrow><\/mfrac><mo>=<\/mo><msub><mi>\u03b4<\/mi><mn>1<\/mn><\/msub><mo>\u22c5<\/mo><msup><mi>x<\/mi><mi>T<\/mi><\/msup><\/mrow><\/semantics><\/math>\u2202<em>W<\/em>1\u200b\u2202<em>L<\/em>\u200b=<em>\u03b4<\/em>1\u200b\u22c5<em>x<\/em><em>T<\/em><math display=\"block\"><semantics><mrow><mfrac><mrow><mi mathvariant=\"normal\">\u2202<\/mi><mi>L<\/mi><\/mrow><mrow><mi mathvariant=\"normal\">\u2202<\/mi><msub><mi>b<\/mi><mn>1<\/mn><\/msub><\/mrow><\/mfrac><mo>=<\/mo><msub><mi>\u03b4<\/mi><mn>1<\/mn><\/msub><\/mrow><\/semantics><\/math>\u2202<em>b<\/em>1\u200b\u2202<em>L<\/em>\u200b=<em>\u03b4<\/em>1\u200b<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">4.5 The General Backpropagation Algorithm<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">For an L-layer network:<\/p>\n\n\n\n<ol start=\"1\" class=\"wp-block-list\">\n<li><strong>Forward pass:<\/strong>\u00a0Compute and store all activations a\u2081, a\u2082, \u2026, a_L<\/li>\n\n\n\n<li><strong>Compute \u03b4_L<\/strong>\u00a0at output layer (depends on loss function)<\/li>\n\n\n\n<li><strong>For k = L down to 2:<\/strong>\n<ul class=\"wp-block-list\">\n<li>Compute \u03b4_{k-1} = (W_k^T \u00b7 \u03b4_k) \u2299 \u03c3'(z_{k-1})<\/li>\n\n\n\n<li>Store gradients: \u2202L\/\u2202W_k = \u03b4_k \u00b7 a_{k-1}^T, \u2202L\/\u2202b_k = \u03b4_k<\/li>\n<\/ul>\n<\/li>\n\n\n\n<li><strong>Update parameters<\/strong>\u00a0using gradient descent (or an optimizer)<\/li>\n<\/ol>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Time complexity:<\/strong>&nbsp;O(n\u00b7d\u00b2) for a dense network \u2014 linear in layers, quadratic in layer width. This is&nbsp;<strong>efficient<\/strong>&nbsp;compared to naive numerical differentiation, which would require O(n\u00b7d\u2074) operations.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\" \/>\n\n\n\n<h3 class=\"wp-block-heading\">5. Numerical Example: Backpropagation by Hand<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Let&#8217;s work through a tiny network&nbsp;<strong>manually<\/strong>&nbsp;to see backpropagation in action.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Network:<\/strong>&nbsp;Input (2 units) \u2192 Hidden (2 units, sigmoid) \u2192 Output (1 unit, sigmoid)<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Weights (initialized randomly):<\/strong><\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>W\u2081 = [[0.15, 0.20], [0.25, 0.30]] (2\u00d72)<\/li>\n\n\n\n<li>b\u2081 = [0.35, 0.35]<\/li>\n\n\n\n<li>W\u2082 = [[0.40], [0.45]] (2\u00d71)<\/li>\n\n\n\n<li>b\u2082 = [0.60]<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Input:<\/strong>&nbsp;x = [0.05, 0.10]<br><strong>Target:<\/strong>&nbsp;y = 0.01<br><strong>Learning rate:<\/strong>&nbsp;\u03b7 = 0.5<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Forward Pass:<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Hidden layer z\u2081:<\/strong><\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>z\u2081\u2081 = 0.15\u00d70.05 + 0.25\u00d70.10 + 0.35 = 0.0075 + 0.025 + 0.35 = 0.3825<\/li>\n\n\n\n<li>z\u2081\u2082 = 0.20\u00d70.05 + 0.30\u00d70.10 + 0.35 = 0.01 + 0.03 + 0.35 = 0.39<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Hidden activation a\u2081 (sigmoid):<\/strong><\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>a\u2081\u2081 = \u03c3(0.3825) = 1\/(1+e\u207b\u2070\u00b7\u00b3\u2078\u00b2\u2075) = 0.5945<\/li>\n\n\n\n<li>a\u2081\u2082 = \u03c3(0.39) = 0.5963<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Output layer z\u2082:<\/strong><\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>z\u2082 = 0.40\u00d70.5945 + 0.45\u00d70.5963 + 0.60 = 0.2378 + 0.2683 + 0.60 = 1.1061<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Output \u0177 (sigmoid):<\/strong><\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>\u0177 = \u03c3(1.1061) = 0.7514<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Loss (Mean Squared Error):<\/strong><\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>L = \u00bd(0.7514 &#8211; 0.01)\u00b2 = \u00bd(0.7414)\u00b2 = \u00bd(0.5497) = 0.2748<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Backward Pass:<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Output \u03b4\u2082:<\/strong><br>For MSE with sigmoid: \u03b4\u2082 = (\u0177 &#8211; y) = 0.7514 &#8211; 0.01 = 0.7414<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>W\u2082 gradients:<\/strong><\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>\u2202L\/\u2202W\u2082\u2081 = \u03b4\u2082 \u00d7 a\u2081\u2081 = 0.7414 \u00d7 0.5945 = 0.4408<\/li>\n\n\n\n<li>\u2202L\/\u2202W\u2082\u2082 = \u03b4\u2082 \u00d7 a\u2081\u2082 = 0.7414 \u00d7 0.5963 = 0.4421<\/li>\n\n\n\n<li>\u2202L\/\u2202b\u2082 = \u03b4\u2082 = 0.7414<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Propagate to hidden layer:<\/strong><br>\u03b4\u2081 = (W\u2082\u1d40 \u00d7 \u03b4\u2082) \u2299 \u03c3'(z\u2081)<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">First, W\u2082\u1d40 \u00d7 \u03b4\u2082 = [0.40, 0.45]\u1d40 \u00d7 0.7414 = [0.2966, 0.3336]<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Sigmoid derivative: \u03c3'(z) = \u03c3(z)\u00b7(1-\u03c3(z))<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>\u03c3'(z\u2081\u2081) = 0.5945 \u00d7 (1-0.5945) = 0.5945 \u00d7 0.4055 = 0.2411<\/li>\n\n\n\n<li>\u03c3'(z\u2081\u2082) = 0.5963 \u00d7 0.4037 = 0.2407<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\">\u03b4\u2081 = [0.2966\u00d70.2411, 0.3336\u00d70.2407] = [0.0715, 0.0803]<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>W\u2081 gradients:<\/strong><\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>\u2202L\/\u2202W\u2081\u2081\u2081 = \u03b4\u2081\u2081 \u00d7 x\u2081 = 0.0715 \u00d7 0.05 = 0.003575<\/li>\n\n\n\n<li>\u2202L\/\u2202W\u2081\u2081\u2082 = \u03b4\u2081\u2081 \u00d7 x\u2082 = 0.0715 \u00d7 0.10 = 0.00715<\/li>\n\n\n\n<li>\u2202L\/\u2202W\u2081\u2082\u2081 = \u03b4\u2081\u2082 \u00d7 x\u2081 = 0.0803 \u00d7 0.05 = 0.004015<\/li>\n\n\n\n<li>\u2202L\/\u2202W\u2081\u2082\u2082 = \u03b4\u2081\u2082 \u00d7 x\u2082 = 0.0803 \u00d7 0.10 = 0.00803<\/li>\n\n\n\n<li>\u2202L\/\u2202b\u2081\u2081 = \u03b4\u2081\u2081 = 0.0715<\/li>\n\n\n\n<li>\u2202L\/\u2202b\u2081\u2082 = \u03b4\u2081\u2082 = 0.0803<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Update Weights (\u03b7 = 0.5):<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>W\u2082 new:<\/strong><\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>W\u2082\u2081 = 0.40 &#8211; 0.5\u00d70.4408 = 0.1796<\/li>\n\n\n\n<li>W\u2082\u2082 = 0.45 &#8211; 0.5\u00d70.4421 = 0.22895<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>W\u2081 new:<\/strong><\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>W\u2081\u2081\u2081 = 0.15 &#8211; 0.5\u00d70.003575 = 0.14821<\/li>\n\n\n\n<li>W\u2081\u2081\u2082 = 0.25 &#8211; 0.5\u00d70.00715 = 0.246425<\/li>\n\n\n\n<li>W\u2081\u2082\u2081 = 0.20 &#8211; 0.5\u00d70.004015 = 0.19799<\/li>\n\n\n\n<li>W\u2081\u2082\u2082 = 0.30 &#8211; 0.5\u00d70.00803 = 0.295985<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\">After one update, the loss decreased from 0.2748 to a lower value. Repeated iterations will drive it toward zero.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\" \/>\n\n\n\n<h3 class=\"wp-block-heading\">6. Vanishing &amp; Exploding Gradients<\/h3>\n\n\n\n<h3 class=\"wp-block-heading\">6.1 Vanishing Gradients<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">In deep networks (e.g., 10+ layers), gradients can become&nbsp;<strong>exponentially small<\/strong>&nbsp;as they propagate backward.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Cause:<\/strong>&nbsp;Activation functions like sigmoid or tanh have derivatives \u2264 0.25. Multiplying many such small numbers makes gradients vanish.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Symptoms:<\/strong><\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Early layers learn very slowly (or not at all)<\/li>\n\n\n\n<li>Model performance plateaus prematurely<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Solutions:<\/strong><\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Use\u00a0<strong>ReLU<\/strong>\u00a0activation (gradient = 1 for positive inputs)<\/li>\n\n\n\n<li><strong>Batch Normalization<\/strong><\/li>\n\n\n\n<li><strong>Residual connections<\/strong>\u00a0(ResNet)<\/li>\n\n\n\n<li>Careful weight initialization (He\/Xavier)<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">6.2 Exploding Gradients<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Conversely, gradients can become&nbsp;<strong>exponentially large<\/strong>, causing numerical overflow and unstable training.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Cause:<\/strong>&nbsp;Large weights, repeated multiplication, or certain activation functions.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Symptoms:<\/strong><\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Loss becomes NaN or Inf<\/li>\n\n\n\n<li>Extreme weight values<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Solutions:<\/strong><\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Gradient clipping<\/strong>\u00a0(cap gradients at a threshold, e.g., 1.0 or 5.0)<\/li>\n\n\n\n<li>Lower learning rate<\/li>\n\n\n\n<li>Weight regularization (L1\/L2)<\/li>\n\n\n\n<li>Proper initialization<\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\" \/>\n\n\n\n<h3 class=\"wp-block-heading\">7. Advanced Optimizers: Beyond Vanilla SGD<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Plain SGD is slow and sensitive to learning rate choices. Modern optimizers adapt the step size for each parameter.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">7.1 Momentum<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Idea:<\/strong>&nbsp;Accumulate a velocity vector to smooth updates and accelerate convergence.<math display=\"block\"><semantics><mrow><msub><mi>v<\/mi><mi>t<\/mi><\/msub><mo>=<\/mo><mi>\u03b2<\/mi><msub><mi>v<\/mi><mrow><mi>t<\/mi><mo>\u2212<\/mo><mn>1<\/mn><\/mrow><\/msub><mo>+<\/mo><mi>\u03b7<\/mi><mi mathvariant=\"normal\">\u2207<\/mi><mi>L<\/mi><mo stretchy=\"false\">(<\/mo><msub><mi>\u03b8<\/mi><mrow><mi>t<\/mi><mo>\u2212<\/mo><mn>1<\/mn><\/mrow><\/msub><mo stretchy=\"false\">)<\/mo><\/mrow><\/semantics><\/math><em>v<\/em><em>t<\/em>\u200b=<em>\u03b2<\/em><em>v<\/em><em>t<\/em>\u22121\u200b+<em>\u03b7<\/em>\u2207<em>L<\/em>(<em>\u03b8<\/em><em>t<\/em>\u22121\u200b)<math display=\"block\"><semantics><mrow><msub><mi>\u03b8<\/mi><mi>t<\/mi><\/msub><mo>=<\/mo><msub><mi>\u03b8<\/mi><mrow><mi>t<\/mi><mo>\u2212<\/mo><mn>1<\/mn><\/mrow><\/msub><mo>\u2212<\/mo><msub><mi>v<\/mi><mi>t<\/mi><\/msub><\/mrow><\/semantics><\/math><em>\u03b8<\/em><em>t<\/em>\u200b=<em>\u03b8<\/em><em>t<\/em>\u22121\u200b\u2212<em>v<\/em><em>t<\/em>\u200b<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Where \u03b2 (typically 0.9) is the momentum coefficient.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Benefit:<\/strong>&nbsp;Dampens oscillations, accelerates through shallow ravines.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">7.2 Nesterov Accelerated Gradient (NAG)<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Idea:<\/strong>&nbsp;Look ahead before computing the gradient.<math display=\"block\"><semantics><mrow><msub><mi>v<\/mi><mi>t<\/mi><\/msub><mo>=<\/mo><mi>\u03b2<\/mi><msub><mi>v<\/mi><mrow><mi>t<\/mi><mo>\u2212<\/mo><mn>1<\/mn><\/mrow><\/msub><mo>+<\/mo><mi>\u03b7<\/mi><mi mathvariant=\"normal\">\u2207<\/mi><mi>L<\/mi><mo stretchy=\"false\">(<\/mo><msub><mi>\u03b8<\/mi><mrow><mi>t<\/mi><mo>\u2212<\/mo><mn>1<\/mn><\/mrow><\/msub><mo>\u2212<\/mo><mi>\u03b2<\/mi><msub><mi>v<\/mi><mrow><mi>t<\/mi><mo>\u2212<\/mo><mn>1<\/mn><\/mrow><\/msub><mo stretchy=\"false\">)<\/mo><\/mrow><\/semantics><\/math><em>v<\/em><em>t<\/em>\u200b=<em>\u03b2<\/em><em>v<\/em><em>t<\/em>\u22121\u200b+<em>\u03b7<\/em>\u2207<em>L<\/em>(<em>\u03b8<\/em><em>t<\/em>\u22121\u200b\u2212<em>\u03b2<\/em><em>v<\/em><em>t<\/em>\u22121\u200b)<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Benefit:<\/strong>&nbsp;More responsive correction, often faster than standard momentum.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">7.3 AdaGrad<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Idea:<\/strong>&nbsp;Adapt learning rates per parameter \u2014 larger updates for infrequent features.<math display=\"block\"><semantics><mrow><msub><mi>G<\/mi><mi>t<\/mi><\/msub><mo>=<\/mo><msub><mi>G<\/mi><mrow><mi>t<\/mi><mo>\u2212<\/mo><mn>1<\/mn><\/mrow><\/msub><mo>+<\/mo><mo stretchy=\"false\">(<\/mo><mi mathvariant=\"normal\">\u2207<\/mi><msub><mi>L<\/mi><mi>t<\/mi><\/msub><msup><mo stretchy=\"false\">)<\/mo><mn>2<\/mn><\/msup><\/mrow><\/semantics><\/math><em>G<\/em><em>t<\/em>\u200b=<em>G<\/em><em>t<\/em>\u22121\u200b+(\u2207<em>L<\/em><em>t<\/em>\u200b)2<math display=\"block\"><semantics><mrow><msub><mi>\u03b8<\/mi><mi>t<\/mi><\/msub><mo>=<\/mo><msub><mi>\u03b8<\/mi><mrow><mi>t<\/mi><mo>\u2212<\/mo><mn>1<\/mn><\/mrow><\/msub><mo>\u2212<\/mo><mfrac><mi>\u03b7<\/mi><msqrt><mrow><msub><mi>G<\/mi><mi>t<\/mi><\/msub><mo>+<\/mo><mi>\u03f5<\/mi><\/mrow><\/msqrt><\/mfrac><mi mathvariant=\"normal\">\u2207<\/mi><msub><mi>L<\/mi><mi>t<\/mi><\/msub><\/mrow><\/semantics><\/math><em>\u03b8<\/em><em>t<\/em>\u200b=<em>\u03b8<\/em><em>t<\/em>\u22121\u200b\u2212<em>G<\/em><em>t<\/em>\u200b+<em>\u03f5<\/em>\u200b<em>\u03b7<\/em>\u200b\u2207<em>L<\/em><em>t<\/em>\u200b<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Problem:<\/strong>&nbsp;Learning rates monotonically decrease to zero.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">7.4 RMSprop<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Idea:<\/strong>&nbsp;Use moving average of squared gradients to normalize updates.<math display=\"block\"><semantics><mrow><mi>E<\/mi><mo stretchy=\"false\">[<\/mo><msup><mi>g<\/mi><mn>2<\/mn><\/msup><msub><mo stretchy=\"false\">]<\/mo><mi>t<\/mi><\/msub><mo>=<\/mo><mi>\u03b2<\/mi><mi>E<\/mi><mo stretchy=\"false\">[<\/mo><msup><mi>g<\/mi><mn>2<\/mn><\/msup><msub><mo stretchy=\"false\">]<\/mo><mrow><mi>t<\/mi><mo>\u2212<\/mo><mn>1<\/mn><\/mrow><\/msub><mo>+<\/mo><mo stretchy=\"false\">(<\/mo><mn>1<\/mn><mo>\u2212<\/mo><mi>\u03b2<\/mi><mo stretchy=\"false\">)<\/mo><mo stretchy=\"false\">(<\/mo><mi mathvariant=\"normal\">\u2207<\/mi><msub><mi>L<\/mi><mi>t<\/mi><\/msub><msup><mo stretchy=\"false\">)<\/mo><mn>2<\/mn><\/msup><\/mrow><\/semantics><\/math><em>E<\/em>[<em>g<\/em>2]<em>t<\/em>\u200b=<em>\u03b2E<\/em>[<em>g<\/em>2]<em>t<\/em>\u22121\u200b+(1\u2212<em>\u03b2<\/em>)(\u2207<em>L<\/em><em>t<\/em>\u200b)2<math display=\"block\"><semantics><mrow><msub><mi>\u03b8<\/mi><mi>t<\/mi><\/msub><mo>=<\/mo><msub><mi>\u03b8<\/mi><mrow><mi>t<\/mi><mo>\u2212<\/mo><mn>1<\/mn><\/mrow><\/msub><mo>\u2212<\/mo><mfrac><mi>\u03b7<\/mi><msqrt><mrow><mi>E<\/mi><mo stretchy=\"false\">[<\/mo><msup><mi>g<\/mi><mn>2<\/mn><\/msup><msub><mo stretchy=\"false\">]<\/mo><mi>t<\/mi><\/msub><mo>+<\/mo><mi>\u03f5<\/mi><\/mrow><\/msqrt><\/mfrac><mi mathvariant=\"normal\">\u2207<\/mi><msub><mi>L<\/mi><mi>t<\/mi><\/msub><\/mrow><\/semantics><\/math><em>\u03b8<\/em><em>t<\/em>\u200b=<em>\u03b8<\/em><em>t<\/em>\u22121\u200b\u2212<em>E<\/em>[<em>g<\/em>2]<em>t<\/em>\u200b+<em>\u03f5<\/em>\u200b<em>\u03b7<\/em>\u200b\u2207<em>L<\/em><em>t<\/em>\u200b<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">7.5 Adam (Adaptive Moment Estimation) \u2014&nbsp;<strong>The Default Choice<\/strong><\/h3>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Idea:<\/strong>&nbsp;Combine momentum (first moment) and RMSprop (second moment) with bias correction.<math display=\"block\"><semantics><mrow><msub><mi>m<\/mi><mi>t<\/mi><\/msub><mo>=<\/mo><msub><mi>\u03b2<\/mi><mn>1<\/mn><\/msub><msub><mi>m<\/mi><mrow><mi>t<\/mi><mo>\u2212<\/mo><mn>1<\/mn><\/mrow><\/msub><mo>+<\/mo><mo stretchy=\"false\">(<\/mo><mn>1<\/mn><mo>\u2212<\/mo><msub><mi>\u03b2<\/mi><mn>1<\/mn><\/msub><mo stretchy=\"false\">)<\/mo><mi mathvariant=\"normal\">\u2207<\/mi><msub><mi>L<\/mi><mi>t<\/mi><\/msub><\/mrow><\/semantics><\/math><em>m<\/em><em>t<\/em>\u200b=<em>\u03b2<\/em>1\u200b<em>m<\/em><em>t<\/em>\u22121\u200b+(1\u2212<em>\u03b2<\/em>1\u200b)\u2207<em>L<\/em><em>t<\/em>\u200b<math display=\"block\"><semantics><mrow><msub><mi>v<\/mi><mi>t<\/mi><\/msub><mo>=<\/mo><msub><mi>\u03b2<\/mi><mn>2<\/mn><\/msub><msub><mi>v<\/mi><mrow><mi>t<\/mi><mo>\u2212<\/mo><mn>1<\/mn><\/mrow><\/msub><mo>+<\/mo><mo stretchy=\"false\">(<\/mo><mn>1<\/mn><mo>\u2212<\/mo><msub><mi>\u03b2<\/mi><mn>2<\/mn><\/msub><mo stretchy=\"false\">)<\/mo><mo stretchy=\"false\">(<\/mo><mi mathvariant=\"normal\">\u2207<\/mi><msub><mi>L<\/mi><mi>t<\/mi><\/msub><msup><mo stretchy=\"false\">)<\/mo><mn>2<\/mn><\/msup><\/mrow><\/semantics><\/math><em>v<\/em><em>t<\/em>\u200b=<em>\u03b2<\/em>2\u200b<em>v<\/em><em>t<\/em>\u22121\u200b+(1\u2212<em>\u03b2<\/em>2\u200b)(\u2207<em>L<\/em><em>t<\/em>\u200b)2<math display=\"block\"><semantics><mrow><msub><mover accent=\"true\"><mi>m<\/mi><mo>^<\/mo><\/mover><mi>t<\/mi><\/msub><mo>=<\/mo><mfrac><msub><mi>m<\/mi><mi>t<\/mi><\/msub><mrow><mn>1<\/mn><mo>\u2212<\/mo><msubsup><mi>\u03b2<\/mi><mn>1<\/mn><mi>t<\/mi><\/msubsup><\/mrow><\/mfrac><mo separator=\"true\">,<\/mo><mspace width=\"1em\"><\/mspace><msub><mover accent=\"true\"><mi>v<\/mi><mo>^<\/mo><\/mover><mi>t<\/mi><\/msub><mo>=<\/mo><mfrac><msub><mi>v<\/mi><mi>t<\/mi><\/msub><mrow><mn>1<\/mn><mo>\u2212<\/mo><msubsup><mi>\u03b2<\/mi><mn>2<\/mn><mi>t<\/mi><\/msubsup><\/mrow><\/mfrac><\/mrow><\/semantics><\/math><em>m<\/em>^<em>t<\/em>\u200b=1\u2212<em>\u03b2<\/em>1<em>t<\/em>\u200b<em>m<\/em><em>t<\/em>\u200b\u200b,<em>v<\/em>^<em>t<\/em>\u200b=1\u2212<em>\u03b2<\/em>2<em>t<\/em>\u200b<em>v<\/em><em>t<\/em>\u200b\u200b<math display=\"block\"><semantics><mrow><msub><mi>\u03b8<\/mi><mi>t<\/mi><\/msub><mo>=<\/mo><msub><mi>\u03b8<\/mi><mrow><mi>t<\/mi><mo>\u2212<\/mo><mn>1<\/mn><\/mrow><\/msub><mo>\u2212<\/mo><mi>\u03b7<\/mi><mfrac><msub><mover accent=\"true\"><mi>m<\/mi><mo>^<\/mo><\/mover><mi>t<\/mi><\/msub><mrow><msqrt><msub><mover accent=\"true\"><mi>v<\/mi><mo>^<\/mo><\/mover><mi>t<\/mi><\/msub><\/msqrt><mo>+<\/mo><mi>\u03f5<\/mi><\/mrow><\/mfrac><\/mrow><\/semantics><\/math><em>\u03b8<\/em><em>t<\/em>\u200b=<em>\u03b8<\/em><em>t<\/em>\u22121\u200b\u2212<em>\u03b7<\/em><em>v<\/em>^<em>t<\/em>\u200b\u200b+<em>\u03f5<\/em><em>m<\/em>^<em>t<\/em>\u200b\u200b<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Default hyperparameters:<\/strong><\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>\u03b7 = 0.001<\/li>\n\n\n\n<li>\u03b2\u2081 = 0.9<\/li>\n\n\n\n<li>\u03b2\u2082 = 0.999<\/li>\n\n\n\n<li>\u03b5 = 1e-8<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Why Adam is so popular:<\/strong><\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Works well across many problems without extensive tuning<\/li>\n\n\n\n<li>Handles sparse gradients<\/li>\n\n\n\n<li>Adapts learning rates automatically<\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\" \/>\n\n\n\n<h3 class=\"wp-block-heading\">8. Python Implementation from Scratch<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">python<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">import numpy as np\n\nclass NeuralNetwork:\n    def __init__(self, layer_sizes, activation='relu', loss='mse'):\n        \"\"\"\n        layer_sizes: list of integers, e.g., [2, 4, 3, 1]\n        \"\"\"\n        self.layer_sizes = layer_sizes\n        self.num_layers = len(layer_sizes)\n        self.activation = activation\n        self.loss = loss\n        \n        # Initialize weights and biases (He initialization)\n        self.weights = []\n        self.biases = []\n        for i in range(self.num_layers - 1):\n            w = np.random.randn(layer_sizes[i], layer_sizes[i+1]) * np.sqrt(2.0 \/ layer_sizes[i])\n            b = np.zeros((1, layer_sizes[i+1]))\n            self.weights.append(w)\n            self.biases.append(b)\n        \n        # Storage for forward pass\n        self.caches = []\n    \n    def _activation(self, z):\n        if self.activation == 'relu':\n            return np.maximum(0, z)\n        elif self.activation == 'sigmoid':\n            return 1 \/ (1 + np.exp(-np.clip(z, -500, 500)))\n        elif self.activation == 'tanh':\n            return np.tanh(z)\n        else:\n            return z  # linear\n    \n    def _activation_derivative(self, z):\n        if self.activation == 'relu':\n            return (z &gt; 0).astype(float)\n        elif self.activation == 'sigmoid':\n            sig = 1 \/ (1 + np.exp(-np.clip(z, -500, 500)))\n            return sig * (1 - sig)\n        elif self.activation == 'tanh':\n            return 1 - np.tanh(z) ** 2\n        else:\n            return np.ones_like(z)\n    \n    def forward(self, X):\n        \"\"\"Forward pass with caching.\"\"\"\n        self.caches = []\n        current = X\n        \n        for i in range(self.num_layers - 2):\n            z = np.dot(current, self.weights[i]) + self.biases[i]\n            a = self._activation(z)\n            self.caches.append((current, z, a))\n            current = a\n        \n        # Output layer (no activation for regression with MSE)\n        z_final = np.dot(current, self.weights[-1]) + self.biases[-1]\n        if self.loss == 'binary_cross_entropy':\n            a_final = self._activation(z_final)  # sigmoid for BCE\n        else:\n            a_final = z_final  # linear for MSE\n        \n        self.caches.append((current, z_final, a_final))\n        return a_final\n    \n    def compute_loss(self, y_true, y_pred):\n        if self.loss == 'mse':\n            return np.mean((y_true - y_pred) ** 2)\n        elif self.loss == 'binary_cross_entropy':\n            y_pred = np.clip(y_pred, 1e-12, 1 - 1e-12)\n            return -np.mean(y_true * np.log(y_pred) + (1 - y_true) * np.log(1 - y_pred))\n        else:\n            raise ValueError(f\"Unknown loss: {self.loss}\")\n    \n    def backward(self, X, y_true, y_pred):\n        \"\"\"Backpropagation using stored caches.\"\"\"\n        m = X.shape[0]\n        gradients_w = [None] * (self.num_layers - 1)\n        gradients_b = [None] * (self.num_layers - 1)\n        \n        # Output layer delta\n        if self.loss == 'mse':\n            dA = 2 * (y_pred - y_true) \/ m\n            # Linear activation derivative is 1\n            delta = dA\n        elif self.loss == 'binary_cross_entropy':\n            # For sigmoid + BCE, delta = y_pred - y_true\n            delta = (y_pred - y_true) \/ m\n        \n        # Backpropagate through output layer\n        for i in reversed(range(self.num_layers - 1)):\n            input_prev, z, a = self.caches[i]\n            \n            if i == self.num_layers - 2:  # Output layer\n                if self.loss == 'binary_cross_entropy':\n                    # delta already computed\n                    pass\n                else:\n                    delta = dA * self._activation_derivative(z)\n            else:  # Hidden layer\n                delta = np.dot(delta, self.weights[i+1].T) * self._activation_derivative(z)\n            \n            # Compute gradients\n            gradients_w[i] = np.dot(input_prev.T, delta)\n            gradients_b[i] = np.sum(delta, axis=0, keepdims=True)\n            \n            # Update delta for next iteration\n            dA = delta\n        \n        return gradients_w, gradients_b\n    \n    def update_parameters(self, gradients_w, gradients_b, learning_rate=0.01, optimizer='sgd', \n                          momentum=0.9, beta1=0.9, beta2=0.999, epsilon=1e-8, t=1):\n        \"\"\"Update weights using various optimizers.\"\"\"\n        if not hasattr(self, 'v_w'):\n            self.v_w = [np.zeros_like(w) for w in self.weights]\n            self.v_b = [np.zeros_like(b) for b in self.biases]\n            self.s_w = [np.zeros_like(w) for w in self.weights]\n            self.s_b = [np.zeros_like(b) for b in self.biases]\n        \n        if optimizer == 'sgd':\n            for i in range(len(self.weights)):\n                self.weights[i] -= learning_rate * gradients_w[i]\n                self.biases[i] -= learning_rate * gradients_b[i]\n        \n        elif optimizer == 'momentum':\n            for i in range(len(self.weights)):\n                self.v_w[i] = momentum * self.v_w[i] - learning_rate * gradients_w[i]\n                self.v_b[i] = momentum * self.v_b[i] - learning_rate * gradients_b[i]\n                self.weights[i] += self.v_w[i]\n                self.biases[i] += self.v_b[i]\n        \n        elif optimizer == 'adam':\n            for i in range(len(self.weights)):\n                self.v_w[i] = beta1 * self.v_w[i] + (1 - beta1) * gradients_w[i]\n                self.v_b[i] = beta1 * self.v_b[i] + (1 - beta1) * gradients_b[i]\n                self.s_w[i] = beta2 * self.s_w[i] + (1 - beta2) * (gradients_w[i] ** 2)\n                self.s_b[i] = beta2 * self.s_b[i] + (1 - beta2) * (gradients_b[i] ** 2)\n                \n                v_w_corrected = self.v_w[i] \/ (1 - beta1 ** t)\n                v_b_corrected = self.v_b[i] \/ (1 - beta1 ** t)\n                s_w_corrected = self.s_w[i] \/ (1 - beta2 ** t)\n                s_b_corrected = self.s_b[i] \/ (1 - beta2 ** t)\n                \n                self.weights[i] -= learning_rate * v_w_corrected \/ (np.sqrt(s_w_corrected) + epsilon)\n                self.biases[i] -= learning_rate * v_b_corrected \/ (np.sqrt(s_b_corrected) + epsilon)\n    \n    def train(self, X, y, epochs=100, batch_size=32, learning_rate=0.01, \n              optimizer='adam', verbose=True):\n        \"\"\"Full training loop.\"\"\"\n        history = {'loss': []}\n        n_samples = X.shape[0]\n        \n        for epoch in range(epochs):\n            # Shuffle data\n            indices = np.random.permutation(n_samples)\n            X_shuffled = X[indices]\n            y_shuffled = y[indices]\n            \n            epoch_loss = 0\n            num_batches = 0\n            \n            for start in range(0, n_samples, batch_size):\n                end = min(start + batch_size, n_samples)\n                X_batch = X_shuffled[start:end]\n                y_batch = y_shuffled[start:end]\n                \n                # Forward pass\n                y_pred = self.forward(X_batch)\n                \n                # Compute loss\n                batch_loss = self.compute_loss(y_batch, y_pred)\n                epoch_loss += batch_loss\n                num_batches += 1\n                \n                # Backward pass\n                grads_w, grads_b = self.backward(X_batch, y_batch, y_pred)\n                \n                # Update parameters\n                self.update_parameters(grads_w, grads_b, learning_rate, optimizer, t=epoch+1)\n            \n            avg_loss = epoch_loss \/ num_batches\n            history['loss'].append(avg_loss)\n            \n            if verbose and (epoch % 10 == 0):\n                print(f\"Epoch {epoch:3d} | Loss: {avg_loss:.6f}\")\n        \n        return history\n\n# Example usage\nif __name__ == \"__main__\":\n    # Generate synthetic data: y = x1 XOR x2\n    X = np.array([[0, 0], [0, 1], [1, 0], [1, 1]], dtype=float)\n    y = np.array([[0], [1], [1], [0]], dtype=float)\n    \n    # Create network: 2 inputs -&gt; 4 hidden -&gt; 1 output\n    nn = NeuralNetwork([2, 4, 1], activation='relu', loss='binary_cross_entropy')\n    \n    # Train\n    history = nn.train(X, y, epochs=200, batch_size=4, learning_rate=0.5, optimizer='adam')\n    \n    # Test\n    predictions = nn.forward(X)\n    print(\"\\nFinal predictions:\")\n    for i, pred in enumerate(predictions.flatten()):\n        print(f\"  {X[i]} -&gt; {pred:.4f} (true: {y[i][0]})\")<\/pre>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\" \/>\n\n\n\n<h3 class=\"wp-block-heading\">9. Gradient Checking: Debugging Your Backprop<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Implementing backpropagation is error-prone.&nbsp;<strong>Gradient checking<\/strong>&nbsp;validates your implementation using numerical approximation.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Numerical gradient formula (central difference):<\/strong><math display=\"block\"><semantics><mrow><mfrac><mrow><mi mathvariant=\"normal\">\u2202<\/mi><mi>L<\/mi><\/mrow><mrow><mi mathvariant=\"normal\">\u2202<\/mi><msub><mi>w<\/mi><mi>i<\/mi><\/msub><\/mrow><\/mfrac><mo>\u2248<\/mo><mfrac><mrow><mi>L<\/mi><mo stretchy=\"false\">(<\/mo><msub><mi>w<\/mi><mi>i<\/mi><\/msub><mo>+<\/mo><mi>\u03f5<\/mi><mo stretchy=\"false\">)<\/mo><mo>\u2212<\/mo><mi>L<\/mi><mo stretchy=\"false\">(<\/mo><msub><mi>w<\/mi><mi>i<\/mi><\/msub><mo>\u2212<\/mo><mi>\u03f5<\/mi><mo stretchy=\"false\">)<\/mo><\/mrow><mrow><mn>2<\/mn><mi>\u03f5<\/mi><\/mrow><\/mfrac><\/mrow><\/semantics><\/math>\u2202<em>w<\/em><em>i<\/em>\u200b\u2202<em>L<\/em>\u200b\u22482<em>\u03f5<\/em><em>L<\/em>(<em>w<\/em><em>i<\/em>\u200b+<em>\u03f5<\/em>)\u2212<em>L<\/em>(<em>w<\/em><em>i<\/em>\u200b\u2212<em>\u03f5<\/em>)\u200b<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Where \u03b5 is a small number (e.g., 1e-7).<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">python<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">def gradient_check(model, X, y, epsilon=1e-7, tolerance=1e-5):\n    \"\"\"Compare analytical gradients to numerical gradients.\"\"\"\n    model.forward(X)\n    y_pred = model.caches[-1][2]\n    \n    # Get analytical gradients\n    grads_w, grads_b = model.backward(X, y, y_pred)\n    \n    # Flatten all parameters\n    params = []\n    for w, b in zip(model.weights, model.biases):\n        params.extend(w.flatten())\n        params.extend(b.flatten())\n    \n    numerical_grads = []\n    \n    for i, param in enumerate(params):\n        # Add epsilon\n        params_plus = params.copy()\n        params_plus[i] += epsilon\n        # Compute loss with perturbed parameters\n        loss_plus = evaluate_loss(model, X, y, params_plus)\n        \n        # Subtract epsilon\n        params_minus = params.copy()\n        params_minus[i] -= epsilon\n        loss_minus = evaluate_loss(model, X, y, params_minus)\n        \n        # Numerical gradient\n        numerical_grad = (loss_plus - loss_minus) \/ (2 * epsilon)\n        numerical_grads.append(numerical_grad)\n    \n    # Compare\n    for i, (analytical, numerical) in enumerate(zip(params, numerical_grads)):\n        if abs(analytical - numerical) &gt; tolerance:\n            print(f\"Gradient mismatch at index {i}: analytical={analytical:.6f}, numerical={numerical:.6f}\")\n            return False\n    \n    print(\"Gradient check passed!\")\n    return True<\/pre>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\" \/>\n\n\n\n<h3 class=\"wp-block-heading\">10. Common Pitfalls &amp; Best Practices<\/h3>\n\n\n\n<h3 class=\"wp-block-heading\">Pitfall 1: Wrong learning rate<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Fix:<\/strong>&nbsp;Use learning rate schedulers or adaptive optimizers (Adam). Start with 0.001 for Adam, 0.01 for SGD.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Pitfall 2: Vanishing gradients with sigmoid\/tanh<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Fix:<\/strong>&nbsp;Use ReLU for hidden layers. Add batch normalization.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Pitfall 3: Not shuffling data<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Fix:<\/strong>&nbsp;Always shuffle before each epoch to prevent learning spurious correlations.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Pitfall 4: Exploding gradients<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Fix:<\/strong>&nbsp;Apply gradient clipping:&nbsp;<code>grad = np.clip(grad, -1.0, 1.0)<\/code><\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Pitfall 5: Overfitting (low training loss, high validation loss)<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Fix:<\/strong>&nbsp;Add regularization (L1\/L2), dropout, early stopping, or increase data.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Pitfall 6: Dead ReLU neurons<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Fix:<\/strong>&nbsp;Use Leaky ReLU, ELU, or lower learning rate.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><\/p>\n\n\n\n<h3 class=\"wp-block-heading\">\ud83d\udcfa YouTube Video Tutorials (Free &amp; High-Quality)<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Video lectures are often the best starting point because they combine visual explanations with mathematical derivations.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">\ud83c\udf93 University Courses (Most Comprehensive)<\/h3>\n\n\n\n<figure class=\"wp-block-table\"><table class=\"has-fixed-layout\"><thead><tr><th class=\"has-text-align-left\" data-align=\"left\">Lecture<\/th><th class=\"has-text-align-left\" data-align=\"left\">Source<\/th><th class=\"has-text-align-left\" data-align=\"left\">Key Topics Covered<\/th><th class=\"has-text-align-left\" data-align=\"left\">Duration<\/th><\/tr><\/thead><tbody><tr><td><strong>&#8220;Gradient Descent and Backpropagation&#8221;<\/strong><\/td><td>FAU Machine Learning for Physicists S24<\/td><td>Core concepts, neural network training fundamentals, mathematical derivations<a href=\"https:\/\/www.fau.tv\/series\/machine-learning-for-physicists-s24\/2-machine-learning-for-physicists-s24\" target=\"_blank\" rel=\"noreferrer noopener\"><\/a><\/td><td>~90 min<\/td><\/tr><tr><td><strong>CS230 Section 2<\/strong><\/td><td>Stanford Deep Learning<\/td><td>Step-by-step backpropagation for 4 different network architectures (univariate regression to 2-layer nonlinear networks)<a href=\"https:\/\/cs230.stanford.edu\/section\/2\/\" target=\"_blank\" rel=\"noreferrer noopener\"><\/a><\/td><td>~60 min<\/td><\/tr><tr><td><strong>CS 4740 Lecture 11<\/strong><\/td><td>Cornell University<\/td><td>Computation graphs, chain rule visualization, forward\/backward differentiation<a href=\"https:\/\/www.cs.cornell.edu\/courses\/cs4740\/2025sp\/lectures\/Lec11.pdf#1#1\" target=\"_blank\" rel=\"noreferrer noopener\"><\/a><\/td><td>~50 min<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>How to use these:<\/strong>&nbsp;Watch the Stanford or Cornell lectures first for academic rigor, then the FAU lecture for physics-oriented intuition. The Stanford CS230 notes are particularly valuable because they show explicit gradient derivations for multiple network types<a href=\"https:\/\/cs230.stanford.edu\/section\/2\/\" target=\"_blank\" rel=\"noreferrer noopener\"><\/a>.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">\ud83d\udc68\u200d\ud83c\udfeb Expert Individual Tutorials<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Class Central lists over 200 backpropagation courses<a href=\"https:\/\/www.classcentral.com\/subject\/backpropagation\" target=\"_blank\" rel=\"noreferrer noopener\"><\/a>. The most recommended for beginners:<\/p>\n\n\n\n<figure class=\"wp-block-table\"><table class=\"has-fixed-layout\"><thead><tr><th class=\"has-text-align-left\" data-align=\"left\">Creator<\/th><th class=\"has-text-align-left\" data-align=\"left\">Focus<\/th><th class=\"has-text-align-left\" data-align=\"left\">Best For<\/th><\/tr><\/thead><tbody><tr><td><strong>Andrej Karpathy<\/strong><\/td><td>&#8220;Micrograd&#8221; &#8211; Building backprop from scratch<\/td><td>Deep mathematical intuition, hands-on coding<\/td><\/tr><tr><td><strong>3Blue1Brown<\/strong><\/td><td>&#8220;Neural Networks&#8221; series (chapters 3-4)<\/td><td>Visual geometric intuition for gradients<\/td><\/tr><tr><td><strong>StatQuest with Josh Starmer<\/strong><\/td><td>Backpropagation clearly explained<\/td><td>Accessible explanations without heavy math<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<blockquote class=\"wp-block-quote is-layout-flow wp-block-quote-is-layout-flow\">\n<p class=\"wp-block-paragraph\"><strong>\ud83d\udca1 Pro Tip:<\/strong>&nbsp;Watch 3Blue1Brown first for visual intuition, then implement Karpathy&#8217;s micrograd tutorial (code is on his GitHub) to cement understanding through actual coding.<\/p>\n<\/blockquote>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\" \/>\n\n\n\n<h3 class=\"wp-block-heading\">\ud83d\udcda Research Papers &amp; Academic Documents<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">For deeper theoretical understanding, these documents provide the mathematical foundations.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Foundational Papers<\/h3>\n\n\n\n<figure class=\"wp-block-table\"><table class=\"has-fixed-layout\"><thead><tr><th class=\"has-text-align-left\" data-align=\"left\">Title<\/th><th class=\"has-text-align-left\" data-align=\"left\">Authors\/Institution<\/th><th class=\"has-text-align-left\" data-align=\"left\">Year<\/th><th class=\"has-text-align-left\" data-align=\"left\">Key Contribution<\/th><\/tr><\/thead><tbody><tr><td><strong>&#8220;An Investigation of the Gradient Descent Process in Neural Networks&#8221;<\/strong><\/td><td>Barak A. Pearlmutter, CMU<\/td><td>1996<\/td><td>Comprehensive PhD thesis on gradient descent dynamics, convergence properties, and Hessian computation<a href=\"http:\/\/reports-archive.adm.cs.cmu.edu\/anon\/anon\/usr\/ftp\/usr\/ftp\/usr\/anon\/usr0\/ftp\/1996\/abstracts\/96-114.html\" target=\"_blank\" rel=\"noreferrer noopener\"><\/a><\/td><\/tr><tr><td><strong>&#8220;Limitations of neural network training due to numerical instability of backpropagation&#8221;<\/strong><\/td><td>Karner et al., arXiv<\/td><td>2022-2023<\/td><td>Modern analysis of numerical stability issues in ReLU networks and practical training limitations<a href=\"https:\/\/browse-export.arxiv.org\/abs\/2210.00805\" target=\"_blank\" rel=\"noreferrer noopener\"><\/a><\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<h3 class=\"wp-block-heading\">Course Notes &amp; Slides<\/h3>\n\n\n\n<figure class=\"wp-block-table\"><table class=\"has-fixed-layout\"><thead><tr><th class=\"has-text-align-left\" data-align=\"left\">Resource<\/th><th class=\"has-text-align-left\" data-align=\"left\">Source<\/th><th class=\"has-text-align-left\" data-align=\"left\">Content<\/th><\/tr><\/thead><tbody><tr><td><strong>CS230 Section 2 Slides<\/strong><\/td><td>Stanford<\/td><td>Complete backprop equations for 4 network types + optimization techniques (momentum, RMSprop, Adam)<a href=\"https:\/\/cs230.stanford.edu\/section\/2\/\" target=\"_blank\" rel=\"noreferrer noopener\"><\/a><\/td><\/tr><tr><td><strong>CS 4740 Computation Graphs<\/strong><\/td><td>Cornell<\/td><td>Visual chain rule derivations, node-by-node gradient flow<a href=\"https:\/\/www.cs.cornell.edu\/courses\/cs4740\/2025sp\/lectures\/Lec11.pdf#1#1\" target=\"_blank\" rel=\"noreferrer noopener\"><\/a><\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>How to use these:<\/strong>&nbsp;The Pearlmutter thesis<a href=\"http:\/\/reports-archive.adm.cs.cmu.edu\/anon\/anon\/usr\/ftp\/usr\/ftp\/usr\/anon\/usr0\/ftp\/1996\/abstracts\/96-114.html\" target=\"_blank\" rel=\"noreferrer noopener\"><\/a>&nbsp;is dense but excellent for understanding why gradient descent works theoretically. Save it for after you have practical experience. The arXiv paper<a href=\"https:\/\/browse-export.arxiv.org\/abs\/2210.00805\" target=\"_blank\" rel=\"noreferrer noopener\"><\/a>&nbsp;discusses real numerical issues you&#8217;ll encounter.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\" \/>\n\n\n\n<h3 class=\"wp-block-heading\">\ud83d\udcbb GitHub Repositories (From Scratch Implementations)<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">The best way to truly understand these algorithms is to implement them. These repositories provide clean, educational code.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">\ud83c\udfc6 Top Pick: Beginner-Friendly<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Rabia-Akhtr\/Back-propagation-Machine-Learning-Tutorial<\/strong><a href=\"https:\/\/github.com\/Rabia-Akhtr\/Back-propagation-Machine-Learning-Tutorial\" target=\"_blank\" rel=\"noreferrer noopener\"><\/a><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">This repository is specifically designed for learning:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>\ud83d\udcc4\u00a0<strong>PDF guide<\/strong>\u00a0explaining theory of backpropagation and gradient descent<\/li>\n\n\n\n<li>\ud83d\udcd3\u00a0<strong>Jupyter notebook<\/strong>\u00a0with working code implementation<\/li>\n\n\n\n<li>\ud83c\udfaf\u00a0<strong>MNIST dataset<\/strong>\u00a0demonstration<\/li>\n\n\n\n<li>\ud83d\udcc9\u00a0<strong>Loss reduction visualization<\/strong><\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\">Perfect if you want theory + code side-by-side.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">\ud83d\ude80 Comprehensive Learning Hub<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>vxnuaj\/awesome-neural-networks<\/strong><a href=\"https:\/\/github.com\/vxnuaj\/awesome-neural-networks\" target=\"_blank\" rel=\"noreferrer noopener\"><\/a><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">This is a complete curriculum covering:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Logistic &amp; Softmax regression from scratch<\/li>\n\n\n\n<li>Forward pass, backpropagation, weight updates<\/li>\n\n\n\n<li>All major activation functions (Sigmoid, ReLU, Tanh, Leaky ReLU)<\/li>\n\n\n\n<li>Optimization algorithms: Momentum, RMSprop, Adam, Adamax, Nadam<\/li>\n\n\n\n<li>Regularization: L1, L2, Dropout, Batch Normalization<\/li>\n\n\n\n<li>Learning rate schedules (exponential decay, cyclical rates)<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\">Each concept has dedicated code files with NumPy implementations.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">\ud83d\udcd6 Reference Implementation<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Slavigrad\/NeuralNetworkSimulator<\/strong><a href=\"https:\/\/github.com\/Slavigrad\/NeuralNetworkSimulator\" target=\"_blank\" rel=\"noreferrer noopener\"><\/a><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">While more of a visualization tool (covered below), its implementation architecture is open-source and well-documented for reference.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>How to use these:<\/strong><\/p>\n\n\n\n<ol start=\"1\" class=\"wp-block-list\">\n<li>Start with\u00a0<strong>Rabia-Akhtr<\/strong>\u00a0repository to get a working backprop implementation quickly<\/li>\n\n\n\n<li>Move to\u00a0<strong>vxnuaj<\/strong>\u00a0repository for deep dives into each component (activation functions, optimizers, regularization)<\/li>\n\n\n\n<li>Use both as references when implementing your own networks<\/li>\n<\/ol>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\" \/>\n\n\n\n<h3 class=\"wp-block-heading\">\ud83c\udfae Interactive Visualizations &amp; Simulators<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Interactive tools help build intuition by letting you see gradients flow in real-time.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">\u2b50 Neural Network Playground (JavaScript)<a href=\"https:\/\/github.com\/Jayanta2004\/dev-card-showcase\/issues\/926\" target=\"_blank\" rel=\"noreferrer noopener\"><\/a><\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">A fully functional neural network built in vanilla JS with:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Real-time training visualization<\/strong>\u00a0&#8211; watch gradient descent update the decision boundary 60 times\/second<\/li>\n\n\n\n<li><strong>Custom math engine<\/strong>\u00a0&#8211; implements matrix operations, forward propagation, and backpropagation via chain rule entirely from scratch<\/li>\n\n\n\n<li><strong>Interactive controls<\/strong>\u00a0&#8211; modify learning rate, hidden layer size, activation functions (Sigmoid vs ReLU)<\/li>\n\n\n\n<li><strong>Zero dependencies<\/strong>\u00a0&#8211; proves complex AI logic can be built with pure mathematical implementation<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\">\ud83d\udc49&nbsp;<strong>Try it live:<\/strong>&nbsp;Link in the GitHub repo<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">\ud83c\udfaf Neural Network Simulator<a href=\"https:\/\/github.com\/Slavigrad\/NeuralNetworkSimulator\" target=\"_blank\" rel=\"noreferrer noopener\"><\/a><\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">More advanced simulator with:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Color-coded weight visualization<\/strong>\u00a0&#8211; line thickness shows weight magnitude, colors indicate positive\/negative values<\/li>\n\n\n\n<li><strong>Animated gradient flow<\/strong>\u00a0&#8211; watch error signals propagate backward through layers<\/li>\n\n\n\n<li><strong>Layer-wise error rings<\/strong>\u00a0&#8211; visualize error magnitude at each neuron<\/li>\n\n\n\n<li><strong>Multiple activation functions<\/strong>\u00a0(ReLU, Sigmoid, Tanh, Linear)<\/li>\n\n\n\n<li><strong>Multiple loss functions<\/strong>\u00a0(MSE, MAE, Binary Cross Entropy)<\/li>\n\n\n\n<li><strong>Momentum and batch processing<\/strong>\u00a0support<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>How to use these:<\/strong>&nbsp;Before writing any code, spend 15-30 minutes playing with these simulators. Change the learning rate dramatically (0.1 \u2192 1.0) and watch training fail. Add a hidden layer and see how the decision boundary becomes flexible. This builds invaluable intuition.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\" \/>\n\n\n\n<h3 class=\"wp-block-heading\">\ud83d\udcd6 Recommended Learning Path<\/h3>\n\n\n\n<h3 class=\"wp-block-heading\">Week 1: Foundation (Theory)<\/h3>\n\n\n\n<ol start=\"1\" class=\"wp-block-list\">\n<li>Watch\u00a0<strong>3Blue1Brown<\/strong>\u00a0&#8220;What is backpropagation really doing?&#8221;<\/li>\n\n\n\n<li>Study\u00a0<strong>Stanford CS230 Section 2<\/strong>\u00a0slides<a href=\"https:\/\/cs230.stanford.edu\/section\/2\/\" target=\"_blank\" rel=\"noreferrer noopener\"><\/a><\/li>\n\n\n\n<li>Read the\u00a0<strong>Rabia-Akhtr PDF guide<\/strong><a href=\"https:\/\/github.com\/Rabia-Akhtr\/Back-propagation-Machine-Learning-Tutorial\" target=\"_blank\" rel=\"noreferrer noopener\"><\/a><\/li>\n<\/ol>\n\n\n\n<h3 class=\"wp-block-heading\">Week 2: Hands-On Practice<\/h3>\n\n\n\n<ol start=\"1\" class=\"wp-block-list\">\n<li>Implement forward\/backward pass for a 2-layer network in NumPy (use\u00a0<code>vxnuaj<\/code>\u00a0repo as reference<a href=\"https:\/\/github.com\/vxnuaj\/awesome-neural-networks\" target=\"_blank\" rel=\"noreferrer noopener\"><\/a>)<\/li>\n\n\n\n<li>Train on XOR problem (classic test case)<\/li>\n\n\n\n<li>Experiment with\u00a0<strong>Neural Network Playground<\/strong><a href=\"https:\/\/github.com\/Jayanta2004\/dev-card-showcase\/issues\/926\" target=\"_blank\" rel=\"noreferrer noopener\"><\/a>\u00a0&#8211; observe how learning rate affects convergence<\/li>\n<\/ol>\n\n\n\n<h3 class=\"wp-block-heading\">Week 3: Optimization &amp; Advanced Topics<\/h3>\n\n\n\n<ol start=\"1\" class=\"wp-block-list\">\n<li>Implement different optimizers (SGD, Momentum, Adam)<\/li>\n\n\n\n<li>Study\u00a0<strong>Cornell computation graphs<\/strong><a href=\"https:\/\/www.cs.cornell.edu\/courses\/cs4740\/2025sp\/lectures\/Lec11.pdf#1#1\" target=\"_blank\" rel=\"noreferrer noopener\"><\/a>\u00a0for chain rule visualization<\/li>\n\n\n\n<li>Add regularization and observe its effect<\/li>\n<\/ol>\n\n\n\n<h3 class=\"wp-block-heading\">Week 4: Debugging &amp; Best Practices<\/h3>\n\n\n\n<ol start=\"1\" class=\"wp-block-list\">\n<li>Implement gradient checking (numerical gradient verification)<\/li>\n\n\n\n<li>Experiment with different activation functions<\/li>\n\n\n\n<li>Read the\u00a0<strong>Pearlmutter thesis<\/strong><a href=\"http:\/\/reports-archive.adm.cs.cmu.edu\/anon\/anon\/usr\/ftp\/usr\/ftp\/usr\/anon\/usr0\/ftp\/1996\/abstracts\/96-114.html\" target=\"_blank\" rel=\"noreferrer noopener\"><\/a>\u00a0sections on convergence<\/li>\n<\/ol>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\" \/>\n\n\n\n<h3 class=\"wp-block-heading\">\ud83d\udd11 Key Concepts to Master<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Based on the Stanford CS230 notes<a href=\"https:\/\/cs230.stanford.edu\/section\/2\/\" target=\"_blank\" rel=\"noreferrer noopener\"><\/a>, ensure you can derive these core equations:<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Gradient Descent Update Rule<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">text<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">\u03b8_new = \u03b8_old - \u03b7 * \u2207L(\u03b8_old)<\/pre>\n\n\n\n<h3 class=\"wp-block-heading\">Backpropagation Steps for a 2-Layer Network<\/h3>\n\n\n\n<figure class=\"wp-block-table\"><table class=\"has-fixed-layout\"><thead><tr><th class=\"has-text-align-left\" data-align=\"left\">Layer<\/th><th class=\"has-text-align-left\" data-align=\"left\">Forward<\/th><th class=\"has-text-align-left\" data-align=\"left\">Backward Gradient<\/th><\/tr><\/thead><tbody><tr><td>Hidden<\/td><td>Z = W\u2081X + b\u2081, A = \u03c3(Z)<\/td><td>\u2202L\/\u2202W\u2081 = ((w\u2082\u1d40 * 2\/m*(\u0177-y)) \u2299 A\u2299(1-A)) X\u1d40<\/td><\/tr><tr><td>Output<\/td><td>\u0177 = w\u2082A + b\u2082<\/td><td>\u2202L\/\u2202w\u2082 = 2\/m*(\u0177-y) A\u1d40<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\" \/>\n\n\n\n<h3 class=\"wp-block-heading\">\ud83d\udcdd Summary Table: Resources by Learning Style<\/h3>\n\n\n\n<figure class=\"wp-block-table\"><table class=\"has-fixed-layout\"><thead><tr><th class=\"has-text-align-left\" data-align=\"left\">If you prefer&#8230;<\/th><th class=\"has-text-align-left\" data-align=\"left\">Start with&#8230;<\/th><\/tr><\/thead><tbody><tr><td>\ud83d\udcfa Visual lectures<\/td><td>3Blue1Brown \u2192 Stanford CS230<\/td><\/tr><tr><td>\ud83d\udcd6 Reading theory<\/td><td>Rabia-Akhtr PDF + Cornell slides<\/td><\/tr><tr><td>\ud83d\udcbb Coding from scratch<\/td><td>vxnuaj\/awesome-neural-networks<\/td><\/tr><tr><td>\ud83c\udfae Hands-on experimentation<\/td><td>Neural Network Playground (web-based)<\/td><\/tr><tr><td>\ud83d\udd2c Deep mathematical rigor<\/td><td>Pearlmutter PhD thesis + arXiv paper<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\" \/>\n\n\n\n<h2 class=\"wp-block-heading\">11. Conclusion<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Backpropagation and gradient descent are the foundational algorithms that made deep learning possible. Gradient descent provides the iterative optimization strategy, while backpropagation efficiently computes the necessary gradients through the chain rule.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Key takeaways:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Mini-batch gradient descent<\/strong>\u00a0balances speed and stability.<\/li>\n\n\n\n<li><strong>Backpropagation<\/strong>\u00a0propagates errors backward, computing gradients in O(n) time.<\/li>\n\n\n\n<li><strong>Vanishing\/exploding gradients<\/strong>\u00a0are real problems, mitigated by ReLU, batch norm, and proper initialization.<\/li>\n\n\n\n<li><strong>Adam<\/strong>\u00a0is the default optimizer for most applications, but SGD with momentum remains competitive.<\/li>\n\n\n\n<li><strong>Always check your gradients<\/strong>\u00a0when implementing backprop from scratch.<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\">Understanding these algorithms at a mathematical and implementation level separates machine learning practitioners who simply call libraries from those who can debug, innovate, and push the field forward.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Whether you&#8217;re training a small perceptron or a billion-parameter transformer, the same principles apply: compute gradients with backpropagation, step downhill with gradient descent, and repeat until convergence.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Backpropagation and Gradient Descent: The Engines of Deep Learning Table of Contents 1. Introduction Every neural network learns by minimizing a&nbsp;loss function&nbsp;\u2014 a measure of how wrong its predictions are. But how does the network know&nbsp;which direction&nbsp;to adjust its thousands (or billions) of weights? Two algorithms answer this question: Without backpropagation, training deep networks would [&hellip;]<\/p>\n","protected":false},"author":73,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[],"class_list":["post-3531","post","type-post","status-publish","format-standard","hentry","category-support"],"_links":{"self":[{"href":"https:\/\/www.mhtechin.com\/support\/wp-json\/wp\/v2\/posts\/3531","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.mhtechin.com\/support\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.mhtechin.com\/support\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.mhtechin.com\/support\/wp-json\/wp\/v2\/users\/73"}],"replies":[{"embeddable":true,"href":"https:\/\/www.mhtechin.com\/support\/wp-json\/wp\/v2\/comments?post=3531"}],"version-history":[{"count":1,"href":"https:\/\/www.mhtechin.com\/support\/wp-json\/wp\/v2\/posts\/3531\/revisions"}],"predecessor-version":[{"id":3532,"href":"https:\/\/www.mhtechin.com\/support\/wp-json\/wp\/v2\/posts\/3531\/revisions\/3532"}],"wp:attachment":[{"href":"https:\/\/www.mhtechin.com\/support\/wp-json\/wp\/v2\/media?parent=3531"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.mhtechin.com\/support\/wp-json\/wp\/v2\/categories?post=3531"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.mhtechin.com\/support\/wp-json\/wp\/v2\/tags?post=3531"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}