Part 5 of The Math Behind Neural Networks

Backpropagation: How Neural Networks Learn from Mistakes

8 min read

The forward pass computes a prediction. The loss measures how wrong it is. Backpropagation answers the question: which weights should change, and in which direction, to reduce the loss?

The answer comes from the chain rule, applied systematically from the output backward through every layer to every weight. This post walks through that calculation in full.


The Computation Graph

A neural network is a composition of functions. The easiest way to reason about it is as a computation graph: a directed acyclic graph where each node is an operation and each edge carries a value.

For a minimal network output = W·x + b → sigmoid → loss:

Forward pass: values flow left to right; we compute and store each intermediate result.

Backward pass: gradients flow right to left; at each node we apply the chain rule and multiply the incoming gradient by the local derivative.


The Key Idea: Local Gradients

At each node in the graph, we only need to know:

  1. The local gradient — how the node’s output changes with respect to its input.
  2. The upstream gradient — how the loss changes with respect to this node’s output (arriving from the right).

Then, by the chain rule:

gradient flowing backward = upstream gradient × local gradient

This is the entire algorithm. Let’s work through it.


Worked Example: A Single Neuron

Setup:

x = 2.0, w = -3.0, b = -3.0
z = w·x + b
ŷ = sigmoid(z)
L = (ŷ - y)² / 2    (MSE loss, true label y = 0)

Forward pass:

z  = (-3.0)(2.0) + (-3.0) = -6.0 - 3.0 = -9.0
ŷ  = sigmoid(-9.0) = 1/(1+e^9.0) = 1/8104.1 ≈ 0.000123
L  = (0.000123 - 0)² / 2 ≈ 0.0000000076   (very small — already close)

Let’s use a worse weight so the gradients are more instructive. Try w = 1.0, b = 0.5:

x = 2.0, w = 1.0, b = 0.5, y = 0

z  = 1.0×2.0 + 0.5 = 2.5
ŷ  = sigmoid(2.5) = 1/(1+e^-2.5) = 1/1.082 ≈ 0.924
L  = (0.924 - 0)² / 2 = 0.427

Loss is 0.427. We want to reduce it by adjusting w and b.

Backward pass — computing ∂L/∂w and ∂L/∂b:

Step 1: gradient of loss with respect to ŷ

∂L/∂ŷ = ŷ - y = 0.924 - 0 = 0.924

Step 2: gradient through sigmoid. Using the identity sigmoid'(z) = sigmoid(z)·(1-sigmoid(z)):

∂ŷ/∂z = sigmoid(2.5)·(1 - sigmoid(2.5)) = 0.924 × 0.076 = 0.0702

∂L/∂z = ∂L/∂ŷ · ∂ŷ/∂z = 0.924 × 0.0702 = 0.0649

Step 3: gradient through the linear layer z = wx + b:

∂z/∂w = x = 2.0
∂z/∂b = 1

∂L/∂w = ∂L/∂z · ∂z/∂w = 0.0649 × 2.0 = 0.1298
∂L/∂b = ∂L/∂z · ∂z/∂b = 0.0649 × 1.0 = 0.0649

Gradient descent update (with η = 0.5):

w ← 1.0 - 0.5 × 0.1298 = 1.0 - 0.0649 = 0.9351
b ← 0.5 - 0.5 × 0.0649 = 0.5 - 0.0325 = 0.4675

After one update, let’s check the new loss:

z  = 0.9351×2.0 + 0.4675 = 2.337
ŷ  = sigmoid(2.337) ≈ 0.912
L  = 0.912² / 2 = 0.416    (was 0.427 — improved slightly)

After thousands of such steps, the loss converges toward 0.


Backprop Through a 2-Layer Network

Now let’s extend this to the 2-layer network from the previous post. The structure:

x → [W1, b1, ReLU] → A1 → [W2, b2, Softmax] → ŷ → Cross-Entropy → L

Using the values from post 4’s forward pass:

x  = [1.5, -0.5]
Z1 = [0.85, -0.65, 1.15]
A1 = [0.85, 0.0, 1.15]      (after ReLU)
Z2 = [0.74, 0.59]
ŷ  = [0.538, 0.463]
y  = [1, 0]                  (class 0 is correct)
L  = -log(0.538) = 0.620

Backward pass:

Step 1: gradient of cross-entropy + softmax (combined)

The combined gradient of cross-entropy loss + softmax has a beautifully clean form:

∂L/∂Z2 = ŷ - y = [0.538-1, 0.463-0] = [-0.462, 0.463]

This is one of the most elegant results in neural network math — the gradient is just the prediction error.

Step 2: gradient w.r.t. W2 and b2

For each weight W2[i,j], the gradient is the product of the upstream gradient at output i and the input activation at position j:

∂L/∂W2[i, j] = (∂L/∂Z2)[i] × A1[j]

# Row 0 (output neuron 0):
∂L/∂W2[0, :] = -0.462 × [0.85, 0.0, 1.15] = [-0.393, 0.0, -0.531]

# Row 1 (output neuron 1):
∂L/∂W2[1, :] =  0.463 × [0.85, 0.0, 1.15] = [ 0.394, 0.0,  0.532]

∂L/∂b2 = ∂L/∂Z2 = [-0.462, 0.463]

Step 3: gradient flowing back to A1

∂L/∂A1 = W2ᵀ @ (∂L/∂Z2)

W2ᵀ has shape (3, 2), ∂L/∂Z2 has shape (2,)

∂L/∂A1[0] = 0.6×(-0.462) + (-0.1)×0.463 = -0.277 - 0.046 = -0.323
∂L/∂A1[1] = (-0.4)×(-0.462) + 0.3×0.463 = 0.185 + 0.139 = 0.324
∂L/∂A1[2] = 0.2×(-0.462) + 0.5×0.463  = -0.092 + 0.232 = 0.140

Step 4: gradient through ReLU

ReLU’s derivative is 1 where Z1 > 0 and 0 where Z1 ≤ 0. This is the ReLU mask:

Z1 = [0.85, -0.65, 1.15]
mask = [1, 0, 1]              (1 where Z1 > 0, 0 elsewhere)

∂L/∂Z1 = ∂L/∂A1 × mask = [-0.323, 0.0, 0.140]

The second neuron was inactive (ReLU output was 0), so no gradient flows through it. This is correct — if a neuron doesn’t contribute to the output, adjusting its weights has no effect on the loss.

Step 5: gradient w.r.t. W1 and b1

Same pattern as Step 2 — multiply the upstream gradient by each input feature:

∂L/∂W1[i, j] = (∂L/∂Z1)[i] × x[j]

∂L/∂W1[0, :] = -0.323 × [1.5, -0.5] = [-0.485,  0.162]
∂L/∂W1[1, :] =  0.0   × [1.5, -0.5] = [ 0.0,    0.0  ]   (zeroed by ReLU mask)
∂L/∂W1[2, :] =  0.140 × [1.5, -0.5] = [ 0.210, -0.070]

∂L/∂b1 = ∂L/∂Z1 = [-0.323, 0.0, 0.140]

The Pattern

Every layer in a neural network has the same two ingredients:

  1. Forward: Z = W @ A_prev + b, A = activation(Z).
  2. Backward: receive ∂L/∂A from the next layer, compute ∂L/∂W, ∂L/∂b (for gradient updates), and ∂L/∂A_prev (to pass backward further).

The backward pass is just the chain rule applied layer by layer, starting from the loss and working back to every weight.


Why “Backpropagation”?

The word comes from “backward propagation of errors.” Information about the loss is propagated backward through the network to every weight. Each weight receives a gradient proportional to how much it contributed to the error.

Weights that contributed a lot to a large error get a large gradient and are adjusted significantly. Weights that barely affected the output get nearly zero gradient and change little. The network naturally focuses learning on the parameters that matter most.


Practical Implementation

In PyTorch, this is handled automatically:

import torch

x  = torch.tensor([1.5, -0.5])
y  = torch.tensor([1.0, 0.0])

# Define weights with gradient tracking
W1 = torch.randn(3, 2, requires_grad=True)
b1 = torch.zeros(3, requires_grad=True)
W2 = torch.randn(2, 3, requires_grad=True)
b2 = torch.zeros(2, requires_grad=True)

# Forward pass
A1 = torch.relu(W1 @ x + b1)
Z2 = W2 @ A1 + b2
probs = torch.softmax(Z2, dim=0)
loss = -torch.log(probs[0])   # true class is index 0

# Backward pass — PyTorch computes all gradients automatically
loss.backward()

# Gradients are now in W1.grad, b1.grad, W2.grad, b2.grad
print(W1.grad)

loss.backward() builds and traverses the computation graph, computing every gradient via the chain rule. The code above does exactly what we computed by hand.


Key Takeaways

  • Backpropagation is the chain rule applied to a computation graph, starting from the loss and moving backward through each layer.
  • At each node, the local gradient × upstream gradient gives the gradient to pass further back.
  • The gradient of cross-entropy + softmax is simply ŷ - y — the prediction error.
  • The gradient through ReLU is either 1 (pass through) or 0 (blocked). Neurons with negative pre-activations don’t update their weights.
  • Every weight receives a gradient proportional to its contribution to the error.

With backpropagation understood, you know how neural networks learn. The next two posts cover the specific techniques that make large language models work: embeddings and attention.