Part 2 of The Math Behind Neural Networks

Derivatives and Gradients: Teaching Machines to Improve

6 min read

A neural network learns by making a mistake and then figuring out which weights caused it — and by how much. The tool that makes this possible is calculus: specifically, derivatives and gradients.

If you learned derivatives in school and then forgot them, this post will rebuild exactly what you need. If you’ve never seen them, this covers everything from the ground up.


What Is a Derivative?

A derivative measures how much a function’s output changes when you nudge its input.

Formally, the derivative of f(x) at a point x₀ is:

f'(x₀) = lim(h→0) [f(x₀ + h) - f(x₀)] / h

The notation lim(h→0) means “the value this expression approaches as h gets vanishingly small.” The whole formula is asking: if I move x by a tiny amount h, how much does f(x) change per unit of that movement? The answer is the slope of the function at that point.

Example: f(x) = x²

f(x₀ + h) - f(x₀)   (x₀+h)² - x₀²   x₀² + 2x₀h + h² - x₀²
——————————————————— = ——————————————— = ——————————————————————
         h                  h                      h
                                       = 2x₀ + h

As h → 0, this becomes 2x₀. So the derivative of is 2x. At x = 3, the slope is 6.


Derivative Rules

You rarely need the limit definition in practice. These rules cover almost everything:

FunctionDerivative
xⁿn·xⁿ⁻¹ (power rule)
ln(x)1/x
c (constant)0
c·f(x)c·f'(x)
f(x) + g(x)f'(x) + g'(x)

The chain rule is the most important one for neural networks:

If y = f(g(x)), then dy/dx = f'(g(x)) · g'(x)

In plain words: the derivative of a composition equals the outer derivative times the inner derivative.

Example:

y = (3x + 1)²

Let g(x) = 3x + 1   →  g'(x) = 3
Let f(u) = u²        →  f'(u) = 2u

dy/dx = f'(g(x)) · g'(x) = 2(3x+1) · 3 = 6(3x+1)

At x = 1: dy/dx = 6(3+1) = 24.

A neural network is a long composition of functions. The chain rule is how gradients flow backward through all of them — which is exactly what backpropagation does (covered in post 5).


Partial Derivatives

Neural networks have millions of parameters, not one. A function of multiple variables uses partial derivatives: the derivative with respect to one variable, treating all others as constants.

Notation: ∂f/∂x means “the partial derivative of f with respect to x.”

Example: f(x, y) = x² + 3xy + y²

∂f/∂x = 2x + 3y      (treat y as a constant, differentiate by x)
∂f/∂y = 3x + 2y      (treat x as a constant, differentiate by y)

At (x=2, y=1):

∂f/∂x = 2(2) + 3(1) = 7
∂f/∂y = 3(2) + 2(1) = 8

This tells us: at the point (2, 1), moving slightly in the x direction increases f by about 7× that movement; moving slightly in the y direction increases f by about 8×.


The Gradient

The gradient of a function is the vector of all its partial derivatives:

∇f(x, y) = [∂f/∂x, ∂f/∂y]

For a function of n variables, the gradient is an n-dimensional vector.

The gradient always points in the direction of steepest increase. If you’re standing on a hillside, the gradient points uphill. Its negative, -∇f, points downhill — toward the minimum.

For the example above at (2, 1):

∇f(2, 1) = [7, 8]

The function increases fastest in the direction [7, 8]. To find a minimum, we move in the direction [-7, -8].


Gradient Descent

Training a neural network means minimising a loss function L(w), where w represents all the weights. The loss measures how wrong the network’s predictions are.

Gradient descent does this iteratively:

  1. Compute the gradient ∇L(w) — which direction does the loss increase fastest?
  2. Take a small step in the opposite direction (downhill).
  3. Repeat.

The update rule is:

w ← w - η · ∇L(w)

Where η (eta) is the learning rate — a small positive number like 0.001 or 0.01 that controls how large each step is.

Why not take huge steps? If η is too large, you overshoot the minimum and the loss oscillates or diverges. Too small, and training takes forever. Finding a good learning rate is one of the main practical challenges in training neural networks.

Worked example:

Suppose the loss is L(w) = (w - 3)² (a simple parabola with minimum at w = 3). Start at w = 0.

L'(w) = 2(w - 3)     (derivative of the loss)
η = 0.3

Step 1: w = 0   L'(0) = 2(0-3) = -6   w ← 0 - 0.3×(-6) = 1.8
Step 2: w = 1.8 L'(1.8) = 2(1.8-3) = -2.4  w ← 1.8 - 0.3×(-2.4) = 2.52
Step 3: w = 2.52  L'(2.52) = -0.96   w ← 2.52 + 0.288 = 2.808
Step 4: w = 2.808 → ...

Each step gets closer to w = 3. Gradient descent is working.


The Loss Surface

In a real neural network, L(w) is a function of millions of weights — the “loss surface” is a high-dimensional landscape. We can’t visualise it, but the mathematics is identical: compute the gradient, step downhill, repeat.

import numpy as np

# simple quadratic loss: L(w) = (w - 3)^2
def loss(w):
    return (w - 3) ** 2

def grad_loss(w):
    return 2 * (w - 3)

w = 0.0
lr = 0.3

for step in range(20):
    g = grad_loss(w)
    w = w - lr * g
    print(f"step {step+1:2d}: w={w:.4f}, L={loss(w):.4f}")

Output converges to w ≈ 3.0 within a handful of steps.


Why the Chain Rule Is Central

In a neural network, the weights w produce hidden activations, which produce the output ŷ, which determines the loss L. This is a chain of compositions:

w  →  hidden  →  ŷ  →  L

To compute ∂L/∂w, you apply the chain rule repeatedly:

∂L/∂w = ∂L/∂ŷ · ∂ŷ/∂z · ∂z/∂w

This is the mathematical heart of backpropagation. Post 5 works through this in full detail.


Key Takeaways

  • A derivative f'(x) measures the slope of f at x — how much the output changes per unit change in input.
  • The chain rule handles compositions: d/dx f(g(x)) = f'(g(x)) · g'(x).
  • A partial derivative ∂f/∂x treats all other variables as constants.
  • The gradient ∇f is the vector of all partial derivatives. It points in the direction of steepest increase.
  • Gradient descent moves weights in the direction -∇L, reducing the loss one step at a time.

Next up: probability and the Gaussian — the mathematical language that neural networks use to express uncertainty and produce predictions.