Part 4 of The Math Behind Neural Networks

A Neural Network from Scratch: Perceptrons, Layers, and Forward Pass

7 min read

The previous three posts gave you vectors, matrices, derivatives, and probability. Now you have everything needed to build and run a neural network from scratch — no libraries, just arithmetic.


The Perceptron: One Neuron

A single neuron takes a vector of inputs, multiplies each by a weight, adds a bias, and passes the result through an activation function.

z = w₁x₁ + w₂x₂ + ... + wₙxₙ + b
a = activation(z)

In matrix form (using the dot product):

z = w · x + b
a = activation(z)

Where w and x are vectors of the same length, and b is a scalar.

Worked example:

x = [0.5, 1.2, -0.3]       # input (3 features)
w = [0.4, -0.1, 0.8]       # weights
b = 0.2                      # bias

z = (0.4×0.5) + (-0.1×1.2) + (0.8×-0.3) + 0.2
  = 0.20 - 0.12 - 0.24 + 0.2
  = 0.04

a = activation(0.04)        # apply activation function

The weights w and bias b are the parameters that training will learn. A single neuron can only learn a linear decision boundary — it separates space with a line (or hyperplane). That’s why we stack neurons into layers and add nonlinear activations.


Activation Functions

Without activation functions, stacking layers is pointless: W₂(W₁x + b₁) + b₂ is still just a linear transformation, equivalent to a single layer. Activations introduce nonlinearity, enabling networks to learn complex patterns.

ReLU (Rectified Linear Unit)

ReLU(z) = max(0, z)

If the input is positive, pass it through unchanged. If negative, output 0.

ReLU(-2.5) = 0
ReLU(0)    = 0
ReLU(3.1)  = 3.1

ReLU is the default choice for hidden layers in modern networks. It’s fast to compute and its gradient is either 0 or 1, which makes backpropagation simple.

Derivative of ReLU:

d/dz ReLU(z) = 1  if z > 0
             = 0  if z < 0

(Undefined at z = 0, but we set it to 0 by convention — rarely matters in practice.)

Sigmoid

sigmoid(z) = 1 / (1 + e^(-z))

Squashes any input to the range (0, 1). Historically used everywhere, now mostly used in output layers for binary classification.

sigmoid(-3) ≈ 0.047
sigmoid(0)  = 0.5
sigmoid(3)  ≈ 0.953

Derivative of sigmoid (a neat identity):

d/dz sigmoid(z) = sigmoid(z) · (1 - sigmoid(z))

tanh

tanh(z) = (e^z - e^(-z)) / (e^z + e^(-z))

Squashes to (-1, 1). Outputs are centred around 0 (unlike sigmoid, which is centred around 0.5), which makes training easier in some cases.

tanh(-2) ≈ -0.964
tanh(0)  = 0.0
tanh(2)  ≈  0.964

From One Neuron to a Layer

A layer applies a weight matrix W to the input vector x, adds a bias vector b, and then applies an activation elementwise.

Z = W @ x + b        # linear transformation
A = activation(Z)    # nonlinear activation

If the layer takes n inputs and produces m outputs:

  • W has shape (m, n)
  • x has shape (n,)
  • b has shape (m,)
  • Z and A both have shape (m,)

Worked example — a 2-neuron layer with 3 inputs:

W = [[ 0.5, -0.2,  0.1],   # shape (2, 3)
     [-0.3,  0.4,  0.7]]

x = [1.0, 2.0, -1.0]       # shape (3,)
b = [0.1, -0.2]             # shape (2,)

Z = W @ x + b
  = [(0.5×1 + -0.2×2 + 0.1×-1 + 0.1),
     (-0.3×1 + 0.4×2 + 0.7×-1 + -0.2)]
  = [(0.5 - 0.4 - 0.1 + 0.1),
     (-0.3 + 0.8 - 0.7 - 0.2)]
  = [0.1, -0.4]

A = ReLU(Z) = ReLU([0.1, -0.4]) = [0.1, 0.0]

The second neuron got a negative pre-activation, so ReLU clamps it to 0. That neuron is “off” for this input.


A Complete 2-Layer Network

Now let’s stack two layers and run a full forward pass. The network will classify 2D points into 2 categories.

Architecture:

  • Input: 2 features
  • Hidden layer: 3 neurons, ReLU activation
  • Output layer: 2 neurons, softmax (class probabilities)
Layer 1 weights and biases:
W1 = [[ 0.4, -0.3],     # shape (3, 2)
      [-0.2,  0.5],
      [ 0.8,  0.1]]
b1 = [0.1, -0.1, 0.0]

Layer 2 weights and biases:
W2 = [[ 0.6, -0.4,  0.2],   # shape (2, 3)
      [-0.1,  0.3,  0.5]]
b2 = [0.0, 0.1]

Input:
x = [1.5, -0.5]

Forward pass — Layer 1:

Z1 = W1 @ x + b1

Z1[0] = 0.4×1.5 + (-0.3)×(-0.5) + 0.1 = 0.6 + 0.15 + 0.1 = 0.85
Z1[1] = (-0.2)×1.5 + 0.5×(-0.5) + (-0.1) = -0.3 - 0.25 - 0.1 = -0.65
Z1[2] = 0.8×1.5 + 0.1×(-0.5) + 0.0 = 1.2 - 0.05 = 1.15

Z1 = [0.85, -0.65, 1.15]

A1 = ReLU(Z1) = [0.85, 0.0, 1.15]

The second neuron fired negatively and was clamped to 0 by ReLU.

Forward pass — Layer 2:

Z2 = W2 @ A1 + b2

Z2[0] = 0.6×0.85 + (-0.4)×0.0 + 0.2×1.15 + 0.0 = 0.51 + 0 + 0.23 = 0.74
Z2[1] = (-0.1)×0.85 + 0.3×0.0 + 0.5×1.15 + 0.1 = -0.085 + 0 + 0.575 + 0.1 = 0.59

Z2 = [0.74, 0.59]

# softmax
e^0.74 = 2.096
e^0.59 = 1.804
sum    = 3.900

ŷ = [2.096/3.900, 1.804/3.900] = [0.538, 0.463]

The network assigns 53.8% probability to class 0 and 46.3% to class 1.

Computing the loss (assuming class 0 is correct):

L = -log(ŷ₀) = -log(0.538) = 0.620

Training will adjust all 14 parameters (W1, b1, W2, b2) to reduce this loss.


Batched Computation

In practice, you don’t process one example at a time — you process a batch of many examples simultaneously using matrix operations, which is far more efficient on GPU hardware.

If your batch has B examples and each example has n features, the input X has shape (B, n). The layer computes:

Z = X @ W.T + b     # shape (B, m)
A = activation(Z)

(Note: X @ W.T rather than W @ x because X has examples as rows, not columns. Both are equivalent for a single example.)

PyTorch and NumPy handle this transparently:

import numpy as np

# batch of 4 examples, 2 features each
X = np.array([[1.5, -0.5],
              [0.0,  1.0],
              [-1.0, 2.0],
              [0.5,  0.5]])  # shape (4, 2)

W1 = np.array([[ 0.4, -0.3],
               [-0.2,  0.5],
               [ 0.8,  0.1]])  # shape (3, 2)
b1 = np.array([0.1, -0.1, 0.0])

Z1 = X @ W1.T + b1   # shape (4, 3)
A1 = np.maximum(0, Z1)  # ReLU, shape (4, 3)
print(A1)

All 4 examples are computed simultaneously, in parallel.


Key Takeaways

  • A perceptron computes z = w·x + b, then a = activation(z).
  • Without activations, stacking layers is equivalent to one linear layer.
  • ReLU max(0,z) is the default hidden-layer activation — simple, fast, and effective.
  • Sigmoid squashes to (0,1) for binary outputs; tanh squashes to (-1,1).
  • A layer is A = activation(W @ x + b).
  • The forward pass chains these layer computations to produce a prediction.
  • Batching processes many examples at once via matrix multiply — essential for GPU efficiency.

The network can now produce predictions. But how does it learn? Next post: backpropagation — where the chain rule does the heavy lifting.