The previous three posts gave you vectors, matrices, derivatives, and probability. Now you have everything needed to build and run a neural network from scratch — no libraries, just arithmetic.
The Perceptron: One Neuron
A single neuron takes a vector of inputs, multiplies each by a weight, adds a bias, and passes the result through an activation function.
z = w₁x₁ + w₂x₂ + ... + wₙxₙ + b
a = activation(z)
In matrix form (using the dot product):
z = w · x + b
a = activation(z)
Where w and x are vectors of the same length, and b is a scalar.
Worked example:
x = [0.5, 1.2, -0.3] # input (3 features)
w = [0.4, -0.1, 0.8] # weights
b = 0.2 # bias
z = (0.4×0.5) + (-0.1×1.2) + (0.8×-0.3) + 0.2
= 0.20 - 0.12 - 0.24 + 0.2
= 0.04
a = activation(0.04) # apply activation function
The weights w and bias b are the parameters that training will learn. A single neuron can only learn a linear decision boundary — it separates space with a line (or hyperplane). That’s why we stack neurons into layers and add nonlinear activations.
Activation Functions
Without activation functions, stacking layers is pointless: W₂(W₁x + b₁) + b₂ is still just a linear transformation, equivalent to a single layer. Activations introduce nonlinearity, enabling networks to learn complex patterns.
ReLU (Rectified Linear Unit)
ReLU(z) = max(0, z)
If the input is positive, pass it through unchanged. If negative, output 0.
ReLU(-2.5) = 0
ReLU(0) = 0
ReLU(3.1) = 3.1
ReLU is the default choice for hidden layers in modern networks. It’s fast to compute and its gradient is either 0 or 1, which makes backpropagation simple.
Derivative of ReLU:
d/dz ReLU(z) = 1 if z > 0
= 0 if z < 0
(Undefined at z = 0, but we set it to 0 by convention — rarely matters in practice.)
Sigmoid
sigmoid(z) = 1 / (1 + e^(-z))
Squashes any input to the range (0, 1). Historically used everywhere, now mostly used in output layers for binary classification.
sigmoid(-3) ≈ 0.047
sigmoid(0) = 0.5
sigmoid(3) ≈ 0.953
Derivative of sigmoid (a neat identity):
d/dz sigmoid(z) = sigmoid(z) · (1 - sigmoid(z))
tanh
tanh(z) = (e^z - e^(-z)) / (e^z + e^(-z))
Squashes to (-1, 1). Outputs are centred around 0 (unlike sigmoid, which is centred around 0.5), which makes training easier in some cases.
tanh(-2) ≈ -0.964
tanh(0) = 0.0
tanh(2) ≈ 0.964
From One Neuron to a Layer
A layer applies a weight matrix W to the input vector x, adds a bias vector b, and then applies an activation elementwise.
Z = W @ x + b # linear transformation
A = activation(Z) # nonlinear activation
If the layer takes n inputs and produces m outputs:
Whas shape(m, n)xhas shape(n,)bhas shape(m,)ZandAboth have shape(m,)
Worked example — a 2-neuron layer with 3 inputs:
W = [[ 0.5, -0.2, 0.1], # shape (2, 3)
[-0.3, 0.4, 0.7]]
x = [1.0, 2.0, -1.0] # shape (3,)
b = [0.1, -0.2] # shape (2,)
Z = W @ x + b
= [(0.5×1 + -0.2×2 + 0.1×-1 + 0.1),
(-0.3×1 + 0.4×2 + 0.7×-1 + -0.2)]
= [(0.5 - 0.4 - 0.1 + 0.1),
(-0.3 + 0.8 - 0.7 - 0.2)]
= [0.1, -0.4]
A = ReLU(Z) = ReLU([0.1, -0.4]) = [0.1, 0.0]
The second neuron got a negative pre-activation, so ReLU clamps it to 0. That neuron is “off” for this input.
A Complete 2-Layer Network
Now let’s stack two layers and run a full forward pass. The network will classify 2D points into 2 categories.
Architecture:
- Input: 2 features
- Hidden layer: 3 neurons, ReLU activation
- Output layer: 2 neurons, softmax (class probabilities)
Layer 1 weights and biases:
W1 = [[ 0.4, -0.3], # shape (3, 2)
[-0.2, 0.5],
[ 0.8, 0.1]]
b1 = [0.1, -0.1, 0.0]
Layer 2 weights and biases:
W2 = [[ 0.6, -0.4, 0.2], # shape (2, 3)
[-0.1, 0.3, 0.5]]
b2 = [0.0, 0.1]
Input:
x = [1.5, -0.5]
Forward pass — Layer 1:
Z1 = W1 @ x + b1
Z1[0] = 0.4×1.5 + (-0.3)×(-0.5) + 0.1 = 0.6 + 0.15 + 0.1 = 0.85
Z1[1] = (-0.2)×1.5 + 0.5×(-0.5) + (-0.1) = -0.3 - 0.25 - 0.1 = -0.65
Z1[2] = 0.8×1.5 + 0.1×(-0.5) + 0.0 = 1.2 - 0.05 = 1.15
Z1 = [0.85, -0.65, 1.15]
A1 = ReLU(Z1) = [0.85, 0.0, 1.15]
The second neuron fired negatively and was clamped to 0 by ReLU.
Forward pass — Layer 2:
Z2 = W2 @ A1 + b2
Z2[0] = 0.6×0.85 + (-0.4)×0.0 + 0.2×1.15 + 0.0 = 0.51 + 0 + 0.23 = 0.74
Z2[1] = (-0.1)×0.85 + 0.3×0.0 + 0.5×1.15 + 0.1 = -0.085 + 0 + 0.575 + 0.1 = 0.59
Z2 = [0.74, 0.59]
# softmax
e^0.74 = 2.096
e^0.59 = 1.804
sum = 3.900
ŷ = [2.096/3.900, 1.804/3.900] = [0.538, 0.463]
The network assigns 53.8% probability to class 0 and 46.3% to class 1.
Computing the loss (assuming class 0 is correct):
L = -log(ŷ₀) = -log(0.538) = 0.620
Training will adjust all 14 parameters (W1, b1, W2, b2) to reduce this loss.
Batched Computation
In practice, you don’t process one example at a time — you process a batch of many examples simultaneously using matrix operations, which is far more efficient on GPU hardware.
If your batch has B examples and each example has n features, the input X has shape (B, n). The layer computes:
Z = X @ W.T + b # shape (B, m)
A = activation(Z)
(Note: X @ W.T rather than W @ x because X has examples as rows, not columns. Both are equivalent for a single example.)
PyTorch and NumPy handle this transparently:
import numpy as np
# batch of 4 examples, 2 features each
X = np.array([[1.5, -0.5],
[0.0, 1.0],
[-1.0, 2.0],
[0.5, 0.5]]) # shape (4, 2)
W1 = np.array([[ 0.4, -0.3],
[-0.2, 0.5],
[ 0.8, 0.1]]) # shape (3, 2)
b1 = np.array([0.1, -0.1, 0.0])
Z1 = X @ W1.T + b1 # shape (4, 3)
A1 = np.maximum(0, Z1) # ReLU, shape (4, 3)
print(A1)
All 4 examples are computed simultaneously, in parallel.
Key Takeaways
- A perceptron computes
z = w·x + b, thena = activation(z). - Without activations, stacking layers is equivalent to one linear layer.
- ReLU
max(0,z)is the default hidden-layer activation — simple, fast, and effective. - Sigmoid squashes to
(0,1)for binary outputs; tanh squashes to(-1,1). - A layer is
A = activation(W @ x + b). - The forward pass chains these layer computations to produce a prediction.
- Batching processes many examples at once via matrix multiply — essential for GPU efficiency.
The network can now produce predictions. But how does it learn? Next post: backpropagation — where the chain rule does the heavy lifting.