Part 1 of The Math Behind Neural Networks

Vectors and Matrices: The Language Neural Networks Speak

6 min read

Every neural network, no matter how large, is doing one thing repeatedly: multiplying matrices and adding vectors. GPT-4 with its hundreds of billions of parameters is, at its core, a long chain of these operations. Before you can understand attention, backpropagation, or transformers, you need to be fluent in the arithmetic they’re written in.

This post builds that fluency from scratch, with worked examples at every step.


Scalars, Vectors, and Matrices

A scalar is a single number. Temperature (22°C), price ($9.99), age (31) — all scalars.

A vector is an ordered list of numbers. It has both a magnitude (how big) and a direction (which way).

v = [3, 2]       (a 2D vector)
w = [1, 0, -4]   (a 3D vector)

Geometrically, v = [3, 2] means “go 3 units right and 2 units up.” Vectors live in space.

A matrix is a rectangular grid of numbers, arranged in rows and columns.

A = [[2, 0],
     [1, 3]]    (a 2×2 matrix — 2 rows, 2 columns)

We say matrix A has shape (2, 2). A matrix with m rows and n columns has shape (m, n).


Vector Operations

Addition

Add vectors component by component. Both vectors must have the same length.

a = [1, 2, 3]
b = [4, 5, 6]

a + b = [1+4, 2+5, 3+6] = [5, 7, 9]

Geometrically: place vector b at the tip of vector a. The result points from the origin to where b ends.

Scalar Multiplication

Multiply every component by the same number.

a = [1, 2, 3]
3 * a = [3, 6, 9]

This stretches (or shrinks, or flips) the vector without changing its direction.

The Dot Product

This one is crucial. The dot product of two vectors is a single number (a scalar):

a · b = a[0]*b[0] + a[1]*b[1] + ... + a[n]*b[n]

Worked example:

a = [1, 2, 3]
b = [4, 5, 6]

a · b = (1×4) + (2×5) + (3×6)
      = 4 + 10 + 18
      = 32

The dot product measures how much two vectors point in the same direction. If two vectors are perpendicular, their dot product is 0. If they point in the same direction, it’s large and positive. This property becomes the foundation of the attention mechanism in transformers.

Geometric interpretation:

a · b = |a| × |b| × cos(θ)

Where |a| is the length (or magnitude) of vector a — computed as sqrt(a[0]² + a[1]² + ...). θ is the angle between the two vectors. cos(0°) = 1 (same direction), cos(90°) = 0 (perpendicular), cos(180°) = -1 (opposite).


Matrices

What a Matrix Represents

A matrix can represent many things: a dataset (rows = examples, columns = features), a transformation of space, or — most importantly for neural networks — the weights connecting one layer to the next.

Matrix Multiplication

This is the core operation. To multiply matrix A (shape m×n) by matrix B (shape n×p), the output C has shape m×p.

The rule: entry C[i][j] is the dot product of row i of A with column j of B.

Worked example (2×2 × 2×2):

A = [[1, 2],    B = [[5, 6],
     [3, 4]]         [7, 8]]

C[0][0] = row 0 of A · col 0 of B = [1,2]·[5,7] = (1×5)+(2×7) = 19
C[0][1] = row 0 of A · col 1 of B = [1,2]·[6,8] = (1×6)+(2×8) = 22
C[1][0] = row 1 of A · col 0 of B = [3,4]·[5,7] = (3×5)+(4×7) = 43
C[1][1] = row 1 of A · col 1 of B = [3,4]·[6,8] = (3×6)+(4×8) = 50

C = [[19, 22],
     [43, 50]]

Worked example (2×3 × 3×2 → 2×2):

A = [[1, 0, 2],     B = [[1, 2],
     [3, 1, 0]]          [0, 1],
                          [4, 0]]

C[0][0] = [1,0,2]·[1,0,4] = 1+0+8 = 9
C[0][1] = [1,0,2]·[2,1,0] = 2+0+0 = 2
C[1][0] = [3,1,0]·[1,0,4] = 3+0+0 = 3
C[1][1] = [3,1,0]·[2,1,0] = 6+1+0 = 7

C = [[9, 2],
     [3, 7]]

Important: matrix multiplication is not commutative. A × B ≠ B × A in general. The inner dimensions must match: (m×n) × (n×p) works; (m×n) × (p×n) does not (unless p = m).

The Transpose

The transpose of a matrix, written Aᵀ, flips rows and columns.

A = [[1, 2, 3],       Aᵀ = [[1, 4],
     [4, 5, 6]]              [2, 5],
                              [3, 6]]

If A has shape (m, n), then Aᵀ has shape (n, m).

The transpose appears constantly in neural networks, particularly in the attention formula QKᵀ (which you’ll see in post 7).


Why Shape Tracking Matters

In code, shape errors are the most common bug when building neural networks. Getting comfortable with shapes now prevents a lot of pain later.

import numpy as np

A = np.array([[1, 2, 3],
              [4, 5, 6]])   # shape (2, 3)

B = np.array([[7, 8],
              [9, 10],
              [11, 12]])    # shape (3, 2)

C = A @ B                   # shape (2, 2) — inner dims match (3 == 3)
print(C)
# [[58  64]
#  [139 154]]

The @ operator in Python is matrix multiplication (same as np.matmul). If you try to multiply shapes that don’t align, NumPy will raise an error immediately.


A Neural Network Layer in One Line

Here’s the payoff. A single layer in a neural network computes:

output = activation(W @ input + b)

Where:

  • W is the weight matrix — shape (output_size, input_size)
  • input is the input vector — shape (input_size,)
  • b is the bias vector — shape (output_size,)
  • activation is a nonlinear function (ReLU, sigmoid, etc.)

Worked example:

W = [[0.5, -0.2],    # shape (3, 2)
     [0.1,  0.8],
     [-0.3, 0.4]]

input = [2.0, 1.0]   # shape (2,)
b     = [0.1, -0.1, 0.2]

z = W @ input + b
  = [0.5*2 + (-0.2)*1 + 0.1,
     0.1*2 +   0.8*1 + (-0.1),
     -0.3*2 + 0.4*1 + 0.2]
  = [0.9, 0.9, 0.0]

That’s a 3-neuron layer processing a 2-dimensional input. The matrix W has 3×2 = 6 learnable weights — the numbers that training will adjust to minimise the loss.


Key Takeaways

  • A vector is an ordered list of numbers with magnitude and direction.
  • The dot product a · b measures similarity between two vectors — large when they point the same way, zero when perpendicular.
  • A matrix is a grid of numbers that can represent a transformation, a dataset, or a set of weights.
  • Matrix multiplication chains transformations together. Shape (m×n) × (n×p) → (m×p).
  • One neural network layer = W @ x + b followed by an activation function.

Everything in the rest of this series builds directly on these operations. Next: the calculus that teaches a network to improve.