When a language model says “the next word is probably ‘cat’”, it is producing a probability distribution over its entire vocabulary. When a classifier is 94% confident an image is a dog, that confidence is a number from probability theory. Understanding how neural networks reason about uncertainty requires understanding a handful of core probability concepts.
Probability Basics
A probability is a number between 0 and 1 that expresses how likely an event is.
P(event) = 0means impossible.P(event) = 1means certain.P(heads on a fair coin) = 0.5.
For a set of mutually exclusive outcomes that cover all possibilities, probabilities must sum to 1:
P(heads) + P(tails) = 0.5 + 0.5 = 1.0
This constraint — probabilities summing to 1 — is something we’ll enforce explicitly with softmax.
Expected Value
The expected value E[X] is the probability-weighted average of all possible outcomes:
E[X] = Σ x · P(X = x)
Example: Roll a fair six-sided die. What’s the expected value?
E[X] = 1×(1/6) + 2×(1/6) + 3×(1/6) + 4×(1/6) + 5×(1/6) + 6×(1/6)
= (1+2+3+4+5+6)/6
= 21/6 = 3.5
You’ll never actually roll 3.5, but on average over many rolls, this is what you expect.
Probability Distributions
A probability distribution describes how probability is spread across possible values of a random variable.
For a discrete distribution (finite outcomes), we list each outcome and its probability.
For a continuous distribution (infinite outcomes), we use a probability density function (PDF): a curve where the area under the curve equals 1, and the area between two values a and b gives P(a ≤ X ≤ b).
The Gaussian Distribution
The most important distribution in mathematics and statistics. Also called the normal distribution.
Its PDF is:
f(x) = (1 / σ√(2π)) · exp(-(x - μ)² / (2σ²))
Two parameters control it completely:
μ(mu) — the mean, where the peak sits.σ(sigma) — the standard deviation, how spread out the distribution is.σ²is the variance.
The bell curve is symmetric around μ. About 68% of the probability mass falls within one standard deviation of the mean (μ ± σ). About 95% falls within μ ± 2σ.
Why the Gaussian Appears Everywhere
The Central Limit Theorem says: the sum of many independent random variables, regardless of their individual distributions, tends toward a Gaussian. Measurement errors, natural variation in populations, noise in signals — all of these arise from the sum of many small independent effects, so they all look Gaussian.
In neural networks, the Gaussian appears in:
- Weight initialisation — weights are often initialised by sampling from
N(0, σ²), which is shorthand for “a normal distribution with mean 0 and variance σ²”. - Noise — dropout and other regularisation techniques add controlled randomness.
- Gaussian processes — probabilistic models closely related to neural networks.
Softmax: Turning Numbers Into Probabilities
A neural network’s output layer for classification typically produces raw scores called logits — one per class, any real value, not constrained to sum to 1. Softmax converts these into a proper probability distribution.
For a vector of logits z = [z₁, z₂, ..., zₙ]:
softmax(z)ᵢ = e^zᵢ / Σⱼ e^zⱼ
Each output is always between 0 and 1, and they sum to exactly 1.
Worked example:
z = [2.0, 1.0, 0.1]
e^2.0 = 7.389
e^1.0 = 2.718
e^0.1 = 1.105
sum = 11.212
softmax = [7.389/11.212, 2.718/11.212, 1.105/11.212]
= [0.659, 0.242, 0.099]
Check: 0.659 + 0.242 + 0.099 = 1.0 ✓
The class with logit 2.0 gets about 66% of the probability. The exponential function amplifies differences: adding 1 to a logit multiplies its (unnormalised) weight by e ≈ 2.72, so small changes in logits become large probability gaps.
The Temperature Parameter
In language models, you’ll see softmax with a temperature T:
softmax(z/T)ᵢ = e^(zᵢ/T) / Σⱼ e^(zⱼ/T)
T = 1.0— standard softmax.T < 1.0— sharper distribution (more confident, more predictable outputs).T > 1.0— flatter distribution (more random, more creative outputs).
When you use “temperature” in a language model API, this is the parameter you’re controlling.
Cross-Entropy Loss
How do you measure how wrong a predicted probability distribution is? With cross-entropy.
For a single example with true label y (one-hot encoded — a vector of zeros with a single 1 at the correct class) and predicted probabilities ŷ:
L = -Σᵢ yᵢ · log(ŷᵢ)
Because y is one-hot (only one class is correct), this simplifies to:
L = -log(ŷ_correct)
Where ŷ_correct is the predicted probability assigned to the true class.
Worked example:
Throughout this series, log means the natural logarithm (base e), which is the standard in machine learning.
True class: cat (index 1)
Predictions: [0.10, 0.70, 0.20]
L = -log(0.70) ≈ 0.357
If the model had been confident and wrong:
Predictions: [0.05, 0.10, 0.85] (cat is index 1, model predicts index 2)
L = -log(0.10) = 2.303 ← much higher loss
Cross-entropy loss grows sharply as the predicted probability of the correct class approaches zero. This is what motivates the model to assign high probability to correct classes.
Why log?
The logarithm has a key property: log(1) = 0 (perfect prediction → zero loss), and log(x) → -∞ as x → 0 (confident wrong prediction → infinite loss). It also has nice mathematical properties that make the gradients clean.
Putting It Together: The Classification Pipeline
Input → Neural Network → Logits → Softmax → Probabilities → Cross-Entropy Loss
Step by step for a 3-class classifier:
input = [0.5, 1.2] # some input features
W = [[0.3, -0.1], # weight matrix (3×2)
[0.2, 0.4],
[-0.5, 0.7]]
b = [0.1, 0.0, -0.2] # biases
logits = W @ input + b
= [0.3*0.5 + (-0.1)*1.2 + 0.1,
0.2*0.5 + 0.4*1.2 + 0.0,
-0.5*0.5 + 0.7*1.2 + (-0.2)]
= [0.13, 0.58, 0.39]
# softmax:
e^0.13 ≈ 1.139, e^0.58 ≈ 1.786, e^0.39 ≈ 1.477
sum ≈ 4.402
probs = [1.139/4.402, 1.786/4.402, 1.477/4.402]
≈ [0.259, 0.406, 0.336]
true class: index 1 (probability = 0.406)
loss = -log(0.406) ≈ 0.902
Training reduces this loss by adjusting W and b using gradient descent.
Key Takeaways
- A probability distribution describes how likely each possible outcome is. Probabilities sum to 1.
- The Gaussian (normal) distribution is defined by mean
μand standard deviationσ. It appears everywhere because of the Central Limit Theorem. - Softmax converts raw logits to probabilities:
e^zᵢ / Σ e^zⱼ. Always positive, always sums to 1. - Temperature controls how sharp or flat the softmax output is.
- Cross-entropy loss
= -log(ŷ_correct)measures how wrong a prediction is. High when the model assigns low probability to the true class.
With vectors, derivatives, and probability in hand, you’re ready to build an actual neural network from scratch. That’s next.