Before transformers, recurrent neural networks processed sequences one token at a time. To remember something from 100 words ago, the information had to survive through 100 sequential updates — often fading or being overwritten. The attention mechanism solves this by allowing every position in the sequence to directly attend to every other position, in a single step.
The math behind this is elegant and entirely built from tools you already have: matrix multiplication, dot products, and softmax.
The Problem: Context-Dependent Meaning
The word “bank” means different things in:
"I deposited money in the bank" → financial institution
"I sat by the river bank" → side of a river
For a language model to process “bank” correctly, it needs to know which other words in the sentence are relevant. That’s what attention computes.
The Library Analogy
Imagine a library system:
- You arrive with a query — what you’re looking for.
- Every book has a key — a label describing its contents.
- Each book also has a value — the actual content.
The system compares your query to every key, figures out which books are most relevant, and returns a weighted blend of those books’ values.
Attention works exactly the same way:
- Q (Query): what this position is looking for.
- K (Key): what each position offers.
- V (Value): the actual content each position contributes.
Scaled Dot-Product Attention
The full formula:
Attention(Q, K, V) = softmax(QKᵀ / √dₖ) · V
Where dₖ is the dimension of the key vectors.
This looks concise but unpacks into four meaningful steps:
Step 1: Compute attention scores
scores = Q @ Kᵀ
Each entry scores[i, j] is the dot product of query i with key j — how relevant position j is to position i. Shape: (seq_len, seq_len).
Step 2: Scale
scores = scores / √dₖ
Why scale? The dot product of two dₖ-dimensional vectors has variance that grows with dₖ. Without scaling, large dₖ values push the dot products into regions where softmax produces near-zero gradients (the “saturated” region). Dividing by √dₖ keeps the variance stable regardless of dimension.
Step 3: Softmax to get weights
weights = softmax(scores, applied row by row)
Each row of weights is a probability distribution over all positions. weights[i, j] is how much position i should attend to position j. Each row sums to 1.
Step 4: Weighted sum of values
output = weights @ V
Position i’s output is a weighted average of all value vectors, weighted by how much i should attend to each position.
A Worked Numerical Example
Sequence: “the cat sat” — 3 tokens. Embedding dimension dₖ = 4.
Input embeddings (3 × 4):
X = [[0.9, 0.3, 0.1, 0.5], # "the"
[0.1, 0.8, 0.4, 0.2], # "cat"
[0.6, 0.1, 0.9, 0.3]] # "sat"
In practice, Q, K, V are computed by multiplying X by learned weight matrices Wq, Wk, Wv. For simplicity, let’s use Q = K = V = X (which is what happens when those weight matrices are identity matrices).
Step 1: Scores = Q @ Kᵀ
Kᵀ (transposed, shape 4×3):
[[0.9, 0.1, 0.6],
[0.3, 0.8, 0.1],
[0.1, 0.4, 0.9],
[0.5, 0.2, 0.3]]
Scores = X @ Kᵀ (shape 3×3):
scores[0,0] = 0.9×0.9 + 0.3×0.3 + 0.1×0.1 + 0.5×0.5 = 0.81+0.09+0.01+0.25 = 1.16
scores[0,1] = 0.9×0.1 + 0.3×0.8 + 0.1×0.4 + 0.5×0.2 = 0.09+0.24+0.04+0.10 = 0.47
scores[0,2] = 0.9×0.6 + 0.3×0.1 + 0.1×0.9 + 0.5×0.3 = 0.54+0.03+0.09+0.15 = 0.81
scores[1,0] = 0.1×0.9 + 0.8×0.3 + 0.4×0.1 + 0.2×0.5 = 0.09+0.24+0.04+0.10 = 0.47
scores[1,1] = 0.1×0.1 + 0.8×0.8 + 0.4×0.4 + 0.2×0.2 = 0.01+0.64+0.16+0.04 = 0.85
scores[1,2] = 0.1×0.6 + 0.8×0.1 + 0.4×0.9 + 0.2×0.3 = 0.06+0.08+0.36+0.06 = 0.56
scores[2,0] = 0.6×0.9 + 0.1×0.3 + 0.9×0.1 + 0.3×0.5 = 0.54+0.03+0.09+0.15 = 0.81
scores[2,1] = 0.6×0.1 + 0.1×0.8 + 0.9×0.4 + 0.3×0.2 = 0.06+0.08+0.36+0.06 = 0.56
scores[2,2] = 0.6×0.6 + 0.1×0.1 + 0.9×0.9 + 0.3×0.3 = 0.36+0.01+0.81+0.09 = 1.27
Scores:
[[1.16, 0.47, 0.81],
[0.47, 0.85, 0.56],
[0.81, 0.56, 1.27]]
Step 2: Scale by √dₖ = √4 = 2
Scaled scores:
[[0.58, 0.235, 0.405],
[0.235, 0.425, 0.28],
[0.405, 0.28, 0.635]]
Step 3: Softmax (row by row)
For row 0: [0.58, 0.235, 0.405]
e^0.58 = 1.786, e^0.235 = 1.265, e^0.405 = 1.500
sum = 4.551
weights[0] = [1.786/4.551, 1.265/4.551, 1.500/4.551]
= [0.393, 0.278, 0.330]
For row 1: [0.235, 0.425, 0.28]
e^0.235 = 1.265, e^0.425 = 1.530, e^0.28 = 1.323
sum = 4.118
weights[1] = [0.307, 0.371, 0.321]
For row 2: [0.405, 0.28, 0.635]
e^0.405 = 1.500, e^0.28 = 1.323, e^0.635 = 1.887
sum = 4.710
weights[2] = [0.318, 0.281, 0.401]
Attention weights:
[[0.393, 0.278, 0.330], "the" attends to: the(39%), cat(28%), sat(33%)
[0.307, 0.371, 0.321], "cat" attends to: the(31%), cat(37%), sat(32%)
[0.318, 0.281, 0.401]] "sat" attends to: the(32%), cat(28%), sat(40%)
Each position attends most strongly to itself (the diagonal), but also gathers information from other positions.
Step 4: Output = weights @ V
Using V = X:
output[0] = 0.393×[0.9,0.3,0.1,0.5] + 0.278×[0.1,0.8,0.4,0.2] + 0.330×[0.6,0.1,0.9,0.3]
= [0.354,0.118,0.039,0.197] + [0.028,0.222,0.111,0.056] + [0.198,0.033,0.297,0.099]
= [0.580, 0.373, 0.447, 0.352]
output[0] is now a blend of all three value vectors, weighted by how relevant each token is to “the”. The embedding for “the” has been enriched with contextual information from the whole sentence.
Causal Masking (GPT-style)
GPT generates text left to right. When predicting the next token, it must not be able to see future tokens. This is enforced with a causal mask: before the softmax, positions are blocked from attending to future positions by adding -∞ to those scores.
Masked scores for sequence "the cat sat":
[[0.58, -inf, -inf], "the" can only see itself
[0.235, 0.425, -inf], "cat" can see "the" and itself
[0.405, 0.28, 0.635]] "sat" can see all three
softmax(-inf) = 0, so those positions get zero weight. The model genuinely cannot leak information from the future.
Multi-Head Attention
Running attention once captures one type of relationship. Multi-head attention runs h parallel attention computations (“heads”), each with its own learned Wq, Wk, Wv matrices, then concatenates and projects the results.
head_i = Attention(X·Wqᵢ, X·Wkᵢ, X·Wvᵢ)
MultiHead(X) = Concat(head_1, ..., head_h) · Wo
Different heads learn to attend to different types of relationships simultaneously:
- One head might focus on syntactic dependencies.
- Another on semantic similarity.
- Another on positional proximity.
GPT-2 (small) uses 12 heads each of dimension 64, for a total dimension of 768.
Key Takeaways
- Attention computes a context-aware representation of each position by attending to all other positions.
- Q (query), K (key), V (value) are linear projections of the input embeddings.
- Scaled dot-product attention:
softmax(QKᵀ / √dₖ) · V. - Scaling by
√dₖprevents the dot products from growing too large and saturating softmax. - Causal masking prevents future positions from influencing past predictions — essential for autoregressive text generation.
- Multi-head attention runs
hattention heads in parallel, each learning different relationships.
The final post puts everything together into the full transformer architecture — and walks through exactly what GPT does on every forward pass.