The Attention Mechanism: How Transformers Focus

Before transformers, recurrent neural networks processed sequences one token at a time. To remember something from 100 words ago, the information had to survive through 100 sequential updates — often fading or being overwritten. The attention mechanism solves this by allowing every position in the sequence to directly attend to every other position, in a single step.

The math behind this is elegant and entirely built from tools you already have: matrix multiplication, dot products, and softmax.

The Problem: Context-Dependent Meaning

The word “bank” means different things in:

"I deposited money in the bank"     → financial institution
"I sat by the river bank"           → side of a river

For a language model to process “bank” correctly, it needs to know which other words in the sentence are relevant. That’s what attention computes.

The Library Analogy

Imagine a library system:

You arrive with a query — what you’re looking for.
Every book has a key — a label describing its contents.
Each book also has a value — the actual content.

The system compares your query to every key, figures out which books are most relevant, and returns a weighted blend of those books’ values.

Attention works exactly the same way:

Q (Query): what this position is looking for.
K (Key): what each position offers.
V (Value): the actual content each position contributes.

Scaled Dot-Product Attention

The full formula:

Attention(Q, K, V) = softmax(QKᵀ / √dₖ) · V

Where dₖ is the dimension of the key vectors.

This looks concise but unpacks into four meaningful steps:

Step 1: Compute attention scores

scores = Q @ Kᵀ

Each entry scores[i, j] is the dot product of query i with key j — how relevant position j is to position i. Shape: (seq_len, seq_len).

Step 2: Scale

scores = scores / √dₖ

Why scale? The dot product of two dₖ-dimensional vectors has variance that grows with dₖ. Without scaling, large dₖ values push the dot products into regions where softmax produces near-zero gradients (the “saturated” region). Dividing by √dₖ keeps the variance stable regardless of dimension.

Step 3: Softmax to get weights

weights = softmax(scores, applied row by row)

Each row of weights is a probability distribution over all positions. weights[i, j] is how much position i should attend to position j. Each row sums to 1.

Step 4: Weighted sum of values

output = weights @ V

Position i’s output is a weighted average of all value vectors, weighted by how much i should attend to each position.

A Worked Numerical Example

Sequence: “the cat sat” — 3 tokens. Embedding dimension dₖ = 4.

Input embeddings (3 × 4):

X = [[0.9, 0.3, 0.1, 0.5],    # "the"
     [0.1, 0.8, 0.4, 0.2],    # "cat"
     [0.6, 0.1, 0.9, 0.3]]    # "sat"

In practice, Q, K, V are computed by multiplying X by learned weight matrices Wq, Wk, Wv. For simplicity, let’s use Q = K = V = X (which is what happens when those weight matrices are identity matrices).

Step 1: Scores = Q @ Kᵀ

Kᵀ (transposed, shape 4×3):
[[0.9, 0.1, 0.6],
 [0.3, 0.8, 0.1],
 [0.1, 0.4, 0.9],
 [0.5, 0.2, 0.3]]

Scores = X @ Kᵀ (shape 3×3):

scores[0,0] = 0.9×0.9 + 0.3×0.3 + 0.1×0.1 + 0.5×0.5 = 0.81+0.09+0.01+0.25 = 1.16
scores[0,1] = 0.9×0.1 + 0.3×0.8 + 0.1×0.4 + 0.5×0.2 = 0.09+0.24+0.04+0.10 = 0.47
scores[0,2] = 0.9×0.6 + 0.3×0.1 + 0.1×0.9 + 0.5×0.3 = 0.54+0.03+0.09+0.15 = 0.81

scores[1,0] = 0.1×0.9 + 0.8×0.3 + 0.4×0.1 + 0.2×0.5 = 0.09+0.24+0.04+0.10 = 0.47
scores[1,1] = 0.1×0.1 + 0.8×0.8 + 0.4×0.4 + 0.2×0.2 = 0.01+0.64+0.16+0.04 = 0.85
scores[1,2] = 0.1×0.6 + 0.8×0.1 + 0.4×0.9 + 0.2×0.3 = 0.06+0.08+0.36+0.06 = 0.56

scores[2,0] = 0.6×0.9 + 0.1×0.3 + 0.9×0.1 + 0.3×0.5 = 0.54+0.03+0.09+0.15 = 0.81
scores[2,1] = 0.6×0.1 + 0.1×0.8 + 0.9×0.4 + 0.3×0.2 = 0.06+0.08+0.36+0.06 = 0.56
scores[2,2] = 0.6×0.6 + 0.1×0.1 + 0.9×0.9 + 0.3×0.3 = 0.36+0.01+0.81+0.09 = 1.27

Scores:
[[1.16, 0.47, 0.81],
 [0.47, 0.85, 0.56],
 [0.81, 0.56, 1.27]]

Step 2: Scale by √dₖ = √4 = 2

Scaled scores:
[[0.58, 0.235, 0.405],
 [0.235, 0.425, 0.28],
 [0.405, 0.28, 0.635]]

Step 3: Softmax (row by row)

For row 0: [0.58, 0.235, 0.405]

e^0.58 = 1.786, e^0.235 = 1.265, e^0.405 = 1.500
sum = 4.551

weights[0] = [1.786/4.551, 1.265/4.551, 1.500/4.551]
           = [0.393, 0.278, 0.330]

For row 1: [0.235, 0.425, 0.28]

e^0.235 = 1.265, e^0.425 = 1.530, e^0.28 = 1.323
sum = 4.118

weights[1] = [0.307, 0.371, 0.321]

For row 2: [0.405, 0.28, 0.635]

e^0.405 = 1.500, e^0.28 = 1.323, e^0.635 = 1.887
sum = 4.710

weights[2] = [0.318, 0.281, 0.401]

Attention weights:
[[0.393, 0.278, 0.330],   "the" attends to: the(39%), cat(28%), sat(33%)
 [0.307, 0.371, 0.321],   "cat" attends to: the(31%), cat(37%), sat(32%)
 [0.318, 0.281, 0.401]]   "sat" attends to: the(32%), cat(28%), sat(40%)

Each position attends most strongly to itself (the diagonal), but also gathers information from other positions.

Step 4: Output = weights @ V

Using V = X:

output[0] = 0.393×[0.9,0.3,0.1,0.5] + 0.278×[0.1,0.8,0.4,0.2] + 0.330×[0.6,0.1,0.9,0.3]
          = [0.354,0.118,0.039,0.197] + [0.028,0.222,0.111,0.056] + [0.198,0.033,0.297,0.099]
          = [0.580, 0.373, 0.447, 0.352]

output[0] is now a blend of all three value vectors, weighted by how relevant each token is to “the”. The embedding for “the” has been enriched with contextual information from the whole sentence.

Causal Masking (GPT-style)

GPT generates text left to right. When predicting the next token, it must not be able to see future tokens. This is enforced with a causal mask: before the softmax, positions are blocked from attending to future positions by adding -∞ to those scores.

Masked scores for sequence "the cat sat":

[[0.58,  -inf, -inf],   "the" can only see itself
 [0.235, 0.425, -inf],  "cat" can see "the" and itself
 [0.405, 0.28,  0.635]] "sat" can see all three

softmax(-inf) = 0, so those positions get zero weight. The model genuinely cannot leak information from the future.

Multi-Head Attention

Running attention once captures one type of relationship. Multi-head attention runs h parallel attention computations (“heads”), each with its own learned Wq, Wk, Wv matrices, then concatenates and projects the results.

head_i = Attention(X·Wqᵢ, X·Wkᵢ, X·Wvᵢ)

MultiHead(X) = Concat(head_1, ..., head_h) · Wo

Different heads learn to attend to different types of relationships simultaneously:

One head might focus on syntactic dependencies.
Another on semantic similarity.
Another on positional proximity.

GPT-2 (small) uses 12 heads each of dimension 64, for a total dimension of 768.

Key Takeaways

Attention computes a context-aware representation of each position by attending to all other positions.
Q (query), K (key), V (value) are linear projections of the input embeddings.
Scaled dot-product attention: softmax(QKᵀ / √dₖ) · V.
Scaling by √dₖ prevents the dot products from growing too large and saturating softmax.
Causal masking prevents future positions from influencing past predictions — essential for autoregressive text generation.
Multi-head attention runs h attention heads in parallel, each learning different relationships.

The final post puts everything together into the full transformer architecture — and walks through exactly what GPT does on every forward pass.