The Transformer and GPT: Putting It All Together

You now have every mathematical tool needed to understand GPT. Vectors and matrices (post 1) move data. Derivatives and gradients (post 2) drive learning. Probability and softmax (post 3) produce predictions. Layers with activations (post 4) learn nonlinear functions. Backpropagation (post 5) trains all the weights. Embeddings (post 6) represent words. Attention (post 7) builds contextual representations.

This post assembles these pieces into the transformer architecture and walks through a complete forward pass.

The Transformer Block

The core unit of a transformer is the transformer block (also called a transformer layer). GPT stacks many of these in sequence.

Each block has two sublayers:

Multi-Head Self-Attention — lets positions attend to each other.
Feed-Forward Network — processes each position independently with a two-layer MLP.

Each sublayer is wrapped with two important additions:

A residual connection (add the input to the output).
Layer normalisation (applied before each sublayer).

The full computation for one block:

# Sub-layer 1: Multi-Head Attention
X' = X + MultiHeadAttention(LayerNorm(X))

# Sub-layer 2: Feed-Forward Network
X'' = X' + FFN(LayerNorm(X'))

Layer Normalisation

Layer normalisation stabilises training by ensuring the activations at each layer have a consistent distribution, regardless of the scale of the previous layer’s outputs.

For a vector x:

μ = mean(x)           # mean of all elements
σ = std(x)            # standard deviation

x_norm = (x - μ) / σ  # normalise

output = γ · x_norm + β   # learnable scale γ and shift β

γ and β are learned parameters (each a vector of length d_model). After normalising, the network can still learn to rescale and shift the distribution as needed.

Worked example (4-dimensional vector):

x = [0.5, -0.3, 1.2, 0.8]

μ = (0.5 - 0.3 + 1.2 + 0.8) / 4 = 2.2 / 4 = 0.55
σ = sqrt(((0.5-0.55)² + (-0.3-0.55)² + (1.2-0.55)² + (0.8-0.55)²) / 4)
  = sqrt((0.0025 + 0.7225 + 0.4225 + 0.0625) / 4)
  = sqrt(0.3025) = 0.55

x_norm = [(0.5-0.55)/0.55, (-0.3-0.55)/0.55, (1.2-0.55)/0.55, (0.8-0.55)/0.55]
       = [-0.091, -1.545, 1.182, 0.455]

# With γ = [1,1,1,1] and β = [0,0,0,0], output = x_norm

Without LayerNorm, deep networks become difficult to train because small changes in early layers compound through many subsequent layers, causing exploding or vanishing gradients.

Residual Connections

A residual connection adds the sublayer’s input directly to its output:

output = x + Sublayer(x)

This looks simple but has a profound effect: during backpropagation, gradients can flow directly from the output back to any earlier layer without being diminished by each intermediate layer. This is why transformers with dozens of layers can be trained at all.

Without residuals, training a 96-layer transformer (like GPT-3) would be practically impossible — gradients would vanish long before reaching the early layers.

The Feed-Forward Network

The FFN sublayer is a standard two-layer MLP applied independently to each position’s vector:

FFN(x) = ReLU(x · W₁ + b₁) · W₂ + b₂

(This post uses the row-vector convention x · W, common in transformer papers. It’s equivalent to Wᵀ @ x from the earlier posts — just with the matrix transposed.)

W₁ has shape (d_model, d_ff) — expands to a larger dimension.
W₂ has shape (d_ff, d_model) — projects back.
d_ff is typically 4 × d_model. In GPT-2 small: d_model = 768, d_ff = 3072.

The FFN is applied position-wise: each token’s vector goes through the same MLP independently. It does not mix information across positions — that’s attention’s job.

The expansion to 4×d_model gives the network capacity to store complex transformations. Research suggests the FFN layers act as associative memories, storing factual knowledge in their weights.

Positional Encoding

Attention has no built-in sense of position — the operation softmax(QKᵀ/√dₖ)V is identical regardless of token order. Position is injected by adding a positional encoding to each token’s embedding before it enters the transformer stack.

The original transformer paper (Vaswani et al. 2017) uses sinusoidal functions:

PE(pos, 2i)   = sin(pos / 10000^(2i/d_model))
PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model))

Where pos is the token position and i is the dimension index.

Why sinusoids?

Each position gets a unique pattern of sine/cosine values.
The model can generalise to sequence lengths not seen during training.
Relative positions can be represented as linear transformations of absolute positions.

GPT-2 uses learned positional embeddings instead: a separate embedding matrix of shape (max_seq_len, d_model) where each row is a trainable vector. Both approaches work; learned embeddings are simpler.

The Full GPT Forward Pass

GPT is a decoder-only transformer. It takes a sequence of tokens and predicts the next token.

Let’s trace through GPT-2 Small (12 layers, 12 heads, d_model = 768):

Input: the sequence “The cat sat on the”

Step 1: Tokenise and embed

tokens = [464, 3797, 3332, 319, 262]   # BPE token ids
# 464="The", 3797="cat", 3332="sat", 319="on", 262="the"

# Look up token embeddings:
E = embedding_matrix[tokens]           # shape (5, 768)

# Add positional embeddings:
P = positional_matrix[[0,1,2,3,4]]     # shape (5, 768)

X = E + P                              # shape (5, 768)

Step 2: Pass through 12 transformer blocks

For each block i in [0..11]:

# Attention sublayer
X_norm = LayerNorm(X)                   # shape (5, 768)
attn_out = MultiHeadAttention(X_norm)   # shape (5, 768)
X = X + attn_out                        # residual

# FFN sublayer
X_norm = LayerNorm(X)
ffn_out = FFN(X_norm)                   # shape (5, 768)
X = X + ffn_out                         # residual

After 12 blocks, X still has shape (5, 768), but each token’s vector is now a rich contextual representation incorporating information from all other tokens.

Step 3: Final layer norm + linear projection

X_final = LayerNorm(X)                  # shape (5, 768)
logits = X_final @ W_unembed            # shape (5, 50257)

W_unembed has shape (768, 50257) — one column per vocabulary token. (In practice, GPT ties this to the transposed embedding matrix to save parameters.)

Step 4: Sample the next token

We only need the last position’s logits (predicting what comes after “the”):

last_logits = logits[-1]                # shape (50257,)
probs = softmax(last_logits / T)        # T = temperature (post 3)

next_token = sample(probs)              # e.g., "mat" (token 17680)

The model outputs “mat” — completing “The cat sat on the mat.”

Parameter Count

Let’s understand where GPT-2 Small’s 117M parameters come from:

Component	Parameters
Token embeddings	50,257 × 768 = 38.6M
Positional embeddings	1,024 × 768 = 0.8M
Per block: Attention (Q,K,V,O matrices)	4 × (768 × 768) = 2.4M
Per block: FFN (W₁, W₂)	(768×3072) + (3072×768) = 4.7M
Per block: LayerNorms	≈ 0.003M
12 blocks × (2.4 + 4.7)M	85.2M
Final LayerNorm + unembedding	≈ 0.6M
Total	≈ 117M

GPT-3 uses 175 billion parameters — the same architecture, scaled up: 96 layers, 96 heads, d_model = 12,288. Every forward pass still executes the same sequence of operations you now understand.

What Training Actually Does

During training on a large text corpus, the loss is the average cross-entropy across all positions (here N is sequence length):

L = -(1/N) Σₜ log P(tokenₜ | token₁, ..., tokenₜ₋₁)

For each training step:

Run a forward pass on a batch of sequences.
Compute the cross-entropy loss.
Run backpropagation to compute gradients for all 117M parameters.
Update all parameters with gradient descent (usually Adam optimiser).

After training on hundreds of billions of tokens, the weights encode enough statistical structure about language that the model can answer questions, write code, translate, summarise, and reason — all from next-token prediction.

Key Takeaways

A transformer block = MultiHeadAttention + FFN, each wrapped with LayerNorm and a residual connection.
Layer normalisation stabilises training by keeping activations at a consistent scale.
Residual connections allow gradients to flow directly to early layers, making deep networks trainable.
The FFN sublayer applies a two-layer MLP at each position independently, with hidden dimension 4×d_model.
Positional encoding injects position information by adding sinusoidal or learned vectors to token embeddings.
GPT is decoder-only: it predicts the next token autoregressively using a causally masked transformer stack.
One forward pass = embed → position encode → N×(LayerNorm+Attention+Residual+LayerNorm+FFN+Residual) → linear → softmax.

You now have the complete mathematical picture of how GPT works, built from nothing but addition, multiplication, and the chain rule. Every “emergent” capability in a language model traces back to these operations, repeated billions of times on billions of tokens.