A neural network can only process numbers. Before a language model can reason about the word “cat”, it must be converted into a vector of numbers. This conversion is called an embedding, and the mathematical properties of those vectors are what give language models their apparent ability to understand meaning.
Why Not Just Use Integers?
The simplest approach: assign each word an integer. “cat” = 1, “dog” = 2, “car” = 3.
This has a fatal problem. The number 3 is larger than 2, which is larger than 1 — but “car” is not more than “dog” in any meaningful sense. Neural networks will try to learn from these numerical relationships and get confused.
One-Hot Encoding
A better idea: give each word a vector of zeros with a single 1 at the word’s index.
vocabulary = ["cat", "dog", "car", "runs", "sleeps"] # size = 5
"cat" → [1, 0, 0, 0, 0]
"dog" → [0, 1, 0, 0, 0]
"car" → [0, 0, 1, 0, 0]
This fixes the ordering problem — no word is numerically larger than another. But it has two major flaws:
-
Dimensionality: Real vocabularies have 50,000–100,000+ words. Each vector would be 50,000+ dimensions, mostly zeros. Storing and multiplying with these is extremely wasteful.
-
No notion of similarity: The dot product of any two different one-hot vectors is 0. The model has no way to know that “cat” and “dog” are more related to each other than “cat” and “car”.
Dense Embeddings
The solution: represent each word as a dense vector in a much smaller space, typically 64 to 1024 dimensions. These vectors are learned during training.
"cat" → [0.2, -0.4, 0.7, 0.1, ...] # 768-dimensional vector
"dog" → [0.3, -0.3, 0.8, 0.0, ...] # similar to "cat"
"car" → [-0.5, 0.6, -0.2, 0.9, ...] # different direction
Words with similar meanings end up with similar vectors. Words that appear in similar contexts during training pull their vectors together. After training, the geometry of the embedding space encodes semantic relationships.
The Embedding Matrix
In practice, embeddings are stored in an embedding matrix E of shape (vocab_size, embedding_dim). Row i of E is the embedding vector for word with index i.
To look up the embedding for word i, multiply the one-hot vector eᵢ by E:
embedding = eᵢ @ E
But multiplying a one-hot vector by a matrix just selects a row — it’s equivalent to a table lookup:
embedding = E[i, :]
This is all an embedding layer does: look up a row in a learned table. The parameters E are learned through backpropagation, just like any other weights.
In PyTorch:
import torch
import torch.nn as nn
vocab_size = 10000
embed_dim = 256
embedding_layer = nn.Embedding(vocab_size, embed_dim)
# Look up embeddings for a batch of token indices
token_ids = torch.tensor([4, 17, 3, 42]) # 4 tokens
embeddings = embedding_layer(token_ids) # shape (4, 256)
Dot Product as Similarity
From post 1, the dot product of two vectors a · b = |a||b|cos(θ). The dot product is large when vectors point in the same direction and small (or negative) when they don’t.
This makes it a natural similarity measure:
sim("cat", "dog") = cat_vec · dog_vec ≈ large positive
sim("cat", "car") = cat_vec · car_vec ≈ small
sim("cat", "hate") = cat_vec · hate_vec ≈ small or negative
The dot product is used directly in the attention mechanism (post 7).
Cosine Similarity
The dot product depends on the magnitude of both vectors. A long vector will have large dot products with everything. Cosine similarity normalises for this:
cosine_sim(a, b) = (a · b) / (|a| × |b|)
Where |a| is the length (L2 norm) of vector a: |a| = sqrt(a[0]² + a[1]² + ... + a[n]²).
The result is always between -1 and +1:
+1— same direction (identical meaning).0— perpendicular (unrelated).-1— opposite directions (opposite meaning).
Worked example:
a = [0.6, 0.8] (embedding for "king")
b = [0.5, 0.9] (embedding for "queen")
c = [-0.9, 0.2] (embedding for "terrible")
a · b = 0.6×0.5 + 0.8×0.9 = 0.30 + 0.72 = 1.02
|a| = sqrt(0.6² + 0.8²) = sqrt(0.36 + 0.64) = 1.0
|b| = sqrt(0.5² + 0.9²) = sqrt(0.25 + 0.81) = sqrt(1.06) ≈ 1.03
cosine_sim(a, b) = 1.02 / (1.0 × 1.03) ≈ 0.99 ← very similar
a · c = 0.6×(-0.9) + 0.8×0.2 = -0.54 + 0.16 = -0.38
|c| = sqrt(0.81 + 0.04) = sqrt(0.85) ≈ 0.92
cosine_sim(a, c) = -0.38 / (1.0 × 0.92) ≈ -0.41 ← dissimilar
Semantic Arithmetic
One of the most surprising properties of trained embeddings is that arithmetic on vectors corresponds to semantic relationships.
The classic example:
embedding("king") - embedding("man") + embedding("woman") ≈ embedding("queen")
The embedding space has learned to encode the concept of “royalty” and “gender” as directions in space. Subtracting the “man” direction and adding the “woman” direction finds the nearby word “queen”.
Other examples:
Paris - France + Germany ≈ Berlin
walking - walk + swim ≈ swimming
puppy - dog + cat ≈ kitten
These relationships are not explicitly programmed. They emerge from training on large corpora, because words that appear in similar contexts end up with similar vector directions.
A caveat: because embeddings are shaped entirely by the text they were trained on, they also absorb biases present in that text. Early word embeddings notoriously encoded sexist and racist stereotypes (e.g. doctor - man + woman ≈ nurse). Modern models partially mitigate this with curated training data and additional fine-tuning, but it remains an active area of research.
Subword Tokenisation
Modern language models don’t embed whole words — they embed subword tokens. The word “unbelievable” might be tokenised as ["un", "believ", "able"], and each subword gets its own embedding. This handles rare words and morphology automatically.
GPT-2 uses Byte Pair Encoding (BPE), which builds a vocabulary of common character sequences. GPT-4 uses a similar approach. The vocabulary size is typically 50,000–100,000 tokens.
Positional Embeddings
Embeddings give each word an identity, but they carry no positional information — the embedding of “cat” is the same whether it appears first or last in a sentence. Transformers solve this by adding a positional embedding to each token’s embedding:
final_embedding[i] = token_embedding[i] + positional_embedding[i]
The positional embeddings encode the position of each token in the sequence. Post 8 covers how these are constructed.
Key Takeaways
- One-hot encoding is sparse and carries no semantic information. Dot products between one-hot vectors are always 0 or 1.
- Dense embeddings are learned vectors of typically 64–1024 dimensions. Similar words get similar vectors.
- An embedding matrix
Estores one row per vocabulary word. Looking up an embedding is just indexing a row. - The dot product
a · bmeasures how much two vectors point in the same direction — the foundation of attention. - Cosine similarity normalises for vector length, giving a similarity score in
[-1, +1]. - Trained embeddings exhibit semantic arithmetic:
king - man + woman ≈ queen.
Next: the attention mechanism — where embeddings meet dot products to build context-aware representations.