Part 4 of Database Internals

B-Trees and B+ Trees: From Binary Trees to Database Indexes

11 min read

You already understand binary search trees. Each node has one key, two children, the left subtree is always smaller, the right always larger. For a tree with a million nodes, you find any key in about 20 comparisons.

That sounds fast. At in-memory speeds it is. At disk speeds, it kills you.


Why BSTs Break at Disk Scale

Every time you descend a level in a BST, you are potentially reading from a different location on disk. In RAM, that is a cache miss — nanoseconds. On a spinning disk, that is a random seek (a seek is the disk arm physically moving to a new track on a hard drive) — milliseconds. A million-node BST has ~20 levels. Twenty disk reads per query. At 5ms per seek, that is 100ms before you return a single row.

A page is the fixed-size block (typically 4–16 KB) that databases read from disk in a single I/O — we’ll see why this matters in a moment.

The fix is not to speed up disk. The fix is to read fewer levels.

If each node could hold 1,000 keys instead of 1, the tree would be 1,000x shallower. A million-row table needs only 2 levels. A billion-row table needs 3.

That is the entire idea behind B-trees.


B-Trees: BSTs Gone Wide

A B-tree generalizes the BST by allowing each node to hold many keys and have many children. The sorted invariant is identical — it just applies across more keys per node.

Formal rules for a B-tree of order t:

  • Each node holds between t-1 and 2t-1 keys (the root can have as few as 1)
  • Each internal node has exactly one more child than it has keys
  • All leaves sit at the same depth — always
  • Keys within a node are sorted ascending

Here t is the tree’s minimum degree — the lower bound on keys per node. The branching factor (children per internal node) is up to 2t.

For t=2 (the smallest B-tree allowed by the definition, also called a 2-3-4 tree), each node holds 1 to 3 keys and has 2 to 4 children.

Here is what a B-tree of order t=2 looks like:

The same BST invariant holds: every key in n1 is less than 30, every key in n2 is between 30 and 70, every key in n3 is greater than 70. It is just a wider tree.

In a B-tree, internal nodes store both keys and the actual data values. You can hit your target at any level of the tree.


Search: Same Idea, Wider Nodes

Search works identically to a BST. At each node, scan the keys left-to-right to find which child to follow (or return if you find the key directly).

Find key 50:

  1. At root [30 | 70]: 50 > 30 and 50 < 70 → take middle child
  2. At [40 | 50 | 60]: scan left to right, find 50 → return value

Two disk reads. In a BST of 9 nodes that is also two reads — but the B-tree scales to millions of nodes while still needing only 2–3 reads. The branching factor is the lever.

The per-node scan is O(t) (or O(log t) with binary search), and tree height is O(log of n in base t) — so total work simplifies to O(log n) since t is a constant for any given database. The win is entirely in the constant — fewer levels means fewer disk reads.


Insertion: Splits Keep the Tree Balanced

You always insert at a leaf. Walk down using the same search logic, land at the right leaf, insert the key in sorted position.

The problem: a leaf can overflow — it may already have 2t-1 keys (the maximum). The fix is to split the overfull node before inserting into it.

Splitting takes a full node with 2t-1 keys and produces:

  • A left child with the first t-1 keys
  • A right child with the last t-1 keys
  • The middle key, which is promoted up to the parent

Example: insert 55. The right child is full and must be split first.

Before:

Split [40 | 50 | 60]. Middle key 50 is promoted to the parent:

Now insert 55 into right child [60]:

The critical detail: split on the way down, not on the way up. Pre-splitting full children means we never need to walk back up to handle parent overflow — a one-pass algorithm that also simplifies concurrent locking.

If the root itself fills and splits, a new root is created holding only the promoted middle key. This is the only way a B-tree grows taller — always from the top.


Deletion: Three Cases

Deletion mirrors insertion. You always want to delete from a leaf, and no node can drop below t-1 keys (the minimum).

Case 1 — Leaf has more than the minimum: just remove the key.

Case 2 — Key is in an internal node: replace it with its in-order predecessor (the largest key in the left subtree) or in-order successor (the smallest key in the right subtree), then delete that key from the leaf. Same trick as BST deletion.

Case 3 — The leaf is at the minimum (t-1 keys): fix the underflow before deleting. Two options:

  • Borrow from a sibling that has a spare key. The sibling donates a key up to the parent; the parent sends its separator key down to the underflowing node. This is sometimes called a rotation — distinct from the AVL/red-black rotations used to rebalance binary search trees.
  • Merge with a sibling when no sibling has a spare. Pull the parent’s separator key down and combine two underfull siblings into one valid node. The parent loses a key — which might trigger a cascade upward.

If the root loses its last key during a merge, the root is removed and the merged child becomes the new root. This is the only way a B-tree shrinks — always from the top.


B+ Trees: The Database Standard

B+ trees make one structural change to the B-tree that turns out to be decisive: internal nodes store only keys, not values. All actual data lives exclusively at the leaf level.

Two things stand out immediately.

Internal nodes are denser. Without values, they hold far more keys per disk page. A 16KB page with 8-byte keys and 8-byte child pointers fits roughly several hundred to ~1,000 keys after accounting for page header and metadata. That is the branching factor. A 3-level tree now covers 1,000³ = one billion rows. The same tree in a standard B-tree, carrying 100-byte records in every node, might only branch by 100 — covering one million rows.

Leaf nodes form a sorted linked list. Every leaf has a pointer to the next leaf in key order. This is the feature that makes databases fast for the queries they actually run.


Range Queries: The Decisive Advantage

A point lookup (WHERE id = 42) is equally fast on a B-tree and a B+ tree — navigate to the leaf, return the value.

A range query (WHERE age BETWEEN 25 AND 55) is where B+ trees dominate.

In a B-tree, matching keys can be scattered across internal and leaf nodes. You must traverse the tree for each one.

In a B+ tree:

  1. Navigate to the first leaf containing 25 — O(log n), one tree traversal
  2. Follow the linked list rightward, collecting every key until you pass 55

No backtracking. No re-traversing the tree. The linked list turns a range query into a sequential scan of exactly the relevant pages. Sequential reads are 10–100x faster than random reads on spinning disks; even on SSDs the gap is 2–10x and prefetching makes sequential scans dramatically cheaper.

This is why every major relational database — PostgreSQL, MySQL InnoDB, SQLite, SQL Server — uses B+ trees as the primary index structure.


One Structural Difference: Promoted vs. Copied Keys

When a node splits in a B-tree, the middle key moves up — it leaves the child and lives only in the parent.

When a node splits in a B+ tree, the middle key copies up — it stays in the right leaf and also appears in the parent as a routing guide.

The leaf [20 | 30] in the B+ tree still contains 20, even though 20 also appears in the parent. Internal nodes are routing indexes, not the authoritative location of data. All data lives in the leaves.


Numbers That Put It in Perspective

Assume a 16KB page, 8-byte keys, 8-byte pointers, 100-byte records:

StructureKeys per internal nodeRows covered at 3 levels
BST17
B-tree (t=50)~99~970,000
B+ tree~1,000~1,000,000,000

B+ trees get that extra order of magnitude because internal nodes never carry record data — every byte goes toward branching. At a billion rows and 3 levels, you need exactly 3 disk reads per point lookup. In practice the upper levels of the tree are kept hot in the buffer cache, so the amortized cost is closer to one read per query.


Summary

B-TreeB+ Tree
Data stored atAny nodeLeaves only
Internal nodes containKeys + valuesKeys only (denser)
Leaf linked listNoYes
Point queryO(log n)O(log n)
Range queryTree traversal per keyFollow linked list — O(log n + k)
Typical height (1B rows)~5 levels~3 levels
Used byMany filesystems (HFS+, ext4, NTFS, APFS, Btrfs use B-trees or B+ trees)Databases (PostgreSQL, MySQL, SQLite, SQL Server)

The mental model builds directly from what you already know:

  • BST — one key, two children, branch by 2
  • B-tree — many keys, many children, branch by many, data anywhere
  • B+ tree — many keys, many children, data only at leaves, leaves linked in sorted order

The B+ tree is not a micro-optimization of the B-tree. It is a deliberate trade: give up the ability to find data in internal nodes, gain the ability to do range scans in O(k) time after a single O(log n) lookup.

That trade is the reason WHERE created_at BETWEEN '2026-01-01' AND '2026-03-31' returns in milliseconds instead of minutes.