Part 3 of Database Internals

How Databases Actually Store and Find Your Data

12 min read

When you write SELECT * FROM orders WHERE id = 42, you’re thinking in tables and rows. The database is thinking in pages, heaps, and B-trees. Understanding the gap between those two views explains every performance decision you’ll ever make about indexes.

Let’s build the picture from the ground up.


The Table: A Logical Abstraction

A table is a logical concept — a named collection of rows with defined columns. It says nothing about how data is physically stored. Two databases can expose the exact same CREATE TABLE syntax and store the data in completely different ways underneath.

The physical reality lives one level down, in pages.


Pages: The Unit of I/O

Databases never read a single row from disk. They read a page — a fixed-size block of data, typically 8KB (PostgreSQL, SQL Server) or 16KB (MySQL InnoDB). A page is the smallest unit the storage engine reads or writes.

A page contains a header, a slot array pointing to row locations, and the actual row data:

Rows are written from the bottom of the page upward. The slot array at the top grows downward. This design means rows can be moved within a page (for compaction) without changing the slot number — callers reference (page, slot), not a raw byte offset.

Why this matters: If your query touches 1 row that happens to share a page with 99 other rows, the database reads all 100 rows’ worth of data into memory regardless. Page size is why “one row lookups” are cheap only when paired with an index — without one, you may read every page in the table.


The Heap: An Unordered Table

The default table structure in most databases is a heap. A heap is simply a collection of pages with no inherent ordering — rows are inserted wherever there is free space, not in any sorted order.

Notice: the rows are in no particular order. id=1 is on page 1, id=2 is on page 2, id=3 is on page 2. There is no relationship between the key value and where the row physically lives.

Consequence: To find id = 4 without an index, the database must read every page from start to finish — a sequential scan or full table scan. For a 10-million-row table, that means reading millions of pages from disk.


Row IDs: The Physical Address of a Row

Every row in a heap has a physical address — a pointer to exactly which page and which slot it occupies. Different databases name this differently:

PostgreSQLOracleSQL Server
ctidROWIDRID
e.g. (2, 3)e.g. AAABXSAABAAAACFAAHe.g. 1:2:3
= page 2, slot 3encoded page + slot= file 1, page 2, slot 3

These row IDs are physical, not logical. They change when a row is moved — during VACUUM FULL (VACUUM is PostgreSQL’s process for reclaiming space from deleted/updated row versions), CLUSTER, or other operations that physically rewrite tuples. Standard VACUUM does not move existing tuples and does not change ctids. This has critical implications for how indexes work, as we’ll see next.

Row IDs are not normally exposed to application code, but they are the mechanism that makes indexes fast.


Indexes and How They Use Row IDs

An index is a separate data structure — almost always a B-tree or B+ tree — that maps a column’s values to the physical location of the corresponding rows in the heap.

When you query WHERE id = 7:

  1. The database traverses the B-tree from root to leaf — typically 2–4 page reads for a large index.
  2. The leaf node contains the row ID: (2, 2) — page 2, slot 2.
  3. The database reads page 2 from the heap and extracts the row at slot 2.

This is a double lookup: index traversal, then a heap fetch. For a single row, this is very fast. At scale, it’s the foundation of every indexed query plan.


The Clustered Index: Data Ordered by the Index

A clustered index eliminates the double lookup by storing the actual row data inside the B-tree leaf nodes — not a pointer to a heap, but the row itself.

The leaf level of a clustered index is the table. There is no separate heap. Rows are physically stored in sorted order by the index key.

Benefits:

  • Range scans are extremely fast — WHERE id BETWEEN 4 AND 7 reads three contiguous leaf pages, no heap fetch.
  • No double lookup — one B-tree traversal retrieves the full row.

Cost:

  • Inserts must go into the right sorted position. Random inserts (e.g., UUIDs as keys) cause page splits — when an insert won’t fit in the target page, the database allocates a new page and moves half the rows to it, an expensive operation that fragments the index over time.
  • There can only be one clustered index per table — the data can only be physically sorted one way.

Aside — FILLFACTOR: Most databases let you reserve free space on each page at index creation time (PostgreSQL’s WITH (fillfactor = 70), MySQL’s similar setting). A 70% fillfactor leaves 30% of each page empty for future inserts and HOT updates, dramatically reducing page splits on write-heavy tables. The tradeoff is read amplification — the index occupies more pages.


Primary Key vs Clustered Index

This is one of the most misunderstood distinctions in databases. They are related but not the same thing.

MySQL InnoDB

InnoDB always has a clustered index. If you define a primary key, it becomes the clustered index. If you don’t, InnoDB finds the first unique NOT NULL column. If none exists, InnoDB creates a hidden 6-byte row ID and clusters on that. You cannot opt out.

SQL Server

A primary key creates a clustered index by default — but you can override this. You can create a primary key as a non-clustered index, and separately define a different column as the clustered index. Primary key and clustered index are independent choices.

PostgreSQL

PostgreSQL has no clustered indexes in the traditional sense. All tables are heaps. A primary key creates a regular B-tree index that stores row IDs pointing back to the heap. The CLUSTER command can physically reorder a table’s pages to match an index — but this ordering is not maintained on subsequent writes. It’s a one-time reorganization, not a structural property.


Secondary Indexes

A secondary index is any index that is not the clustered index. How it works depends on the table’s storage structure.

Secondary Index on a Heap Table (PostgreSQL)

Both the primary key index and any secondary indexes are B-trees pointing to physical row IDs (ctid).

Problem with heap + secondary index: If a row is vacuumed and moves to a different page, every index pointing to that row ID must be updated. PostgreSQL’s HOT (Heap-Only Tuple) optimization handles a different case: when an UPDATE doesn’t change any indexed column, Postgres places the new row version on the same page and chains it from the old one — so no new index entries need to be created. This avoids index bloat from frequent updates to non-indexed columns and keeps secondary indexes pointing at the page rather than at every individual row version.

Secondary Index on a Clustered Table (MySQL InnoDB)

Secondary indexes don’t store physical row IDs — they store the primary key value instead. To fetch the full row, InnoDB does a secondary index lookup followed by a primary key (clustered index) lookup.

Why primary key value instead of physical address? When a clustered index page splits (due to an insert), rows shift around. If secondary indexes stored physical addresses, every secondary index would need updating on every page split. By storing the primary key, secondary indexes remain stable — only the clustered index itself needs to maintain the physical layout.

Cost: Every secondary index lookup on InnoDB requires two B-tree traversals — one in the secondary index, one in the clustered index. This is called a double lookup (also called a bookmark lookup in SQL Server terminology).

Mitigation — Covering Index: If the query only needs columns that are already in the index, the database doesn’t need to look up the full row at all. A covering index includes all columns needed by the query, allowing the lookup to be resolved entirely from the index leaf node.

-- Without covering index: secondary index lookup + clustered index lookup
SELECT email, created_at FROM users WHERE email = 'alice@x.com';

-- With covering index on (email, created_at): resolved from index alone
CREATE INDEX idx_email_created ON users (email, created_at);

How It All Connects



Key Takeaways

  • Pages are the unit of I/O. Indexes reduce the number of pages read; they don’t make individual page reads faster.
  • Heaps store rows in insertion order. Without an index, finding a row requires reading every page.
  • Row IDs (ctid, ROWID, RID) are physical addresses. They are what B-tree indexes store to bridge the index and the heap.
  • A clustered index stores the actual rows in the B-tree leaf nodes. The leaf level is the table. There is no heap.
  • A primary key is a logical uniqueness constraint. Whether it becomes a clustered index depends on the database: always in InnoDB, by default in SQL Server, never in PostgreSQL.
  • Secondary indexes on a clustered table store the primary key value, not a physical address, to stay stable across page splits — at the cost of a double lookup.
  • A covering index avoids the second lookup entirely by including all the columns a query needs directly in the index.

The next time you wonder whether to add an index, you’re really asking: is it worth the cost of maintaining a separate B-tree and potentially doing a second lookup, in exchange for skipping a sequential scan of all those pages?