Part 5 of Database Internals

PostgreSQL Indexing: Internals, Types, and Trade-offs

43 min read

An index is the single most impactful performance tool in a relational database. It can turn a 30-second query into a 2ms one. It can also double your write latency, bloat your storage, and confuse the query planner — all silently. This post goes deep on how indexes actually work inside PostgreSQL, every index type available, and the non-obvious rules that govern when an index helps versus hurts.


How PostgreSQL Stores Data Without an Index

To understand indexes you have to understand what they replace. PostgreSQL stores table rows in heap files — a collection of 8KB pages on disk. Rows are written in the order they arrive. There is no inherent sort order. There is no clustering. A row’s physical location is its TID (Tuple ID) — a pair (page_number, slot_offset).

When you run SELECT * FROM orders WHERE user_id = 42 against a table with no index, PostgreSQL has no choice but to read every page of the heap file, look at every row, and discard the ones where user_id != 42. That’s a sequential scan. On a table with 10 million rows and 80,000 pages, that’s 80,000 page reads.

An index is a separate data structure that maps column values to TIDs. Instead of reading 80,000 pages, you consult the index to find the handful of TIDs where user_id = 42, then fetch just those heap pages. The index does not contain the full row — only the indexed column(s) and the TID pointer back to the heap.


B-tree Internals: How the Default Index Works

The default index type in PostgreSQL is a B-tree (Balanced Tree). Understanding it deeply unlocks understanding of nearly everything else about index performance.

The Structure

A B-tree index is a tree of fixed-size pages (also 8KB, same as heap pages). There are three kinds of pages:

  • Root page: The entry point. Contains separator keys that route the search left or right.
  • Internal pages: Branch nodes. Store separator keys and child page pointers. Never store TIDs.
  • Leaf pages: Store the actual index entries: (indexed_value, TID) pairs sorted by value. Every leaf page is doubly linked to its neighbors — this is what makes range scans cheap.

A Point Lookup: WHERE id = 42

  1. Read the root page. Binary search for 42 among separator keys. Follow the child pointer that covers 42’s range.
  2. Read the internal page. Binary search again. Follow the child pointer.
  3. Read the leaf page. Binary search for 42. Find the TID(s).
  4. Use the TID to fetch the actual row from the heap.

For a table with 10 million rows, a B-tree index is typically 3-4 levels deep. That’s 3-4 page reads to find any value — vs 80,000 for a seq scan. This is the O(log N) lookup.

A Range Scan: WHERE id BETWEEN 100 AND 200

  1. Find the leaf page containing 100 (same point lookup process).
  2. Scan forward across leaf pages (following the linked list) collecting TIDs until the value exceeds 200.
  3. Fetch heap rows for all collected TIDs.

The doubly-linked leaf chain makes this efficient — no need to go back to the root for each subsequent value. This is why B-trees support ORDER BY, BETWEEN, <, > natively.

Page Splits — The Hidden Cost of Inserts

Every leaf page has a fill factor (default 90% for indexes). When a new index entry arrives and the target leaf page is full, PostgreSQL splits the page: creates a new page, moves half the entries there, and updates the parent page to add a new separator key pointing to the new page.

If the parent is also full, it splits too — a cascade that can propagate up to the root. Root splits increase the tree height by one. This is rare but means a B-tree grows in height slowly over time.

Why this matters for UUIDs — we’ll cover this in depth later, but the short version is: random values cause every insert to land on a random page, requiring that page to be in the buffer cache. Monotonically increasing values always append to the rightmost leaf — only one page needs to be hot.

The Visibility Map and Index-Only Scans

Every heap page has a corresponding bit in the visibility map — a compact structure that tracks whether all rows on a page are visible to all transactions (fully frozen/vacuumed). When PostgreSQL does an index scan, it normally has to fetch the heap page to check row visibility. But if the visibility map says a page is all-visible, it can skip the heap fetch entirely and return the indexed values directly.

This is an index-only scan — the fastest possible read path. You see Heap Fetches: 0 in EXPLAIN ANALYZE output. It requires:

  1. All SELECTed columns are in the index.
  2. The visibility map is current (i.e., VACUUM has run recently on the table).

Tables with high write volume that haven’t been vacuumed recently will have many pages not marked all-visible, causing index-only scans to degrade back to regular index scans.


Index Types in PostgreSQL

PostgreSQL has six built-in index types plus extensions, each optimized for different access patterns and data shapes.

B-tree (Default)

CREATE INDEX idx_orders_user_id ON orders (user_id);
-- equivalent to:
CREATE INDEX idx_orders_user_id ON orders USING btree (user_id);

Supports: =, <, <=, >, >=, BETWEEN, IN, IS NULL, IS NOT NULL, LIKE 'prefix%', ORDER BY, DISTINCT.

Works with: Any type that has a defined ordering — integers, text, timestamps, enums, UUIDs, numeric, etc.

Use for: Primary keys, foreign keys, range queries, sorting, prefix string matches. When in doubt, start here.

Does not support: Full-text search, containment queries (@>), geometric overlap, JSONB key existence.

Hash

CREATE INDEX idx_orders_status_hash ON orders USING hash (status);

Computes a hash of the column value and stores (hash_bucket, TID). There is no tree — lookup is O(1) for equality.

Supports: Only = (equality). Nothing else.

Performance vs B-tree for equality: Hash indexes are theoretically faster for pure equality lookups on high-cardinality columns. In practice, the difference is marginal — B-tree lookup is O(log N) but with a very small constant (3-4 page reads), and PostgreSQL’s buffer cache often keeps hot pages warm anyway.

Downsides:

  • Cannot be used for <, >, BETWEEN, ORDER BY, LIKE, range anything.
  • Cannot be used in a composite index.
  • Before PostgreSQL 10, hash indexes were not WAL-logged and would be lost on crash — they had to be manually rebuilt. Since PostgreSQL 10 they are fully crash-safe, but the stigma remains in older documentation.
  • The planner rarely chooses a hash index over a B-tree even for equality unless you force it.

When to use: Almost never in practice. The one valid use case is an equality-only lookup on a column with very long values (e.g., a SHA-256 hash column) where storing the full value in a B-tree entry is wasteful — a hash index stores only the hash of the hash, making index pages much denser. Even then, most engineers reach for B-tree and a text_pattern_ops or just accept the slightly larger index.

GIN (Generalized Inverted Index)

CREATE INDEX idx_articles_tags ON articles USING GIN (tags);
CREATE INDEX idx_products_attrs ON products USING GIN (attributes);  -- JSONB
CREATE INDEX idx_articles_fts ON articles USING GIN (to_tsvector('english', body));

A GIN index is an inverted index — it maps from individual element values to the rows that contain them. For an array column tags = ['go', 'performance', 'database'], the GIN index contains three entries: go → {row 5, row 12, row 30}, performance → {row 5, row 88}, etc.

Supports:

  • Arrays: @> (contains), <@ (contained by), && (overlap), = ANY()
  • JSONB: @> (contains), ? (key exists), ?| (any key exists), ?& (all keys exist)
  • Full-text search: @@ (tsvector matches tsquery)
  • Range types: with range_ops
  • Trigrams (with pg_trgm): LIKE '%pattern%', ILIKE, similarity (%)

Characteristics:

  • Lookup is very fast — find elements in O(log N), intersect posting lists.
  • Insert is expensive — every new element in an array/JSONB must update potentially many posting lists. GIN uses a “pending list” buffer to batch writes (controlled by gin_pending_list_limit), but the eventual merge is costly.
  • Index size is large — an inverted index with many distinct elements can be several times larger than the source data.
  • Not suitable for columns where every row has a unique value — there’s nothing to invert.

The pending list: GIN batches inserts into a pending list (default 4MB) and merges into the main index structure lazily (during VACUUM or when the pending list fills). This makes individual inserts fast, but bulk loads can trigger expensive background merges. For bulk inserts, use CREATE INDEX after loading rather than maintaining a GIN index during load.

When to use:

  • Any array column you query with && or @>.
  • JSONB columns where you query by key existence or containment.
  • Full-text search (tsvector columns or to_tsvector() expressions).
  • Trigram search for LIKE '%substring%' or fuzzy matching.

GiST (Generalized Search Tree)

CREATE INDEX idx_locations_geom ON locations USING gist (coordinates);  -- PostGIS
CREATE INDEX idx_events_period ON events USING gist (during);           -- tsrange
CREATE INDEX idx_articles_fts ON articles USING gist (to_tsvector('english', body));

GiST is a framework for building balanced search trees for non-standard data types. Where B-tree needs a total ordering, GiST needs only a set of user-defined predicates (consistent, union, penalty, picksplit). This makes it extensible to types that don’t have a natural linear order.

Supports:

  • Geometric types: PostGIS (geometry, geography), built-in PostgreSQL geometric types (point, polygon, circle). Operators: && (overlap), @> (contains), ~ (same), <-> (distance).
  • Range types: int4range, tsrange, daterange, etc. Operators: &&, @>, -|- (adjacent).
  • Full-text search: tsvector @@ tsquery (but GIN is generally faster for this).
  • Nearest-neighbor search: GiST supports ORDER BY column <-> value LIMIT N — find the N closest points to a given coordinate. GIN cannot do this.

GiST vs GIN for full-text:

  • GiST: faster to build, smaller index, supports <-> nearest-neighbor. Slower at search time.
  • GIN: larger index, slower to build, much faster search (especially with many matching documents).

For read-heavy full-text search, GIN almost always wins. GiST is preferred when you need nearest-neighbor or when write performance matters more.

When to use: Geospatial queries (PostGIS), date/time range overlap queries, nearest-neighbor lookups, exclusion constraints (EXCLUDE USING gist).

BRIN (Block Range INdex)

CREATE INDEX idx_logs_created_at ON logs USING brin (created_at);
CREATE INDEX idx_logs_created_at ON logs USING brin (created_at) WITH (pages_per_range = 64);

A BRIN index is radically different from everything else. Instead of indexing individual values, it divides the heap into ranges of consecutive pages (default: 128 pages per range) and stores only the min and max value of the indexed column within that range.

A BRIN index on a 100GB table might be only 256KB — tens of thousands of times smaller than a B-tree index on the same column.

How a lookup works: For WHERE created_at > '2025-01-01', scan the BRIN index (a tiny read) and discard any page ranges whose max < '2025-01-01'. Then do a sequential scan of only the surviving page ranges. This is still a scan — BRIN doesn’t give you direct row-level access — but it skips huge chunks of the table.

The critical requirement: physical correlation. BRIN only works when the physical order of rows in the heap matches the order of the indexed values. If you insert rows in timestamp order (which is natural for logs, events, IoT data), then page 1 has the oldest rows, page N has the newest — the min/max ranges are tight and exclusive. If rows are inserted randomly, the min/max of every range will be nearly the full column domain, and BRIN can’t skip anything.

-- Check physical correlation before creating a BRIN index
SELECT correlation FROM pg_stats WHERE tablename = 'logs' AND attname = 'created_at';
-- correlation near 1.0 = good for BRIN
-- correlation near 0.0 = BRIN will be useless

When to use:

  • Append-only or mostly-append tables (logs, events, time-series, audit trails).
  • created_at, inserted_at, serial IDs — any column that grows monotonically with insertion order.
  • Very large tables (100GB+) where a B-tree index would be several GB itself.
  • When you need only a small fraction of the table for a time-range query and can tolerate a scan of the relevant page ranges.

When NOT to use:

  • Random insert patterns (correlation near 0) — BRIN will be ignored by the planner.
  • High-selectivity point lookups — BRIN will still scan hundreds of pages; a B-tree is far better.
  • Columns that are updated frequently — BRIN summaries can become stale (use brin_summarize_new_values() or autovacuum to refresh).

SP-GiST (Space-Partitioned GiST)

CREATE INDEX idx_users_phone ON users USING spgist (phone_number);
CREATE INDEX idx_network_cidr ON networks USING spgist (cidr);

SP-GiST builds space-partitioned trees — structures like tries, quad-trees, k-d trees, and radix trees. Unlike B-tree and GiST which are balanced, SP-GiST trees can be unbalanced and work well for data with natural partition structure.

Supports:

  • Text prefix matching (trie structure) — similar to B-tree for LIKE 'prefix%' but faster on high-cardinality text.
  • IP addresses and CIDR containment (inet, cidr types) — << (subnet), >>= (supernet), etc.
  • Geometric types (quad-tree for points) — 2D &&, @>, <->.
  • Phone numbers and other hierarchically structured strings.

When to use: IP/CIDR containment queries, text prefix queries on very long strings, 2D point data when GiST is too slow.

In practice, SP-GiST is the most specialized index type and you’ll encounter it mainly when working with IP ranges, hierarchical codes, or specific geometric workloads.


Composite Indexes

A composite (multi-column) index stores entries for multiple columns together.

CREATE INDEX idx_orders_user_status ON orders (user_id, status);

The Prefix Rule

A composite index (a, b, c) can be used by queries that filter on:

  • a alone
  • a, b
  • a, b, c

But not by queries that filter on:

  • b alone
  • c alone
  • b, c
  • c, a

The planner can only use the index from the left. This is because the index is physically sorted first by a, then by b within each a value, then by c within each (a, b) combination. Searching for b = 'pending' without knowing a is equivalent to searching a phone book by first name — the sort order is useless.

-- These use the index:
WHERE user_id = 42
WHERE user_id = 42 AND status = 'pending'
WHERE user_id = 42 AND status = 'pending' AND created_at > '2025-01-01'
WHERE user_id = 42 AND created_at > '2025-01-01'  -- uses user_id part only

-- These do NOT use the index:
WHERE status = 'pending'
WHERE created_at > '2025-01-01'
WHERE status = 'pending' AND created_at > '2025-01-01'

Exception: index skip scan. PostgreSQL 18+ (skip scans for skipping leading equality columns; not yet released as of this writing — verify against your target version) introduced skip scans for some cases, and even earlier versions can sometimes use a composite index for the second column alone via a bitmap scan — but you cannot rely on this.

Column Order in a Composite Index

The rule of thumb: put equality conditions first, range conditions last.

-- GOOD: equality first, range last
CREATE INDEX ON orders (user_id, status, created_at);
-- Query: WHERE user_id = 42 AND status = 'pending' AND created_at > '2025-01-01'
-- Narrows to one user, one status, then scans the date range

-- BAD: range in the middle
CREATE INDEX ON orders (user_id, created_at, status);
-- Query: WHERE user_id = 42 AND status = 'pending' AND created_at > '2025-01-01'
-- After a range condition (created_at BETWEEN ...), columns to its right (status)
-- cannot be used as further index conditions — but they can still be applied as a
-- filter on the rows the index retrieves. The index is used; the filter just runs
-- after the index lookup, post-fetching all rows that match the range.

Why? Once the index encounters a range predicate (>, <, BETWEEN, LIKE 'x%'), it can narrow to a start point but then must scan all subsequent entries. Columns after a range column in a composite index cannot be used to skip entries.

Covering Indexes (Index-Only Scans)

If you add the SELECTed columns to the index, PostgreSQL can serve the entire query from the index without touching the heap:

-- Query needs user_id (filter) and name, email (select)
-- Without this index: index scan + heap fetch per row
-- With this index: index-only scan (Heap Fetches: 0)
CREATE INDEX idx_users_covering ON users (user_id) INCLUDE (name, email);

The INCLUDE syntax (PostgreSQL 11+) adds columns to the leaf pages only — they’re not part of the sort key, so they don’t affect the tree structure or index lookups. The included columns just ride along as payload, allowing index-only scans without the filtering confusion of making them full key columns.

Do not put frequently updated columns as key columns in a composite index — every update to those columns requires updating the index entry, which means a page write + potential page split. Frequently updated columns work better in INCLUDE (leaf-only) or in a separate index.


Partial Indexes

A partial index indexes only the rows matching a WHERE predicate.

-- Only index orders that aren't completed — active orders are the hot working set
CREATE INDEX idx_orders_active ON orders (user_id, created_at)
WHERE status != 'completed';

-- Only index non-null values
CREATE INDEX idx_users_referral ON users (referral_code)
WHERE referral_code IS NOT NULL;

-- Unique constraint only on active users (allow multiple deleted users with same email)
CREATE UNIQUE INDEX idx_users_email_active ON users (email)
WHERE deleted_at IS NULL;

Why partial indexes are powerful:

  1. Smaller index. If 95% of orders are completed and you only ever query active orders, a partial index on the 5% that are active is 20x smaller, fits more in memory, and has faster tree traversal.

  2. Better statistics. The planner’s estimates for a partial index cover only the indexed subset — tighter statistics, better plan choices.

  3. Conditional uniqueness. You can enforce uniqueness only within a subset of rows (as in the email example above), which is impossible with a regular unique index.

For a query to use a partial index, the query’s WHERE clause must imply the index predicate. The planner must be able to prove that any row satisfying the query also satisfies the index’s WHERE clause.

-- Uses the partial index (query implies status != 'completed')
SELECT * FROM orders WHERE user_id = 42 AND status = 'pending';

-- Does NOT use the partial index (query doesn't constrain status)
SELECT * FROM orders WHERE user_id = 42;

UUIDs and Indexes: The Fragmentation Problem

This is one of the most important and commonly misunderstood topics in PostgreSQL performance.

Why UUIDv4 Destroys B-tree Performance

UUIDv4 values are random — 550e8400-e29b-41d4-a716-446655440000 has no relationship to the UUID inserted before or after it. When you use UUIDv4 as a primary key (which creates a B-tree index), every insert goes to a random position in the index tree.

Here’s what happens at scale:

The cascade of problems:

  1. Cache thrash. Each insert touches a different random leaf page. On a large table, the entire leaf level of the index is larger than your buffer cache (shared_buffers is the chunk of RAM Postgres reserves as a database-wide page cache; default 128 MB, typically tuned to ~25% of system memory). Every insert requires reading a cold page from disk before writing to it — a random read + a dirty write per insert.

  2. Index bloat. Because inserts land on random pages, leaf pages fill up and split uniformly across the entire key space. After a split, both resulting pages are around 50% full. On a sequential insert pattern, the rightmost page is the only one that fills and splits — everything else stays at 90%+ fill factor. With random UUIDs, average fill factor degrades toward 70-75%, making the index ~25% larger than necessary.

  3. WAL amplification. Every random page touched must be written to WAL. Sequential inserts touch one hot page repeatedly — the WAL entry for a page write is amortized across many inserts. Random inserts each dirty a new page.

  4. Write latency spikes. Under high insert volume, the buffer eviction pressure from constantly loading new random pages causes latency spikes as the OS does I/O. This is most visible on spinning disks but happens on SSDs too.

How bad is it in practice? On a table of 50 million rows with a UUIDv4 primary key:

  • Index size: ~3GB (B-tree on 16-byte UUID × 50M rows, with bloat)
  • Insert throughput: drops significantly compared to serial/BIGSERIAL as the table grows
  • The working set of “hot index pages” = the entire leaf level ≈ the entire 3GB index

With a BIGSERIAL primary key (sequential integers):

  • Index size: ~800MB (8-byte integers, no bloat — rightmost leaf stays hot)
  • Insert throughput: stable as the table grows — only the rightmost leaf + its path to root need to be hot

The Fixes

Option 1: Use BIGSERIAL / BIGINT instead of UUID

The simplest fix. A 64-bit integer primary key is:

  • Half the size per row (8 bytes vs 16 bytes)
  • Always sequential — no fragmentation
  • Faster for equality lookups (smaller comparison)

Downside: exposes sequential IDs in URLs/APIs (enumeration risk), coordination complexity in distributed systems.

Option 2: UUIDv7 (time-ordered UUID)

UUIDv7 encodes a millisecond-precision timestamp in the most significant bits:

018e3c2f-1234-7xxx-xxxx-xxxxxxxxxxxx
^^^^^^^^^
timestamp prefix (48 bits, ms precision)

Because the timestamp prefix increases monotonically, UUIDv7 inserts always go to the rightmost leaf of the B-tree — exactly like sequential integers. You get global uniqueness without coordination, URL-safe IDs, and no B-tree fragmentation.

-- gen_random_uuid() has been in core PostgreSQL since 13 (and in pgcrypto long
-- before). For UUIDv7 specifically, use the pg_uuidv7 extension or generate at
-- the application layer.
CREATE EXTENSION pg_uuidv7;
SELECT uuid_generate_v7();  -- 018e3c2f-1234-7abc-...

-- Or generate in application code (most UUID libraries support v7 now)

Option 3: ULID

Universally Unique Lexicographically Sortable Identifier — 128-bit, 26-character Crockford Base32 string. The first 48 bits are a millisecond timestamp, the remaining 80 bits are random. Same properties as UUIDv7: monotonically increasing within a millisecond, globally unique, B-tree friendly.

Option 4: Keep UUIDv4 but use fillfactor

FILLFACTOR (the percentage of each index/heap page Postgres fills before considering it ‘full’ — default is 90 for indexes, 100 for heap tables; lowering it leaves room for HOT updates and reduces page splits on write-heavy tables).

If you must keep UUIDv4 (e.g., existing system, external ID requirements), reduce the index fill factor to leave room for out-of-order inserts without triggering as many splits:

CREATE INDEX idx_orders_uuid ON orders (id) WITH (fillfactor = 70);

This trades index size (larger, less dense pages) for fewer splits. It doesn’t fix cache thrash — random pages are still loaded cold for each insert — but reduces write amplification from constant splits.

Summary:

PK TypeInsert patternB-tree behaviorRecommended?
BIGSERIALSequentialRightmost-leaf only, no bloatYes — for internal systems
UUIDv4RandomFull-tree random access, ~25% bloatAvoid for high-write tables
UUIDv7 / ULIDMonotonic (by time)Rightmost-leaf only, like BIGSERIALYes — for distributed/external IDs
UUIDv1Random in B-tree key orderFragments like UUIDv4Avoid — see note below

Note on UUIDv1: Despite being timestamp-based, UUIDv1’s time bits are stored low-order-first — so it sorts roughly randomly in B-tree key order and fragments leaves like UUIDv4. UUIDv1 also leaks the host MAC address. UUIDv6 reorders v1’s bits to be monotonic but is rarely used; UUIDv7 is the right modern choice.


Data Types and Index Compatibility

Not every data type works well with every index type. Here’s the full picture.

Operator Classes

PostgreSQL links data types to index types via operator classes. An operator class defines which operators the index can answer for a given type. When you create an index, you can specify the operator class explicitly:

-- Default operator class for text (uses locale-aware collation)
CREATE INDEX ON users (last_name);

-- text_pattern_ops: ignores collation, supports LIKE 'prefix%' and pattern matching
-- Required for LIKE queries that need to use an index with non-C locale
CREATE INDEX ON users (last_name text_pattern_ops);

If your database locale is not C or POSIX, a regular B-tree index on a text column cannot support LIKE 'prefix%' — you need text_pattern_ops. This surprises many engineers who add an index on email and then find their WHERE email LIKE 'john%' query still does a seq scan.

-- Check your database locale
SHOW lc_collate;
-- If not 'C' or 'POSIX', use text_pattern_ops for LIKE support

-- Correct approach for pattern matching on non-C locale:
CREATE INDEX idx_users_email_pattern ON users (email text_pattern_ops);
SELECT * FROM users WHERE email LIKE 'john%';  -- now uses the index

Type-to-Index Quick Reference

Data TypeBest IndexNotes
integer, bigint, numericB-treeDefault. Full range/equality/order support.
text, varcharB-treeUse text_pattern_ops for LIKE on non-C locale.
text (substring/fuzzy)GIN + pg_trgmFor LIKE '%middle%', ILIKE, similarity.
uuidB-treeWorks. Random UUIDv4 causes fragmentation — use UUIDv7.
timestamp, timestamptzB-treeFull range support. BRIN if append-only large table.
dateB-treeSame as timestamp.
booleanPartial indexA plain B-tree on a boolean column has only two distinct values, so it can’t narrow a query much on its own. Use a partial index (WHERE active = true) to index only the rows that matter, or include the boolean as a leading or trailing column of a composite index.
enumB-treeEquality and ordering works. Low cardinality — partial index often better.
jsonbGIN@>, ?, `?
text[], int[] (arrays)GIN@>, &&, = ANY().
tsvectorGINFull-text @@. GiST if nearest-neighbor needed.
geometry (PostGIS)GiSTSpatial operators (&&, @>, <->).
tsrange, daterangeGiSTRange overlap &&, containment @>.
inet, cidrSP-GiSTSubnet containment <<, >>=. B-tree for exact equality.
hstoreGINKey existence, containment.

Low-Cardinality Columns: When B-tree Is Counterproductive

A low-cardinality column has very few distinct values — boolean, status with 3-4 values, country_code with ~250 values.

A B-tree index on status = 'pending' where 30% of rows are pending: the index will return 30% of all rows. PostgreSQL’s planner knows this and will likely choose a seq scan instead (since 30% > the ~15% threshold for preferring a seq scan). The index exists but is rarely used.

What works instead:

-- Partial index: only the rows you actually query
CREATE INDEX idx_orders_pending ON orders (created_at) WHERE status = 'pending';
-- This index is tiny (only pending rows), always selective, always used for pending queries

-- Composite index with high-cardinality leading column
CREATE INDEX idx_orders_user_status ON orders (user_id, status);
-- user_id is high-cardinality — user_id narrows to a manageable set, status filters further

JSONB Indexing: Choosing the Right Approach

JSONB is one of the most commonly over-indexed or wrongly-indexed types. There are three different approaches with very different trade-offs:

-- 1. GIN index on the whole column — supports @>, ?, ?|, ?&
--    Large index, good for existence/containment queries
CREATE INDEX idx_products_attrs_gin ON products USING GIN (attributes);
SELECT * FROM products WHERE attributes @> '{"color": "red"}';  -- uses GIN

-- 2. Functional B-tree index on a specific key — supports = and range on that key
--    Small index, only for that specific key, supports ordering
CREATE INDEX idx_products_color ON products ((attributes->>'color'));
SELECT * FROM products WHERE attributes->>'color' = 'red';  -- uses B-tree
SELECT * FROM products ORDER BY attributes->>'price'::numeric;  -- uses B-tree for sort

-- 3. GIN with jsonb_path_ops — smaller than default GIN, faster @> queries
--    But ONLY supports @>, not ?, ?|, ?&
CREATE INDEX idx_products_attrs_path ON products USING GIN (attributes jsonb_path_ops);
SELECT * FROM products WHERE attributes @> '{"color": "red"}';  -- uses GIN
-- SELECT * FROM products WHERE attributes ? 'color';  -- CANNOT use this index

Decision rule:

  • Querying by key existence (? 'color') or multi-key containment → full GIN.
  • Querying by specific key’s value with equality or range (->>'price' > 100) → functional B-tree on that key.
  • Only need containment (@>) and want a smaller index → jsonb_path_ops GIN.

When NOT to Create an Index

Every index has a cost. Engineers tend to over-index because the benefit (faster reads) is visible immediately and the cost (slower writes, more storage, maintenance overhead) accumulates gradually.

1. Small tables

PostgreSQL’s seq scan on a table that fits in a few pages is faster than consulting an index. The planner will ignore the index anyway. A “small table” threshold is roughly a few hundred rows, but this varies with row width. If EXPLAIN shows Seq Scan despite an index, the table might just be too small.

2. Write-heavy tables with no read latency requirement

Every INSERT updates all indexes on the table. Every UPDATE that changes an indexed column updates those indexes. Every DELETE marks index entries as dead. On a table with 10 indexes receiving 50,000 inserts/sec, 10× the write I/O is going to index maintenance. If you never query some of those indexes directly, drop them.

-- Find indexes that are never used (since last stats reset)
SELECT schemaname, tablename, indexname, idx_scan
FROM pg_stat_user_indexes
WHERE idx_scan = 0
ORDER BY schemaname, tablename;

An index with idx_scan = 0 since the last pg_stat_reset() is a candidate for removal. Verify it’s not a constraint enforcement index (primary key, unique) before dropping.

3. Low-cardinality columns (without a partial index)

A plain B-tree index on is_deleted boolean or status varchar with 3 values provides almost no filtering benefit. The planner will use a seq scan for any query that matches a non-trivial fraction of rows. Use a partial index instead.

4. Columns that are only queried as part of a large scan anyway

If a column is always accessed with other conditions that already narrow the result to a small set, an index on it alone adds no value. The composite index with the narrowing column first is the right answer.

5. During bulk loads

Drop non-constraint indexes before a bulk insert, then rebuild afterward:

-- Before bulk load
DROP INDEX CONCURRENTLY idx_orders_user_id;
DROP INDEX CONCURRENTLY idx_orders_created_at;

-- Load data
\COPY orders FROM 'orders.csv' CSV HEADER;

-- After bulk load — building from scratch is faster than incremental maintenance
CREATE INDEX CONCURRENTLY idx_orders_user_id ON orders (user_id);
CREATE INDEX CONCURRENTLY idx_orders_created_at ON orders (created_at);

Building an index from scratch is 3-10x faster than maintaining it through incremental inserts because PostgreSQL can sort the data once and write the B-tree in one pass. Incremental maintenance requires a random insertion for each row.


Index Maintenance

Index Bloat

MVCC keeps multiple versions of every row Postgres has updated or deleted but not yet vacuumed. Older versions (called dead tuples) take space until autovacuum reclaims them. See ACID and Isolation Levels for why MVCC exists in the first place.

Dead tuples in the heap have corresponding dead entries in every index on the table. VACUUM removes dead heap tuples and marks dead index entries as reusable, but it does not compact the index pages or return space to the OS. Over time, especially after bulk deletes or updates, indexes accumulate bloat — pages that are mostly empty but still occupy disk space and memory.

-- Check index bloat (requires pgstattuple extension)
CREATE EXTENSION IF NOT EXISTS pgstattuple;
SELECT * FROM pgstatindex('idx_orders_user_id');
-- free_percent > 20-30% is a sign of bloat worth addressing

For severe bloat, rebuild the index:

-- Non-concurrent: fast but locks the table (avoid in production)
REINDEX INDEX idx_orders_user_id;

-- Concurrent: safe in production, no table lock
REINDEX INDEX CONCURRENTLY idx_orders_user_id;

-- Or: create a new index, swap, drop old (manual concurrent reindex)
CREATE INDEX CONCURRENTLY idx_orders_user_id_new ON orders (user_id);
DROP INDEX CONCURRENTLY idx_orders_user_id;
ALTER INDEX idx_orders_user_id_new RENAME TO idx_orders_user_id;

CREATE INDEX CONCURRENTLY — Adding Indexes Without Downtime

Always use CONCURRENTLY in production when adding an index to a table that receives active queries:

-- Locks the table during build — ALL reads and writes block until done
CREATE INDEX idx_orders_user_id ON orders (user_id);

-- Takes only brief locks — reads and writes continue the entire time
CREATE INDEX CONCURRENTLY idx_orders_user_id ON orders (user_id);

On a table with millions of rows, a non-concurrent build can hold an AccessExclusiveLock for minutes. Every query that touches the table queues behind it. In production, that’s a full outage for that table.

What CONCURRENTLY Actually Does — Three Phases

The implementation is more subtle than “just build it without locking”:

Phase 1 — Snapshot 1 + initial build: PostgreSQL takes a snapshot of the table and builds an index on all rows visible in that snapshot. It holds only a ShareUpdateExclusiveLock during this phase — concurrent reads and writes are fully allowed. New writes that arrive during Phase 1 are not yet in the index.

Phase 2 — Wait for old transactions: PostgreSQL waits until all transactions that started before Phase 1 ended have committed or rolled back. This is the safety gate. Any transaction still running from before the index build started might have a view of the table that doesn’t include the Phase 1 index. PostgreSQL must wait them out.

Phase 3 — Snapshot 2 + catch-up: Takes a second snapshot, scans the heap again, and indexes any rows written between Phase 1 and Phase 2 that were missed. Takes another brief ShareUpdateExclusiveLock.

After Phase 3, the index is fully consistent and the planner can use it.

Why It Can Fail — INVALID Indexes

CREATE INDEX CONCURRENTLY can fail and leave behind an INVALID index in two situations:

  1. A long-running transaction never ends. Phase 2 waits for old transactions to close. If a transaction started before Phase 1 and stays open for hours (e.g., a forgotten BEGIN in a psql session, a hung application connection), the build stalls indefinitely. The build waits indefinitely until you cancel it. There is no automatic timeout — set a lock_timeout or statement_timeout if you need bounded waits. If the build is cancelled or fails, the partially-built index is left in INVALID state and must be REINDEX-ed or DROP-ed.

  2. A deadlock or unique constraint violation. If the index is unique and a duplicate value is inserted during the build, the index is marked INVALID.

An INVALID index is not used by the planner, but it does consume write overhead — every insert/update/delete still maintains it. You must drop and recreate it.

-- Find INVALID indexes
SELECT schemaname, tablename, indexname
FROM pg_indexes
WHERE indexname IN (
    SELECT indexrelid::regclass::text
    FROM pg_index
    WHERE NOT indisvalid
);

-- Drop the invalid index (also concurrent to avoid blocking)
DROP INDEX CONCURRENTLY idx_orders_user_id;

-- Recreate — make sure no long-running transactions are open first
CREATE INDEX CONCURRENTLY idx_orders_user_id ON orders (user_id);

Constraints and Gotchas

Cannot run inside a transaction block. CONCURRENTLY requires its own transaction management across the three phases:

BEGIN;
CREATE INDEX CONCURRENTLY idx_orders_user_id ON orders (user_id);
-- ERROR: CREATE INDEX CONCURRENTLY cannot run inside a transaction block
ROLLBACK;

Takes longer overall. The concurrent build does more total work than a non-concurrent one (two heap scans, phase wait). On a large table it may take 2-3x as long wall-clock time, though your application doesn’t feel it.

Unique indexes need extra care. For CREATE UNIQUE INDEX CONCURRENTLY, Phase 2 also checks for duplicates in the rows written during Phase 1. If duplicates exist, the index fails as INVALID — it won’t silently ignore them.

REINDEX CONCURRENTLY for rebuilding existing indexes (PostgreSQL 12+):

-- Rebuild a bloated or corrupt index without locking
REINDEX INDEX CONCURRENTLY idx_orders_user_id;

-- Rebuild all indexes on a table concurrently
REINDEX TABLE CONCURRENTLY orders;

Same three-phase mechanics, same INVALID-index risk on failure.

Pre-flight Checklist Before a Large Concurrent Build

-- 1. Check for long-running transactions that would stall Phase 2
SELECT pid, now() - xact_start AS duration, state, query
FROM pg_stat_activity
WHERE xact_start IS NOT NULL
ORDER BY duration DESC;
-- Kill any that have been running for hours and are idle:
-- SELECT pg_terminate_backend(pid);

-- 2. Estimate index build time by checking table size
SELECT pg_size_pretty(pg_total_relation_size('orders'));

-- 3. Monitor build progress (PostgreSQL 12+)
SELECT phase, blocks_done, blocks_total,
       round(100.0 * blocks_done / nullif(blocks_total, 0), 1) AS pct
FROM pg_stat_progress_create_index
WHERE relid = 'orders'::regclass;

The pg_stat_progress_create_index view is invaluable for large tables — it shows exactly which phase the build is in and how far through the heap scan it is.

Autovacuum and Index Health

PostgreSQL’s autovacuum process runs automatically when a table accumulates enough dead tuples (controlled by autovacuum_vacuum_threshold and autovacuum_vacuum_scale_factor). It cleans up dead heap tuples and index entries. Without autovacuum:

  • Index entries accumulate as dead, causing index scans to become slower (more dead entries to skip).
  • The visibility map becomes stale, preventing index-only scans.
  • Table and index size grow unboundedly.

For high-write tables, autovacuum may not keep up with the default settings. Tune it per table:

ALTER TABLE orders SET (
    autovacuum_vacuum_scale_factor = 0.01,  -- vacuum when 1% rows are dead (default 20%)
    autovacuum_vacuum_cost_delay = 2         -- less throttling for this table
);

Bloom Filter Indexes

PostgreSQL 9.6+ ships a bloom index access method as a contrib extension. It’s a specialized index type built on the classic Bloom filter data structure — useful in a narrow but real scenario that no other index type handles well.

What a Bloom Filter Is

A Bloom filter is a space-efficient probabilistic data structure that answers one question: “Has this value been seen before?” It can answer definitively “no” (the value is definitely absent) but only probabilistically “yes” (the value might be present — with a configurable false-positive rate).

Internally it’s a bit array. To insert a value, hash it with k different hash functions, and set those k bits. To test membership, hash the query value the same way and check whether all k bits are set. If any bit is unset, the value is definitely not in the set. If all are set, the value is probably in the set — some bits may have been set by other values.

Value "alice" → hash functions → set bits [3, 17, 42, 89]
Value "bob"   → hash functions → set bits [7, 17, 55, 91]

Query "alice": check bits [3, 17, 42, 89] — all set → probably present (fetch heap to confirm)
Query "carol": check bits [2, 18, 43, 89] — bit 2 unset → definitely absent (skip heap entirely)

PostgreSQL’s Bloom Index

CREATE EXTENSION IF NOT EXISTS bloom;

CREATE INDEX idx_orders_bloom ON orders
USING bloom (user_id, status, region)
WITH (length = 80, col1 = 2, col2 = 2, col3 = 3);
  • length — size of the bloom signature per row in bits (8–4096, default 80)
  • col1/col2/col3 — number of hash functions per column (1–4096, default 2)

The index stores a fixed-length bit signature for each row. More bits and more hash functions = fewer false positives, larger index.

The One Scenario Where Bloom Beats Everything Else

A bloom index is specifically designed for tables where you need equality searches across many columns in arbitrary combinations — not a fixed set of columns you always query together.

Consider a user activity log table:

CREATE TABLE activity_log (
    id          bigserial PRIMARY KEY,
    user_id     int,
    device_id   int,
    session_id  int,
    action_type int,
    country_id  int,
    partner_id  int
);

Queries come in any combination:

WHERE user_id = 42 AND country_id = 5
WHERE device_id = 99 AND action_type = 3
WHERE partner_id = 7 AND session_id = 1234 AND country_id = 5
WHERE user_id = 42 AND device_id = 99 AND partner_id = 7

To cover all possible combinations with B-tree composite indexes, you’d need potentially dozens of indexes — one for each common combination, with correct column ordering. That’s enormous write overhead and storage.

With a single bloom index covering all six columns, any equality condition on any subset of those columns can use the one index:

CREATE INDEX idx_activity_bloom ON activity_log
USING bloom (user_id, device_id, session_id, action_type, country_id, partner_id);

The index signature for each row is a bitwise OR of the per-column signatures. A query hashes its filter values, checks whether all resulting bits are set in a row’s signature. Rows that don’t match the bit pattern are skipped without ever reading the heap. Rows whose bit pattern might match are fetched and rechecked.

The Trade-offs

False positives are normal and expected. A bloom index trades a small rate of unnecessary heap fetches for dramatically smaller index size and simpler maintenance. The planner knows this — it always applies a recheck after the bloom scan. You’ll see it in EXPLAIN:

Bitmap Heap Scan on activity_log
  Recheck Cond: ((user_id = 42) AND (country_id = 5))
  Rows Removed by Index Recheck: 23      ← false positives fetched from heap
  ->  Bitmap Index Scan on idx_activity_bloom
        Index Cond: ((user_id = 42) AND (country_id = 5))

Only supports equality (=). No range, no LIKE, no ORDER BY. Bloom is an existence check — it can’t reason about ordering or ranges.

Not useful for high-selectivity single-column queries. A B-tree beats bloom for WHERE user_id = 42 alone. Bloom shines when the combination of multiple equality conditions provides selectivity that no single-column B-tree index can.

Tuning the false-positive rate:

-- More bits per row = fewer false positives, larger index
-- Default (length=80): ~1-3% false positive rate for typical data
-- Higher length trades storage for fewer wasted heap fetches

-- Measure current false-positive rate:
EXPLAIN (ANALYZE, BUFFERS)
SELECT * FROM activity_log WHERE user_id = 42 AND country_id = 5;
-- Look at "Rows Removed by Index Recheck" vs total rows fetched
-- If removal rate > 20-30%, increase length

When to Reach for Bloom

SituationUse Bloom?
Many columns, all equality, arbitrary combinationsYes — this is exactly what it’s for
High-cardinality single columnNo — B-tree is better
Range queriesNo — bloom can’t help
Fixed set of columns always queried togetherNo — composite B-tree is better
Write-heavy table, need to minimize index countYes — one bloom index vs many B-trees
Need to check “has this user_id + device_id ever appeared together?”Yes

In practice, bloom indexes appear most in analytics tables, event logs, and data warehouse-style schemas where ad-hoc multi-column equality filtering is common and you want to avoid maintaining a combinatorial explosion of B-tree indexes.


Common Mistakes Checklist

  • UUIDv4 as a primary key on a high-write table. Random insertion order causes constant B-tree page splits, hurts insert latency, and inflates the index well past what sequential keys would. Use UUIDv7 (time-prefixed) or BIGSERIAL.
  • CREATE INDEX without CONCURRENTLY in production. Locks the table against writes for the duration of the build. Always use CREATE INDEX CONCURRENTLY on a live system — and watch for invalid indexes if it fails partway through (SELECT … FROM pg_index WHERE NOT indisvalid).
  • Indexing a low-cardinality column on its own. A boolean or 3-value status column has too few distinct values for B-tree to help. Use a partial index on the selective subset instead.
  • Wrapping the indexed column in a function. WHERE LOWER(email) = 'x' won’t use an index on email unless you have a functional index on LOWER(email). Same for created_at::date, EXTRACT(...), type casts, and most function applications.
  • Composite index column order doesn’t match the query. A (a, b, c) composite is usable for WHERE a = …, WHERE a = … AND b = …, and WHERE a = … AND b = … AND c = … — but not for WHERE b = … alone. Lead with the highest-selectivity equality columns; trailing columns are only useful when the leading columns are constrained.
  • Mismatched types in the predicate. WHERE int_col = '42' may not use an int_col index because the comparison forces a cast. Make sure literals match column types — or cast the literal, not the column.
  • Indexing every column “just in case.” Each index adds write amplification, takes up space, and gives the planner more shapes to consider. Index what you query; remove indexes that pg_stat_user_indexes shows idx_scan = 0.
  • Forgetting ANALYZE after a bulk load. Stale statistics lead the planner to ignore newly-useful indexes. Always ANALYZE after big inserts or COPYs.
  • Partial index whose WHERE doesn’t match the query. The query’s predicate must be a logical superset of the partial index’s predicate. WHERE active = true AND created > now() won’t use a WHERE active = true partial index unless the planner can prove the subset relationship.
  • SELECT * defeating index-only scans. Index-only scans require every returned column to be in the index. Listing the specific columns you need (or adding INCLUDE to the index) lets the index-only path light up.
  • BRIN on data with no correlation. BRIN assumes physical row order roughly matches indexed-value order. On a table with random insertion order, BRIN can be worse than a sequential scan. Check correlation in pg_stats before reaching for BRIN.
  • REINDEX as a reflex. Modern Postgres rarely needs manual reindexing. Use it for genuinely bloated indexes (pg_stat_user_indexes + pgstattuple), not as a general fix.


Quick Reference

SituationRecommendation
High-cardinality equality + rangeB-tree
Append-only 100GB+ table, time-range queriesBRIN (check correlation first)
Array or JSONB containmentGIN
Full-text searchGIN on tsvector
Fuzzy / substring string matchingGIN + pg_trgm
Geospatial / range overlapGiST
IP/CIDR subnet queriesSP-GiST
5% of rows are the hot “active” working setPartial index on those rows
UUID primary key, high insert rateSwitch to UUIDv7 or BIGSERIAL
boolean or enum indexPartial index (avoid plain B-tree on low-cardinality column)
LIKE '%substring%' or ILIKEGIN + pg_trgm extension
LIKE 'prefix%' with non-C localeB-tree with text_pattern_ops
JSONB key value equality/rangeFunctional B-tree on (col->>'key')
Multi-condition AND with no composite indexBitmapAnd on individual indexes (but add composite if frequent)
Index not used despite existingCheck: function on column, type mismatch, low selectivity, stale stats, random_page_cost
Bulk insertDrop non-constraint indexes, load, CREATE INDEX CONCURRENTLY
Many columns, equality-only, arbitrary combinationsBloom filter index (CREATE EXTENSION bloom)
Adding index to live production tableAlways CREATE INDEX CONCURRENTLY
Rebuilding bloated/corrupt index in productionREINDEX INDEX CONCURRENTLY
Index build stalling (Phase 2 wait)Check pg_stat_activity for long-running transactions