Part 8 of Database Internals

Postgres Concurrency Internals: Connections, Locks, and MVCC

34 min read

You run ALTER TABLE on the busy orders table and every query in the cluster freezes for forty seconds. A Python script hangs with one open transaction overnight, and the next VACUUM runs for an hour. The dashboard that’s fast for you takes three seconds for the user sitting next to you, on the same data. Most of the day-to-day weirdness in a production database comes from many clients talking to it at once, and someone holding something they shouldn’t be.

The previous post covered the semantics of concurrency — which isolation level to pick, what each one guarantees, where the anomalies come from. This post is the mechanics: what actually happens when your connection hits Postgres, where the locks live, why MVCC lets readers and writers coexist, and how to figure out who’s blocking whom at 3 AM. Same database, different layer.

Part 8 of the Database Internals series.


1. The Connection Model

Every Postgres client connection is a separate OS process on the database server. There is no thread pool inside Postgres. There is no event loop multiplexing connections onto a small number of workers. When psql connects, an entire postgres process is forked for it, lives until the client disconnects, and dies when the client goes away.

The postmaster is the supervisor process. It listens on the TCP port, accepts an incoming connection, authenticates the client, then fork()s a new backend process to serve that client for its lifetime. The backend executes queries, owns its own local memory (work_mem, temp_buffers), and reaches into shared memory (shared_buffers, the WAL buffers, the lock table) for anything cluster-wide.

A backend uses roughly 5–15 MB of resident memory on its own, depending on Postgres version and what the connection has done (cached plan trees, prepared statements, large work_mem for sorts). Multiply by max_connections, add shared_buffers, and you have a hard ceiling on how many backends a box can hold before you swap.

Why this matters

  • Spawning isn’t free. Forking takes single-digit milliseconds. Most short-lived clients (HTTP request handlers, lambdas, CLI scripts) spend more time on connection setup than on the actual query if they reconnect every time.
  • Idle backends still cost RAM. A connection holding open a transaction overnight still pins memory and prevents max_connections from being available to someone else.
  • There’s no in-process query cache. Each backend builds its own plan cache. Two clients running the same query don’t share a parsed plan unless they’re behind a connection pool that shares backends.
  • Killing a backend is the unit of cancellation. pg_terminate_backend(pid) sends SIGTERM; pg_cancel_backend(pid) cancels the current query but keeps the session.

This is unusual. MySQL InnoDB uses thread-per-connection (one OS thread, shared address space — lower memory per connection, but a bug in one thread’s code path can corrupt another’s state). MS SQL Server uses fiber-style scheduling on a thread pool. Postgres’s choice was made in the early 1990s for isolation and portability, and it has shaped almost every operational pattern around the database since.

The practical consequence: max_connections caps the number of backends, not the number of clients you can serve. Each backend is a process tied to one CPU core at a time, so useful work is bounded by your cores — not by how many backends you allow. But most clients spend most of their time not executing a query: they’re waiting on the network, on application code, on a user clicking a button. A small pool of backends, kept busy and shared across many clients, serves far more clients than the pool’s size. That gap — many clients, few backends — is the gap connection pooling exists to close.


2. The Vocabulary

Before any code, the words. Concurrency discussions go sideways quickly when these terms are used loosely.

Backend — the OS process that serves one client connection. One backend per connection, period. Visible in ps as postgres: user dbname host(port) idle.

postmaster — the parent supervisor process. Accepts connections, forks backends, restarts crashed background workers. PID 1 of the cluster.

max_connections — hard upper bound on simultaneous backends. Changing it requires a restart. Default 100. Includes superuser-reserved slots.

Connection pool — a layer between clients and the database that maintains a small pool of long-lived backends and multiplexes many client connections over them. Implementations: PgBouncer, pgcat, Supabase Supavisor, application-side pools (HikariCP, pgx’s built-in pool).

Pool mode — how aggressively the pooler hands out backends. Session = backend held for the whole client session; Transaction = backend assigned per transaction; Statement = backend per statement (rarely used).

Heavyweight lock (or regular lock) — a lock on a database object (a table, a row, an index, an advisory key) that Postgres takes on your behalf when you run a statement. You don’t pick the mode; the statement does. Tracked in the cluster-wide lock table, visible in pg_locks, released at transaction end.

Lightweight lock (LWLock) — an internal lock protecting shared-memory data structures inside Postgres (buffer pool slots, the WAL insert position, the lock table itself). Two modes: SHARED and EXCLUSIVE. Held for microseconds. You don’t take these directly — they’re acquired by Postgres internals on your behalf.

Spinlock — the lowest-level mutex, used when a lock is held for nanoseconds. Busy-waits in a tight loop. Below LWLocks in the stack.

Lock mode — for heavyweight locks, the kind of lock: from the gentlest ACCESS SHARE (a SELECT) to ACCESS EXCLUSIVE (an ALTER TABLE). Eight modes total. A lock conflict matrix says which pairs of modes can coexist on the same object and which block each other.

MVCC (Multi-Version Concurrency Control) — the trick that lets Postgres readers and writers ignore each other. Instead of locking a row while you read it, Postgres keeps multiple versions of the row on disk and serves whichever version was committed when your transaction started. A long-running SELECT doesn’t block UPDATEs, and ongoing UPDATEs don’t block readers — they each see their own consistent snapshot. The cost: outdated versions pile up until vacuum reclaims them. Postgres, Oracle, and MySQL InnoDB all use MVCC. Section 5 unpacks the on-disk mechanics.

Tuple — Postgres’s term for a row. Specifically, a physical row version on a heap page. An UPDATE produces a new tuple; the old one stays around for MVCC visibility until vacuum reclaims it.

xmin / xmax — hidden columns on every tuple. xmin = the transaction ID that inserted it. xmax = the transaction ID that deleted or updated it (zero if neither). Together they decide whether a tuple is visible to a given snapshot.

XID (Transaction ID) — a 32-bit number assigned to every write transaction. Read-only transactions get a “virtual” XID that costs nothing. Wraparound is real, autovacuum manages it.

Snapshot — a frozen record of “which transactions had already committed at the moment I started.” Every transaction takes one. Every row the transaction reads is filtered through it, so the transaction sees a consistent point-in-time view of the database even as other transactions commit changes around it.

Wait event — what a backend is blocked on right now. Column in pg_stat_activity. Values include Lock (waiting on a heavyweight lock), LWLock, BufferPin, IO, Client, Timeout, IPC. The first thing to check when “the database is slow.”

Deadlock — two or more transactions waiting on locks held by each other, forming a cycle. Postgres detects this periodically (every deadlock_timeout, default 1 second) and aborts one transaction with SQLSTATE 40P01.

Advisory lock — a database-hosted mutex that callers compete for using an integer key they pick themselves. Postgres doesn’t connect the key to any row, table, or object — it just guarantees that two callers asking for the same key can’t both hold the lock at once. Used by application code to serialize work across processes (“only one instance runs this cron job,” “elect a leader”) when you already have a database and don’t want to stand up a separate coordinator like Redis or ZooKeeper.


3. Connection Pooling

A backend can run one query at a time, and the OS can only execute as many processes in parallel as the box has cores. So however many backends you allow, only a handful are actually doing work in any given moment — roughly 2 × cores + effective_io_concurrency, give or take, depending on how much your queries wait on disk. Stack more backends on top of that and they start fighting each other for the same CPU and the same internal lock table; throughput stops climbing and starts dropping:

The trick: most applications have thousands of clients but only need dozens of backends at any instant. A web server with 200 worker threads doesn’t issue 200 simultaneous queries — most threads spend most of their time in application logic, I/O to other services, or idle. A connection pooler exploits this.

PgBouncer pool modes

PgBouncer is the industry-standard Postgres pooler. Its three modes differ in how long it holds a backend on behalf of a client.

Session pooling. A client connects, PgBouncer assigns it a backend, the backend stays bound to that client until the client disconnects. PgBouncer is essentially a connection multiplexer for the count, not the time. Useful when you have many short-lived clients (CLI scripts, lambdas) but ordinary session semantics inside each one.

Transaction pooling. A client checks out a backend at BEGIN and returns it at COMMIT/ROLLBACK. Between transactions, the same backend can serve a different client. This is the high-leverage mode — a pool of 50 backends easily serves 5000 clients if average transactions are a few milliseconds. Most production deployments use transaction pooling.

Statement pooling. Backend returned after every statement. Breaks multi-statement transactions entirely. Rarely useful outside read-only pipelines.

Transaction-pooling caveats (the real-world list)

The cost of transaction pooling is that anything that relies on session state between transactions breaks. The exhaustive list:

  • Server-side prepared statements. A prepared statement lives on the backend that prepared it. If the next transaction lands on a different backend, the prepared name doesn’t exist. The fixes: tell your driver to disable server-side prepares (prepareThreshold=0 in JDBC; binary_parameters=yes + prefer_simple_protocol in some Go drivers), or use a pooler that emulates prepared statements (recent PgBouncer ≥ 1.21 and pgcat both do this).
  • SET (without LOCAL). Session-scoped SET timezone = 'UTC' is sticky on the backend. The next client lands on a backend with whatever the previous client did. Use SET LOCAL, which is transaction-scoped and cleared at COMMIT.
  • Cursors declared with DECLARE CURSOR ... WITH HOLD. Cursors don’t survive backend hand-off. Use server-side cursors only in session pooling.
  • LISTEN / NOTIFY. The subscription is on the backend. Once the backend is returned to the pool, you stop receiving notifications.
  • Temporary tables. CREATE TEMP TABLE lives on the backend; the next transaction won’t see it.
  • Advisory session locks (pg_advisory_lock). Released only when the backend session ends. Under transaction pooling, sessions don’t really end on the client’s clock — use pg_advisory_xact_lock instead.

When PgBouncer’s transaction pooling caveats bite, the symptom is usually “works in dev with one client, breaks in prod under load.” The fix is either to disable the offending session-scoped feature or to put PgBouncer in session-pooling mode for the specific path that needs it (two pools, two ports).

Pool sizing

The starting formula for the backend pool:

pool_size = (cpu_cores × 2) + effective_io_concurrency

For a 16-core box on NVMe: 16 × 2 + 4 ≈ 36. Round to 40. The remaining max_connections budget pays for admin sessions, replication slots, and headroom.

The starting formula for the client side (default_pool_size × max_db_connections):

max_client_conn = number of app instances × average client pool size

Set this generously — PgBouncer holds idle clients cheaply (a few KB each).

What’s after PgBouncer

PgBouncer is single-process, single-threaded (which is fine for hundreds of thousands of clients on modern CPUs, but caps single-instance throughput in the millions-of-qps range). The newer options:

  • pgcat — Rust, multi-threaded, supports sharding and read/write splitting natively.
  • Supabase Supavisor — Erlang/Elixir, designed for cloud-scale fan-in.
  • Odyssey (Yandex) — multi-threaded, transaction-pooling pgbouncer drop-in.

For the median application: PgBouncer in transaction-pooling mode, in front of a pool of 30–60 backends, is the right starting point.


4. The Lock Manager

Two layers, very different jobs.

Heavyweight locks — the user-visible layer

Heavyweight locks protect database objects: tables, rows, indexes, advisory keys. They’re tracked in a cluster-wide hash table in shared memory, indexed by (database, relation, object). Every entry records which transaction holds it, in which mode, and which transactions are queued waiting for it.

There are two questions to keep straight, and the post is going to answer them with two separate tables.

  1. Acquisition — given a SQL statement, which mode does it take?
  2. Compatibility — given two locks on the same object, can they coexist or does one wait?

The first is a one-way lookup. The second is a symmetric relation between every pair of modes, and it’s the actual conflict matrix Postgres consults on every request.

Acquisition: which statement takes which mode

Postgres defines eight table-level lock modes. Statements take whichever they need automatically, weakest to strongest:

ModeAcquired by
ACCESS SHARESELECT
ROW SHARESELECT FOR UPDATE / FOR SHARE
ROW EXCLUSIVEINSERT, UPDATE, DELETE
SHARE UPDATE EXCLUSIVEVACUUM (non-FULL), ANALYZE, CREATE INDEX CONCURRENTLY
SHARECREATE INDEX (non-concurrent)
SHARE ROW EXCLUSIVErarely explicit
EXCLUSIVEREFRESH MATERIALIZED VIEW CONCURRENTLY
ACCESS EXCLUSIVEALTER TABLE, DROP TABLE, TRUNCATE, REINDEX, VACUUM FULL

The important rule: you don’t pick the lock — the statement does. SELECT always takes ACCESS SHARE. ALTER TABLE always takes ACCESS EXCLUSIVE. You don’t get a vote; Postgres just does whatever the statement requires before executing it. The three ways to ask for a specific lock yourself:

  • SELECT ... FOR UPDATE / FOR SHARE / FOR NO KEY UPDATE / FOR KEY SHARE — picks a row-level lock mode by name. Covered in the next subsection.
  • LOCK TABLE orders IN ACCESS EXCLUSIVE MODE — takes a table-level lock at exactly the mode you specify, no statement attached. Useful when you’re about to run several ALTER TABLEs and want one acquisition at the start of the transaction instead of each statement re-acquiring it.
  • pg_advisory_lock(key) and friends — application-defined locks keyed by an integer. Covered in section 8.

Compatibility: which modes can share an object

Once a mode is held on an object, the conflict matrix decides whether the next request is granted or made to wait. Rows are what’s already held; columns are what’s being requested. Blank means “compatible — grant immediately”; ✗ means “conflict — wait in queue.”

Held ↓ \ Requested →ASRSRESUESSREEAE
ACCESS SHARE (AS)
ROW SHARE (RS)
ROW EXCLUSIVE (RE)
SHARE UPDATE EXCLUSIVE (SUE)
SHARE (S)
SHARE ROW EXCLUSIVE (SRE)
EXCLUSIVE (E)
ACCESS EXCLUSIVE (AE)

How the two tables connect: the first tells you which mode your statement asks for; the second tells you whether you get it now or queue. A SELECT (asks for AS) running on a table where someone holds AE (a slow ALTER TABLE) — table 1 says ACCESS SHARE is the requested mode, table 2’s AE row says every requested mode is blocked, so your SELECT waits. The pg_locks view is essentially this matrix in motion: at any moment it shows you what’s held, what’s requested, and whether each request was granted (granted = true) or queued.

A few entries in the matrix worth committing to memory:

  • The bottom-right cell — ACCESS EXCLUSIVE blocks ACCESS SHARE — is why a careless ALTER TABLE freezes every SELECT on the table.
  • SHARE UPDATE EXCLUSIVE conflicts with itself, which is why two VACUUMs on the same table can’t run in parallel. CREATE INDEX CONCURRENTLY exists at exactly this mode — gentle enough not to block reads or writes, but two CONCURRENTLY builds on the same table still serialize.
  • SHARE and SHARE don’t conflict, so two non-concurrent CREATE INDEX statements on different indexes of the same table can run in parallel (they both just read the heap).

Row-level locks (a different mechanism)

Postgres does not keep per-row entries in the cluster-wide lock table. There would be far too many of them. Instead, row-level locks live in the row’s own tuple header — specifically, the xmax field gets stamped with the locker’s XID, plus a flag indicating “this is a lock, not a delete.” This is brilliantly cheap (free in storage) but has a consequence: you can’t tell, from the lock table, who is row-locking what without joining against the tuple header. The view pg_locks shows row-level locks only as tuple rows on rows that are currently being waited on; non-contended row locks don’t appear at all.

Four row-lock modes, from weakest to strongest:

  • FOR KEY SHARE — taken by foreign-key checks. Blocks UPDATE of indexed columns.
  • FOR SHARE — explicit shared lock.
  • FOR NO KEY UPDATE — taken by UPDATE that doesn’t touch the primary key.
  • FOR UPDATE — the strong one. Blocks everyone, including FK checks.

Most application code reaches for SELECT ... FOR UPDATE (covered in the ACID post). The weaker modes are mostly internal but visible when foreign keys cause unexpected contention — an UPDATE on the parent row competes with FK checks on every child.

Lightweight locks — the internal layer

Heavyweight locks are transactional — held until COMMIT. LWLocks are operational — held for microseconds while a backend reads or writes a shared-memory structure. Examples:

  • Pinning a buffer in shared_buffers (so it isn’t evicted while you read it).
  • Inserting a record into the WAL.
  • Reading the cluster-wide lock table (yes, the lock table itself is LWLock-protected).
  • Updating the visibility map.

LWLocks have only two modes — SHARED and EXCLUSIVE — and don’t deadlock-check (they’re held for so short a time that detection would cost more than the deadlock). If a backend is spending real time blocked on an LWLock, something is wrong further up: too many backends contending on one buffer, an XLogInsert bottleneck, or a hot lock-table partition.

The wait events distinguish them clearly. In pg_stat_activity.wait_event_type:

  • Lock — waiting for a heavyweight lock. Probably your application code or a long transaction.
  • LWLock — waiting on an internal lock. Probably a Postgres configuration or workload issue.
  • BufferPin — waiting to pin a buffer that something else is preventing eviction of. Rare.
  • IO — waiting on disk or the OS page cache. Not a lock at all.
  • Client — waiting on the client to send the next command (idle in transaction).

Different wait_event_types, completely different fixes.


5. MVCC Under the Hood

The semantics of MVCC live in the ACID post: every transaction sees a snapshot, readers don’t block writers, snapshot isolation. The mechanics are simpler than they sound.

The tuple header

Every row in Postgres carries hidden columns alongside its user-defined columns. You can see them:

SELECT xmin, xmax, ctid, * FROM accounts WHERE id = 42;
  • xmin — the XID of the transaction that inserted this tuple.
  • xmax — the XID of the transaction that deleted or updated it (zero if neither).
  • ctid — the physical address (page, offset) of the tuple on disk.

When you UPDATE a row:

  1. The current tuple’s xmax is set to your XID.
  2. A new tuple is written, with xmin = your XID and xmax = 0.
  3. The new tuple’s ctid points back to where the row physically lives now.

The old tuple is still on disk. It will be visible to any transaction whose snapshot was taken before your commit, and invisible to any started after. Once no snapshot can see the old tuple anymore, vacuum reclaims the space.

Snapshots

A snapshot is taken at the start of a transaction (under READ COMMITTED, also at the start of each statement). Conceptually:

snapshot = {
  xmin: lowest active XID at snapshot time,
  xmax: next XID that will be assigned,
  active: list of XIDs that were in-progress at snapshot time,
}

The visibility rule, in one sentence: a tuple is visible to a snapshot if the transaction that inserted it had committed before the snapshot was taken, and the transaction that deleted it (if any) hadn’t. That’s the whole of MVCC visibility. Everything else — repeatable reads, read-your-writes, why long transactions cause bloat — falls out of that single rule.

In code, this means checking the tuple’s xmin against the snapshot (“did the inserter finish in time?”) and its xmax against the snapshot (“did the deleter, if any, finish in time?”). The Postgres source for the check lives in HeapTupleSatisfiesMVCC; you almost never need to look at it.

HOT updates (the optimization that matters)

If an UPDATE doesn’t change any indexed column, Postgres can perform a HOT (Heap-Only Tuple) update: the new tuple is written to the same page if there’s room, and the indexes don’t need updating because they still point to the old ctid, which forwards to the new one. HOT updates massively reduce index bloat on update-heavy tables.

Concrete consequence: adding an index on a frequently-updated column can dramatically slow down updates, because every update on a row now mutates an indexed column and forces a full index entry for the new tuple. This is one of the silent reasons “we added an index and writes got worse.”

Vacuum: the garbage collector

Dead tuples (xmax < oldest active snapshot’s xmin) can be reclaimed. Autovacuum is the background daemon that does this periodically per table, based on dead-row thresholds. The user-facing consequences:

  • Long-running transactions delay vacuum. A transaction that started two hours ago has a snapshot two hours old. Vacuum can’t reclaim anything newer than that. Dead tuples pile up. Tables bloat.
  • VACUUM (non-FULL) is online. It acquires SHARE UPDATE EXCLUSIVE, which conflicts only with schema changes and other vacuums. Reads and writes continue.
  • VACUUM FULL is offline. It rewrites the entire table, taking ACCESS EXCLUSIVE. Don’t run it on a live production table without a maintenance window. Use pg_repack (or its bundled extensions) instead — it does the same job online, locking only for the final swap.

XID wraparound

XIDs are 32 bits — 4.3 billion of them. Postgres uses a circular comparison: at any moment, half the space is “past” and half is “future.” If a tuple’s xmin would be older than the past half (>2 billion XIDs old), it becomes invisible to everyone — a silent data-loss condition.

Autovacuum prevents this by freezing old tuples: rewriting their xmin as a special “frozen” sentinel that’s visible to every future snapshot. The vacuum_freeze_max_age setting controls how aggressive freezing is. If autovacuum can’t keep up, you eventually see warnings like database "x" must be vacuumed within N transactions. At 1 million transactions remaining, Postgres refuses new writes and enters single-user mode until you run a manual emergency vacuum.

This used to be the most embarrassing Postgres outage class — entire shops taken offline by XID exhaustion. Postgres 14+ defaults plus monitoring of pg_database.datfrozenxid have mostly retired it, but it’s still the reason “autovacuum is healthy” is not a question you skip on a production cluster.


6. Deadlocks and How They’re Resolved

A deadlock is the most boring kind of concurrency bug: two transactions, each waiting on a lock the other one is holding, neither willing to give up. They’d wait forever if nothing intervened.

Postgres notices. Whenever a backend has been waiting on a lock for longer than deadlock_timeout (default 1 second), Postgres walks the “who’s waiting on whom” graph looking for cycles. If it finds one, it picks a victim and kills it:

ERROR:  deadlock detected
DETAIL:  Process 1234 waits for ShareLock on transaction 5678; blocked by process 4321.
         Process 4321 waits for ShareLock on transaction 5679; blocked by process 1234.
HINT:  See server log for query details.
SQLSTATE: 40P01

The other transaction proceeds. The aborted transaction’s client should retry.

The classic cause: inconsistent lock ordering

Both transactions tried to do the same thing — lock two accounts in a money transfer — but each locked them in a different order. The fix is canonical lock ordering: always lock the row with the lower ID first.

-- safe pattern
BEGIN;
  SELECT * FROM accounts WHERE id IN ('A1', 'A2')
   ORDER BY id  -- guarantees consistent lock order
   FOR UPDATE;
  -- transfer logic
COMMIT;

This works because Postgres acquires row locks in row-fetch order. Two transactions hitting the same pair always lock in the same order, so no cycle is possible.

When deadlocks are inevitable

Some workloads will deadlock no matter how you order locks — for example, UPDATE on rows whose order can’t be predicted by the application. The right pattern is: catch the 40P01 error, sleep with jitter, retry the transaction. Most ORMs (and Postgres clients) make this trivial. Treating deadlocks as “fatal application bugs” instead of “expected concurrency outcomes” is a common over-correction.

deadlock_timeout tuning

Lowering deadlock_timeout makes the detector run more often — fewer false-wait stalls before resolution, but more CPU spent checking. The default of 1 second is right for almost all workloads; raise it only if your monitoring shows the detector running frequently on contended tables. Lowering below 100 ms is rarely productive.


7. Observing It Live

When the database “feels slow,” the answer almost always lives in three views.

pg_stat_activity — what every backend is doing

SELECT pid, state, wait_event_type, wait_event,
       now() - xact_start AS xact_age,
       now() - query_start AS query_age,
       LEFT(query, 60) AS query
FROM pg_stat_activity
WHERE state != 'idle'
ORDER BY xact_age DESC NULLS LAST;
  • stateactive (running a query), idle in transaction (the danger zone), idle.
  • wait_event_type, wait_event — what’s blocking, if anything.
  • xact_age — how long the transaction has been open. Anything over a few seconds on an OLTP system is suspicious; anything over minutes is almost certainly a bug.

idle in transaction is the single most common cause of mystery contention. The application opened a transaction, took some row locks, then went off to do something slow (an external HTTP call, a Thread.sleep) without committing. Every other transaction touching those rows piles up behind it.

pg_locks — every lock in the cluster

SELECT locktype, relation::regclass, mode, granted, pid
FROM pg_locks
WHERE NOT granted;

Ungranted locks (granted = false) are the things waiting. Join pg_locks to pg_stat_activity by pid to see which queries are stuck and on what.

pg_blocking_pids() — the one-shot diagnostic

SELECT pid,
       pg_blocking_pids(pid) AS blocked_by,
       wait_event_type, wait_event,
       query
FROM pg_stat_activity
WHERE state = 'active' AND pg_blocking_pids(pid) <> '{}';

This returns, for every backend currently waiting, the list of PIDs blocking it. The output reads like a directed graph: “PID 1234 is blocked by 5678 and 5679.” Trace upwards to find the root culprit — usually one idle in transaction session holding everything else hostage.

If you don’t memorize any other query in this post, memorize this one. It is the single highest-leverage observation in Postgres operations.

Not concurrency-specific, but the natural neighbour: query-level totals over time (calls, total time, mean time, rows). Extension, needs CREATE EXTENSION pg_stat_statements; and shared_preload_libraries. When pg_blocking_pids shows a query holding everyone else up, pg_stat_statements shows how often that query runs and how slow it is on average.


8. Advisory Locks

Postgres exposes its lock manager directly to application code via advisory locks. They have no semantic meaning to Postgres — they’re keyed by a 64-bit integer (or a pair of 32-bit integers) that you assign and interpret. Two clients that agree on what key 42 means can use it to serialize work.

-- session-scoped: released when the session ends or you release explicitly
SELECT pg_advisory_lock(42);
-- ... critical section ...
SELECT pg_advisory_unlock(42);

-- transaction-scoped: released automatically at COMMIT/ROLLBACK
SELECT pg_advisory_xact_lock(42);
-- ... critical section ...
COMMIT;

There are blocking and non-blocking variants:

  • pg_advisory_lock(key) — blocks until acquired.
  • pg_try_advisory_lock(key) — returns true if acquired, false if held by someone else. The “I’ll try, but skip if busy” pattern.

When to reach for them

  • Idempotent batch jobs. A nightly cron is invoked by two replicas of your scheduler at the same time. Wrap the job in pg_try_advisory_xact_lock(job_id); the second invocation no-ops.
  • Leader election when you already have a Postgres. One worker holds pg_advisory_lock(leader_key). Whichever connection acquires it is the leader. Drop the connection and another worker picks it up.
  • Coordinating expensive, non-transactional side effects. Sending an email, calling an external API, regenerating a cache — pg_try_advisory_xact_lock ensures only one instance does the work.

What NOT to use them for

  • Replacing row-level locks. Use SELECT ... FOR UPDATE for that. Advisory locks bypass MVCC and don’t protect against your own concurrent transactions writing the same rows.
  • Cross-cluster coordination. Advisory locks live in one Postgres cluster’s lock table. They mean nothing to replicas, other clusters, or anything outside that one shared memory segment.
  • Anything that should survive a network partition. A session that loses its TCP connection has its session-scoped lock released by the postmaster — but only when the backend actually exits, which can take tcp_keepalives_idle + tcp_keepalives_interval × tcp_keepalives_count (minutes by default).

The session-pool footgun

pg_advisory_lock is session-scoped. Under PgBouncer transaction pooling, your “session” is one transaction. The lock is released the moment that transaction ends. Use pg_advisory_xact_lock instead — its lifetime matches the transaction, which is exactly what transaction pooling guarantees you own. This is the most common subtle bug when wiring advisory locks into a pooled application.


9. Buffer Pool Concurrency (Briefly)

shared_buffers is one large block of shared memory divided into 8 KB buffers, each of which can hold one page from a table or index. Every backend reads and writes through this pool — there’s no per-backend page cache.

Concurrency lives at two layers:

  1. The buffer mapping table — which buffer holds which page. A central hash table protected by 128 LWLock partitions. Contention here shows up as BufferMapping waits, usually triggered by very high churn between sets of hot pages and rarely-visited ones.
  2. Per-buffer state — a pin count, a content lock (SHARED for read, EXCLUSIVE for write), an “is dirty” flag. Pinning a buffer prevents eviction while a backend is reading from it. Content locks let multiple readers share a page; writers must be alone.

The everyday consequence: two backends reading the same hot page don’t block each other. Two backends writing to the same page do — for the duration of writing one row, which is microseconds. Most “lock contention” stories in Postgres ops chats are actually buffer-content contention, which is a different fix (sort the inserts, partition the table, add a partial index).

Checkpoints are the other moment when buffer-pool concurrency matters operationally. A checkpoint flushes all dirty buffers to disk. If too many buffers are dirty at once, the I/O storm during the checkpoint causes a latency spike across every query. checkpoint_completion_target (default 0.9) spreads checkpoint writes over 90% of the interval to smooth this.


10. Decision Tree

ScenarioReach forWhy
Many idle clients, few active transactions at any momentPgBouncer in transaction pooling, backend pool sized to ~2× coresMultiplexes thousands of clients onto tens of backends. The default modern Postgres deployment shape.
ORM that relies on server-side prepared statementsSession pooling OR PgBouncer ≥ 1.21 with prepared-statement support OR disable server-side prepares in the driverTransaction pooling breaks server-side prepares unless the pooler emulates them.
Read-heavy analytics that must not block writesRead replica + async streaming (see replication post)Keeps the primary’s snapshot horizon young, prevents long-running readers from blocking vacuum.
Periodic batch job that must not run twicepg_advisory_xact_lock(job_id) at the start of the job’s transactionSingle-cluster mutual exclusion that auto-releases on commit/rollback.
Schema migration on a busy tableSET lock_timeout, then ALTER TABLE ... NOT VALID / VALIDATE CONSTRAINT, CREATE INDEX CONCURRENTLYAvoid acquiring ACCESS EXCLUSIVE for more than the timeout; do the heavy work under weaker locks.
Repeated 40001 serialization failures under SERIALIZABLERetry with backoff, shorten transactions, narrow the rows each transaction touchesSSI assumes you retry. See the ACID post for the retry pattern.
Repeated 40P01 deadlocksCanonical lock order + retry on conflictOrder all multi-row locks the same way across the codebase.
Hot-row contention (a single counter, an inventory total)Atomic UPDATE counter = counter + 1 at READ COMMITTED, or shard the counterAvoid SELECT FOR UPDATE on a row every request hits.
idle in transaction sessions in pg_stat_activityidle_in_transaction_session_timeout + fix the client codeOpen transactions just before the writes; close them immediately after.
LWLock waits dominate wait_event_typeLook upstream: too many backends, hot buffer, WAL bottleneckNot an application fix — usually max_connections is too high or wal_buffers is too small.

The pattern: most concurrency problems are upstream of the database. Pool well, hold transactions briefly, lock in consistent order, retry on conflict, and Postgres handles the rest.


11. Common Mistakes

  • Treating max_connections as a throughput knob. It’s a memory ceiling. Going from 200 to 2000 makes things slower, not faster, because every backend competes for the same CPU and same lock table.
  • Running an ORM with server-side prepared statements through transaction-pooled PgBouncer. Symptom: random prepared statement "..." does not exist errors under load. Fix: disable server-side prepares in the driver, or use a pooler that emulates them.
  • No lock_timeout on migrations. Default is zero (infinite). An ALTER TABLE that can’t get ACCESS EXCLUSIVE because of one slow SELECT will sit and block every other query in the meantime. Always SET lock_timeout = '5s'; before DDL.
  • Long-running transactions on the primary. Each one pins a snapshot and blocks vacuum. Bloat compounds; queries get slower table by table; nobody notices for weeks. Set idle_in_transaction_session_timeout to something like '5min'.
  • Using SERIALIZABLE without a retry loop. SSI aborts conflicting transactions with 40001. Without retry, the application sees random transient errors. (See ACID post.)
  • ALTER TABLE ADD COLUMN ... DEFAULT some_function() on a hot table. Pre-Postgres 11 this rewrote every row under ACCESS EXCLUSIVE. Modern Postgres handles constant defaults instantly; non-constant defaults still rewrite. Test in staging on representative data.
  • Acquiring multiple row locks in different orders across code paths. Deadlocks under load. Standardize a lock order (alphabetical by primary key, ascending by ID — pick one and document it).
  • Forgetting that LISTEN/NOTIFY doesn’t survive transaction pooling. Symptom: notifications work in dev (direct connection), silently disappear in prod (PgBouncer). Either route LISTEN through session pooling or replace with a real queue.
  • Using pg_advisory_lock (session-scoped) under transaction pooling. Lock released between transactions whether you wanted it to be or not. Use pg_advisory_xact_lock.
  • Treating wait_event = Lock and wait_event = LWLock as the same problem. First is an application or transaction-ordering issue; second is usually a configuration or workload-shape issue. Different fixes.
  • Running VACUUM FULL on a production table during traffic. It’s offline (ACCESS EXCLUSIVE) for as long as it takes to rewrite the table. Use pg_repack or pg_squeeze instead.
  • Holding a transaction open across an HTTP call. Common in middleware that wraps the whole request in BEGIN/COMMIT. Every external call’s latency becomes lock duration for everyone behind you on the same rows.
  • Adding an index on a frequently-updated column without thinking about HOT. Every update now touches the index, costs more, and the column is rarely read by the index anyway. Measure before adding.
  • Ignoring pg_stat_activity because “Postgres seems fine.” It’s the cheapest production diagnostic in the stack. Make it one query away from being on a dashboard.

12. Closing

The number of concurrent clients your database supports isn’t max_connections. Backends are bounded by your CPU cores; clients are bounded by how briefly each one actually needs the database. Pool well and a handful of backends serves thousands of clients. The lock manager keeps writers correct; MVCC keeps readers fast; vacuum keeps the past garbage-collected; deadlock detection keeps cycles short; pg_stat_activity tells you which of those failed.

The skill is recognizing which layer a slowdown lives in. Connection-saturation, heavyweight-lock contention, internal-LWLock contention, MVCC bloat, plain old disk I/O — these look identical to the application (“the database is slow”) and have completely different fixes. The diagnostic queries in section 7 are the difference between “we’re rolling back the deploy” and “we know exactly which transaction is holding everyone up.”

The next post in the Database Internals series steps back to the operational layer: configuration, observability, and the production checklist — the settings, dashboards, and alerts that turn the everything-above into something you can run at 3 AM.