Part 1 of Distributed Systems

Sharding Postgres: When One Database Stops Being Enough

34 min read

The hot tables are partitioned. The old data is on cheap storage. billing, analytics, and notifications each live on their own Postgres instances now. Read replicas absorb the analytical load. Autovacuum is keeping up. Every single-node tuning lever has been pulled.

And still — the central tenant data is 12 TB, the write rate exceeds what one disk’s Write-Ahead Log (WAL) can sustain, one machine’s max_connections ceiling caps how many application servers can talk to the database, and the replica is permanently a few minutes behind during peak traffic. You’ve genuinely outgrown a single Postgres cluster.

This is when sharding stops being theoretical and starts being the next thing on the roadmap. Sharding is the horizontal distribution of one dataset across multiple Postgres servers. Each shard is a complete Postgres instance, with its own WAL, its own shared_buffers (Postgres’s in-memory page cache), its own backends, its own backups. They don’t know about each other. A routing layer in the application or in a proxy decides which shard owns which row.

The honest framing — and the thing every public engineering post about sharding eventually says: you are trading one big problem (a database that can’t keep up) for a dozen smaller problems (routing, fan-out, hot shards, cross-shard joins, rebalancing, schema migrations, distributed transactions). The smaller problems are tractable. They don’t disappear because you picked the right tool.

This post is the first in the Distributed Systems series. It assumes you’ve already exhausted what one Postgres can do — partitioned the hot tables, decomposed independent domains into their own instances, added read replicas. (If you haven’t, start with the sibling post: Postgres Partitioning.) From here on, we’re talking about the operational regime that begins when one database stops being enough.


1. The Three Architectural Questions

Every sharded system answers these. The answers determine everything else.

  1. What’s the partition key? The column you hash or range-distribute on. It separates the single-shard queries (cheap, scale linearly with shards added) from the scatter-gather queries (touch every shard, don’t scale). Pick badly and the design fights you for years.
  2. How do clients find the right shard? Three approaches, often combined: a directory service (a metadata DB that maps tenants/users to shards), deterministic hashing (compute the shard from the key with no lookup), or a SQL-aware routing proxy.
  3. How do you reshard? Eventually a shard gets too hot, a region is added, or the original key was wrong. Online rebalancing is the hardest engineering problem in this space.

2. Shard Keys — The Hardest Decision

For a multi-tenant B2B SaaS, the answer is almost always tenant_id. Every tenant’s data lives on one shard. Queries already filter by tenant. No single tenant is large enough to need their data spread across machines. This is the architecture Notion, Figma, Slack (on Vitess), and most B2B systems converge on.

For a B2C product, user_id works for user-centric reads (profile, feed, messages) but breaks down for global features (timelines of accounts you follow, group chats, marketplace browses). Large B2C systems usually shard by user and maintain denormalized cross-cutting structures — fanout tables, materialized timelines — that absorb the scatter-gather cost up front.

For geo-distributed products, region works as a coarse shard key (data sovereignty, latency), with finer-grained tenant or user sharding nested inside each region.

The wrong shard keys are obvious in retrospect:

  • Timestamps for write-heavy workloads. Every write goes to the newest shard. The hot-shard problem in its purest form.
  • Auto-incrementing IDs. Same issue when range-sharded; hashing fixes the imbalance but you still scatter-gather for almost everything.
  • Low-cardinality columns (country code, plan tier). Some shards end up 10× the size of others, and you can’t rebalance.

A useful rule: look at your top 20 query predicates. If 90%+ filter by a particular column, that’s your shard key. If no single column dominates, you don’t have a sharding problem — you have a data-model problem.


3. Consistent Hashing and the Logical/Physical Split

A naive hash(key) % N design hits a wall when N changes. Suppose today N = 3, so tenant 42 resolves to hash(42) % 3 = 0 — shard 0. Tomorrow you add a fourth machine: hash(42) % 4 = 2 — shard 2. The same tenant ID now points to a different shard. Almost every tenant gets a new assignment, and every row in the system has to be physically copied to its new home. That’s called re-keying: the routing function changed, so the answer to “which shard does this row belong to?” changed for nearly everyone.

Consistent hashing fixes this by mapping both keys and nodes onto a circular hash space. Each key is owned by the next node clockwise from it. When you add a node, only the keys in the arc between the new node and its clockwise neighbour need to move — the rest stay where they were.

In a 3-node ring, adding a 4th node moves roughly 1/4 of the keys. With hash(key) % N, adding a 4th node would have moved almost all of them. That’s the whole point.

But most production systems use a simpler trick on top: decouple logical and physical shards.

  • Pick a logical shard count larger than you’ll ever need (256, 1024, 4096).
  • Distribute logical shards across a smaller number of physical instances.
  • Keep a directory tenant_id → logical_shard → physical_instance.
  • When a machine gets hot, move a few logical shards. No re-keying.

To be precise about what does and doesn’t move: the data still gets copied. When LS-3 migrates from pg-02 to pg-03, every row belonging to LS-3 is physically transferred across — there’s no magic that avoids the bytes-on-disk part. That copy uses the online-rebalancing playbook later in this post (double-write → backfill → verify → cut over), and at real scale it can take hours.

What the indirection avoids is re-keying. With naive hash(tenant_id) % N, adding one machine to a 3-node cluster changes N from 3 to 4, which changes the hash result for roughly 3/4 of all tenants, which means 3/4 of the entire dataset has to move. With the logical/physical split, the routing function (hash(tenant_id) % 256) is permanent — every tenant always resolves to the same logical shard, forever. To rebalance, you pick one logical shard, update its directory row, and copy that shard’s data. The difference is “copy 1/256 of the dataset” versus “copy 3/4 of the dataset” — for the same operational outcome of adding one machine.

This is the architecture published by Notion (480 logical shards across 32 instances), Instagram (logical shards as Postgres schemas across a few dozen machines), and most large Postgres shops.


4. Routing Layers

Three styles:

  • Application-level. The app computes the shard and opens a connection to the right Postgres. Maximum control, maximum responsibility. Each shard is vanilla Postgres — same monitoring, same backups, same psql. You write the fan-out logic, the directory cache, the rebalancing tools. Most large Postgres shops do this.
  • SQL-aware proxy (PgCat, Citus’s coordinator, vendor proxies). One endpoint the app talks to; it parses queries, computes shards, fans out reads, merges results. Strengths: app stays simple. Weakness: the proxy has a worldview, and your queries have to fit it.
  • Distributed Postgres-compatible databases (CockroachDB, YugabyteDB). A different system that speaks the Postgres wire protocol. Covered in the Ecosystem section.

5. The Hard Parts (And There Are a Lot of Them)

When teams build sharded Postgres for the first time, the first six months are mostly about the things they didn’t plan for. Here’s the catalogue. Each one is a project, not a paragraph — but you need to know all of them exist before you start.

5.1 Cross-shard JOINs

A single Postgres JOIN runs in microseconds because both sides of the join are on the same machine, in the same buffer cache. The moment you shard, a JOIN between two tables that don’t share the same shard key becomes a fan-out query: the app or proxy asks every shard for matching rows, ships them over the network, and merges the results.

Concrete example. You have:

  • users sharded by tenant_id
  • audit_logs sharded by tenant_id

A query like this stays on a single shard — Postgres handles it the same way it always did:

SELECT u.email, l.action
FROM users u
JOIN audit_logs l ON l.user_id = u.id
WHERE u.tenant_id = 42;

But add a global feature: “show every audit log entry across all tenants for users who signed up in the last 24 hours.” Now the routing layer has to:

  1. Query every shard for users who signed up in the last 24 hours.
  2. Group those user IDs by which shard they came from.
  3. Query every shard again for matching audit log entries.
  4. Merge everything into one result set.

With 32 shards that’s 64 round-trips, network latency added per call, partial-failure modes when one shard times out, and merge logic somewhere in the app. The fix is co-location — design the schema so tables you’d join always share the shard key. The cost of co-location is that occasionally you’ll want a cross-tenant join and simply won’t have it.

5.2 Cross-shard aggregations and reporting

Single-shard reads scale linearly with shard count. Cross-shard aggregations don’t.

“How many active users do we have?” is now SELECT count(*) FROM users WHERE last_active > now() - interval '30 days' on every shard, summed in the app. “Top 10 products globally by revenue this month” is a partial sort on every shard followed by a global merge-sort.

This is why every sharded company eventually builds a parallel analytics path — a data warehouse like Snowflake, BigQuery, Redshift, or ClickHouse, fed by Change Data Capture (CDC) streams from each shard. Operational queries hit Postgres; analytics queries hit the warehouse. You absorb the latency cost (warehouse data is minutes behind real-time) in exchange for queries that can scan globally.

If you don’t plan for the warehouse before sharding, every reporting feature becomes a scatter-gather of doom and the analytics team starts asking why the dashboard takes 90 seconds to load.

5.3 Foreign keys disappear

A FOREIGN KEY in Postgres is the database telling you “this user_id must reference a real row in users.id.” It’s enforced at write time — try to insert an order for a user who doesn’t exist and Postgres refuses.

Across shards, this guarantee doesn’t exist. If users and orders live on different shards, Postgres has no way to check. Two ways to handle it:

  • Co-locate the related tables. If orders and users are both sharded by tenant_id, the FK can be enforced normally within each shard’s vanilla Postgres. The constraint is local — and local is fine because all the rows you care about for that tenant are on one shard.
  • Enforce in application code. Check that the parent exists before the insert. Accept the race condition (the parent could be deleted between check and insert, leaving an orphan) or guard it with application-level locking. Most teams accept the race because the failure mode — an orphan row — is recoverable, and the alternative is much more expensive.

5.4 Globally unique IDs

bigserial and SERIAL columns are backed by a sequence — a per-database counter that increments on every insert. Sequences are local to each Postgres instance. Once you have N shards, two different shards will both happily emit id = 1 for the first inserted row, and now you have collisions everywhere.

Three common fixes:

  • UUIDs. Universally Unique Identifiers — 128-bit random values, unique by construction across any system. 16 bytes instead of 8, and the randomness defeats the locality optimizations that B-tree indexes use for inserts. v7 UUIDs (time-prefixed) are a good middle ground; standard v4 random UUIDs hurt insert performance because each new row lands on a random index page.
  • Snowflake-style IDs. A 64-bit integer composed of (timestamp_ms | shard_id | sequence_within_ms). Fits in a bigint, sorts roughly by time (good for indexes), unique by construction (different shards have different shard_id bits). Originally Twitter’s design, now copied widely.
  • Centralised ID service. A single coordinator hands out ID ranges to each shard. Simple to reason about, but it’s now a hop on every insert and a single point of failure.

The decision matters far beyond ID generation. IDs are everywhere — in URLs, in logs, in cross-system traces. Choosing badly means migrating IDs later, which is much worse than migrating data.

5.5 Distributed transactions

A single Postgres transaction is atomic. This works fine on one machine:

BEGIN;
UPDATE accounts SET balance = balance - 10 WHERE id = 1;
UPDATE accounts SET balance = balance + 10 WHERE id = 2;
COMMIT;

Either both updates apply or neither does. Across shards (say id = 1 is on shard A and id = 2 is on shard B), you have two bad options.

Two-Phase Commit (2PC). A coordinator asks every participating shard “can you commit?” Each shard prepares the transaction (writes the changes durably but doesn’t make them visible) and replies. Once all shards say yes, the coordinator broadcasts “commit.” Sounds clean. In practice, the prepare phase locks rows on every shard for the round-trip duration; if the coordinator crashes between phases, you have prepared transactions that can’t be resolved without manual intervention. Production sharded systems almost universally avoid 2PC.

Sagas. Break the cross-shard operation into a sequence of single-shard transactions, each with a compensating undo step. If a later step fails, run the inverse of the earlier steps.

Concrete example — transferring $10 between two users on different shards:

Step 1:  shard A → debit user-1 by $10      (commit on shard A)
Step 2:  shard B → credit user-2 by $10     (commit on shard B)

If Step 2 fails:
   Compensation → shard A → credit user-1 by $10

The sequence has to be idempotent (running step 2 twice must be the same as running it once), because retries are unavoidable in distributed systems. It has to handle the case where the compensation itself fails (you ledger the failure and route it to a human-review queue). And it has to be observable, so you can answer “where is transfer #1234 stuck?” three weeks later when a customer complains.

Stripe’s Ledger system is essentially a giant saga orchestrator on top of sharded storage. Done well, sagas are robust. Done badly, they’re a permanent source of data inconsistency.

5.6 Online rebalancing

Sooner or later one shard is too hot, a region is being added, or the original key choice was wrong for a workload that has since changed shape. You need to move data while the application keeps reading and writing. There’s no maintenance window — the system is live.

The pattern that has emerged across published architectures (Notion, Figma, Slack on Vitess, Stripe):

  1. Double-write. Application writes go to both the old shard and the new shard for every row in the migrating range. Reads still come from the old shard.
  2. Backfill. Stream historical data from old to new using logical replication (Postgres’s built-in publication/subscription) or a CDC tool like Debezium.
  3. Verify. Compare random samples between old and new. Run the application against the new shard in shadow mode — every read also queries the new shard, results are compared but the old shard’s response is what’s returned to the user.
  4. Cutover. Flip reads to the new shard. Keep double-writing for a safety window — usually days — in case you need to roll back.
  5. Tear down. Stop writing to the old shard. Eventually drop the data.

Each step is hours-to-days at real scale. Each has subtle failure modes:

  • Replication lag — the new shard is minutes behind because the backfill is competing with live writes.
  • Diverging writes — the double-write got applied to the new shard but not the old (network blip), or vice versa, and a row’s value differs between the two.
  • Idempotency edge cases — the same write got applied twice on one side because of a retry.

Teams that get this right build it once as a generic framework and reuse it for every subsequent rebalance. Teams that don’t end up doing the migration by hand the first three times, with a different bug each time.

5.7 Hot shards and tenant skew

Eventually one shard becomes a latency outlier. The usual causes:

  • One tenant grew 40× the average (Notion’s published numbers describe this happening repeatedly).
  • A region with disproportionate traffic concentration (us-east always lighting up).
  • A “celebrity user” pattern — one row receiving thousands of reads per second because some piece of content is going viral.

Mitigations are uniformly painful:

  • Split the giant tenant across multiple shards. Breaks co-location, requires application changes, makes their data harder to back up coherently.
  • Isolate them on a dedicated shard so their problems don’t affect other tenants. Operationally fine. But you’ve now admitted the architecture isn’t really “shard by tenant” — it’s “shard by tenant, except for the special ones.”
  • Application-level rate limiting for the hot row or hot tenant. Buys time, doesn’t fix the underlying skew.
  • Caching layer in front of the hot reads. Helps for read-heavy hot rows, doesn’t help if writes are the problem.

The lesson Notion and others have written about publicly: you can’t predict which tenant will go viral. Build the shard-migration tooling early, because you’ll need it.

5.8 Schema migrations across N shards

ALTER TABLE on one Postgres is operationally simple. On 32 shards it’s a different problem.

  • The migration has to run on every shard. If it fails halfway through, some shards have the new schema and some don’t.
  • The application code that depends on the new schema can’t roll out before the schema. So the schema change must be deployed first, and it must be backward-compatible — the application still works against shards that haven’t migrated yet.
  • The rollout takes minutes-to-hours, and during that window the application is running against shards in mixed states.

The pattern: every schema change is two PRs.

  1. PR 1 — make the schema change in a way the old code still works against. Add a new nullable column with a default; don’t drop the old column. Deploy. Run the migration on every shard.
  2. PR 2 — once every shard has migrated and you’ve waited a deploy cycle, remove the old column and the now-unused code. Deploy that.

Expensive migrations (rewriting a 500 GB table) need pg_repack or batched UPDATE loops on each shard, with monitoring to catch the one shard where it fails partway through and resume from where it stopped. The thing you might be slightly lazy about on one Postgres becomes religion on N.

5.9 Backups and point-in-time recovery

A single Postgres backup is pg_basebackup plus an archived WAL stream. Point-in-time recovery (PITR) means “restore to the state the database was in at 14:32:11 last Tuesday.” You replay WAL forward from the base backup until you reach that timestamp.

On 32 shards, you have 32 base backups taken at 32 slightly different points in time, with 32 separate WAL streams. If you need to restore the whole system to a globally consistent moment — say, before a bug corrupted data across multiple tenants — you need:

  • Each shard’s base backup, taken near the same wall-clock time.
  • Each shard’s archived WAL.
  • Tooling that orchestrates a parallel restore across all 32 shards to the same target timestamp.

Most teams accept the approximation: shards are restored to near the same point in time, and any cross-shard inconsistencies introduced by the restore are reconciled manually after. True globally-consistent PITR on a sharded Postgres is hard enough that most teams don’t attempt it. They invest instead in audit logs and replay tools that can selectively undo a bad transaction.

5.10 Connection pooling explosion

Your application talks to one Postgres? It opens N database connections. Talks to 32 shards? It potentially needs N × 32 connections, one pool per shard. With pgbouncer (a connection pooler) in front of each shard, the pgbouncer fleet is now managing 32× more connections too.

Concrete numbers. Suppose your application server keeps 50 connections per Postgres. You have 100 application servers. With one database: 5,000 connections — already close to Postgres’s practical ceiling (~10,000) and a clear case for pgbouncer. With 32 shards: potentially 160,000 connections in the worst case. You’ll exhaust file descriptors before you exhaust the database.

Solutions:

  • Lazy pools. Only open a connection to a shard when a request actually needs it. Most requests touch one shard, so most pools stay small.
  • Reverse routing. Application servers are assigned to subsets of shards, so each server only pools connections to its assigned shards.
  • Shared connection pool layer. A central pgbouncer cluster fronting all shards, sized for the global connection count rather than per-app.

You’ll discover this in production when traffic spikes against one shard exhaust its max_connections and the application starts failing every request that needs that shard, regardless of which app server is doing the asking.

5.11 Failover and disaster recovery

One Postgres going down = your app is down. 32 shards = one shard going down means 1/32 of users see errors. Better in some ways, worse in others.

  • Better: the blast radius is smaller. Most users don’t notice. You have time to respond.
  • Worse: failover has to be automated per shard because manual intervention doesn’t scale across 32 instances. Each shard needs a hot standby (a streaming replica ready to take over) and an automated promotion script. Patroni or Stolon are the two common tools for managing this.
  • Worse: “is the database up?” is no longer a single health check. It’s 32. Your dashboards have to handle “27/32 healthy, 5 degraded” — and the routing layer has to route around the failing ones gracefully.

You also need to think about correlated failures. A bad query pattern or a bug in a schema migration can take down many shards simultaneously. “One shard at a time is fine” is a comforting story that doesn’t survive contact with reality.

5.12 Observability becomes 32× harder

A slow query on one shard is invisible on the dashboards for the other 31. The questions you ask change:

  • “What’s the p99 query latency?” (the latency the slowest 1% of queries experience) → “What’s the p99 latency per shard, and which shard is slowest?”
  • “Which queries are the worst offenders?”“Which queries, on which shards, and is it the same workload everywhere?”
  • “Disk usage?” → 32 numbers, hopefully one dashboard.

You need per-shard tagging in your metrics, log aggregation that can group by shard, and dashboards that surface the outlier shard rather than averaging across all 32 (averages hide the one shard that’s on fire). The “show me the worst shard right now” view is the single most valuable dashboard a sharded system has.

5.13 Testing and dev environments

Production has 32 shards. Your laptop has one Postgres. The bugs that only appear in production are exactly the ones related to the sharding boundary — cross-shard joins, routing logic, the saga compensation step you never exercised.

Most teams settle on one of:

  • Run a small number of shards locally (2 or 4) to exercise the routing logic. Not a faithful reproduction, but catches most routing bugs.
  • Maintain a staging environment that mirrors production at smaller scale and is the only place where the sharded code path is genuinely tested before production.
  • Integration tests against a small sharded cluster in CI. Slow but catches real bugs.

Neither is great. Sharded systems have a long tail of “only-fails-in-production” bugs that you pay for through monitoring discipline and quick rollback rather than test coverage.

5.14 Multi-tenant security boundaries

When tenant_id is the shard key, the database physically separates tenants. A query that forgets to filter by tenant_id returns nothing useful, because that tenant’s data isn’t on the connected shard. This is accidentally good for security.

But the inverse: if the routing layer has a bug that sends a query to the wrong shard, you’ve just leaked one tenant’s data to another. Routing-layer bugs become security incidents in sharded systems in a way they aren’t in single-database systems. Teams that take this seriously add:

  • Pre-flight assertions. The application checks that the connection it’s about to use is the right shard for the tenant in the request context. Cheap and catches most bugs immediately.
  • Result validation. Every query result is filtered through a check that all returned rows match the expected tenant_id. A row that doesn’t match means routing failed; the request is rejected and an alert fires.
  • Periodic audits. Random sampling of query results to confirm no cross-tenant leakage in production.

5.15 The cost — in dollars and headcount

Sharding multiplies your infrastructure cost roughly linearly with shard count. 32 shards is 32 Postgres instances, each with its own storage, backups, monitoring, hot standby. The savings come from using smaller machines per shard, but rarely fully offset the multiplier — you usually end up paying 1.5–3× what one giant machine would have cost.

The bigger cost is headcount. A team running one Postgres can treat it as part-time work for a thoughtful backend engineer. A team running 32 shards needs at least one full-time database engineer thinking about capacity, rebalancing, failover, and the migration framework. The platform is now a product — with an on-call rotation, a roadmap, and an expectation that someone owns it 24/7.

This is the hidden tax of premature sharding: you’re permanently adding a person (or three) to your headcount in exchange for a problem you might not have actually had yet.


6. What Sharding Actually Does to Postgres

The two strategies (partitioning and sharding) route around fundamentally different bottlenecks. Partitioning shrinks the active working set and gives operations a smaller object to grip. Sharding adds physical resources — more disks, more cores, more RAM, more max_connections — at the cost of coordination.

WAL — the resource sharding actually adds

Postgres writes every change to the Write-Ahead Log before modifying the heap. All writes across all partitions go to the same WAL on the same physical disk, serialized through XLogInsert and ultimately one fsync per group commit. There is one WAL per Postgres cluster.

This is the throughput ceiling partitioning can’t break. A modern NVMe disk fsyncs in tens of microseconds; with synchronous commit and a busy backend you can push tens of thousands of WAL records per second per instance, but you don’t get to add more WAL by adding partitions. Sharding spreads WAL across N machines and N disks — this is the actual write-scaling mechanism.

Replication

A streaming replica replays WAL serially. When the primary writes WAL faster than the replica can apply, lag expands under load. Postgres 16+ has parallel apply for some WAL record types, but the constraint stands: one replica can’t keep up with a primary running on hardware many times its size. Sharding turns “one fast primary, one struggling replica” into “N modest primaries, each with a comfortable replica.”

Connections

max_connections is per-cluster, and each backend is a heavyweight process. The practical ceiling is a few thousand — beyond which you’re using pgbouncer in transaction-pooling mode and accepting its tradeoffs (no session state, fewer prepared statements, no LISTEN/NOTIFY). Sharding multiplies the ceiling by shard count.

Checkpoints

The checkpointer flushes dirty buffers and trims the WAL every 5 minutes or 1 GB, whichever first. On a single instance with TB of data and a hot working set, every checkpoint is an I/O spike that shows up in tail latency. Sharding spreads checkpoints across machines so they’re no longer correlated.

The shape of this: sharding solves problems that are about per-cluster physical resources. If your problem isn’t physical resources, sharding adds a lot of complexity without buying you anything. Partition first; vertical-decompose second; shard only when nothing cheaper is left.


7. The Postgres Ecosystem

The tools that show up everywhere in production sharded Postgres:

Citus — turns one coordinator and N worker nodes into a distributed Postgres. Define a distribution column; Citus rewrites queries to fan out and merge results. Co-located tables join efficiently. Owned by Microsoft now, powers Azure Cosmos DB for PostgreSQL. Best fit when the workload has a clear distribution column and most queries already filter by it.

Logical replication — Postgres’s publication/subscription system streams row changes (not WAL bytes) to subscribers. The mechanism behind every serious online shard migration, every Debezium-based CDC pipeline, and tools like pg_easy_replicate for upgrades.

Distributed Postgres-compatible databasesCockroachDB, YugabyteDB. Not Postgres, but speak its wire protocol. Strong consistency via Raft, automatic resharding, multi-region support. The catch: write latency is higher (consensus isn’t free), some PG features are missing or partial (certain extensions, advisory locks, custom types, some JSONB operators). Right answer for new systems needing multi-region transactional writes from day one; harder sell as a migration target for a mature Postgres workload.

Patroni / Stolon — automated failover managers for Postgres clusters. Essential for running 32 shards each with a hot standby — manual failover does not scale.


8. Failure Modes Specific to Sharding

Beyond the hard parts above, here are the incident-report patterns you’ll see most often in mature sharded systems.

Uneven shard distribution. You sharded by tenant_id and one tenant is 40× the average. That shard is the bottleneck; everyone else is bored. Splitting the giant tenant breaks co-location; isolating it on a dedicated shard works but admits the architecture isn’t really “shard by tenant” anymore.

Painful rebalancing. Shard 7 moving to a new machine. Double-write on, backfill at 80%, a bug in the comparison job silently passing diverging writes through. You discover it after cutover. Two sources of truth and a Saturday.

Cross-shard transactions, snuck in. A feature needed atomic updates across two tenants. You don’t have 2PC. The code stitched together unrelated commits, and now an orphan in shard B points to a missing parent in shard A. The second-worst class of bug in sharded systems. (The worst: silent divergence during a rebalance.)

Schema migrations that aren’t backward-compatible. You dropped a column and the code that read it in one PR — but the rollout is shard-by-shard, and now some shards have the new schema and some don’t. Always two-step: schema-compatible change first, code change second, after every shard has migrated.

The slow shard nobody noticed. Per-shard metrics exist but the dashboard averages across all of them. One shard is at p99 = 2 seconds; the global average is fine; nobody notices until a customer complains. The single most valuable sharded-system dashboard is the outlier view, not the average.


9. The Full Scaling Journey

What actually happens, in order, in systems that reach interesting scale:

Stage 0 — One Postgres. Tune shared_buffers (~25% of RAM), work_mem, effective_io_concurrency, autovacuum cost limits. Add the right indexes. Use EXPLAIN ANALYZE. Orders of magnitude on the table.

Stage 1 — Read replicas. Send analytical and reporting traffic to followers. Buys read scale, not writes. Replicas aren’t transactionally consistent with the primary; design around it or pay write latency for synchronous replication.

Stage 2 — Partition the hot tables. Range-partition the 4 TB events table by day or month. Drop old partitions on a cron. Same for audit logs, activity streams, anything append-mostly. Often buys 2–5 years of runway. (Covered in detail in the sibling post: Postgres Partitioning.)

Stage 3 — Vertical decomposition. Pull large independent domains into their own Postgres instances: billing, analytics, notifications, search. Multiple connection pools, but each database is still single-node. This is where most companies stop.

Stage 4 — Shard the core. When central tenant/user data still won’t fit, pick a partition key (almost always tenant_id), pick a logical shard count, build the routing layer, migrate over a quarter using double-write. Build the runbooks. Hire a database team.

Stage 5 — Operate the platform. Sharding isn’t a project, it’s a regime. Per-shard observability, capacity planning, rebalancing tooling, migration automation. The sharding system is a product.

The mistake is jumping from Stage 1 to Stage 4. Avoid premature sharding the same way you’d avoid premature microservices: the costs are paid forever, the benefits arrive later than expected, and most wins were available cheaper one layer up.


10. The Insight

Partitioning is usually about making one database manageable. Sharding is about admitting one database is no longer enough. They’re not points on the same continuum — they’re answers to different questions, taken from different toolkits, costing different things, paid forever.

The teams that scale Postgres well are not the teams who shard early. They’re the teams who partition aggressively, decompose their data domains, build serious observability around their one cluster, and only reach for sharding when they’ve genuinely exhausted what one big machine can do. By the time they do shard, the architecture has been forced into a shape that can be sharded — a clear partition key emerging from the query patterns, related data co-located by accident of the data model, transactional boundaries that already respect tenant edges.

The architecture that survives is the one you can still understand at 3 AM with a single psql window open. Postgres makes that achievable longer than most people think, if you use partitioning well, vertical decomposition deliberately, and sharding only when nothing cheaper is left.

This is the first post in the Distributed Systems series. Next up: replication strategies — synchronous vs asynchronous, physical vs logical, and the consistency tradeoffs each forces on the application above it. After that: consensus protocols (Raft and friends), distributed transactions (2PC, sagas, the outbox pattern), CDC pipelines, and distributed locks. Sharding was the easy one; the rest of this is about coordinating machines that don’t share memory and can’t fully trust each other.