#Database

Database Sharding: What They Don't Tell You

November 28, 2024
8 min read

Database Sharding: What They Don't Tell You

Database sharding sounds simple in theory: split your data across multiple databases to handle more load. In practice, it's one of the most complex architectural decisions you can make.

The Trigger

Our PostgreSQL instance was handling 50,000 queries per second. Response times were creeping up, and we were already on the largest available instance. Sharding seemed inevitable.

The Decision

We considered three sharding strategies:

  • 1.Range-based sharding: Partition by date or ID ranges
  • 2.Hash-based sharding: Distribute evenly using a hash function
  • 3.Directory-based sharding: Use a lookup service
  • We chose hash-based sharding on user_id for even distribution:

    typescript
    400">"text-primary">function getShardId(userId: string, numShards: number): number {
      400">"text-primary">const hash = crypto.createHash(400">'md5').update(userId).digest(400">'hex');
      400">"text-primary">return parseInt(hash.substring(0, 8), 16) % numShards;
    }

    The Hidden Costs

    What we didn't anticipate:

  • Cross-shard queries: Aggregations became exponentially more complex
  • Referential integrity: Foreign keys don't work across shards
  • Operational complexity: Backups, migrations, and monitoring multiplied
  • Would I Do It Again?

    Yes, but I'd start with read replicas and caching first. Sharding should be a last resort, not a first solution.

    System Status:Online
    Session: 00:00:00