Database Sharding: What They Don't Tell You

Database sharding sounds simple in theory: split your data across multiple databases to handle more load. In practice, it's one of the most complex architectural decisions you can make.

The Trigger

Our PostgreSQL instance was handling 50,000 queries per second. Response times were creeping up, and we were already on the largest available instance. Sharding seemed inevitable.

The Decision

We considered three sharding strategies:

1.Range-based sharding: Partition by date or ID ranges

2.Hash-based sharding: Distribute evenly using a hash function

3.Directory-based sharding: Use a lookup service

We chose hash-based sharding on user_id for even distribution:

typescript

400">"text-primary">function getShardId(userId: string, numShards: number): number {
  400">"text-primary">const hash = crypto.createHash(400">'md5').update(userId).digest(400">'hex');
  400">"text-primary">return parseInt(hash.substring(0, 8), 16) % numShards;
}

The Hidden Costs

What we didn't anticipate:

▸Cross-shard queries: Aggregations became exponentially more complex

▸Referential integrity: Foreign keys don't work across shards

▸Operational complexity: Backups, migrations, and monitoring multiplied

Would I Do It Again?

Yes, but I'd start with read replicas and caching first. Sharding should be a last resort, not a first solution.