Scaling WebSockets to 1M Concurrent Connections
When I first inherited the WebSocket infrastructure at my previous company, we were handling around 10,000 concurrent connections. The system worked, but barely. Today, I'll walk you through the journey of scaling that same system to handle over 1 million concurrent connections.
The Initial Architecture
The original setup was simple—perhaps too simple:
// The 400">"simple" approach that got us to 10K connections
400">"text-primary">const wss = new WebSocket.Server({ port: 8080 });
wss.on(400">'connection', (ws) => {
ws.on(400">'message', (message) => {
// Broadcast to all clients
wss.clients.forEach((client) => {
400">"text-primary">if (client.readyState === WebSocket.OPEN) {
client.send(message);
}
});
});
});This approach has several fundamental problems at scale:
The First Bottleneck: Connection Limits
The first wall we hit was around 65,000 connections. This is the default ephemeral port range on most Linux systems. The fix was relatively straightforward:
# Increase the port range
echo 400">"1024 65535" > /proc/sys/net/ipv4/ip_local_port_range
# Increase file descriptor limits
ulimit -n 1000000But this only bought us time.
Horizontal Scaling with Redis Pub/Sub
The real scaling came from horizontal scaling. We introduced Redis as a message bus:
400">"text-primary">import Redis 400">"text-primary">from 400">'ioredis';
400">"text-primary">const publisher = new Redis();
400">"text-primary">const subscriber = new Redis();
// Each server subscribes to channels
subscriber.subscribe(400">'broadcast');
subscriber.on(400">'message', (channel, message) => {
localClients.forEach(client => client.send(message));
});
// Publishing goes through Redis
400">"text-primary">function broadcast(message: string) {
publisher.publish(400">'broadcast', message);
}This allowed us to scale horizontally, but introduced new challenges around connection affinity and message ordering.
Lessons Learned
The best time to plan for scale is before you need it. The second best time is now.
What's Next
In the next post, I'll cover how we implemented connection draining during deployments—a surprisingly complex problem that took us three attempts to get right.