#SystemDesign

Scaling WebSockets to 1M Concurrent Connections

December 15, 2024
12 min read

Scaling WebSockets to 1M Concurrent Connections

When I first inherited the WebSocket infrastructure at my previous company, we were handling around 10,000 concurrent connections. The system worked, but barely. Today, I'll walk you through the journey of scaling that same system to handle over 1 million concurrent connections.

The Initial Architecture

The original setup was simple—perhaps too simple:

typescript
// The 400">"simple" approach that got us to 10K connections
400">"text-primary">const wss = new WebSocket.Server({ port: 8080 });

wss.on(400">'connection', (ws) => {
  ws.on(400">'message', (message) => {
    // Broadcast to all clients
    wss.clients.forEach((client) => {
      400">"text-primary">if (client.readyState === WebSocket.OPEN) {
        client.send(message);
      }
    });
  });
});

This approach has several fundamental problems at scale:

  • Single process limitation: Node.js runs on a single thread
  • Memory pressure: Each connection consumes memory
  • Broadcast inefficiency: O(n) message distribution
  • The First Bottleneck: Connection Limits

    The first wall we hit was around 65,000 connections. This is the default ephemeral port range on most Linux systems. The fix was relatively straightforward:

    bash
    # Increase the port range
    echo 400">"1024 65535" > /proc/sys/net/ipv4/ip_local_port_range
    
    # Increase file descriptor limits
    ulimit -n 1000000

    But this only bought us time.

    Horizontal Scaling with Redis Pub/Sub

    The real scaling came from horizontal scaling. We introduced Redis as a message bus:

    typescript
    400">"text-primary">import Redis 400">"text-primary">from 400">'ioredis';
    
    400">"text-primary">const publisher = new Redis();
    400">"text-primary">const subscriber = new Redis();
    
    // Each server subscribes to channels
    subscriber.subscribe(400">'broadcast');
    
    subscriber.on(400">'message', (channel, message) => {
      localClients.forEach(client => client.send(message));
    });
    
    // Publishing goes through Redis
    400">"text-primary">function broadcast(message: string) {
      publisher.publish(400">'broadcast', message);
    }

    This allowed us to scale horizontally, but introduced new challenges around connection affinity and message ordering.

    Lessons Learned

  • 1.Start with observability: We should have instrumented everything from day one
  • 2.Test at scale early: Synthetic load testing saved us from many production incidents
  • 3.Plan for failure: Every component will fail; design for graceful degradation
  • The best time to plan for scale is before you need it. The second best time is now.

    What's Next

    In the next post, I'll cover how we implemented connection draining during deployments—a surprisingly complex problem that took us three attempts to get right.

    System Status:Online
    Session: 00:00:00