Massive Scale & SWIM Gossip

Node failure detection and cluster membership at 10,000 players. How SWIM gossip protocol enables scalable failure detection without centralized coordination.

Overview

This episode dives into SWIM (Scalable Weakly-consistent Infection-style Process Group Membership) gossip protocol and how it solves the distributed membership problem for Black Skies.

Key Topics

The membership problem: who is in the cluster right now?
Why heartbeat-based systems don’t scale (O(n²) message complexity)
SWIM’s three-phase failure detection: probe → indirect probe → suspicion
Dissemination via gossip: infection-style broadcast

Timestamps

0:30 · The membership problem 4:45 · SWIM protocol overview 10:20 · Failure detection mechanics 16:00 · Integration with tile processors

SWIM in Practice

Our implementation uses:

Protocol period: 1 second (configurable)
Suspicion multiplier: 4 (suspect for 4 protocol periods before declaring failed)
Dissemination limit: 9 (each update gossiped to 9 random peers)

Code Walkthrough

// Simplified SWIM failure detector
public class SwimFailureDetector
{
    private readonly ConcurrentDictionary<NodeId, NodeState> _membership;
    private readonly TimeSpan _protocolPeriod = TimeSpan.FromSeconds(1);
    
    public async Task RunProtocolPeriodAsync()
    {
        var probeTarget = SelectRandomMember();
        if (!await ProbeDirectAsync(probeTarget))
        {
            // Direct probe failed - use k indirect probes
            var indirectTargets = SelectKRandomMembers(k: 3);
            var indirectResults = await Task.WhenAll(
                indirectTargets.Select(t => ProbeIndirectAsync(t, probeTarget))
            );
            
            if (!indirectResults.Any(r => r))
            {
                DeclareSuspected(probeTarget);
            }
        }
    }
}

Next Episode

Episode 3 covers fan-out architecture: how we deliver events to 10,000 clients efficiently.