Skip to main content
BLACK SKIES ARCHITECTURE · EPISODE 2 OF 7

Massive Scale & SWIM Gossip

· 22:00 ·
#distributed-systems#swim#gossip#failure-detection

Node failure detection and cluster membership at 10,000 players. How SWIM gossip protocol enables scalable failure detection without centralized coordination.

Overview

This episode dives into SWIM (Scalable Weakly-consistent Infection-style Process Group Membership) gossip protocol and how it solves the distributed membership problem for Black Skies.

Key Topics

  • The membership problem: who is in the cluster right now?
  • Why heartbeat-based systems don’t scale (O(n²) message complexity)
  • SWIM’s three-phase failure detection: probe → indirect probe → suspicion
  • Dissemination via gossip: infection-style broadcast

Timestamps

SWIM in Practice

Our implementation uses:

  • Protocol period: 1 second (configurable)
  • Suspicion multiplier: 4 (suspect for 4 protocol periods before declaring failed)
  • Dissemination limit: 9 (each update gossiped to 9 random peers)

Code Walkthrough

// Simplified SWIM failure detector
public class SwimFailureDetector
{
    private readonly ConcurrentDictionary<NodeId, NodeState> _membership;
    private readonly TimeSpan _protocolPeriod = TimeSpan.FromSeconds(1);
    
    public async Task RunProtocolPeriodAsync()
    {
        var probeTarget = SelectRandomMember();
        if (!await ProbeDirectAsync(probeTarget))
        {
            // Direct probe failed - use k indirect probes
            var indirectTargets = SelectKRandomMembers(k: 3);
            var indirectResults = await Task.WhenAll(
                indirectTargets.Select(t => ProbeIndirectAsync(t, probeTarget))
            );
            
            if (!indirectResults.Any(r => r))
            {
                DeclareSuspected(probeTarget);
            }
        }
    }
}

Next Episode

Episode 3 covers fan-out architecture: how we deliver events to 10,000 clients efficiently.