BLACK SKIES ARCHITECTURE · EPISODE 2 OF 7
Massive Scale & SWIM Gossip
· 22:00 ·
#distributed-systems#swim#gossip#failure-detection
Node failure detection and cluster membership at 10,000 players. How SWIM gossip protocol enables scalable failure detection without centralized coordination.
Overview
This episode dives into SWIM (Scalable Weakly-consistent Infection-style Process Group Membership) gossip protocol and how it solves the distributed membership problem for Black Skies.
Key Topics
- The membership problem: who is in the cluster right now?
- Why heartbeat-based systems don’t scale (O(n²) message complexity)
- SWIM’s three-phase failure detection: probe → indirect probe → suspicion
- Dissemination via gossip: infection-style broadcast
Timestamps
0:30 · The membership problem 4:45 · SWIM protocol overview 10:20 · Failure detection mechanics 16:00 · Integration with tile processors
SWIM in Practice
Our implementation uses:
- Protocol period: 1 second (configurable)
- Suspicion multiplier: 4 (suspect for 4 protocol periods before declaring failed)
- Dissemination limit: 9 (each update gossiped to 9 random peers)
Code Walkthrough
// Simplified SWIM failure detector
public class SwimFailureDetector
{
private readonly ConcurrentDictionary<NodeId, NodeState> _membership;
private readonly TimeSpan _protocolPeriod = TimeSpan.FromSeconds(1);
public async Task RunProtocolPeriodAsync()
{
var probeTarget = SelectRandomMember();
if (!await ProbeDirectAsync(probeTarget))
{
// Direct probe failed - use k indirect probes
var indirectTargets = SelectKRandomMembers(k: 3);
var indirectResults = await Task.WhenAll(
indirectTargets.Select(t => ProbeIndirectAsync(t, probeTarget))
);
if (!indirectResults.Any(r => r))
{
DeclareSuspected(probeTarget);
}
}
}
}
Next Episode
Episode 3 covers fan-out architecture: how we deliver events to 10,000 clients efficiently.