Skip to main content
BLACK SKIES ARCHITECTURE · EPISODE 7 OF 7

Optimizations II: Reconnects, Edge Caching & Polish

· 26:00 ·
#distributed-systems#performance#caching#reliability

Handling thundering herds, passive mode for alt-tabbed clients, and edge caching strategies. The final 20% of performance gains.

Overview

The final episode covers the last mile of optimization: handling failure scenarios gracefully, supporting passive clients, and edge deployment strategies.

Key Topics

  • Reconnect storms: server-guided jittered backoff
  • Passive mode: 90% fan-out reduction for alt-tabbed clients
  • Edge caching: Slack Flannel pattern for entity metadata
  • Multi-region considerations: latency vs consistency trade-offs

Reconnect Storm Handling

When 5,000 clients disconnect simultaneously (cell tower outage, AWS region blip):

Bad: Fixed 1-second retry → synchronized reconnect wave → thundering herd

Good: Jittered exponential backoff:

var retryDelay = TimeSpan.FromSeconds(1) + 
                 TimeSpan.FromMilliseconds(Random.Shared.Next(4000));
// Spread over 5 seconds, 10x spike reduction

Timestamps

Passive Mode

Most players aren’t actively watching:

  • 90% of connections are alt-tabbed or backgrounded
  • Passive clients: 0.5 Hz updates, delta-only
  • Active clients: 2 Hz updates, full events

Result: 90% reduction in fan-out for the majority of connections.

Edge Caching (Flannel Pattern)

Entity metadata (names, factions, ship types) changes rarely:

Client ← Relay Pod ← Origin (DynamoDB)

         Edge Cache (CloudFront)
              TTL: 60s + request coalescing
  • Cache hit: <5ms
  • Cache miss: 50-100ms + coalesced upstream
  • Staleness: Tolerable for metadata (not combat state)

Multi-Region Reality

Tokyo client → us-east-1 relay = 150-200ms RTT

This blows our 100ms hot-delta budget. Solutions explored:

  1. Regional relays: Tokyo player → Tokyo relay → us-east-1 tile processor

    • Adds relay→processor latency (60-80ms)
    • Still misses budget
  2. Regional tile processors: Split world by region

    • Not viable for single contiguous battle
  3. Accept 200ms for cross-region: Documented limitation

    • Current approach

Series Conclusion

Building for 10,000 concurrent players in a single battle requires:

  • Spatial partitioning (H3 hex grid)
  • Interest management (viewport filtering)
  • Sharded pub/sub (Redis 7)
  • Careful serialization (Protobuf + Zstandard)
  • Graceful degradation (passive mode, event prioritization)

The architecture is a hypothesis awaiting load test validation. The load tests needed are documented in the HEX Architecture post.

Thanks for watching.