Optimizations II: Reconnects, Edge Caching & Polish

Handling thundering herds, passive mode for alt-tabbed clients, and edge caching strategies. The final 20% of performance gains.

Overview

The final episode covers the last mile of optimization: handling failure scenarios gracefully, supporting passive clients, and edge deployment strategies.

Key Topics

Reconnect storms: server-guided jittered backoff
Passive mode: 90% fan-out reduction for alt-tabbed clients
Edge caching: Slack Flannel pattern for entity metadata
Multi-region considerations: latency vs consistency trade-offs

Reconnect Storm Handling

When 5,000 clients disconnect simultaneously (cell tower outage, AWS region blip):

Bad: Fixed 1-second retry → synchronized reconnect wave → thundering herd

Good: Jittered exponential backoff:

var retryDelay = TimeSpan.FromSeconds(1) + 
                 TimeSpan.FromMilliseconds(Random.Shared.Next(4000));
// Spread over 5 seconds, 10x spike reduction

Timestamps

0:15 · Reconnect problem 6:30 · Backoff strategies 13:00 · Passive mode 19:45 · Edge architecture

Passive Mode

Most players aren’t actively watching:

90% of connections are alt-tabbed or backgrounded
Passive clients: 0.5 Hz updates, delta-only
Active clients: 2 Hz updates, full events

Result: 90% reduction in fan-out for the majority of connections.

Edge Caching (Flannel Pattern)

Entity metadata (names, factions, ship types) changes rarely:

Client ← Relay Pod ← Origin (DynamoDB)
              ↑
         Edge Cache (CloudFront)
              TTL: 60s + request coalescing

Cache hit: <5ms
Cache miss: 50-100ms + coalesced upstream
Staleness: Tolerable for metadata (not combat state)

Multi-Region Reality

Tokyo client → us-east-1 relay = 150-200ms RTT

This blows our 100ms hot-delta budget. Solutions explored:

Regional relays: Tokyo player → Tokyo relay → us-east-1 tile processor
- Adds relay→processor latency (60-80ms)
- Still misses budget
Regional tile processors: Split world by region
- Not viable for single contiguous battle
Accept 200ms for cross-region: Documented limitation
- Current approach

Series Conclusion

Building for 10,000 concurrent players in a single battle requires:

Spatial partitioning (H3 hex grid)
Interest management (viewport filtering)
Sharded pub/sub (Redis 7)
Careful serialization (Protobuf + Zstandard)
Graceful degradation (passive mode, event prioritization)

The architecture is a hypothesis awaiting load test validation. The load tests needed are documented in the HEX Architecture post.

Thanks for watching.