Optimizations II: Reconnects, Edge Caching & Polish
Handling thundering herds, passive mode for alt-tabbed clients, and edge caching strategies. The final 20% of performance gains.
Overview
The final episode covers the last mile of optimization: handling failure scenarios gracefully, supporting passive clients, and edge deployment strategies.
Key Topics
- Reconnect storms: server-guided jittered backoff
- Passive mode: 90% fan-out reduction for alt-tabbed clients
- Edge caching: Slack Flannel pattern for entity metadata
- Multi-region considerations: latency vs consistency trade-offs
Reconnect Storm Handling
When 5,000 clients disconnect simultaneously (cell tower outage, AWS region blip):
Bad: Fixed 1-second retry → synchronized reconnect wave → thundering herd
Good: Jittered exponential backoff:
var retryDelay = TimeSpan.FromSeconds(1) +
TimeSpan.FromMilliseconds(Random.Shared.Next(4000));
// Spread over 5 seconds, 10x spike reduction
Timestamps
Passive Mode
Most players aren’t actively watching:
- 90% of connections are alt-tabbed or backgrounded
- Passive clients: 0.5 Hz updates, delta-only
- Active clients: 2 Hz updates, full events
Result: 90% reduction in fan-out for the majority of connections.
Edge Caching (Flannel Pattern)
Entity metadata (names, factions, ship types) changes rarely:
Client ← Relay Pod ← Origin (DynamoDB)
↑
Edge Cache (CloudFront)
TTL: 60s + request coalescing
- Cache hit:
<5ms - Cache miss:
50-100ms+ coalesced upstream - Staleness: Tolerable for metadata (not combat state)
Multi-Region Reality
Tokyo client → us-east-1 relay = 150-200ms RTT
This blows our 100ms hot-delta budget. Solutions explored:
-
Regional relays: Tokyo player → Tokyo relay → us-east-1 tile processor
- Adds relay→processor latency (60-80ms)
- Still misses budget
-
Regional tile processors: Split world by region
- Not viable for single contiguous battle
-
Accept 200ms for cross-region: Documented limitation
- Current approach
Series Conclusion
Building for 10,000 concurrent players in a single battle requires:
- Spatial partitioning (H3 hex grid)
- Interest management (viewport filtering)
- Sharded pub/sub (Redis 7)
- Careful serialization (Protobuf + Zstandard)
- Graceful degradation (passive mode, event prioritization)
The architecture is a hypothesis awaiting load test validation. The load tests needed are documented in the HEX Architecture post.
Thanks for watching.