Why etcd for Coordination When You Already Have Redis
The distinction between linearizable coordination and high-throughput hot state, and why Black Skies uses both.
A common question about the Black Skies architecture: why do we need both Redis and etcd? Redis is fast and familiar. etcd adds complexity. This post explains why the distinction between coordination and hot state matters.
Two Different Problems
Hot state (Redis):
- High write volume (every tick, every entity)
- Can tolerate brief inconsistency during failover
- Performance is critical (sub-millisecond reads)
Coordination (etcd):
- Low write volume (tile ownership changes, epoch increments)
- Must be strictly consistent
- Latency matters less than correctness
Trying to make one database do both creates problems. Redis Cluster is great for hot state but doesn’t provide linearizability. etcd provides linearizability but would choke on the write volume of per-tick state updates.
The CAP Tradeoff
Redis prioritizes availability during partitions. If the primary goes down, a replica can be promoted quickly. But during that brief window, different nodes might have different views of who owns a tile.
etcd prioritizes consistency. During a partition, it will refuse writes rather than risk divergence. This is exactly what we want for tile ownership records.
Our Three-Tier Storage
| Tier | Technology | Use Case | Consistency | Latency |
|---|---|---|---|---|
| Hot state | Redis | Entity positions, combat state | Eventual | < 1ms |
| Coordination | etcd | Tile ownership, epochs | Linearizable | ~10ms |
| Durable | DynamoDB | Inventories, player profiles | Strong | ~50ms |
Each tier handles what it’s good at.
When Redis Loses Data
Redis is not durable by default. A crash can lose seconds of writes. For hot state, we accept this tradeoff:
- Entity positions can be recovered from last known good state
- Combat state can be reconstructed from event logs
- Clients can replay from the last acknowledged tick
For coordination, this would be catastrophic. Losing a tile ownership record means two processors could claim the same tile.
Why Not Use DynamoDB for Everything?
DynamoDB is durable and consistent. But:
- It costs money per request
- It has higher latency than Redis
- It doesn’t do pub/sub efficiently
We use DynamoDB for the “source of truth” that survives crashes: player inventories, unlocked ships, persistent progress. But we don’t hit DynamoDB on every tick.
The Bottom Line
Using both Redis and etcd isn’t over-engineering—it’s using the right tool for each job. The hard part isn’t the technology choice; it’s defining the boundaries between what needs strict consistency and what can tolerate brief inconsistency.
That’s where architecture decisions become load-bearing.