Why etcd for Coordination When You Already Have Redis

A common question about the Black Skies architecture: why do we need both Redis and etcd? Redis is fast and familiar. etcd adds complexity. This post explains why the distinction between coordination and hot state matters.

Two Different Problems

Hot state (Redis):

High write volume (every tick, every entity)
Can tolerate brief inconsistency during failover
Performance is critical (sub-millisecond reads)

Coordination (etcd):

Low write volume (tile ownership changes, epoch increments)
Must be strictly consistent
Latency matters less than correctness

Trying to make one database do both creates problems. Redis Cluster is great for hot state but doesn’t provide linearizability. etcd provides linearizability but would choke on the write volume of per-tick state updates.

The CAP Tradeoff

Redis prioritizes availability during partitions. If the primary goes down, a replica can be promoted quickly. But during that brief window, different nodes might have different views of who owns a tile.

etcd prioritizes consistency. During a partition, it will refuse writes rather than risk divergence. This is exactly what we want for tile ownership records.

Our Three-Tier Storage

Tier	Technology	Use Case	Consistency	Latency
Hot state	Redis	Entity positions, combat state	Eventual	< 1ms
Coordination	etcd	Tile ownership, epochs	Linearizable	~10ms
Durable	DynamoDB	Inventories, player profiles	Strong	~50ms

Each tier handles what it’s good at.

When Redis Loses Data

Redis is not durable by default. A crash can lose seconds of writes. For hot state, we accept this tradeoff:

Entity positions can be recovered from last known good state
Combat state can be reconstructed from event logs
Clients can replay from the last acknowledged tick

For coordination, this would be catastrophic. Losing a tile ownership record means two processors could claim the same tile.

Why Not Use DynamoDB for Everything?

DynamoDB is durable and consistent. But:

It costs money per request
It has higher latency than Redis
It doesn’t do pub/sub efficiently

We use DynamoDB for the “source of truth” that survives crashes: player inventories, unlocked ships, persistent progress. But we don’t hit DynamoDB on every tick.

The Bottom Line

Using both Redis and etcd isn’t over-engineering—it’s using the right tool for each job. The hard part isn’t the technology choice; it’s defining the boundaries between what needs strict consistency and what can tolerate brief inconsistency.

That’s where architecture decisions become load-bearing.