Skip to main content

Why etcd for Coordination When You Already Have Redis

3 min read

The distinction between linearizable coordination and high-throughput hot state, and why Black Skies uses both.

distributed-systems redis etcd architecture

A common question about the Black Skies architecture: why do we need both Redis and etcd? Redis is fast and familiar. etcd adds complexity. This post explains why the distinction between coordination and hot state matters.

Two Different Problems

Hot state (Redis):

  • High write volume (every tick, every entity)
  • Can tolerate brief inconsistency during failover
  • Performance is critical (sub-millisecond reads)

Coordination (etcd):

  • Low write volume (tile ownership changes, epoch increments)
  • Must be strictly consistent
  • Latency matters less than correctness

Trying to make one database do both creates problems. Redis Cluster is great for hot state but doesn’t provide linearizability. etcd provides linearizability but would choke on the write volume of per-tick state updates.

The CAP Tradeoff

Redis prioritizes availability during partitions. If the primary goes down, a replica can be promoted quickly. But during that brief window, different nodes might have different views of who owns a tile.

etcd prioritizes consistency. During a partition, it will refuse writes rather than risk divergence. This is exactly what we want for tile ownership records.

Our Three-Tier Storage

TierTechnologyUse CaseConsistencyLatency
Hot stateRedisEntity positions, combat stateEventual< 1ms
CoordinationetcdTile ownership, epochsLinearizable~10ms
DurableDynamoDBInventories, player profilesStrong~50ms

Each tier handles what it’s good at.

When Redis Loses Data

Redis is not durable by default. A crash can lose seconds of writes. For hot state, we accept this tradeoff:

  • Entity positions can be recovered from last known good state
  • Combat state can be reconstructed from event logs
  • Clients can replay from the last acknowledged tick

For coordination, this would be catastrophic. Losing a tile ownership record means two processors could claim the same tile.

Why Not Use DynamoDB for Everything?

DynamoDB is durable and consistent. But:

  • It costs money per request
  • It has higher latency than Redis
  • It doesn’t do pub/sub efficiently

We use DynamoDB for the “source of truth” that survives crashes: player inventories, unlocked ships, persistent progress. But we don’t hit DynamoDB on every tick.

The Bottom Line

Using both Redis and etcd isn’t over-engineering—it’s using the right tool for each job. The hard part isn’t the technology choice; it’s defining the boundaries between what needs strict consistency and what can tolerate brief inconsistency.

That’s where architecture decisions become load-bearing.