Split-Brain Resolution with Epoch-Based Fencing Tokens
How CockroachDB's range lease model and Kleppmann's fencing tokens informed our approach to tile processor ownership.
In a distributed game with thousands of tile processors, the question isn’t whether split-brain will happen—it’s when. This post describes how Black Skies uses epoch-based fencing tokens to prevent stale processors from corrupting game state.
The Problem
Each tile in Black Skies has exactly one authoritative processor. But network partitions and process crashes can create situations where two processors both believe they own the same tile.
Without a fencing mechanism, a stale processor could:
- Accept player actions that should be rejected
- Overwrite state that a new processor has modified
- Send conflicting updates to clients
The Solution: Epoch Fencing
Every tile ownership record has three fields:
{tileId, ownerProcessorId, leaseEpoch}
When a tile needs a new processor (crash, scale-out, rebalancing):
- Tile Manager increments the epoch via etcd compare-and-swap
- New processor starts with the higher epoch
- Redis Lua scripts check epoch on every write
- Stale writes are rejected at the storage layer
This is directly inspired by:
Why etcd?
We already use Redis for hot state. Why add etcd?
Redis is AP (availability + partition tolerance). During a network partition, different nodes can have different views of who owns a tile.
etcd is CP (consistency + partition tolerance). It provides linearizable operations—the guarantee that once etcd says “processor X owns tile Y at epoch Z,” that fact is true everywhere.
The Fencing Check
Every state mutation in Redis goes through a Lua script:
local current = redis.call('HGET', 'tile:' .. tileId, 'epoch')
if tonumber(current) > tonumber(providedEpoch) then
return {err = 'stale epoch: ' .. providedEpoch .. ' < ' .. current}
end
-- proceed with write
This check is atomic with the write. Race conditions can’t slip through.
Recovery
When a processor detects its epoch is stale (write rejected), it:
- Sends migration signals to affected entities
- Stops accepting new actions
- Terminates after reference count drops to zero
This graceful degradation prevents the “zombie processor” problem.
What This Doesn’t Solve
Epoch fencing handles the “two processors think they own the same tile” case. It doesn’t handle:
- Client-side confusion (solved by sequence numbers)
- Network partitions between clients and servers (solved by timeouts and reconnect)
- Byzantine failures (we assume processors don’t actively try to break the system)
Status
The fencing implementation is complete and unit-tested. Chaos testing (random process kills, network partitions) is the next step.
Related: Sharding a Real-Time World with H3 Hexagonal Indexing