Skip to main content

Split-Brain Resolution with Epoch-Based Fencing Tokens

3 min read

How CockroachDB's range lease model and Kleppmann's fencing tokens informed our approach to tile processor ownership.

distributed-systems consistency fault-tolerance

In a distributed game with thousands of tile processors, the question isn’t whether split-brain will happen—it’s when. This post describes how Black Skies uses epoch-based fencing tokens to prevent stale processors from corrupting game state.

The Problem

Each tile in Black Skies has exactly one authoritative processor. But network partitions and process crashes can create situations where two processors both believe they own the same tile.

Without a fencing mechanism, a stale processor could:

  • Accept player actions that should be rejected
  • Overwrite state that a new processor has modified
  • Send conflicting updates to clients

The Solution: Epoch Fencing

Every tile ownership record has three fields:

{tileId, ownerProcessorId, leaseEpoch}

When a tile needs a new processor (crash, scale-out, rebalancing):

  1. Tile Manager increments the epoch via etcd compare-and-swap
  2. New processor starts with the higher epoch
  3. Redis Lua scripts check epoch on every write
  4. Stale writes are rejected at the storage layer

This is directly inspired by:

Why etcd?

We already use Redis for hot state. Why add etcd?

Redis is AP (availability + partition tolerance). During a network partition, different nodes can have different views of who owns a tile.

etcd is CP (consistency + partition tolerance). It provides linearizable operations—the guarantee that once etcd says “processor X owns tile Y at epoch Z,” that fact is true everywhere.

The Fencing Check

Every state mutation in Redis goes through a Lua script:

local current = redis.call('HGET', 'tile:' .. tileId, 'epoch')
if tonumber(current) > tonumber(providedEpoch) then
    return {err = 'stale epoch: ' .. providedEpoch .. ' < ' .. current}
end
-- proceed with write

This check is atomic with the write. Race conditions can’t slip through.

Recovery

When a processor detects its epoch is stale (write rejected), it:

  1. Sends migration signals to affected entities
  2. Stops accepting new actions
  3. Terminates after reference count drops to zero

This graceful degradation prevents the “zombie processor” problem.

What This Doesn’t Solve

Epoch fencing handles the “two processors think they own the same tile” case. It doesn’t handle:

  • Client-side confusion (solved by sequence numbers)
  • Network partitions between clients and servers (solved by timeouts and reconnect)
  • Byzantine failures (we assume processors don’t actively try to break the system)

Status

The fencing implementation is complete and unit-tested. Chaos testing (random process kills, network partitions) is the next step.


Related: Sharding a Real-Time World with H3 Hexagonal Indexing