Distribute Raft Workers Across The Cluster by zonotope · Pull Request #1370 · fluree/db

zonotope · 2026-06-24T16:09:17Z

Splits the leader-only CommitWorker into per-branch Stager tasks, distributes them across all cluster members via rendezvous hashing, and adds a cross-node RPC so follower-staged commits route their ApplyHead propose through the current leader. Net effect: followers no longer sit idle while the leader serializes every transaction; each node owns ~1/N of branches and stages in parallel.

Motivation

The previous architecture ran every per-branch commit worker on the raft leader. Followers were ledger replicas that only saw applies — they didn't stage. On a busy ledger this made the leader the bottleneck for every transaction in the cluster, while CPU and IO on followers went unused. The design goal was to fan staging out without losing the consensus guarantees raft already provides (exactly-once apply, ordered queue per branch).

Architecture

Per-branch staging tasks

CommitWorker (one task on the leader, iterating every branch's queue in turn) is replaced by Stager (one task per active RefKey, polls only its own branch). Each stager owns the same end-to-end pipeline as the old worker — load envelope from CAS, stage the commit, write the commit blob, publish the head advance — scoped to a single branch. Cross-branch concurrency is now the default: branch B's first entry no longer waits for branch A's backlog to drain.

The staging code itself is unchanged; this is a refactor of who calls it, not how it works. The retry loop (MAX_STAGE_ATTEMPTS + exponential backoff + poison fallback), the panic-per-entry boundary (catch_unwind around process_entry), and the StagedReceiptMap side-channel all behave identically.

Node-lifetime supervisor

StagerSupervisor runs on every node (leader and followers alike), spawned at process start instead of being bound to leadership transitions. Each tick (250ms) the supervisor reconciles its running stagers against the desired set: branches in NameServiceState::queues whose rendezvous owner under the current voter set is this node. New owner → spawn a stager; ownership moved away → abort the stager. Shutdown is driven by CancellationToken; the supervisor's select! loop catches the cancel, aborts every running stager, and returns.

The supervisor wakes on either raft.metrics().changed() (membership / leader / term updates, so ownership recomputes within one tick) or a 250ms poll (new branches appearing in the queue map).

Rendezvous-hashing assignment

A new raft::ownership::owner(ref_key, voters) function maps each RefKey to the NodeId that should run its stager. Uses xxh64 with a fixed seed (so every node computes identical scores) and Highest Random Weight selection. Per-call cost is one xxh64 over the composed key bytes plus one xxh64 per voter. Empty voter set returns None; the supervisor treats that as "cluster not yet bootstrapped, claim nothing."

Rendezvous was chosen over plain hash(ref_key) % N because the latter reshuffles ~(N-1)/N of branches on every membership change (adding the 5th node to a 4-node cluster moves ~67% of branches). Rendezvous moves only ~1/(N+1) and only branches that genuinely changed owner — important for rolling-restart deployments.

Voters only, not learners. Learners are full data replicas but typically transient (joining cluster, staging for promotion); promoting one to voter is the operator's signal that it should participate.

Cross-node ApplyHead RPC

When a stager runs on a follower, it can't propose Command::ApplyHead directly — openraft surfaces ForwardToLeader on non-leader proposes. Added RaftNameService::apply_staged_commit(args), a new method exposed at POST /raft/apply_staged_commit (postcard body, no auth, same intra-cluster trust model as the existing openraft RPCs). The follower's publish_commit detects non-leader role, looks up the leader's raft_addr from membership, takes the staged receipt out of its local map, ferries it to the leader via the new endpoint. The leader's handler validates the queue front matches the follower's queue_id, stashes the receipt in the leader's StagedReceiptMap, proposes Command::ApplyHead, returns the outcome.

The endpoint lives on the existing private listener under /raft alongside the openraft RPCs. It is included in RaftIntegration::raft_rpc_router() so any caller using the integration's router (including the multi-node test harness) gets it automatically.

The CommitPublisher trait is untouched. All forwarding logic lives inside the RaftNameService impl — other backends (MemoryNameService, FileNameService, DynamoDbNameService, etc.) are unaffected.

Event-bus reconciliation

AppliedReceipt and its per-op variants are now Clone + Serialize + Deserialize so the typed receipt rides the wire to the leader (lets cross-node-staged transactions resolve with full per-op detail instead of falling back to AppliedReceipt::Minimal).

The state-machine adapter publishes LedgerCommitPublished events on RaftIntegration::event_bus on every ApplyHead apply. The previous local cache event listener subscribed to Fluree's internal LedgerEventBus and only handled LedgerIndexPublished / LedgerRetracted — LedgerCommitPublished was silently dropped. In the old architecture this was fine because the leader's stager called finalize_local_state on the leader's Fluree directly, keeping its cache warm. With distributed stagers, the staging node and the queried node are usually different, so non-staging nodes needed an event-driven refresh path.

Two fixes: the existing listener now handles LedgerCommitPublished (merged with the existing LedgerIndexPublished arm — both call LedgerManager::notify); spawn_local_cache_event_listener is made pub and the server spawns a second instance against the raft integration's bus, so commits published by the state-machine adapter reach the cache. Fluree's internal listener stays put (harmless when raft owns publish — its bus simply sees no events).

Lifecycle integration

RaftIntegration now owns the Arc<RaftNameService> it used to construct externally — the integration builds it in new with staged_receipts and with_forwarding(id, http_client) configured, and exposes it via nameservice(). Both apply_staged_commit_router and raft_network::router mount under /raft via raft_rpc_router().

Server lifecycle gains one new field: raft_stager_supervisor: Option<CancellableTaskHandle>. Spawned at startup as a peer to raft_leader_watcher. Shutdown sequence is supervisor → leader watcher → release task so in-flight stagers drain before the leader-only background tasks (indexer, evictor) stop racing on shared state.

LeaderWatcherHandle and StagerSupervisorHandle were structurally identical (a JoinHandle plus a CancellationToken) so they were consolidated into a single CancellableTaskHandle driven by a private spawn_cancellable(future_fn) helper. The two public spawn functions still exist with their distinct docs; only the handle type is shared.

Failure modes

Mid-stage abort (ownership moves while a stager is partway through staging): the new owner restarts from scratch. Partial CAS writes on the old node are orphaned but safe (content-addressed; the same bytes would have hashed identically). Same lifecycle pattern as today's leader-loss abort.
Transient duplicate stagers during a membership change: ownership decisions on different nodes may briefly disagree because they're computed against different membership_config snapshots. Both nodes may stage the same queue entry. Wasteful but safe — the state machine's queue_id front check serializes ApplyHead, so exactly one apply lands per entry and the loser sees Stale.
Cross-node RPC failures (network / unreachable leader): the follower's stager treats the failure as a publish error, cleans up its local stash, and the outer retry loop re-stages. Idempotent because the queue_id check on the state machine guarantees exactly-once apply regardless of how many times we try.
Leader change mid-apply_staged_commit: the leader receives the ferried receipt, stashes it, calls client_write. If it steps down before commit, openraft returns ForwardToLeader; my handler takes the stash back and returns NotLeader to the follower. The follower retries against the new leader.

…ibuted-raft-workers

bplatz · 2026-06-24T17:00:26Z

 }

-#[derive(Debug)]
+#[derive(Clone, Debug, Serialize, Deserialize)]


TransactApplied embeds Option<TrackingTally>, and TrackingTally has skip_serializing_if fields — those don't round-trip through positional postcard on the apply_staged_commit RPC. Needs a skip-free wire DTO, plus a round-trip test for ApplyStagedCommitArgs (none today).

bplatz · 2026-06-24T17:00:26Z

        }
+
+        for (_, handle) in stagers.drain() {
+            handle.abort();


Stagers are abort()ed but never awaited here (same at the reassign path ~1040), so an ownership flap can respawn a stager for the same RefKey before the old one yields, racing on the shared staged_receipts. Want the abort_and_await discipline from the leader watcher (reconcile would need to go async).

bplatz · 2026-06-24T17:00:26Z

-/// queue front. Poisoning still goes direct to Raft because there's
-/// no trait surface for "fail this queue entry."
+/// queue front (forwarding to the leader when this stager runs on a
+/// follower). Poisoning still goes direct to Raft because there's no


This bites follower-owned stagers: propose_poison goes through local client_write, which returns ForwardToLeader on a follower → the poison bounces forever and head-of-line-blocks the branch. Deterministic poisons need a leader-forwarded path like apply_staged_commit.

bplatz · 2026-06-24T17:28:35Z

+
+        let resp = forwarding
+            .http_client
+            .post(&target)


No per-request .timeout(...) here — only the shared client's connect_timeout. The other raft RPCs set one (network.rs / forward.rs); without it a connected-but-stalled leader hangs the stager on this send().await before it ever reaches backoff/retry. Thread a timeout through ForwardingConfig.

bplatz · 2026-06-24T17:55:33Z

+                "apply_staged_commit stale: queue_id {queue_id} no longer at front \
+                 (current front: {current_front_queue_id:?})"
+            ))),
+            Err(e) => Err(NameServiceError::storage(format!(


This catch-all flattens every ApplyStagedCommitError into one storage error, so the caller retries them all as transient. Terminal ones (notably InvariantViolated) then spin forever instead of poisoning. Classify off the structured variants here.

bplatz · 2026-06-24T17:55:33Z

+/// raft port to peer addresses only.
+pub fn apply_staged_commit_router(ns: Arc<RaftNameService>) -> Router {
+    Router::new()
+        .route(APPLY_STAGED_COMMIT_PATH, post(handle_apply_staged_commit))


No DefaultBodyLimit on this route, so it falls back to axum's 2 MiB default while the sibling raft RPCs set explicit caps — and an oversize receipt 413s into the retry-forever path. Add an explicit cap (reuse the network config).

bplatz · 2026-06-24T17:56:27Z

-//!    work locally, writes the commit blob, stashes the typed receipt
-//!    in [`staged_receipt::StagedReceiptMap`], and proposes
-//!    [`state_machine::Command::ApplyHead`] via the
+//! 3. The leader-only [`commit_worker::StagerSupervisor`] (driven by


Stale now — the StagerSupervisor runs on every node and isn't driven by the leader watcher. The "Submission flow on the leader" framing at the top of this module doc reads the same way.

bplatz · 2026-06-24T17:59:50Z

-            })
+            .keys()
+            .filter(|ref_key| owner(ref_key, &voters) == Some(self.id))
+            .cloned()


Worth calling out as a known liveness limitation: owner() runs over the configured voter set (current_voters() reads membership_config), not the set that's actually up. If a voter is down but still in the membership config, every live node computes that dead node as the owner for ~1/N of branches and declines, so those branches have no stager anywhere and their queues stall — even though consensus still has quorum. It clears only when the voter returns or membership is reconfigured to drop it. This is a real reduction in write availability vs. what raft alone tolerates (the old leader-only worker never hit it, since the leader is always live).

Not suggesting the naive fix — filtering by live voters would break the determinism this relies on (nodes have divergent liveness views → owner disagreement → either gaps or double-ownership on the same RefKey). A real fix has to make liveness a consensus fact (e.g. leader-driven membership eviction of an unreachable voter). For this PR a doc note on the limitation is probably enough; flagging so it's a deliberate choice rather than a surprise.

…aft-workers

bplatz · 2026-06-26T12:00:51Z

            }
            Err(err) => {
-                if self.commit_replicated(ref_key, &commit_id).await {
+                if self.commit_replicated(&commit_id).await {


commit_replicated proves our ApplyHead landed only when commit_id is a fresh CID. No-op revert/rebase republish the existing head (current_head_id / pre_rebase_head_id), so head == commit_id holds whether or not this entry's ApplyHead actually applied. On a publish error this then returns Ok(()), run() sets last_committed = queue_id, but the front was never popped — snapshot_front keeps returning the same entry and the queue_id <= last guard sleeps the branch forever. Gate the landed-check on the queue front advancing past queue_id, not on head equality.

bplatz

Looks good, just added one more comment for a type of failure scenario.

zonotope added 13 commits June 23, 2026 15:08

add method to returning full ledger id from ref_key

639e177

add separate worker for each ref queue + supervisor to manage them

fef36a5

Merge remote-tracking branch 'origin/feature/raft' into feature/distr…

c4037a5

…ibuted-raft-workers

add raft branch queue ownership function for distributing to workers

6d134d3

make applied receipts (de)serializable

b00cbc6

add leader forwarding config for publish_commit on a follower/worker

3aa33dd

start stagers with forwarding config + http routes

ca35018

clean up redundant self_id fields

fc467fb

move apply commit route to the rpc writer; forward ledger commit events

068f7ad

clean up doc comments

994ad5a

cleanup

2d8c99a

fmt + clippy

13aeafe

consolidate cancellable tasks

d11a6db

zonotope requested review from aaj3f and bplatz June 24, 2026 16:09

bplatz reviewed Jun 24, 2026

View reviewed changes

Base automatically changed from feature/raft to main June 24, 2026 18:35

zonotope added 9 commits June 24, 2026 14:45

Merge remote-tracking branch 'origin/main' into feature/distributed-r…

ff8d8c0

…aft-workers

fix broken tests caused by merge

505d2e4

fmt

94b7797

block worker until it's last commit appears in it's replicated state

2e02bc4

wait until every node converges on the same leader in raft test

a1aa999

add postcard round-trippable wire shapes for apply commit rpc

8043654

await on all aborted join handles to ensure teardown

4033219

add trait for more reliable poison message dispatching

b4ab913

add timeouts for rpc apply commit and poison to detect stalled leader

2feba5e

zonotope added 8 commits June 25, 2026 21:32

add max body limit for staged commit and queue poison requests

efc7894

Update submission flow documentation

81f3d62

stager -> worker

bb7050e

Replace WorkerSupervisorParts with logical groupings

0fc586b

group RaftIntegration constructor fields logically

17e841d

add types and commands for storing voters eligible to hosting workers

bdac271

add worker eligible voters to state machine

175ff41

use eligible voter set to compute queue owners

93ace98

bplatz reviewed Jun 26, 2026

View reviewed changes

bplatz approved these changes Jun 26, 2026

View reviewed changes

zonotope added 20 commits June 26, 2026 14:06

add peer liveness monitor

e7d06b4

CreateLedgerArgs -> NewLedger, CreateBranchArgs -> NewBranch

2324fbd

EnqueueCommandArgs -> QueueSubmissions

0194aba

better payload names

c06b304

fmt + clippy

71560d5

add end to end integration test that exercises the liveness monitor

c7c5eac

only mutate last proposed when the proposed eligibility landed

b8e0a4d

don't treat stale applied commits as transient. drop and move on

fdc2fc5

retain surviving demoted voter set after membership changes

c2407c3

add ssrf guard for apply_staged_commit and apply_queue_poison

37eb87a

limit per tick allocation for tracking branch ownership

42f2d15

refactor compute_desired_owners to eliminate test-only helper

0d75d90

consolidate raft commit and queue poison errors with trait

1d203dc

use existing current_millis fn instead of reimplementing

e7716d9

store status codes instead of strings; allow Display to format errors

c88c385

descriptive instead of persuasive doc comments

5e04382

hold/borrow read locks serially instead of simultaneously

2c977d2

add stash guard with explicit drop to avoid manual clearing on error

b0a6759

avoid deep-cloning RaftMetrics on every liveness monitor tick

12baff4

detect when membership set hasn't changed to skip voter set collection

cf3f10b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Distribute Raft Workers Across The Cluster#1370

Distribute Raft Workers Across The Cluster#1370
zonotope wants to merge 51 commits into
mainfrom
feature/distributed-raft-workers

zonotope commented Jun 24, 2026

Uh oh!

bplatz Jun 24, 2026

Uh oh!

bplatz Jun 24, 2026

Uh oh!

bplatz Jun 24, 2026

Uh oh!

bplatz Jun 24, 2026

Uh oh!

bplatz Jun 24, 2026

Uh oh!

bplatz Jun 24, 2026

Uh oh!

bplatz Jun 24, 2026

Uh oh!

bplatz Jun 24, 2026

Uh oh!

bplatz Jun 26, 2026

Uh oh!

bplatz left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

zonotope commented Jun 24, 2026

Motivation

Architecture

Per-branch staging tasks

Node-lifetime supervisor

Rendezvous-hashing assignment

Cross-node ApplyHead RPC

Event-bus reconciliation

Lifecycle integration

Failure modes

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bplatz left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants