[world-local] Reduce sequential replay I/O#2152
Conversation
🦋 Changeset detectedLatest commit: 0a0264f The changes in this PR will be included in the next version bump. This PR includes changesets to release 18 packages
Not sure what this means? Click here to learn what changesets are. Click here if you're a maintainer who wants to add another changeset to this PR |
🧪 E2E Test Results❌ Some tests failed Summary
❌ Failed Tests▲ Vercel Production (1 failed)fastify (1 failed):
Details by Category❌ ▲ Vercel Production
✅ 💻 Local Development
✅ 📦 Local Production
✅ 🐘 Local Postgres
✅ 🪟 Windows
✅ 📋 Other
❌ Some E2E test jobs failed:
Check the workflow run for details. |
📊 Benchmark Results
workflow with no steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro | Express | Next.js (Turbopack) workflow with 1 step💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro | Express | Next.js (Turbopack) workflow with 10 sequential steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Express | Nitro | Next.js (Turbopack) workflow with 25 sequential steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro | Express | Next.js (Turbopack) workflow with 50 sequential steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Express | Nitro | Next.js (Turbopack) Promise.all with 10 concurrent steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro | Express | Next.js (Turbopack) Promise.all with 25 concurrent steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Express | Nitro | Next.js (Turbopack) Promise.all with 50 concurrent steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Express | Nitro | Next.js (Turbopack) Promise.race with 10 concurrent steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Express | Nitro | Next.js (Turbopack) Promise.race with 25 concurrent steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Express | Nitro | Next.js (Turbopack) Promise.race with 50 concurrent steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Express | Nitro | Next.js (Turbopack) workflow with 10 sequential data payload steps (10KB)💻 Local Development
▲ Production (Vercel)
🔍 Observability: Express | Nitro | Next.js (Turbopack) workflow with 25 sequential data payload steps (10KB)💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro | Express | Next.js (Turbopack) workflow with 50 sequential data payload steps (10KB)💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro | Express | Next.js (Turbopack) workflow with 10 concurrent data payload steps (10KB)💻 Local Development
▲ Production (Vercel)
🔍 Observability: Express | Nitro | Next.js (Turbopack) workflow with 25 concurrent data payload steps (10KB)💻 Local Development
▲ Production (Vercel)
🔍 Observability: Express | Nitro | Next.js (Turbopack) workflow with 50 concurrent data payload steps (10KB)💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro | Express | Next.js (Turbopack) Stream Benchmarks (includes TTFB metrics)workflow with stream💻 Local Development
▲ Production (Vercel)
🔍 Observability: Express | Nitro | Next.js (Turbopack) stream pipeline with 5 transform steps (1MB)💻 Local Development
▲ Production (Vercel)
🔍 Observability: Express | Nitro | Next.js (Turbopack) 10 parallel streams (1MB each)💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro | Express | Next.js (Turbopack) fan-out fan-in 10 streams (1MB each)💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro | Express | Next.js (Turbopack) SummaryFastest Framework by WorldWinner determined by most benchmark wins
Fastest World by FrameworkWinner determined by most benchmark wins
Column Definitions
Worlds:
|
e4860ff to
d9a284c
Compare
Resolve conflicts in @workflow/world-local where main's hook dedup-recovery rework landed on the same event-write paths this branch caches: - fs.ts writeExclusive: combine main's temp-file + hard-link atomic publish (with Windows retry) with this branch's withEnsuredDirectory ENOENT-retry wrapper, mirroring the already-merged write(). - events-storage.ts: keep main's per-instance stepLocks/hookLocks and the writeExclusive-based publish + canonical-eventId dedup recovery, and add the recent-event cache on top. Extract the cache bookkeeping into rememberStoredEvent() and call it after each successful publish (main event path + hook_conflict path) so caching layers onto main's stronger write semantics instead of replacing them. Build + all 427 @workflow/world-local tests pass. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
VaguelySerious
left a comment
There was a problem hiding this comment.
Defensive review — approve
I reviewed this with one question in mind: does it meaningfully widen the error surface or hurt debugability? My conclusion is no, and the design goes out of its way to avoid both. Approving.
Why the risk is contained
Blast radius is local-dev only. Every change is in @workflow/world-local, the filesystem backend used for local dev and tests. Production (world-vercel) is untouched, so even a latent bug here cannot reach deployed workflows.
Disk stays the single source of truth for existence. The event cache is only consulted for fileIds that listJSONFiles() actually returns from the directory listing (paginatedFileSystemQuery). So the cache can never resurrect a deleted event or invent one — it only avoids re-reading the content of files that are write-once and append-only. A lingering cache entry for a since-deleted file is simply never looked up.
Cached snapshots are genuinely detached, and proven so. rememberStoredEvent decodes from the exact serialized byte payload captured before the write await, and reads structuredClone the value out again. That closes both the caller-mutation window (test: "reuses locally appended events without exposing cached instances") and the shape-divergence risk — the reparse means a cached event is byte-for-byte what a disk read would have produced, including Date→ISO-string normalization in executionContext (test: "normalizes cached event metadata the same way as disk reads").
Memory is bounded several ways. 4 MiB / 1000 entries with FIFO eviction, oversized events bypass the cache, terminal runs release their history, and clear()/close() release everything. Per-instance, so instances can't pollute each other.
External directory removal is handled, not assumed away. withEnsuredDirectory retries exactly once on ENOENT after forgetting the cached dir — bounded (no loop), and a genuine ENOENT still propagates with its original error. All in-process directory-removal paths (clear()) are paired with clearCreatedFilesCache(); external removal is covered by the retry. Covered by three fs.test.ts regressions.
Debugability
Read-through/write-through to disk is preserved, so the JSON files on disk always reflect runtime state and remain fully inspectable. A cache parse failure silently degrades to a disk read (where the existing malformed-JSON warning surfaces) rather than masking anything.
Verification
pnpm test(world-local): 427 passedpnpm typecheck: clean- CI: Unit Tests pass on ubuntu + windows. The two red checks are
E2E Vercel Prod Tests (fastify)(aworld-verceldeploy-run failure —"status":"failed"on the polling lane) and the Required-Check gate that aggregates it. This PR cannot affect aworld-vercelprod lane; it matches the known prod-lane flakiness. Re-run rather than treat as a regression.
Minor, non-blocking notes
structuredCloneruns per cache hit (≤1000 per large replay). The posted measurements still show a net win, and total retained bytes are capped, so this is fine — just noting it's the cost traded for mutation safety.- One narrow debugability tradeoff: if you hand-edit an event JSON in place for an active (non-terminal) run while the dev server is live, the next
events.list()would serve the cached pre-edit content (deletion is safe — the dir listing drops it). Events are write-once by design so this is exotic, but it's the one case where on-disk state and served state can diverge during an active run. createdDirectoriesCacheis process-global and grows with distinct directories touched, but that set is tiny/fixed and strictly less of a concern than the pre-existingcreatedFilesCache(which grows per file).
Nicely scoped, well-tested, and the correctness footguns of caching are each addressed explicitly. 👍
TooTallNate
left a comment
There was a problem hiding this comment.
Approve — the relative-dataDir cache bug is fixed, and the regression test genuinely guards it
Re-review of d175b349f + f9c47dac9 (plus the fresh main merge). Every item from my prior REQUEST_CHANGES is addressed and verified.
The bug fix is correct and complete
The fix resolves directory once at the top of paginatedFileSystemQuery (const resolvedDirectory = path.resolve(directory)) and uses it for both the file listing and the cache-key path.join(resolvedDirectory, ...). I traced both sides of the cache to confirm the keys now agree:
- Write side populates via
cacheEvent(eventPath, ...)whereeventPath = taggedPath(basedir, 'events', ...)→resolveWithinBase→path.resolve(basedir, ...), which was already absolute. - Read side was the sole culprit: it used
path.join(directory, ...)(relative whendataDiris relative), so.get(filePath)never matched the absolute key the write side set. The fix makes the read key absolute too — so write and read now produce identical keys regardless of relative/absolutedataDir.
So the cache is no longer a silent no-op in the default .workflow-data config, and the perf claim now holds in production.
The regression test actually catches the bug — I verified by reverting
The new reuses sequential-step events with a relative data directory test builds the world with path.relative(process.cwd(), testDir), runs 5 sequential steps, and asserts zero event-file readFile calls. To confirm it's a real guard, I reverted just the path.resolve fix locally and reran it:
× reuses sequential-step events with a relative data directory
AssertionError: expected [...] to have a length of +0 but got 55
55 event-file reads without the fix (the cache-no-op), 0 with it. That's exactly the failure mode I originally reproduced, now pinned.
The optional test gaps and mock cleanup are all addressed
listByCorrelationIdcache (my #2):reuses locally appended events for correlation queries— spiesreadFile, asserts 0 event-file reads on the correlation path.- FIFO/byte eviction (my #3):
evicts old events once the recent-event byte bound is exceededwrites 1 MiB-payload events to overflow the 4 MiB cap, then asserts a subsequent read does hit disk (eventFileReads.length > 0) with correct data — exercising thewhileeviction loop without a slow 1000-event test. Good lightweight approach. - Mock restoration (my #4): the suite has a top-level
afterEach(() => vi.restoreAllMocks())(line 154), so everyvi.spyOn(fs, 'readFile')is restored between tests.
Verified locally
- Full
@workflow/world-localsuite: 427 passed (up from 375 at my first review — the new coverage). - Reverted-fix reproduction confirms the regression test fails without the change; restored → green.
- Clean merge with current
main(0 conflicts).
CI: the fastify Vercel-prod E2E failure is unrelated — this PR touches only world-local, while that lane runs against world-vercel, so it can't be affected (it's the recurring Vercel-prod flake). Worth a re-run.
Everything I flagged is resolved and the non-cache parts (createdDirectoriesCache + ENOENT retry, terminal-run cache release, structuredClone mutation isolation, clear()/close() integration) were already solid. Clean perf improvement now. LGTM.
|
Backport PR opened against |
Summary
mkdirsyscallsRoot cause
The existing
sequentialStepsWorkflow(count, 0)benchmark reproduces the zero-work sequential-step shape. On the PR merge base, its local-world storage work is dominated by three persisted lifecycle events per step and the incrementalevents.list()call used for replay. The listing path rereads append-only event files that the same storage instance just wrote, while the write path repeatedly callsmkdir(..., { recursive: true })for fixed directories.This workload does not exercise streams; the previously landed stream metadata optimization is separate from this path.
Correctness and memory safeguards
Reviewing the caching implementation exposed three correctness issues that are covered here:
EventSchema, so they are detached from caller mutations and have the same normalized shape as disk readsRetention is explicitly bounded:
4 MiBand1000entries across active runs; oversized events are read from disk instead of retainedworld.clear()orworld.close()is calledMeasurement
I modeled the event/replay lifecycle for a no-delay sequential workflow directly through
@workflow/world-localstorage with a relativedataDir, matching normal local-world configuration. Both revisions ran the same probe with one warmup and five measured trials per size; medians are reported. The control is the PR merge base (ae37315cb).For 200 steps, incremental
events.list()time fell from193.35 msto104.86 ms(45.8%lower).A 50-step filesystem-operation trace demonstrates the removed work:
readFilemkdirAn end-to-end workbench probe also showed that most remaining no-delay sequential-workflow latency occurs above this storage path: a 200-step run reported
22.8 sinside/.well-known/workflow/v1/flow.Merge with
mainThis branch has been merged up to the latest
main.mainreworked the local event-write path in the interim (atomic publish viawriteExclusivetemp-file + hard-link, per-instancestepLocks/hookLocks, andhook_createddedup-recovery). Conflicts were resolved so the recent-event cache layers on top of those write semantics rather than replacing them:writeExclusivekeeps main's atomic temp-file + hard-link publish (and Windows retry), now wrapped in the branch'swithEnsuredDirectoryENOENT-retry for externally-removed data dirsrememberStoredEvent()helper invoked after each successful publish (both the main event path and thehook_conflictpath), preserving the pre-awaitserialized-snapshot detachment guaranteeValidation
dataDircache misses; a five-step sequential lifecycle regression test now verifies repeated event lists make zero event-file readsENOENTon the next event write) and verified directory recoverylistByCorrelationId()and FIFO byte-limit eviction without a Windows-hostile high-file-count testpnpm --filter @workflow/world-local typecheckpnpm --filter @workflow/world-local test(427tests passed, post-merge)pnpm --filter '@workflow/world-local...' buildpnpm changeset status --since=origin-https/maingit diff --check