Fail steps when step modules fail to load#2709
Conversation
🦋 Changeset detectedLatest commit: 9ae084b The changes in this PR will be included in the next version bump. This PR includes changesets to release 16 packages
Not sure what this means? Click here to learn what changesets are. Click here if you're a maintainer who wants to add another changeset to this PR |
🧪 E2E Test Results✅ All tests passed Summary
Details by Category✅ ▲ Vercel Production
✅ 💻 Local Development
✅ 📦 Local Production
✅ 🐘 Local Postgres
✅ 🪟 Windows
✅ 📋 Other
❌ Some E2E test jobs failed:
Check the workflow run for details. |
📊 Benchmark Results
workflow with no steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro | Express | Next.js (Turbopack) workflow with 1 step💻 Local Development
▲ Production (Vercel)
🔍 Observability: Express | Nitro | Next.js (Turbopack) workflow with 10 sequential steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro | Express | Next.js (Turbopack) workflow with 25 sequential steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro | Express | Next.js (Turbopack) workflow with 50 sequential steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro | Express | Next.js (Turbopack) Promise.all with 10 concurrent steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro | Express | Next.js (Turbopack) Promise.all with 25 concurrent steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro | Express | Next.js (Turbopack) Promise.all with 50 concurrent steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro | Express | Next.js (Turbopack) Promise.race with 10 concurrent steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Express | Nitro | Next.js (Turbopack) Promise.race with 25 concurrent steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro | Express | Next.js (Turbopack) Promise.race with 50 concurrent steps💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro | Express | Next.js (Turbopack) workflow with 10 sequential data payload steps (10KB)💻 Local Development
▲ Production (Vercel)
🔍 Observability: Express | Nitro | Next.js (Turbopack) workflow with 25 sequential data payload steps (10KB)💻 Local Development
▲ Production (Vercel)
🔍 Observability: Express | Nitro | Next.js (Turbopack) workflow with 50 sequential data payload steps (10KB)💻 Local Development
▲ Production (Vercel)
🔍 Observability: Express | Nitro | Next.js (Turbopack) workflow with 10 concurrent data payload steps (10KB)💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro | Express | Next.js (Turbopack) workflow with 25 concurrent data payload steps (10KB)💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro | Express | Next.js (Turbopack) workflow with 50 concurrent data payload steps (10KB)💻 Local Development
▲ Production (Vercel)
🔍 Observability: Express | Nitro | Next.js (Turbopack) Stream Benchmarks (includes TTFB metrics)workflow with stream💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro | Express | Next.js (Turbopack) stream pipeline with 5 transform steps (1MB)💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro | Express | Next.js (Turbopack) 10 parallel streams (1MB each)💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro | Express | Next.js (Turbopack) fan-out fan-in 10 streams (1MB each)💻 Local Development
▲ Production (Vercel)
🔍 Observability: Nitro | Express | Next.js (Turbopack) SummaryFastest Framework by WorldWinner determined by most benchmark wins
Fastest World by FrameworkWinner determined by most benchmark wins
Column Definitions
Worlds:
|
VaguelySerious
left a comment
There was a problem hiding this comment.
AI review: no blocking issues
| inputFiles, | ||
| stepsOutfile, | ||
| flowOutfile, | ||
| bundleFinalOutput: false, |
There was a problem hiding this comment.
AI Review: Note
The production build path (bundleFinalOutput: true) is never exercised. Both real builders (standalone.ts and vercel-build-output-api.ts) bundle the generated route through esbuild with bundle: true, but every test here and in step-source-registration.test.ts uses bundleFinalOutput: false (raw Node ESM import).
The entire fix depends on esbuild preserving lazy evaluation of the inlined dynamic import('./__step_registrations.js') — if that import were ever evaluated eagerly (e.g. splitting gets enabled, or the bundler changes), the top-level throw would resurface at route-module init and reintroduce the 500, and no test in this PR would catch it.
I reproduced the esbuild path with the exact build settings used here and confirmed it works today: importing the bundled route does not throw, and the failure is correctly deferred to the loader call. But since that lazy-eval guarantee is the crux of the fix, consider adding a case that drives a bundleFinalOutput: true build and asserts the route imports cleanly while the step still fails at execution.
| const normalizedError = await normalizeUnknownError(err); | ||
| const normalizedStack = normalizedError.stack || getErrorStack(err) || ''; | ||
| const wrappedError = new FatalError( | ||
| `Failed to load step "${stepName}": ${normalizedError.message}` |
There was a problem hiding this comment.
AI Review: Note
Wrapping the load error in FatalError makes step-module load failures permanently non-retryable: this step_failed is written before stepFn.maxRetries is ever read, so it bypasses the step's normal retry path entirely.
For a deterministic failure (missing native dep like sharp/libvips) that's exactly right. But a transient import failure (isolate OOM, transient FS error) that the previous route-500 → queue-redelivery path might have recovered from on a fresh isolate is now terminal for the step. Worth confirming this fatal-always semantics is intended rather than a side effect of reusing FatalError — if transient recovery matters, a retryable error here would let the queue redeliver.
| ): Promise<StepFunction | undefined> { | ||
| const loader = getStepIdMatch(registeredStepLoaders, stepId); | ||
| if (loader) { | ||
| await loader(); |
There was a problem hiding this comment.
AI Review: Nit
loadStepFunction invokes the loader on every step execution with no already-registered short-circuit (deliberate per the "Refresh lazy step loaders before lookup" commit). With the generated () => import(...) loader this is a cached, resolved-promise await, so it's negligible in practice — just noting it adds a microtask to the hot step path where getStepFunction used to be synchronous.
Summary
registerStepFunctionLoaderfromworkflow/runtime.step_failedevents and requeue the workflow so user code can observe the failure.run_failedevents.Root Cause
The combined step/workflow route eagerly imported
__step_registrationswhen the route module loaded. If a step module imported a native dependency that could not load in production, such assharpmissinglibvips, the route threw a 500 before the Workflow runtime had claimed the step or written a terminal event. That left the run stuck instead of failing the step.Workflow Bundle Case
The combined route embeds workflow code as VM input instead of host-importing workflow modules during route initialization. The analogous failure happens while evaluating the workflow bundle in the workflow VM. This PR adds regression coverage that a sharp-like top-level workflow bundle failure is caught by
workflowEntrypointand persisted asrun_failedrather than escaping as a route 500.Local Repro
Added a regression test that creates the old eager route shape and a step registration module throwing
Could not load the "sharp" module using the linux-x64 runtime. The legacy route import rejects immediately, matching the reported 500 path. The fixed generated route imports successfully, captures a step loader, and only rejects when that loader is invoked so the runtime can persiststep_failed.E2E Coverage
packages/core/e2e/module-load-failures.test.tsbuilds temporary combined routes from source, imports the generated route module, registers it with a local-world direct queue handler, then starts real runs throughstart(). It asserts:step_startedfollowed bystep_failed, and the workflow can catch the resultingFatalErrorrun_failedwithUSER_ERRORValidation
pnpm format -- packages/core/e2e/module-load-failures.test.ts packages/core/src/runtime.test.tspnpm lint -- packages/core/e2e/module-load-failures.test.tspnpm lint -- packages/core/src/runtime.test.tsexited 0 with existing warning-level findings in unrelated tests in that filepnpm lint -- .changeset/lazy-step-loaders.md packages/workflow/src/runtime.ts packages/workflow/src/runtime.test.ts packages/builders/src/step-source-registration.test.tspnpm vitest run packages/core/e2e/module-load-failures.test.tspnpm vitest run packages/core/e2e/module-load-failures.test.ts packages/core/src/runtime.test.ts packages/workflow/src/runtime.test.ts packages/builders/src/step-source-registration.test.ts packages/core/src/private.test.ts packages/core/src/runtime/step-executor.test.ts packages/core/src/runtime/step-handler.test.tspnpm --filter @workflow/core typecheckcurrently fails in this checkout withTS2688: Cannot find type definition file for "*"from the sharedpackages/tsconfig/base.jsonpnpm --filter @workflow/builders buildfails with the sameTS2688issue