Skip to content

Add crash-and-resume drama to example 08#118

Merged
chris-colinsky merged 2 commits into
mainfrom
feature/example-08-crash-and-resume
Jun 2, 2026
Merged

Add crash-and-resume drama to example 08#118
chris-colinsky merged 2 commits into
mainfrom
feature/example-08-crash-and-resume

Conversation

@chris-colinsky
Copy link
Copy Markdown
Member

@chris-colinsky chris-colinsky commented Jun 2, 2026

Summary

Adds a crash-and-resume phase to example 08 that dramatizes the synchronous-checkpoint-by-contract reliability claim from the README. First invoke of the v1 graph hits a simulated transient failure inside size_crew (a module-level attempt counter raises RuntimeError on the first call only); main() catches NodeException at the invoke() boundary, prints what's durable on disk via checkpointer.load(), then re-invokes with resume_invocation=<id> to complete the pipeline. The existing v1->v2 migration phase rides on the crash-survived checkpoint, so both reliability stories compose in one demo.

Walk-through doc (docs/examples/08-checkpointing-and-migration.md) rewritten to cover both phases. Also fixes a pre-existing documentation bug: final_v1.trace and final_v2.trace are accumulated by the append reducer across resume (the engine starts a resume from the saved state, not a fresh one), so the documented samples and narrative now reflect what the example actually prints. The honest framing is that each node name appears exactly once across both invokes; the absence of duplicates is the engine-side skip-set's signature. Em dashes in pre-existing module comments and docstrings normalized to match example 12's convention.

Test plan

  • LLM_API_KEY=sk-... uv run python examples/08-checkpointing-and-migration/main.py produces output matching the walk-through doc's "Reading the output" block (phase 1 trace: ['define_objective', 'size_crew', 'draft_timeline']; phase 2 trace: ['define_objective', 'size_crew', 'draft_timeline', 'assess_risks'])
  • uv run pytest tests/test_examples_smoke.py green
  • uv run mkdocs build --strict clean

The first invoke of the v1 graph now hits a simulated transient
failure in size_crew (a module-level attempt counter makes the
first call raise RuntimeError; subsequent calls run normally).
main() catches NodeException at the invoke() boundary, loads the
saved record to inspect what's durable, then re-invokes with
resume_invocation=<id> to complete the pipeline. The migration
phase rides on the crash-survived checkpoint, so both reliability
stories compose in one demo.

Walk-through doc rewritten to cover both phases. Trace claims
corrected (final_v1.trace and final_v2.trace are accumulated by
the append reducer across the resume, not reset). Em dashes in
existing module comments and docstrings normalized to match
example 12's convention.
Copilot AI review requested due to automatic review settings June 2, 2026 01:23
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a crash-and-resume phase to example 08 to demonstrate checkpoint durability across a mid-run node failure, then composes that with the existing v1→v2 state-migration-on-resume story. Documentation and changelog entries are updated to reflect the new two-phase walkthrough and expected output.

Changes:

  • Simulate a transient node crash in example 08 and resume execution from the saved checkpoint.
  • Update the example 08 walkthrough doc to cover crash/resume + migration phases and revised trace output narrative.
  • Update catalogs/changelog text to reflect the expanded example scenario.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

File Description
src/openarmature/AGENTS.md Updates example 08 description to mention crash-and-resume.
examples/08-checkpointing-and-migration/main.py Adds simulated node crash, catches NodeException, inspects checkpoint record, resumes, then runs v2 migration phase.
docs/examples/08-checkpointing-and-migration.md Rewrites walkthrough to match the new crash/resume + migration behavior and output.
CHANGELOG.md Notes the example 08 crash-and-resume expansion in Unreleased changes.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread examples/08-checkpointing-and-migration/main.py Outdated
Comment thread examples/08-checkpointing-and-migration/main.py Outdated
Phase 2 was passing the pre-crash record's invocation_id to
resume_invocation=, which made the v2 graph re-run size_crew +
draft_timeline rather than resuming from the post-resume
completed record. Each invoke() (including a resume) mints its
own invocation_id; the pre-crash record stays under the original
id, the resumed attempt's checkpoints save under a new id.

Phase 1's resume now re-queries CheckpointFilter(correlation_id=
run_id) after the second invoke completes to capture the new id.
Phase 2 uses that id as resume_invocation= instead of the
original. The walk-through doc grows a new bullet explaining the
two-id semantics; the resumed id is printed in phase 1's output
block alongside the result fields.

Surfaced by CoPilot review on PR #118 (line 361 + 403 threads).
@chris-colinsky chris-colinsky merged commit 9835052 into main Jun 2, 2026
6 checks passed
@chris-colinsky chris-colinsky deleted the feature/example-08-crash-and-resume branch June 2, 2026 01:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants