Add crash-and-resume drama to example 08#118
Merged
Merged
Conversation
The first invoke of the v1 graph now hits a simulated transient failure in size_crew (a module-level attempt counter makes the first call raise RuntimeError; subsequent calls run normally). main() catches NodeException at the invoke() boundary, loads the saved record to inspect what's durable, then re-invokes with resume_invocation=<id> to complete the pipeline. The migration phase rides on the crash-survived checkpoint, so both reliability stories compose in one demo. Walk-through doc rewritten to cover both phases. Trace claims corrected (final_v1.trace and final_v2.trace are accumulated by the append reducer across the resume, not reset). Em dashes in existing module comments and docstrings normalized to match example 12's convention.
There was a problem hiding this comment.
Pull request overview
Adds a crash-and-resume phase to example 08 to demonstrate checkpoint durability across a mid-run node failure, then composes that with the existing v1→v2 state-migration-on-resume story. Documentation and changelog entries are updated to reflect the new two-phase walkthrough and expected output.
Changes:
- Simulate a transient node crash in example 08 and resume execution from the saved checkpoint.
- Update the example 08 walkthrough doc to cover crash/resume + migration phases and revised trace output narrative.
- Update catalogs/changelog text to reflect the expanded example scenario.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| src/openarmature/AGENTS.md | Updates example 08 description to mention crash-and-resume. |
| examples/08-checkpointing-and-migration/main.py | Adds simulated node crash, catches NodeException, inspects checkpoint record, resumes, then runs v2 migration phase. |
| docs/examples/08-checkpointing-and-migration.md | Rewrites walkthrough to match the new crash/resume + migration behavior and output. |
| CHANGELOG.md | Notes the example 08 crash-and-resume expansion in Unreleased changes. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Phase 2 was passing the pre-crash record's invocation_id to resume_invocation=, which made the v2 graph re-run size_crew + draft_timeline rather than resuming from the post-resume completed record. Each invoke() (including a resume) mints its own invocation_id; the pre-crash record stays under the original id, the resumed attempt's checkpoints save under a new id. Phase 1's resume now re-queries CheckpointFilter(correlation_id= run_id) after the second invoke completes to capture the new id. Phase 2 uses that id as resume_invocation= instead of the original. The walk-through doc grows a new bullet explaining the two-id semantics; the resumed id is printed in phase 1's output block alongside the result fields. Surfaced by CoPilot review on PR #118 (line 361 + 403 threads).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds a crash-and-resume phase to example 08 that dramatizes the synchronous-checkpoint-by-contract reliability claim from the README. First invoke of the v1 graph hits a simulated transient failure inside
size_crew(a module-level attempt counter raisesRuntimeErroron the first call only);main()catchesNodeExceptionat theinvoke()boundary, prints what's durable on disk viacheckpointer.load(), then re-invokes withresume_invocation=<id>to complete the pipeline. The existing v1->v2 migration phase rides on the crash-survived checkpoint, so both reliability stories compose in one demo.Walk-through doc (
docs/examples/08-checkpointing-and-migration.md) rewritten to cover both phases. Also fixes a pre-existing documentation bug:final_v1.traceandfinal_v2.traceare accumulated by theappendreducer across resume (the engine starts a resume from the saved state, not a fresh one), so the documented samples and narrative now reflect what the example actually prints. The honest framing is that each node name appears exactly once across both invokes; the absence of duplicates is the engine-side skip-set's signature. Em dashes in pre-existing module comments and docstrings normalized to match example 12's convention.Test plan
LLM_API_KEY=sk-... uv run python examples/08-checkpointing-and-migration/main.pyproduces output matching the walk-through doc's "Reading the output" block (phase 1 trace:['define_objective', 'size_crew', 'draft_timeline']; phase 2 trace:['define_objective', 'size_crew', 'draft_timeline', 'assess_risks'])uv run pytest tests/test_examples_smoke.pygreenuv run mkdocs build --strictclean