Add crash-and-resume drama to example 08 by chris-colinsky · Pull Request #118 · LunarCommand/openarmature-python

chris-colinsky · 2026-06-02T01:23:08Z

Summary

Adds a crash-and-resume phase to example 08 that dramatizes the synchronous-checkpoint-by-contract reliability claim from the README. First invoke of the v1 graph hits a simulated transient failure inside size_crew (a module-level attempt counter raises RuntimeError on the first call only); main() catches NodeException at the invoke() boundary, prints what's durable on disk via checkpointer.load(), then re-invokes with resume_invocation=<id> to complete the pipeline. The existing v1->v2 migration phase rides on the crash-survived checkpoint, so both reliability stories compose in one demo.

Walk-through doc (docs/examples/08-checkpointing-and-migration.md) rewritten to cover both phases. Also fixes a pre-existing documentation bug: final_v1.trace and final_v2.trace are accumulated by the append reducer across resume (the engine starts a resume from the saved state, not a fresh one), so the documented samples and narrative now reflect what the example actually prints. The honest framing is that each node name appears exactly once across both invokes; the absence of duplicates is the engine-side skip-set's signature. Em dashes in pre-existing module comments and docstrings normalized to match example 12's convention.

Test plan

LLM_API_KEY=sk-... uv run python examples/08-checkpointing-and-migration/main.py produces output matching the walk-through doc's "Reading the output" block (phase 1 trace: ['define_objective', 'size_crew', 'draft_timeline']; phase 2 trace: ['define_objective', 'size_crew', 'draft_timeline', 'assess_risks'])
uv run pytest tests/test_examples_smoke.py green
uv run mkdocs build --strict clean

The first invoke of the v1 graph now hits a simulated transient failure in size_crew (a module-level attempt counter makes the first call raise RuntimeError; subsequent calls run normally). main() catches NodeException at the invoke() boundary, loads the saved record to inspect what's durable, then re-invokes with resume_invocation=<id> to complete the pipeline. The migration phase rides on the crash-survived checkpoint, so both reliability stories compose in one demo. Walk-through doc rewritten to cover both phases. Trace claims corrected (final_v1.trace and final_v2.trace are accumulated by the append reducer across the resume, not reset). Em dashes in existing module comments and docstrings normalized to match example 12's convention.

Copilot

Pull request overview

Adds a crash-and-resume phase to example 08 to demonstrate checkpoint durability across a mid-run node failure, then composes that with the existing v1→v2 state-migration-on-resume story. Documentation and changelog entries are updated to reflect the new two-phase walkthrough and expected output.

Changes:

Simulate a transient node crash in example 08 and resume execution from the saved checkpoint.
Update the example 08 walkthrough doc to cover crash/resume + migration phases and revised trace output narrative.
Update catalogs/changelog text to reflect the expanded example scenario.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

File	Description
src/openarmature/AGENTS.md	Updates example 08 description to mention crash-and-resume.
examples/08-checkpointing-and-migration/main.py	Adds simulated node crash, catches `NodeException`, inspects checkpoint record, resumes, then runs v2 migration phase.
docs/examples/08-checkpointing-and-migration.md	Rewrites walkthrough to match the new crash/resume + migration behavior and output.
CHANGELOG.md	Notes the example 08 crash-and-resume expansion in Unreleased changes.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Phase 2 was passing the pre-crash record's invocation_id to resume_invocation=, which made the v2 graph re-run size_crew + draft_timeline rather than resuming from the post-resume completed record. Each invoke() (including a resume) mints its own invocation_id; the pre-crash record stays under the original id, the resumed attempt's checkpoints save under a new id. Phase 1's resume now re-queries CheckpointFilter(correlation_id= run_id) after the second invoke completes to capture the new id. Phase 2 uses that id as resume_invocation= instead of the original. The walk-through doc grows a new bullet explaining the two-id semantics; the resumed id is printed in phase 1's output block alongside the result fields. Surfaced by CoPilot review on PR #118 (line 361 + 403 threads).

Copilot AI review requested due to automatic review settings June 2, 2026 01:23

Copilot started reviewing on behalf of chris-colinsky June 2, 2026 01:23 View session

Copilot AI reviewed Jun 2, 2026

View reviewed changes

Comment thread examples/08-checkpointing-and-migration/main.py Outdated

Comment thread examples/08-checkpointing-and-migration/main.py Outdated

chris-colinsky merged commit 9835052 into main Jun 2, 2026
6 checks passed

chris-colinsky deleted the feature/example-08-crash-and-resume branch June 2, 2026 01:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add crash-and-resume drama to example 08#118

Add crash-and-resume drama to example 08#118
chris-colinsky merged 2 commits into
mainfrom
feature/example-08-crash-and-resume

chris-colinsky commented Jun 2, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

chris-colinsky commented Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

chris-colinsky commented Jun 2, 2026 •

edited

Loading