Skip to content

Add entity-table checkpoint save/load primitives#8

Merged
MaxGhenis merged 5 commits intomainfrom
entity-table-checkpoint
Apr 25, 2026
Merged

Add entity-table checkpoint save/load primitives#8
MaxGhenis merged 5 commits intomainfrom
entity-table-checkpoint

Conversation

@MaxGhenis
Copy link
Copy Markdown
Contributor

Summary

  • Country-agnostic pipeline checkpoint primitives that persist a named dict of entity DataFrames to parquet + metadata.json.
  • save_entity_table_checkpoint(tables, path, *, stage, extra_metadata) / load_entity_table_checkpoint(path, *, expected_stage=None).
  • extra_metadata round-trips unchanged under an "extra" key — callers use this for config/source fingerprints to decide if a checkpoint is stale.

Motivation

microplex-us has save_us_pipeline_checkpoint / load_us_pipeline_checkpoint that reimplement the same parquet-dir + metadata layout. Hoisting the generic primitive here avoids duplicating it across every country package (UK / US / others) and keeps the save/load format consistent. Enables post_imputation / post_microsim recalibration flows that skip the ~11 h synthesis step.

Test plan

  • 7 new tests in tests/core/test_checkpoints.py (round-trip, None tables, metadata schema, extra_metadata, missing-path, expected_stage mismatch, overwrite, empty-stage rejection).
  • All 20 tests/core/* pass locally.
  • CI run on this PR.
  • Follow-up PR in microplex-us swapping save_us_pipeline_checkpoint to delegate to save_entity_table_checkpoint.

MaxGhenis and others added 5 commits April 21, 2026 22:38
Country-agnostic pipeline checkpoint primitives that persist a named
dict of entity DataFrames to parquet + ``metadata.json`` and read them
back. Country-specific wrappers (microplex-us, microplex-uk, ...) can
use these to serialize their bundle dataclasses at a named pipeline
stage — ``post_imputation``, ``post_microsim``, etc. — so a downstream
rerun can resume from a saved state without repeating synthesis,
donor imputation, or tax-benefit microsim.

- ``save_entity_table_checkpoint(tables, path, *, stage, extra_metadata)``
- ``load_entity_table_checkpoint(path, *, expected_stage=None)``
- ``extra_metadata`` is attached under ``"extra"`` and round-trips
  unchanged — useful for config/source fingerprints that a caller
  wants to check for cache invalidation.

Tests: 7 round-trip cases covering multi-table dicts, ``None`` entries,
metadata schema, ``expected_stage`` mismatch, overwrite behavior, and
invalid-stage rejection. All 20 ``tests/core/*`` pass.

Motivation: microplex-us currently holds an equivalent pattern
(``save_us_pipeline_checkpoint``, ``load_us_pipeline_checkpoint``) that
reimplements the same parquet-dir + metadata layout. Hoisting the
generic primitive here avoids duplicating it across every country
package and makes the save/load format consistent.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@MaxGhenis MaxGhenis merged commit 7947259 into main Apr 25, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant