Skip to content

Start spec-based-ecps-rewire: v6 post-mortem + calibrator decision#3

Merged
MaxGhenis merged 62 commits intomainfrom
spec-based-ecps-rewire
Apr 25, 2026
Merged

Start spec-based-ecps-rewire: v6 post-mortem + calibrator decision#3
MaxGhenis merged 62 commits intomainfrom
spec-based-ecps-rewire

Conversation

@MaxGhenis
Copy link
Copy Markdown
Contributor

Summary

Opens the spec-based-ecps-rewire workstream with two anchor docs before any code lands:

  • docs/v6-postmortem.md — localizes today's OOM kill to calibrate_policyengine_tables(backend="entropy") at 1.5M households × ~1.2k constraints. Post-donor stage instrumentation (commit 960ac2f) did its job: first run to identify the specific stage that killed v4 as well.
  • docs/calibrator-decision.md — picks microcalibrate as the production mainline, microplex.reweighting.Reweighter as the optional sparse deployment post-step, and retires Calibrator(backend="entropy") at scales > ~200k records. Revises migration step 2 of the core-wiring audit.

Depends on #2 (core-wiring-audit) for the migration-order context but does not require it to merge first.

Test plan

  • Review the v6 evidence against the memory signature claim (macOS time -l rusage vs v4).
  • Review the calibrator decision against any production-blocker on microcalibrate (availability, licensing, missing features needed by the SS-model longitudinal extension).
  • Confirm the hierarchical-calibration-is-deferred split: this PR picks the backend; hierarchical structure is decided separately at the G2 local-area gate.

Not in this PR

  • No code yet. First code commit will be the calibration backend swap (migration step 2 revised) behind a small-scale smoke harness.
  • No changes to the current pe_us_data_rebuild_checkpoint path. That stays live for historical comparison runs until the rewired pipeline clears G1.

MaxGhenis and others added 30 commits April 16, 2026 23:10
Two docs that anchor the rewire direction with specific evidence from
today's run:

docs/v6-postmortem.md
  - Timeline of v6 from launch to OOM kill
  - Stage-marker localization of the killer:
    calibrate_policyengine_tables with backend=entropy on
    1.5M households × ~1.2k constraints on a 48 GB workstation
  - rusage comparison to v4 (nearly identical signature: 22 GB max RSS,
    293 GB peak phys_footprint)
  - What v6 ruled IN as working at scale (donor integration, tables build)
  - What v6 ruled OUT as the killer (synthesis, support enforcement,
    tables build)
  - How this becomes evidence for the rewire rather than against it

docs/calibrator-decision.md
  - Mainline: microcalibrate (gradient-descent chi-squared, identity
    preserving, production-proven by PE-US-data, aligns with SS-model
    longitudinal plan)
  - Optional sparse deployment step after mainline: microplex.reweighting
    (L0 / HardConcrete, for web-app-sized subsamples only)
  - Retire Calibrator(backend=entropy) at scales above ~200k records
  - Revises migration step 2 of core-wiring-audit accordingly

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… plan

calibrator-decision.md:
  - Cites microplex/benchmarks/results/sparse_coverage.csv as empirical
    support: sparse L0 drives rare-subpopulation ratios to 0.0 at 10%,
    2%, and 1% sparsity (elderly_selfemp, young_dividend both zero),
    while generative synthesis preserves them at 7-30x oracle ratio.
  - Adds an explicit scale caveat: sparse_coverage evidence is from
    10k-row synthetic data; the structural pattern (L0 zeros records
    exactly) survives scale-up on mathematical grounds even if
    absolute numbers shift.

synthesizer-benchmark-scale-up.md (new):
  - Records what the existing benchmark_multi_seed.json measures:
    10k rows x 7 columns of SYNTHETIC data. The cps/sipp/psid labels
    are partial-observation schemas over one synthetic population, not
    real sources.
  - Production gap: 3,000-7,000x on (rows x columns) plus the
    synthetic-to-real jump.
  - Predicted failure modes per method at scale (QRF compute-bound
    above 1M rows, MAF tail-coverage risk on top income, QDNN needs
    joint zero-mask head at 150 zero-capable vars, PRDC metric
    degenerates in 150D without embedding).
  - Three-stage scale-up protocol (100k x 50, 1M x 50, 3.4M x 155)
    with matched holdouts, rare-cell preservation tracking, and
    wall-time / RSS measurements per method.
  - Ballpark runtime expectations per method per stage on a 48 GB M3.
  - Diagnoses PSID coverage = 0 as unresolved and must-fix before
    any SS-model longitudinal work commits to PSID as the backbone.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
First real code on spec-based-ecps-rewire. Wraps microcalibrate (gradient-
descent chi-squared) behind the same fit_transform / validate interface as
the legacy microplex.calibration.Calibrator — drop-in replacement for the
entropy calibration step that killed v6.

Interface contract (tested):
  - Same fit_transform signature: data, marginal_targets, weight_col,
    linear_constraints
  - Same validate() output keys: converged, max_error, sparsity,
    linear_errors
  - Identity preservation: every input record survives with a
    non-negative weight (v4/v6 entropy path does not guarantee this)
  - Empty constraints returns copy of input unchanged
  - Constraint shape and weight-column existence validated up front

Smoke tests (tests/calibration/test_microcalibrate_adapter.py, 8 tests,
5.2 s):
  - Interface contract coverage
  - Single age-band count constraint converges within 5 % relative error
    on 200 records
  - Two orthogonal constraints (count + income-sum) both reach within
    10 % relative error on 300 records
  - Validation output shape matches legacy contract

Packaging:
  - microcalibrate >= 0.21 added to required dependencies
  - requires-python bumped to >= 3.13 to match microcalibrate's lower
    bound

Not in this commit (deliberate):
  - No changes to pe_us_data_rebuild / us.py pipeline yet — adapter is
    standalone so it can be wired incrementally
  - No scale-up validation — that goes through the protocol in
    docs/synthesizer-benchmark-scale-up.md

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Root cause: the multi-source fusion benchmark harness in microplex
(scripts/run_benchmark.py + src/microplex/eval/benchmark.py) collapses
the shared-column pool across sipp/cps/psid to exactly 2 variables
(is_male, age) because of a <5% NaN filter applied per-source before
intersection. PSID has the highest ratio of non-shared columns (13
of 15) and the smallest row count (9,207), so its per-column models
are the most under-conditioned. PRDC k-NN coverage collapses to 0
because synthetic records cluster around model means and miss the
real holdout neighborhoods.

Key facts:
  - shared_cols intersection for the benchmark is literally
    ['is_male', 'age']
  - SIPP (9 cols, 7 non-shared, 476k rows): coverage 0.29-0.95
  - CPS (10 cols, 8 non-shared, 144k rows): coverage 0.34-0.50
  - PSID (15 cols, 13 non-shared, 9k rows): coverage 0.00 uniformly
  - Pattern tracks non-shared-ratio and row count, not method choice

Implications:
  - G1 cross-section synthesizer choice: unaffected, continue with
    ZI-MAF for CPS-style, ZI-QRF for panel
  - SS-model longitudinal work: PSID is NOT ruled out as trajectory
    training backbone; the benchmark verdict is not the relevant
    evaluation. A PSID-only benchmark is needed before committing.
  - Paper claims depending on PSID=0 need qualification: valid claim
    is "cross-source fusion with 2 shared vars fails on PSID" not
    "all methods fail on PSID"

Reproduction script included in the doc (runs in seconds).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Implements the stage-1/2/3 protocol from docs/synthesizer-benchmark-scale-up.md
as a real runnable harness.

Components:
  - src/microplex_us/bakeoff/scale_up.py
      * ScaleUpStageConfig: frozen dataclass with curated 50-column default
        (14 demographics + 36 income/wealth/benefit targets)
      * ScaleUpRunner: load_frame, split, fit_and_generate, run
      * _load_enhanced_cps: entity-aware loader that broadcasts
        household / SPM-unit / tax-unit / family / marital-unit variables
        down to person level via person_<entity>_id -> <entity>_id lookups
      * Per-method metrics: PRDC precision/density/coverage (via prdc
        library), wall time, peak RSS, rare-cell preservation ratios
        (elderly self-employed, young dividend, disabled SSDI,
        top-1 % employment), zero-rate MAE
      * CLI: python -m microplex_us.bakeoff.scale_up --stage stage1 ...
      * Stage configs: stage1 (~77k from ECPS), stage2 (1M, needs larger
        source), stage3 (v6 seed-ready 3.4M x 155)

  - tests/bakeoff/test_scale_up.py
      * Smoke tests on a 500-row, 5-column, ZI-QRF-only slice
      * Entity-broadcast verification via real ECPS loading
      * Column-missing error path
      * Default column-set sanity check

Notable limitations recorded for follow-up:
  - state_fips / snap_reported / net_worth / housing_assistance and other
    non-person entity variables are now correctly broadcast to person
    level via ID lookup. This was the blocker for a flat DataFrame.
  - enhanced_cps_2024 has 77k persons, not the 100k stage-1 target.
    n_rows=None now uses all available.
  - is_household_head is not in ECPS; replaced with is_separated.

Not in this commit (deliberate):
  - No execution of stage1 / stage2 / stage3 runs yet
  - No CTGAN / TVAE support (present in registry, not in default method set)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Earlier heuristic flipped the unit on Darwin and reported 892 GB for an
actual 0.87 GB process. Cross-checked ru_maxrss against
psutil.Process().memory_info().rss on Python 3.14 / macOS: 190_873_600
raw = 0.18 GB matches psutil exactly. Platform-conditional: darwin uses
bytes, Linux and other BSDs use kilobytes.

Smoke tests unaffected (they only asserted peak_rss > 0).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Previous harness wrote all results atomically at the end of the run. If
ZI-QDNN crashed after ZI-QRF and ZI-MAF had completed, their numbers
were lost.

Now ScaleUpRunner.run() takes an optional incremental_path and appends
each ScaleUpResult as a JSONL line immediately after it completes. The
final atomic JSON is still written at the end as before; the JSONL is
supplementary and survives mid-run kills.

CLI adds --incremental-jsonl; defaults to <output>.partial.jsonl so the
feature is on by default.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- __main__.py so `python -m microplex_us.bakeoff` works without the
  runpy.RuntimeWarning about package double-import. The existing
  `python -m microplex_us.bakeoff.scale_up` still works for callers
  who want to pin to the submodule path.
- test_incremental_jsonl_persists_each_method: verifies that each
  method's result is flushed to JSONL before the next method starts,
  so an interrupted run preserves earlier methods' numbers.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Ran ZI-QRF, ZI-MAF, ZI-QDNN on 40,000 rows x 50 columns of real
enhanced_cps_2024 and compared against the existing 10k x 7 synthetic
benchmark_multi_seed result.

  Small (10k x 7 synthetic CPS)   Stage 1 (40k x 50 real ECPS)
  ZI-MAF   0.499 (winner)          ZI-MAF   0.054 (near-collapsed)
  ZI-QDNN  0.406                   ZI-QDNN  0.306 (mid-pack)
  ZI-QRF   0.347                   ZI-QRF   0.465 (winner)

Rare-cell preservation:
  ZI-QRF:  modest over-sampling (2-4x), disabled_ssdi -> 0.0
  ZI-MAF:  elderly_self_employed -> 103x (zero-inflation classifier
           miscalibrated on real data), disabled_ssdi -> 0.0
  ZI-QDNN: elderly_self_employed -> 116x, disabled_ssdi -> 0.0

RSS cost:
  ZI-QRF   3.5 GB   (production-workable on a 48 GB machine)
  ZI-MAF  23.5 GB   (marginal)
  ZI-QDNN 32.5 GB   (marginal; 1.6 TB naive extrapolation at 3.4M rows)

Harness fix: cast loaded DataFrame to float32. Column dtype mix (bool /
int32 / float32) previously caused torch-based methods to fail with
"can't convert np.ndarray of type numpy.object_".

Implications:
- Revises the G1 cross-section synthesizer default: ZI-QRF, not ZI-MAF
  (the small-benchmark winner).
- SS-model methodology doc's "production direction: ZI-QDNN" claim does
  not survive this stage. Needs revision.
- ZI-MAF + ZI-QDNN might recover with hyperparameter tuning, but at the
  default settings in the benchmark classes they are not competitive.

Not resolved:
- 61k rows OOM-kills ZI-QRF (SIGKILL, no output). Scaling is clean to
  40k. Cause likely loky worker accumulation across 36 target columns.
- PRDC in 50D may be degenerate — the scale-up doc flagged this as a
  risk. Needs embedding-based PRDC to confirm or deny the ordering.

uv.lock regenerated after the earlier Python >= 3.13 bump.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Earlier 77k attempts died during PRDC computation, not during
synthesizer fitting. PRDC on 15k real x 61k synthetic x 50 features
materialized ~7 GB-per-copy distance matrices and OOM'd.

Fix: add prdc_max_samples to ScaleUpStageConfig (default 20k). Both
real and synthetic are sub-sampled before PRDC. The coverage metric is
stable well below the capped size; more synthetic records doesn't
improve it, only costs memory.

Stage 1 at 77k x 50:
  ZI-QRF:   cov=0.256 fit= 36s RSS= 6.0 GB (winner, production-workable)
  ZI-QDNN:  cov=0.147 fit= 95s RSS=11.0 GB
  ZI-MAF:   cov=0.014 fit=216s RSS=11.0 GB (near-collapsed)

Ordering (ZI-QRF > ZI-QDNN > ZI-MAF) matches the 40k run.
Absolute coverage differs because the 40k run used uncapped PRDC
(8k x 32k) while 77k uses capped (15k x 15k); both are internally
consistent, and doc notes this.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Reran 40k x 50 x 3 methods with the same 15k PRDC cap as 77k so
cross-scale comparison is directly interpretable.

40k capped:   ZI-QRF 0.352 > ZI-QDNN 0.222 > ZI-MAF 0.029
77k capped:   ZI-QRF 0.256 > ZI-QDNN 0.147 > ZI-MAF 0.014

Coverage drops with scale but ordering is invariant. PRDC's k-NN
radius is set on real data, so larger real sample tightens the
radius and absolute coverage drops even if synthesizer quality is
the same. Ordering is the production-relevant signal; that's stable.

overnight-session-2026-04-16.md consolidates the full night's work:
11 commits, the scale-up finding, architecture decisions locked in,
and explicit follow-ups for the next session (embedding PRDC,
ZI-MAF hyperparameter tuning, MicrocalibrateAdapter wiring into
us.py, per-column zero-rate breakdown, PSID-only benchmark).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ScaleUpResult now includes zero_rate_per_column: for every column, the
real zero-rate, synthetic zero-rate, and absolute difference. Lets the
stage-1 doc identify which specific columns drive each method's
overall zero-rate MAE — the pilot/stage-1 result showed every method
drives disabled_ssdi to 0, but aggregate MAE of 0.18+ implies many
other columns also diverge.

scripts/embedding_prdc_compare.py: one-off validation script that
fits a 16-dim autoencoder on the holdout, encodes real and synthetic
to latent space, and reports PRDC both in the raw 50-dim feature
space and in the learned 16-dim embedding. Settles whether the
stage-1 ordering (ZI-QRF > ZI-QDNN > ZI-MAF) is a metric artifact
from PRDC-in-high-dimensions or a genuine method difference.

Usage:
    uv run python scripts/embedding_prdc_compare.py --n-rows 40000

Tests still pass (7/7).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds "microcalibrate" to the calibration_backend literal and to
_build_weight_calibrator's dispatch in USMicroplexPipeline. The existing
_apply_policyengine_constraint_stage call site needs no change because
MicrocalibrateAdapter.fit_transform / .validate match the legacy
Calibrator interface exactly.

Usage in the checkpoint pipeline:

  uv run python -m microplex_us.pipelines.pe_us_data_rebuild_checkpoint \\
    ... \\
    --calibration-backend microcalibrate

Effect:
  - Replaces the entropy-backend solve that killed v4 and v6 (1.5M
    households x ~1.2k constraints on a 48 GB workstation) with
    microcalibrate's gradient-descent chi-squared, which is
    identity-preserving and what PE-US-data uses in production.
  - No other pipeline changes. Backend swap only.

Tests:
  - tests/calibration/test_us_pipeline_dispatch.py (3 tests):
      * backend string resolves to MicrocalibrateAdapter instance
      * end-to-end fit_transform + validate through the pipeline path
      * unknown backend still raises ValueError
  - All 18 calibration + bakeoff tests pass.

Docs:
  - docs/microcalibrate-wiring-plan.md: rationale, contract-compat
    checks, validation plan, risk register, rollout order.

Not in this commit:
  - No v7 run. Full-scale validation is the next production run.
  - No benchmark comparison of microcalibrate vs entropy numerical
    accuracy. v6 evidence is that entropy can't even complete, so
    microcalibrate is not competing for accuracy — it's the only
    backend that gets us past the OOM.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds coverage for the per-column zero-rate field added earlier. Verifies:
  - every target column is present
  - real / synth / abs_diff entries are shaped and bounded correctly
  - abs_diff is consistent with the real/synth difference
  - scalar zero_rate_mae is in the same ballpark as per-column diffs

All 8 bakeoff tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds method_kwargs: dict[str, dict] to ScaleUpStageConfig so the
harness can dispatch method constructors with custom settings. Replaces
the one-off ZI-MAF tuning script pattern with a config-level knob that
works for every method in the registry.

Example use:
    cfg = ScaleUpStageConfig(
        stage="stage1_tuned",
        methods=("ZI-MAF",),
        method_kwargs={"ZI-MAF": {"n_layers": 8, "hidden_dim": 128, "epochs": 200}},
        ...
    )

Makes the ZI-MAF hyperparameter search (currently running as a
standalone script) repeatable through the normal harness path and
keeps stage-1 / stage-2 / stage-3 comparisons explicit about which
hyperparameters each method used.

All 9 bakeoff tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
docs/quickstart-rewire.md: ordered walkthrough of everything that
landed on spec-based-ecps-rewire overnight, starting with the G1
unblocker (--calibration-backend microcalibrate) and working through
the scale-up bakeoff harness, the embedding-PRDC validation script,
and the diagnostics that identify which cells / columns each method
breaks on.

Readable cold. Assumes only git + uv installed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Tests whether MicrocalibrateAdapter on top of a weak synthesizer
recovers weighted aggregate accuracy. Stage-1 PRDC measured
un-weighted coverage; the actual production pipeline is
synthesize -> calibrate, so a method that produces biased samples may
still produce accurate WEIGHTED aggregates after calibration.

Procedure for each method:
  1. Fit synthesizer on train, generate synthetic with unit weights.
  2. Rescale initial weights so synth totals match holdout-scale
     (moves gradient descent's starting point close to the target).
  3. Build per-target-column sum LinearConstraints with holdout totals.
  4. Run MicrocalibrateAdapter.
  5. Report pre- and post-calibration relative error per target.

Usage:
    uv run python scripts/calibrate_on_synthesizer.py --n-rows 20000

Interpretation:
  - If post-cal error converges to near-zero across methods, choice of
    synthesizer matters less than PRDC alone suggested. The weights
    carry the accuracy signal.
  - If ZI-MAF / ZI-QDNN can't be calibrated (gradient descent diverges
    or leaves huge residuals), the PRDC verdict stands and the
    synthesizer choice is load-bearing.

Output: artifacts/calibrate_on_synthesizer.json with per-target
pre/post errors, calibration wall time, weight distribution summary.

Not run tonight — deferred to Max's morning after the ZI-MAF tuning
job completes (both would contend for CPU otherwise).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Four ZI-MAF configurations ran at 40k x 50 real ECPS:

  default    (4L, 32h, 50e):  coverage=0.026  fit=124s
  wide       (4L, 128h, 50e): coverage=0.029  fit=228s
  long       (4L, 32h, 200e): coverage=0.032  fit=467s
  wide+long  (8L, 128h, 200e, lr=5e-4): coverage=0.033 fit=1711s

ZI-QRF on the same data at the same PRDC cap: coverage=0.352 in 19s.

14x the compute budget moves ZI-MAF from 0.026 -> 0.033 -- a 25% relative
improvement that does not close the 10x gap to ZI-QRF. Stage-1 verdict
stands: ZI-QRF is the production synthesizer, ZI-MAF is confirmed
non-competitive at this scale with the current method-class architecture.

Diagnosis (docs/zi-maf-hyperparameter-search.md):
  - Per-column independent flows can't capture cross-target correlations.
  - Zero-inflation RF classifier + MAF combination is biased on rare cells.
  - Log-transform + standardization compresses heavy tails.
  - Rescuing ZI-MAF plausibly requires joint-target architecture, which
    is a week of implementation that may still not close the gap.

SS-model methodology doc's "production direction: ZI-QDNN" claim remains
overturned; stage-1 ZI-QDNN was mid-pack (0.147 at 77k) and this tuning
exercise doesn't revisit it.

Artifact: artifacts/zi_maf_tuning.json

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Fit a 16-dim autoencoder on the 40k x 50 holdout and re-computed PRDC
in both raw 50-dim space and the learned 16-dim latent space. The
concern from docs/synthesizer-benchmark-scale-up.md was that raw-feature
PRDC in 50 dimensions might be noise-dominated.

Raw 50-dim PRDC coverage:
  ZI-QRF   0.348
  ZI-QDNN  0.219
  ZI-MAF   0.025

Embed 16-dim PRDC coverage:
  ZI-QRF   0.309
  ZI-QDNN  0.222
  ZI-MAF   0.038

Ordering preserved. ZI-QRF > ZI-QDNN > ZI-MAF in both spaces. The 10x
gap between ZI-QRF and ZI-MAF narrows modestly (to ~8x) in the embedding
but does not invert.

Combined with the ZI-MAF tuning result (coverage only bumps from 0.026
to 0.033 with 14x the compute), this is the fourth independent
robustness check confirming stage-1: small-scale synth, 5k real, 40k
real, 77k real, embedding-16.

G1 cross-section synthesizer default: ZI-QRF. Stage-1 finding is robust.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Ran microcalibrate on top of each method's synthetic output, using
holdout target-sums as calibration targets. Tests whether calibration
compensates for weak synthesis (earlier hope) or requires structurally
sound inputs.

Mean relative error across 36 target columns, pre- vs post-cal:
  ZI-QRF   0.256 -> 0.141  (cal halves error)
  ZI-QDNN  0.388 -> 0.327  (modest help)
  ZI-MAF   17.98 -> 15.08  (synthesis so broken cal can't save it)

Clear finding: calibration refines structurally sound output (ZI-QRF,
ZI-QDNN) but cannot rescue a structurally broken synthesizer (ZI-MAF).
Falsifies the hope that weighting could compensate for weak synthesis.

Fourth independent robustness check on the synthesizer ordering:
  1. Raw 50-d PRDC at 40k real      ZI-QRF 0.348 > QDNN 0.219 > MAF 0.025
  2. Raw 50-d PRDC at 77k real      ZI-QRF 0.256 > QDNN 0.147 > MAF 0.014
  3. Embed 16-d PRDC at 40k real    ZI-QRF 0.309 > QDNN 0.222 > MAF 0.038
  4. Calibrate-on-synth at 20k      ZI-QRF 0.141 > QDNN 0.327 > MAF 15.08

Every axis, every scale, every metric: ZI-QRF wins. Finding is locked.

Follow-up note on production calibration settings:
  - MicrocalibrateAdapter at 200 epochs still improves per-epoch at the
    end of training; bump to 500-1000 in production to reach the
    adapter's 5% relative-error convergence bar.
  - `us.py` wiring uses `calibration_max_iter=100` by default; bump to
    `--calibration-max-iter 500` or higher for the v7 production run.

Artifacts: artifacts/calibrate_on_synthesizer.json (full per-target
errors), artifacts/calibrate_on_synthesizer.log (cal loss trajectory).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Found: upstream microplex.eval.benchmark._MultiSourceBase.generate adds
Gaussian sigma=0.1 noise to EVERY shared-column value, including binary
and categorical ones. is_military=1 becomes 1.04; state_fips=6 becomes
6.11; cps_race=3 becomes 2.97.

Impact:
  - Per-column zero-rate breakdown is dominated by shared-col noise
    pollution, not by synthesizer target-column quality.
  - PRDC coverage is reduced uniformly across methods (so ordering is
    preserved) but absolute numbers understate how good the methods
    actually are.

Local mitigation (in harness, not in microplex core):
  _snap_categorical_shared_cols runs after method.generate() and, for
  every shared column whose training values are all integer-valued,
  snaps synthetic values back to the nearest training-pool value.

Heuristic: integer-valued in training == categorical. Catches is_*
flags, cps_race, state_fips, own_children_in_household. Leaves
continuous cols (fractional floats like pre_tax_contributions) with
their noise.

Verified on a 5k probe:
  is_military: 3999 synth uniques -> 2 (matches train)
  cps_race:    ~3500 synth uniques -> 14 (train has 16)
  state_fips:  3999 synth uniques -> 51 (matches train's 51)
  age:         3999 synth uniques -> 86 (matches train's 86)
  pre_tax_contributions: 3994 synth uniques -> 3994 (left alone, non-integer)

docs/per-column-zero-rate-bug.md captures the bug, why the stage-1
ordering still held despite it, and the recommended upstream fix.

All 9 bakeoff tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
After the categorical-snap mitigation for the upstream shared-col
noise bug, re-ran stage-1 at both 40k and 77k scales:

  40k × 50:
    ZI-QRF   coverage 0.979 (pre-snap: 0.352, +0.627)
    ZI-QDNN  coverage 0.796 (pre-snap: 0.222, +0.574)
    ZI-MAF   coverage 0.168 (pre-snap: 0.029, +0.139)

  77k × 50:
    ZI-QRF   coverage 0.928 (pre-snap: 0.256, +0.672)
    ZI-QDNN  coverage 0.707 (pre-snap: 0.147, +0.560)
    ZI-MAF   coverage 0.106 (pre-snap: 0.014, +0.092)

Ordering preserved (ZI-QRF > ZI-QDNN > ZI-MAF). Absolute numbers are
meaningfully higher because the pre-snap numbers were dragged down
uniformly by the shared-col noise on binary/categorical conditioning
vars (is_military, cps_race, state_fips etc).

Headline story changes:
  - ZI-QRF quality is far better than pilot suggested -- 92.8%
    coverage at 77k is production-credible.
  - ZI-QDNN is legitimately competitive (0.707) though ZI-QRF still
    wins by 31% and runs 3x faster.
  - ZI-MAF at 0.106 is still the worst but not "entirely broken" as
    the pre-snap 0.014 suggested.

All other findings (ordering, calibrate-on-synth, embedding-PRDC,
ZI-MAF hyperparameter-tuning verdict) hold. This snap is a measurement
improvement, not a direction change. G1 next-action playbook unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ession summary

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…t CLI

Previously the checkpoint runner defaulted to calibration_backend='entropy'
with no way to switch from the command line. The microcalibrate backend
is wired into USMicroplexBuildConfig but there was no way to activate
it without code changes.

CLI now accepts:
  --calibration-backend {entropy,ipf,chi2,sparse,hardconcrete,pe_l0,microcalibrate,none}
  --calibration-max-iter <int>

Both feed into config_overrides and route through to _build_weight_calibrator.

Usage (the G1 run):
  uv run python -m microplex_us.pipelines.pe_us_data_rebuild_checkpoint \\
    --calibration-backend microcalibrate \\
    --calibration-max-iter 500 \\
    ...

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
paper/
  _quarto.yml          project config, HTML + PDF targets
  AFFILIATION.md       hard rule: Cosilico-only, independent of PolicyEngine
  README.md            build + citation-style notes
  references.bib       37 confirmed BibTeX entries from four parallel lit searches
  literature-review.qmd    standalone survey of tabular synth, calibration,
                           evaluation metrics, and US tax microsim literature
  index.qmd            main manuscript — intro, related work, architecture
                       outline, methods outline, results tables for stage-1
                       ordering and upstream-bug correction, limitations;
                       Architecture / Methods / Discussion / Conclusion
                       sections marked to-draft
  _output/             quarto build outputs (gitignored)

Four claim axes the paper will defend:
  1. Head-to-head QRF vs neural synth on real US tax microdata (novel cell)
  2. Identity-preserving calibration as explicit architectural requirement
     (novel framing; precedents cited)
  3. Chained QRF + microcalibrate composition (novel composition; components
     cited)
  4. Benchmark noise-injection bug diagnosis + upstream fix (real finding,
     corrected results published)

Cosilico-only affiliation: all author / institutional framing scrubbed of
PolicyEngine co-authorship per explicit requirement. PolicyEngine data
products and microcalibrate cited as prior work, not co-products.

Quarto renders both files cleanly to HTML (53 KB / 65 KB) with pandoc's
default citation style (chicago-author-date); swap in a journal CSL in
_quarto.yml once a target venue is chosen.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Five subagent reviewers (citation, methodology, domain, stylistic,
reproducibility) ran in parallel on the paper scaffold. Four of five
returned Major Revisions; one returned Minor. Consensus verdict: the
draft has good bones but is not submittable in current state.

Five BLOCKER findings that must land before any review circulation:

B1. Two of four "independent robustness checks" were generated before
    the snap fix (embedding_prdc_compare.json Apr 17 08:03 and
    calibrate_on_synthesizer.json Apr 17 08:06 both predate the
    snap-fix commits at 12:06 / 12:20). Must rerun the scripts through
    ScaleUpRunner.fit_and_generate or with the upstream fix applied.

B2. The 36 "target columns" are CPS-reported inputs, not policy
    outputs. Tax-microsim reviewers expect targets = federal tax,
    EITC, CTC, etc. Fix: rename at minimum; ideally add a downstream
    tax-aggregate validation running policyengine-us (or Tax-Calculator
    / TAXSIM) on microplex-us output and compare against IRS SOI /
    USDA / SSA / CBO administrative totals.

B3. Four body sections (Architecture, Methods, rare-cell, Discussion,
    Conclusion) are stubs. Submission-blocking.

B4. No Code and Data Availability statement. Required at every target
    venue; HuggingFace URL with pinned revision + license + software
    versions + hardware.

B5. No Conflicts of Interest disclosure. Author founded PolicyEngine
    and led Enhanced CPS work cited extensively. Silence reads worse
    than acknowledgement given the field size.

High-priority (H1-H7): first-person conversion, self-contain Related
Work, strip documentation register, table captions, at least one
figure, "widely-used" claim, citation form audit.

Medium-priority (M1-M10): uncertainty quantification, calibration
convergence, formal identity-preservation definition, embedding-PRDC
circularity, Forbes claim softening, cross-sectional identity-
preservation motivation, substrate circularity, target-set expansion,
snap cardinality guard, PRDC/split seed decoupling.

Low-priority (L1-L8): Synthcity citation error, TabPFGen / CTAB-GAN+ /
Auten-Splinter / Meyer-Mok-Sullivan / Czajka additions, URL/DOI
completeness, bibliography cleanup, table formatting, abstract
cleanup, unused-ref removal, data-product citations, LICENSE file,
regression test for ordering, Quarto-chunk-ified tables.

Revision order and time budget: ~2-3 weeks to submittable draft,
with the downstream tax-output validation as the main bottleneck.
Detailed sequence in the doc.

Noted two places where reviewers over-called:
  - zi_maf_tuning.json exists (reproducibility reviewer missed it)
  - Identity-preservation framing is defensible if scoped to the
    cross-section calibration layer (citation reviewer cited Dekkers
    2015, which is about ageing not calibration)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
B1 from paper/REVIEW-RESPONSE.md: both scripts predated the upstream
shared-col noise fix (Apr 17 08:03-08:06 vs snap commits at 12:06/12:20).
With microplex installed editable from the repaired upstream sibling,
rerunning both scripts now exercises the fixed generate() method.

embedding-PRDC (40k x 50 real ECPS, AE latent dim 16):
               raw-50             embed-16
  ZI-QRF       0.348 -> 0.982     0.309 -> 0.984  (post-snap)
  ZI-QDNN      0.219 -> 0.791     0.222 -> 0.819
  ZI-MAF       0.025 -> 0.183     0.038 -> 0.201
Ordering preserved in both spaces; absolute PRDC coverage rises
substantially for every method because noise on binary/categorical
conditioning variables is no longer forcing synthetic values off the
training support. ZI-QRF is near-ceiling (0.98+) in both spaces.

calibrate-on-synth (20k x 50, 500 epochs microcalibrate):
  ZI-QRF   pre 0.317 -> post 0.105
  ZI-QDNN  pre 0.386 -> post 0.251
  ZI-MAF   pre 17.51 -> post 11.86
Bumped from 200 to 500 epochs per reviewer's convergence concern.
Ordering unchanged. ZI-MAF still ~100x worse than ZI-QDNN post-cal,
consistent with the "calibration cannot rescue broken synthesis" story.

Pre-snap artifacts preserved as artifacts/*.pre-snap.json for audit trail.
Docs (embedding-prdc-validation.md, calibrate-on-synthesizer-result.md)
and paper/index.qmd §5.4 updated with post-snap numbers. Pre-snap
numbers kept inline as archived comparison for transparency.

Note: artifacts/ is .gitignore'd so the JSON files live on disk but
not in the repo. Log files also gitignore'd. This is intentional
per the repo's earlier cleanup; result tables in docs and paper
are the canonical record.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Landed from paper/REVIEW-RESPONSE.md:

B4 — Code and data availability section: repo URLs, HF dataset pointer,
   license, rebuildability note, reproduction environment (Python
   3.14.0, macOS 14, M3, 48 GB RAM, CPU-only, ~6 min wall time).

B5 — Disclosures section: explicit statement that I founded
   PolicyEngine, led the @ghenis2024ecps work, and am conducting this
   research at Cosilico independent of PolicyEngine. Closes the COI
   gap the domain and methodology reviewers both flagged.

H1 — First-person voice: converted "we"→"I"/"this paper" throughout
   abstract, §1, §2, §5. Literature-review.qmd still needs a pass
   (tracked in REVIEW-RESPONSE.md).

H4 — Table captions and cross-ref labels: added for all three main
   tables (Table {#tbl-stage1}, {#tbl-prefix}, {#tbl-calibrate}).
   Expanded abbreviations (Fit→Fit time, Pre/Post-cal→Before/After
   calibration). Applied consistent bolding (all-best-in-column).

H6 — Softened "widely-used upstream benchmark base class" claim to
   "Synthesizer benchmarks that used the same microplex.eval.benchmark
   base class before the correction landed." Removed the [report low]
   placeholder in the same sentence.

Misc — also:
   - Fixed Synthcity citation author list (Qian, Davis, van der Schaar
     for the NeurIPS 2023 D&B paper, not Cebere).
   - Added Ruggles 2025 citation in Related Work (domain reviewer M9).
   - Removed unused @zhang2017privbayes entry.
   - Rewrote noise-injection paragraph to drop backticked code-token
     lists in favor of English (per stylistic reviewer L6): "sex,
     military-service, state FIPS, and CPS race indicators."
   - Results-section prose rewritten from dashboard-caption sentence
     fragment into full prose referencing the tables.

Quarto renders both files cleanly (index.html + literature-review.html
in paper/_output/).

Remaining work from REVIEW-RESPONSE.md:
   - B2: rename target columns + downstream tax-output validation
     (several days)
   - B3: draft §3 Architecture, §4 Methods, §5.3 rare-cell,
     §6 Discussion, §8 Conclusion (still stubs)
   - H1 literature-review.qmd voice pass
   - H2 self-contain Related Work (400-600 words lifted from lit
     review into index.qmd §2)
   - H3 strip remaining engineering register
   - H5 add pipeline schematic figure
   - Plus M-tier and L-tier items per REVIEW-RESPONSE.md

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
MaxGhenis and others added 29 commits April 18, 2026 07:06
Tested five zero-classifiers on ZI-QDNN at 77k x 50 (seed 42):
  RF default         coverage 0.7081 (baseline)
  HistGradientBoost  coverage 0.7017
  MLP (64x32, DNN)   coverage 0.6984
  RF + isotonic      coverage 0.6983
  Logistic           coverage 0.6941

All within 0.014 coverage points — at or below our multi-seed std of
~0.002-0.003. The RF default is effectively optimal among alternatives
tested; no classifier swap meaningfully improves ZI-QDNN.

Interpretation: a 50-tree RF already captures all the information
content of P(y>0|x) that cross-sectional classification can extract
from 14 conditioning variables at 61k training rows. More sophisticated
classifiers (HistGB, DNN) don't extract additional signal.

What WOULD lift ZI-QDNN above 0.71 is architectural, not a classifier
swap:
- Joint zero-mask model (predict full 36-dim zero pattern jointly so
  cross-target zero correlations are captured)
- Joint quantile output (shared-backbone multivariate QDNN)
- Post-hoc calibration on the QDNN draw itself (Platt / conformal)

Implementation:
- Added _patch_zi_classifier in local_methods.py that rewrites a ZI
  method instance's fit() to use a configurable classifier_factory
- Added four classifier factories: logistic, hgb, calibrated, dnn
- Added guard for single-class training data (prevents logistic crash
  on columns with zero positive samples)

Full writeup in docs/zi-factorial.md (appended §"ZI classifier
comparison (QDNN)").

Artifact: artifacts/zi_classifier_comparison.json (not git-tracked,
artifacts/ is gitignore'd; see docs for the table).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…gnal

Run per-column 80/20 fit/val splits on the 26 ZI-eligible target columns
(zero_frac >= 10%) and score each of the 5 classifiers on log-loss, Brier,
ECE, and ROC-AUC without the downstream QDNN draw in the loop.

Outcome flips the coverage story cleanly:

  classifier       ll_mean  ll_med  brier    ece   auc
  HistGB            0.2252  0.1712  0.0707  0.005  0.809  <-- best
  DNN               0.2337  0.1956  0.0732  0.007  0.748
  RF_calibrated     0.2343  0.1834  0.0739  0.008  0.763
  Logistic          0.2468  0.2028  0.0770  0.018  0.756
  RF_default        0.3095  0.2523  0.0810  0.039  0.737  <-- worst

Log-loss spread 0.085 (~6x the coverage spread); ECE gap ~8x; AUC gap 7
points. Seven points of AUC is far outside noise. The classifiers are
NOT equivalent — the downstream QDNN non-zero draw swamps the signal,
so coverage reports a tie.

Implication: swapping classifiers alone cannot lift ZI-QDNN past 0.71
coverage. The binding constraint is the non-zero quantile output, not
the zero gate. This is exactly hypothesis (b) from the methodology
discussion.

Secondary: if P(y=0|x) is ever surfaced as a diagnostic or subgroup-level
signal, prefer HistGB (or a calibrated RF) over the RF default. The
calibration gap invisible on coverage is directly user-visible on
calibration plots and top-k retrieval.

Artifact: artifacts/zi_classifier_isolated_eval.json (config, per-column
metrics, aggregate). Script: scripts/zi_classifier_isolated_eval.py.
Doc: appended section to docs/zi-factorial.md.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The isolated per-column evaluation in commit cbf1258 showed HistGB
Pareto-dominates the 50-tree RF default on every intrinsic classifier
metric (log-loss 0.225 vs 0.310, ECE 0.005 vs 0.039, AUC 0.809 vs 0.737)
across the 26 ZI-eligible target columns. PRDC coverage is insensitive
to the swap (0.7017 vs 0.7081) because the downstream QDNN draw swamps
the gap, but the classifier is chosen on intrinsic quality: if the
component's job is to predict P(y > 0 | x), HistGB does it better.

Changes:

- local_methods.py: ZIQDNNHistGBMethod exported as the deployment default,
  built via _make_zi_variant + _hgb_factory. Drop the placeholder
  ZIQDNN{Logistic,HGB,Calibrated}Method stubs that were never instantiated.
- scale_up.py registry: "ZI-QDNN" now resolves to HistGB-backed variant.
  The upstream RF-backed ZIQDNNMethod is kept under "ZI-QDNN-RF" so prior
  artifacts (produced with RF) remain exactly reproducible — just pass
  --methods ZI-QDNN-RF at the CLI.
- paper/index.qmd §4: add one paragraph explaining the default shift and
  that the §5 numbers were generated with the RF default. The benchmark
  is not re-run.

Rationale for swap despite coverage-level indifference:

- HistGB is strictly better at the quantity the ZI component is
  ostensibly predicting (P(y > 0 | x)).
- If P(y=0|x) is ever surfaced as a user-visible diagnostic signal
  (subgroup top-k retrieval, calibration plots, "household likely to
  have zero capital gains"), RF's ECE=0.039 won't hold up.
- Runtime cost is ~13x (2.8s → 36s for 26 columns at 77k × 50);
  projects to ~30 min at v7's 3.4M rows. Not a blocker.

Regression testing: ZI-QDNN-RF preserves bit-reproducibility of earlier
coverage artifacts.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The embedding_prdc_compare and calibrate_on_synthesizer artifacts were
re-run on 2026-04-17 21:15/21:17 against post-fix upstream microplex
(commit 81a5e10 at 12:20). The pre-snap versions are preserved as
.pre-snap.json for audit; paper §5 references the post-snap numbers.
No further rerun needed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Root cause of the 2026-04-18 01:57 v7 OOM: adapter built a float64
DataFrame for the estimate_matrix (6 GB at 1.5M x ~500), then
microcalibrate allocated an independent float32 torch copy. With no
upstream change, the duplicate alone crossed the macOS jetsam kill
threshold on the 48 GB workstation.

Fix on this side: build the DataFrame directly from float32 columns.
Downstream torch layer was already casting to float32, so this is a
free precision-compatible win that drops the adapter's peak allocation
from 6 GB to 3 GB.

Upstream microcalibrate PR in flight to (a) release the pandas
DataFrame reference after __init__, and (b) add batch_size gradient
accumulation so the per-epoch activation is O(batch * targets) instead
of O(n_records * targets). Those two combined with this adapter change
should let v7 complete at k >= 4,000 constraints.

TDD: test_microcalibrate_adapter_memory.py::test_estimate_matrix_passed_to_calibration_is_float32
spies on Calibration.__init__ and asserts every column dtype is float32.
Adds a convergence regression test (300 records, 400 epochs, 3 age-band
constraints) to catch any precision loss from the dtype change.

Also drop unused `field` import from dataclasses and two non-load-bearing
`assert ... is not None` checks in validate() (flagged by code-simplifier
subagent review).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
microcalibrate 0.22.0 ships the gradient-accumulation batch_size
parameter and the pandas-release-after-init memory fix from PR #99.
With batch_size=100_000 on a 1.5M-household frame at k ≈ 500
constraints, per-batch activation is ~200 MB instead of ~3 GB. Combined
with the adapter's float32 matrix (commit 6ffdb06) and the upstream
DataFrame release, the v7 pipeline should complete under the 48 GB
workstation budget.

- pyproject.toml: microcalibrate>=0.22
- adapter config: batch_size=100_000 default on MicrocalibrateAdapterConfig
- adapter fit_transform: forwards batch_size into Calibration

Next: rerun v7 with microcalibrate backend and feed output to
policyengine-us for tax-aggregate downstream validation (REVIEW-RESPONSE B2).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous commit (704ff77) bumped uv.lock and the adapter config,
but the pyproject.toml pin was left at >=0.21 by mistake. Fix.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The adapter moved to upstream microplex (see CosilicoAI/microplex#6)
so every country package shares one identity-preserving calibrator
instead of duplicating the glue. This commit:

- Swaps pyproject dependency `microcalibrate>=0.22` for `microplex[calibrate]`,
  picking up the torch/optuna/l0 stack transitively via the extra.
- Deletes `src/microplex_us/calibration/microcalibrate_adapter.py`;
  the source of truth is now `microplex.calibration.microcalibrate_adapter`.
- Rewrites `src/microplex_us/calibration/__init__.py` to re-export the
  adapter classes from upstream so existing
  `from microplex_us.calibration import MicrocalibrateAdapter` imports
  keep working — bit-for-bit backward-compatible for downstream pipelines.

All 13 microplex-us calibration tests pass against the re-exported
adapter (identical behavior, upstream-hosted implementation).

Next: once microplex#6 merges, this PR can merge too; pipelines using
MicrocalibrateAdapter get the batched calibration transparently.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two small but important truth-in-text updates to the Identity-preserving
calibration section:

1. The production default now explicitly references `MicrocalibrateAdapter`
   as a country-agnostic adapter shipped from upstream `microplex` under
   the `calibrate` extra. This matches the structure after the 2026-04-18
   relocation (microplex PR #6, merged as 254114d) and makes the paper
   accurate for reproducibility: country packages inherit the calibrator
   rather than duplicating it.

2. The OOM-completion claim now acknowledges the two fixes that made the
   production run at 1.5M-household scale actually feasible: the adapter's
   float32 estimate matrix (microplex-us commit 6ffdb06) and upstream
   microcalibrate 0.22's batched gradient accumulation (PolicyEngine/
   microcalibrate#99). Before both landed, the gradient-descent chi-
   squared backend OOM'd too — replacing "avoids the dense materialization
   and completes in minutes" with the honest version.

These update the paper's architectural prose to match the stack that the
v7 rerun actually uses.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
v7 uses donor_imputer_backend='qrf' (default), which leaves
ColumnwiseQRFDonorImputer.zero_inflated_vars empty and runs QRF
predict() over all 3.37M rows for every column — including
columns that are 99% zero. v8 flips to --donor-imputer-backend zi_qrf
for a ~5-10x speedup on zero-heavy columns via predict-skipping.

Added tests (all pass):

- test_zi_whitelist_produces_zero_classifier: whitelist + heavy-zero
  → RF gate is fitted; dense columns don't get a gate.
- test_empty_whitelist_means_no_gates: pins v7 semantics; empty
  whitelist → no gates ever.
- test_generate_calls_qrf_only_on_predicted_positive_rows: proves
  QRF predict is called on a strict subset (not all rows). Uses a
  97%-zero column + 10k generate rows; asserts predict_rows < 50%
  of generate size. This is the wall-clock optimization v8 depends on.
- test_zi_qrf_backend_populates_whitelist: factory wires the
  ZERO_INFLATED_POSITIVE-family variables into the whitelist when
  backend='zi_qrf'.
- test_qrf_backend_leaves_whitelist_empty: regression-pin for the
  v7 default behavior so the switch doesn't silently regress.

Added docs/next-run-plan.md with:
- exact launch command for v8
- list of what zi_qrf actually covers (PUF tax vars only; benefit
  vars like SSI/TANF/SNAP are CONTINUOUS in variables.py and need
  a one-line reclassification to get the same optimization)
- pre-launch verification instructions (5-test smoke check)
- subtle consequence note: post-ZI QRF can't return zero (trained
  on y>0 subset); zeros come from gate path only — sharp boundary.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two linked changes:

1. pe_l0.py: PolicyEngineL0Calibrator.fit now calls
   _build_sparse_constraint_system from microplex.calibration directly,
   skipping the dense np.vstack + sp.csr_matrix(A) round-trip. At v7
   scale (1.5M records × ~4k constraints) this avoids the ~24 GB dense
   intermediate that macOS memorystatus killed the v7 microcalibrate
   rerun over on 2026-04-18 (python3.14 [28015] grew to 172 GB
   compressed). Requires microplex from the sparse-constraint-builder
   branch (CosilicoAI/microplex#7). Residual computation also switched
   from `A @ weights - b` to `X_sparse @ weights - b`; identical
   numerics, no dense matrix ever materialized.

2. paper/index.qmd §3.3 / §3.4: weaken the identity-preservation
   definition from strict positivity (∀i: w_i' > 0) to row-set
   preservation (∀i: w_i' >= 0 AND id(r_i') = id(r_i)). Max's point
   in conversation: a record with w_i = 0 still has its entity
   identifier and row position in the HDF5 dataset — it's just
   excluded from the current year's weighted aggregates, and is
   available for year Y+1's calibration to re-weight up. This is
   consistent with CBOLT / DYNASIM's equal-per-person frozen-weight
   convention; zero-sparsity is a strict superset of that flexibility.

   §3.4 (Sparse L0) rewritten accordingly: L0 is now framed as a
   first-class calibrator alongside chi-squared, not as "optional
   post-processing." Both backends are identity-preserving under the
   corrected definition. The chi-squared vs L0 trade-off is now
   "deployment artifact size vs rare-subpopulation coverage audit
   burden" rather than "identity vs size."

Consequence for v8: the pe_l0 backend is now recommended for
memory-constrained runs on the 48 GB workstation. Next launch should
use --calibration-backend pe_l0 alongside --donor-imputer-backend zi_qrf
(see docs/next-run-plan.md).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Needed to launch v8 with zi_qrf (the ZI predict-skip path). The config
field already exists at USMicroplexBuildConfig.donor_imputer_backend
but wasn't reachable from the command line — only the default (qrf)
ran for v7. Adds the `--donor-imputer-backend` flag with choices
{maf, qrf, zi_qrf} and wires it into config_overrides like the
sibling --calibration-backend flag.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ColumnwiseQRFDonorImputer previously trained its zero-inflation
classifier with label `(y > 0).astype(int)` and filtered the
downstream QRF training set to `y > 0`. For any target that can be
negative (short_term_capital_gains, partnership_s_corp_income,
farm_income, rental_income, self_employment_income, etc.), the QRF
only ever saw positive training rows and could therefore never emit
a negative value at generate time — the entire negative tail of the
synthetic frame was blanked out.

Minimal fix:

- Label the classifier as `(y != 0).astype(int)` so the positive
  class is "nonzero (either sign)" rather than "positive only".
- Filter the QRF training set to `y != 0`, mixing positives and
  negatives so the QRF learns the full nonzero conditional
  distribution.

Test (TDD):

tests/pipelines/test_donor_imputer_negative_preservation.py fits on
a synthetic frame with ~40% negatives, ~20% zeros, ~40% positives,
generates 2000 synthetic rows, asserts at least 5% of the generated
values are negative. Pre-fix: 0 negatives produced. Post-fix: passes.

Scope:

This is the minimal fix. The full upgrade is to replace
`ColumnwiseQRFDonorImputer`'s ad-hoc gate entirely with
`microimpute.models.ZeroInflatedImputer` (PolicyEngine/microimpute#186,
merged), which auto-detects the three-sign regime on each target and
routes nonzero-positive and nonzero-negative predictions through
separate QRFs. That gives a structural guarantee against
interior-band leakage in addition to the drop-negatives fix — see
the holdout experiment in PolicyEngine/microimpute@a13b1f4 for the
quantitative comparison. Tracked for v9 as a standalone refactor.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…Imputer

Introduces a new donor_imputer_backend option, `regime_aware`, that
wraps microimpute.ZeroInflatedImputer (PolicyEngine/microimpute#186,
merged) per target column. ZeroInflatedImputer auto-detects the
three-sign regime on the training distribution and routes predictions
through sign-specific QRFs, giving a structural guarantee that no
prediction lands in the interior band between max(train_negatives)
and min(train_positives).

Differences from the existing backends:

- `qrf`: single QRF, no gate. Zeros come out as whatever the QRF
  happens to predict near zero. Interior-band violations typical.
- `zi_qrf`: ad-hoc `y > 0` gate (since commit 8c88277, `y != 0` — keeps
  negatives). Binary gate + single QRF on the mixed nonzero subset.
  Interior-band violations still possible because one QRF trained on
  both signs interpolates near zero.
- `regime_aware` (new): ZeroInflatedImputer auto-detects one of seven
  regimes (THREE_SIGN / ZI_POSITIVE / ZI_NEGATIVE / SIGN_ONLY /
  POSITIVE_ONLY / NEGATIVE_ONLY / DEGENERATE_ZERO) per target, and
  for three-sign variables routes to separate positive and negative
  QRFs. Interior-band violations structurally impossible.

Tests (6 pass):

- `tests/pipelines/test_regime_aware_donor_imputer.py`:
  - Class importable from microplex_us.pipelines.us
  - Factory dispatches `backend='regime_aware'` to the new class
  - Fit+generate preserves negatives, positives, and exact zeros
  - **Zero interior-band violations** on a three-sign fixture with a
    designed (-100, 100) empty band in training data — the structural
    guarantee the upstream PR provides

CLI flag `--donor-imputer-backend` now accepts `regime_aware` alongside
maf / qrf / zi_qrf. Ready to launch v9 once v8 completes.

Known upstream issue: microimpute 2.x's
ZeroInflatedImputer._fit_base_single hardcodes log_level="ERROR" and
conflicts with any caller that passes log_level via base_imputer_kwargs.
Worked around here by leaving base_imputer_kwargs={}. Will file
follow-up PR to microimpute to make the hardcode conditional.

v8 pipeline unaffected: its in-memory process imported the pre-edit
modules at start and is still running on the `zi_qrf` backend with the
v7-era `ColumnwiseQRFDonorImputer`. This change lands cleanly for v9
without interfering.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Finding: the v8 calibration-stage jetsam kill at 197 GB compressed
memory was NOT caused by the L0 fit itself (isolated measurement:
1.5M × 4000 × 5% density in 23s at 13.5 GB peak RSS). It was caused
by retained state around the fit — in particular the pre-filter
``compiled_constraints`` set holding ~4,000 × 1.5M × float64 dense
arrays (~48 GB) while an in-line PolicyEngine Microsimulation
(25–35 GB) and the entity table bundle (10 GB) are simultaneously
alive.

This commit addresses the ~30 GB of *transient* memory churn
inside the 48 GB baseline: ``_build_policyengine_constraint_records``
scans every constraint's coefficient array three separate times
during ledger + deferred-stage selection, and each scan allocates a
full-length ``np.abs(...)`` intermediate. At v7/v8 scale that's 3 ×
48 GB of transient allocations the macOS compressor was counting.

Fix: precompute ``active_households`` and ``coefficient_mass`` once
per constraint, pass a ``metadata_lookup`` dict through the ledger
and deferred-stage-selection call chain, and use the cached scalars
instead of rescanning. Two existing helpers gain optional
``metadata_lookup`` kwargs:

- ``_constraint_active_household_count(constraint, *, metadata_lookup=None)``
- ``_build_policyengine_constraint_records(targets, constraints, *, metadata_lookup=None)``

New helpers:

- ``_precompute_constraint_metadata(constraints)``: one-pass
  over-constraint scalar extraction.
- ``_strip_constraint_coefficients(constraints)``: future-use
  helper that replaces coefficient arrays with empty sentinels;
  staged here but not yet wired — doing a full strip needs
  reconciling with ``_subset_policyengine_linear_constraints`` and
  the deferred-stage solver, both of which consume coefficients.

The ``_build_policyengine_calibration_target_ledger`` and
``_select_policyengine_deferred_stage_constraints`` signatures now
accept ``compiled_constraint_metadata`` as an optional kwarg.
``calibrate_policyengine_tables`` precomputes the metadata once
and threads it through both.

Tests (5 new, all pass):

- ``test_precomputed_scalars_match_direct_computation``
- ``test_empty_constraints_produce_empty_metadata``
- ``test_active_household_count_uses_lookup``
- ``test_build_records_uses_lookup_when_coefficients_stripped``
  (proves the lookup path produces identical records to the
  coefficient-scan path)
- ``test_records_without_lookup_still_work`` (backward compat)

Expected impact on v9 run memory: ~30 GB saved vs v8, plus any
compressor-overhead multiplier. Alone this probably isn't enough to
fit v9 in 48 GB; the remaining ~50 GB of PE tables + oracle Microsim
+ baseline compiled_constraints still dominate. But it's a safe
first step while the batched-Microsim utility (needed next) gets
built.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ause

Corrected diagnosis from the v9 jetsam kill (203 GB compressed):

- L0 fit itself is fine: isolated script materializes 1.5M × 4000 ×
  5%-density CSR + runs L0 for 2 epochs at 13.5 GB peak RSS in 23 s.
- v9's OOM occurred AFTER "calibration start" logged but before
  "calibration complete" — inside `_resolve_policyengine_calibration_targets`,
  during variable materialization (not the fit).
- Variable materialization runs a full-dataset Microsimulation at
  1.5M-household scale (~25–35 GB) while simultaneously building ~4k
  dense 1.5M-length float64 coefficient arrays (~48 GB). Together
  this is the actual peak.

Fix: add `batch_size` to `materialize_policyengine_us_variables`.
When set, the function loops over disjoint household chunks
(default `None` preserves legacy single-pass path). Each chunk runs
its own Microsimulation (~2–3 GB) and contributes its rows to the
concat'd output. Correct by construction for per-household scalar
variables (all our calibration targets), documented as unsafe for
population-quantile-dependent variables (not targets we use).

Wiring:

- `materialize_policyengine_us_variables(…, batch_size=None)` — new
  kwarg; recurses on chunks when set.
- `_subset_bundle_by_households` / `_concat_bundles` helpers added
  alongside.
- `materialize_policyengine_us_variables_safely(…, batch_size=None)`
  forwards the kwarg.
- `USMicroplexBuildConfig.policyengine_materialize_batch_size` exposes
  it at the top-level config (default `None`).
- Pipeline call site at `us.py:3789` threads
  `self.config.policyengine_materialize_batch_size` into the safely-
  materialize call.
- CLI: new `--policyengine-materialize-batch-size` flag on the
  rebuild-checkpoint runner.

Tests (3 new, all pass):

- `test_single_pass_vs_batched_equivalent` — full-dataset and
  5-chunk paths produce identical attached variable values.
- `test_batch_size_larger_than_data_is_noop` — batch_size > n is a
  no-op.
- `test_uneven_batch_split` — 50 records / batch 17 → chunks 17, 17,
  16; values correct.

Expected impact on v10 peak: ~48 GB (coefficients) + ~3 GB (per-batch
Microsim) + ~10 GB (entity tables) + ~5 GB (Python accumulated state)
≈ 66 GB. Still over the 48 GB workstation budget unless we ALSO
reduce the coefficient-array baseline — but it's a reasonable next
step and removes the largest Microsim transient. If 66 GB is still
too much, the next lever is switching coefficient storage from dense
np.float64 to float32 (halves) or sparse (likely 10×).

Launch v10 with `--policyengine-materialize-batch-size 100000`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two follow-ups to the batched-materialize commit, per code-simplifier
review:

1. **Duplicate subset helper consolidated.**
   ``_subset_policyengine_tables_by_households`` in
   ``pipelines/us.py`` and ``_subset_bundle_by_households`` in
   ``policyengine/us.py`` were 95% the same logic with cosmetic
   differences. Promoted the canonical version to
   ``policyengine/us.py`` as the public-ish
   ``subset_policyengine_tables_by_households`` (module boundary:
   pipelines depends on policyengine, so the helper belongs there),
   and imported it under the old private name in ``pipelines/us.py``
   for backward-compat with the three existing call sites. The
   duplicate body is gone; ~30 lines deleted, no behavior change.

2. **Redundant "why 48 GB" docstrings trimmed.**
   ``_constraint_active_household_count`` and
   ``_precompute_constraint_metadata`` had 8-line commit-message-
   style docstrings; the commit log already carries that rationale.
   Trimmed to a single sentence each.

3. ``_strip_constraint_coefficients`` kept and tightened to a
   single-pass generator expression — the test at
   ``test_constraint_metadata_lookup.py`` exercises it to pin the
   metadata-lookup fallback path, so it's not dead.

35 regression tests still green. No functional change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
v10's L0 calibration collapsed weights from 442k to 1,511 active
across three stages because stages 2+ reapplied `lambda_l0=1e-4` on
warm-started (already-sparse) weights, compounding pruning past the
useful sparse support. Stage 2+ now drops the sparsity penalty and
only refines residuals; stage 1 still selects the sparse support.

Adds post-imputation and post-microsim pipeline checkpoints so a
rerun can skip the ~11 h synthesis + imputation + PE-tables build
(loading from post-imputation) or additionally the ~30 min microsim
materialization (loading from post-microsim), leaving only the fit
loop to tune. Wired as `--pipeline-checkpoint-save-post-imputation-path`
and `--pipeline-checkpoint-save-post-microsim-path`. Resume support
lands in a follow-up; saves are sufficient to prevent loss if a late
pipeline stage (write, OOM, sparsity collapse) fails.

Tests:
- `test_pe_l0_deferred_stage_disables_sparsity_penalty`
- `test_hardconcrete_deferred_stage_disables_sparsity_penalty`
- `tests/policyengine/test_us_pipeline_checkpoint.py` (8 tests)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Follow-up to per-stage lambda_l0 + checkpoint saves. Resume from a
post-imputation checkpoint skips the ~11 h synthesis/imputation +
PE-tables build and reruns only the ~30 min calibration (microsim +
fit), enabling rapid iteration on calibration backends / lambda
schedules / target sets.

- ``recalibrate_policyengine_us_from_checkpoint(config, path)``: load
  a saved post-imputation bundle and dispatch to
  ``pipeline.calibrate_policyengine_tables``. Returns a
  ``USMicroplexRecalibrateResult`` narrower than a full build result —
  synthesis state is unavailable when resuming.
- ``pe_us_recalibrate_from_checkpoint`` CLI: writes parquet for the
  calibrated bundle + a JSON summary. Supports optional post-microsim
  checkpoint save on the recalibration pass.
- v1 only accepts ``post_imputation`` checkpoints. Resume from a
  post-microsim checkpoint requires pickled compiled constraints
  (follow-up).

Tests: 3 new tests in ``test_recalibrate_from_checkpoint.py``
exercising dispatch, the post-microsim rejection, and the missing-path
error. 34 tests pass in the affected suites.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Realization: post-microsim resume doesn't need pickled constraints.
The bundle saved at that stage already has the materialized target
variables as columns, so ``infer_policyengine_us_variable_bindings``
picks them up, ``policyengine_us_variables_to_materialize`` returns an
empty set, and ``_resolve_policyengine_calibration_targets``
short-circuits past the microsim call. The cost of skipping microsim
and going straight to the L0 fit is the calibration-fit wall time
(~1-3 min) instead of the full ~30 min that would include microsim
materialization.

- ``recalibrate_policyengine_us_from_checkpoint`` now accepts both
  ``post_imputation`` and ``post_microsim`` stages.
- CLI help text and module docstring updated.
- Parametrized dispatch test covers both stages; a new test rejects
  unknown stages loaded from a hand-crafted metadata.json.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Addresses the reviewer's B2 ask for downstream-policy-output
validation, not just input-target validation. After calibration the
``policyengine_us.h5`` artifact is ingested by
``policyengine_us.Microsimulation``; this module computes a canonical
set of 2024 aggregates (income_tax, eitc, ctc, snap, ssi, aca_ptc) and
compares them against IRS/USDA/SSA/CMS published totals. Each
benchmark has a cited source — no magic numbers.

- ``DownstreamBenchmark`` record carrying computed, benchmark,
  unit, source, and derived abs/rel error.
- ``DOWNSTREAM_BENCHMARKS_2024`` canonical 2024 benchmark set
  (six headline aggregates, each sourced).
- ``compute_downstream_aggregates(dataset_path, period)`` runs
  ``policyengine_us.Microsimulation`` on an h5 and returns per-
  variable weighted sums.
- ``compute_downstream_comparison(aggs, benchmarks)`` joins
  computed values to their benchmarks with signed relative error.

Tests: 7 new unit tests covering record fields, JSON serialization,
zero-benchmark guard, canonical-set completeness, source-presence
invariant, and the comparison join.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The one-shot ``python -c '...'`` run on the v11 output got SIGKILL'd
before producing output — Python buffered stdout was lost on signal,
and no per-variable state was saved to disk. This script runs the
same computation with ``python -u`` for line-buffered stdout and
writes a ``<output>.partial.json`` after each variable so a late
kill still leaves N-of-6 aggregates recoverable.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
``scripts/run_b2_batched.py`` computes an aggregate by subsetting the
PE-US h5 into household-size chunks, running a fresh
``Microsimulation`` per chunk, and summing. Works around the
``income_tax`` / ``aca_ptc`` OOM at 1.5M households where deep
dependency chains materialize too many intermediate arrays. Correct
entity subsetting: for each group entity (tax_unit, spm_unit, family,
marital_unit), the chunk's group-unit set is derived from
``person_<entity>_id`` of persons in the chunk's households, then
masked back onto the group-entity id array.

Validated end-to-end on ``ssi``: batched 4×500k households
reproduces the unbatched aggregate exactly ($108.23B).

``scripts/run_b2_validation_single_var.py`` is a thinner runner that
assumes the variable fits in one pass; used for the cheap aggregates
(eitc, snap, ssi, ctc).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Full set of six 2024 tax-benefit aggregates computed on the
v11-per-stage-lambda calibrated frame against published IRS / USDA /
SSA / CMS benchmarks:

- income_tax: $2,089.7B vs $2,400B benchmark (-12.9%)
- eitc:      $64.2B  vs $64B   benchmark ( +0.3%)
- snap:      $101.8B vs $100B  benchmark ( +1.8%)
- ctc:       $151.9B vs $115B  benchmark (+32.1%)
- ssi:       $108.2B vs $66B   benchmark (+64.0%)
- aca_ptc:   $14.1B  vs $60B   benchmark (-76.4%)

Three headline aggregates (income_tax, eitc, snap) reconcile to the
admin totals within single-digit-to-low-teens relative error; three
don't, and each points to a specific synthesis-step shortfall that a
follow-up calibration pass can address by adding direct targets on
the disbursed aggregate.

Addresses paper reviewer B2 (add downstream-tax-output validation).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
downstream.py
- Replace reliance on MicroSeries ``.sum()`` semantics with an
  explicit ``compute_downstream_weighted_aggregate`` helper that pulls
  the correct entity weight variable (tax_unit_weight /
  spm_unit_weight / person_weight / ...) from PE's variable metadata
  and takes the numpy dot product. Same numerics as ``.sum()`` on the
  v11 artifact, but test-covered and robust to simulator changes.
- ``ENTITY_WEIGHT_VARIABLES`` table maps PE entity keys to weight
  variable names.

RegimeAwareDonorImputer
- Add ``seed`` constructor arg and deterministic
  ``_reset_prediction_rngs`` during ``generate`` so repeated calls
  with the same seed produce byte-identical output.

scripts/run_b2_batched.py
- Classify each h5 variable by PE's variable metadata first, then
  fall back to length matching; raises on ambiguous length matches
  rather than silently picking one. Added structural-variable
  overrides for IDs / weights / link columns.
- Wire batched runner's per-chunk aggregate through
  ``compute_downstream_weighted_aggregate``.

scripts/run_b2_validation.py / run_b2_validation_single_var.py
- Use ``compute_downstream_weighted_aggregate`` for consistency with
  the other callers and explicit weighting.

Tests: 3 new entity-resolution tests in test_run_b2_batched.py; 3 new
weighted-aggregate tests in test_downstream.py; 2 new
seed-determinism tests in test_regime_aware_donor_imputer.py. 21
tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@MaxGhenis MaxGhenis merged commit b2b6830 into main Apr 25, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant