Add batch_size gradient accumulation; release pandas matrix after init by MaxGhenis · Pull Request #99 · PolicyEngine/microcalibrate

MaxGhenis · 2026-04-18T15:16:37Z

Summary

Two memory-scale improvements to Calibration for workloads with >1 million records (microplex-us v7 pipeline: 1.5M households × 4,000+ constraints):

Release the pandas DataFrame after __init__. Calibration now builds a single float32 torch copy of the user's estimate_matrix (self.estimate_matrix_tensor) during __init__ and sets self.original_estimate_matrix = None. Downstream code in hyperparameter_tuning, evaluation, exclude_targets, and assess_analytical_solution reads the cached tensor instead of re-materializing from .values. Saves ~3 GB at v7 scale.
Add batch_size for gradient accumulation. When batch_size is set, the chi-squared loss gradient is computed via a two-pass scheme: accumulate S_j under torch.no_grad(), compute a per-target coefficient $c_j = 2 (S_j - t_j + 1) / \text{denom}j^2 / n\text{targets} \cdot \nu_j$ using the same clamped denominator as metrics.loss(), then for each batch call virtual_loss = (coef · (exp(w[b]) @ A[b])).sum().backward(). Peak activation drops from $O(N \cdot k)$ to $O(B \cdot k)$.

Together with a float32-at-construction change in microplex-us (CosilicoAI/microplex-us#TODO), this unblocks calibration at the v7 scale where the previous RSS trajectory crashed macOS jetsam.

Backward compatibility

Calibration.original_estimate_matrix is now None after construction. The equivalent float32 data is on Calibration.estimate_matrix_tensor. I verified no readers of original_estimate_matrix exist outside microcalibrate's own code (searched policyengine-us-data, policyengine-uk-data, policyengine-data, microplex-us).
batch_size=None (default) preserves the existing full-batch path bit-for-bit. batch_size >= n_records degenerates to full-batch.
batch_size with regularize_with_l0=True raises ValueError — the L0 sparse loop uses a different objective and is not yet batched.

Correctness

The batched gradient is exactly equal to the full-batch gradient (not approximate) under the same dropout realization. Derivation, conditions, and reviewer sign-off captured in the commit message. coef uses _safe_denominator(targets) so batched and full-batch agree on targets near $-1$.

Test plan

pytest tests/ — 49 passed (40 pre-existing + 9 new in test_batch_reweight.py + 2 new in test_memory.py).
New equivalence tests cover: batch_size=None determinism; batch_size=100 vs full-batch within 1e-4; ragged tail (batch_size=333, n=1000); batch_size >= n degeneracy; batch_size=1 extreme; equivalence under dropout_rate=0.3; equivalence under non-trivial normalization_factor; equivalence under excluded_targets; L0+batch guardrail.
black . -l 79 and isort --profile black src/ clean.
Downstream: rerun the microplex-us v7 pipeline with batch_size=100_000 and confirm RSS peak stays under 40 GB. (Will verify in a follow-up and post numbers in this PR.)

🤖 Generated with Claude Code

Two memory-scale improvements for Calibration at >1e6 records: 1. Release the pandas DataFrame after __init__. Calibration now builds a single float32 torch copy of the user's estimate_matrix (self.estimate_matrix_tensor) during __init__ and sets self.original_estimate_matrix = None. Downstream code in hyperparameter_tuning, evaluation, exclude_targets, and assess_analytical_solution reads the cached tensor instead of re-materializing from .values. This avoids holding both a float64 pandas (6 GB at 1.5M x 500) and a float32 torch (3 GB) copy simultaneously. 2. Add batch_size to Calibration and reweight(). When set, the chi-squared loss gradient is accumulated over disjoint record batches via a two-pass scheme: - Phase 1: accumulate S_j = sum_i w_i * A_{ij} per target under torch.no_grad() (peak memory O(batch_size * n_targets)). - Compute per-target coefficient c_j = d(loss)/d(S_j) = 2 * ((S_j - t_j + 1) / denom_j^2) / n_targets * normalization_factor_j using the same clamped denominator as metrics.loss(), so batched and full-batch agree on targets near -1. - Phase 2: per batch, compute virtual_loss_b = coef · (exp(weights_log[b]) @ A[b]) and call .backward() to accumulate gradients into weights_log. retain_graph=True on all but the last batch preserves the weights_log dropout computation graph across the inner loop. Single-mask semantics: dropout is applied once to the full weights_log tensor before the phase 2 loop, then sliced per batch, so batched gradient equals full-batch gradient exactly (not approximately) under the same dropout realization. Not supported: regularize_with_l0=True with batch_size (the L0 sparse loop uses a different objective and is not yet batched). Raises ValueError if both are set. Tests (TDD): - tests/test_memory.py: original_estimate_matrix is None post-init; estimate_matrix_tensor is present with correct dtype/shape; calibrate() still converges after the release. - tests/test_batch_reweight.py: 9 tests covering (a) full-batch determinism, (b) batch_size=100 matches full-batch within 1e-4 rel err, (c) ragged tail (batch_size=333 with n=1000), (d) batch_size >= n degenerates exactly, (e) batch_size=1 extreme, (f) equivalence under dropout_rate=0.3 (single-mask invariant), (g) equivalence under a non-trivial normalization_factor, (h) equivalence under excluded_targets, (i) batch_size + regularize_with_l0 raises. Reviewer feedback addressed (three subagent reviews, 2026-04-18): - Methodology (accept): math derivation verified. E1 fix — coef uses _safe_denominator(targets) identical to metrics.loss, so targets = -1 no longer diverges between paths. - Reproducibility (minor revisions): changelog fragments added; _full_estimate_matrix_tensor renamed to estimate_matrix_tensor (public-ish, since cross-module); L0+batch_size interaction now raises instead of silently running full-batch; edge-case tests added (dropout, normalization, excluded, batch=1); weakref-based memory test replaced with direct attribute-state assertions. - Code-simplifier: unused `field` import dropped in adapter, DataFrame construction collapsed to dict comprehension, redundant isinstance guard removed, batch_starts list replaces recomputed n_batches. All 49 upstream tests pass under `pytest tests/`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

vercel · 2026-04-18T15:16:43Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Actions	Updated (UTC)
microcalibrate	Ready	Preview, Comment	Apr 18, 2026 3:16pm

vercel Bot deployed to Preview April 18, 2026 15:16 View deployment

MaxGhenis merged commit 7e31c28 into main Apr 18, 2026
6 checks passed

MaxGhenis deleted the batch-calibration-memory-fix branch April 18, 2026 15:20

MaxGhenis mentioned this pull request Apr 18, 2026

Host country-agnostic microcalibrate adapter under microplex.calibration CosilicoAI/microplex#6

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add batch_size gradient accumulation; release pandas matrix after init#99

Add batch_size gradient accumulation; release pandas matrix after init#99
MaxGhenis merged 1 commit intomainfrom
batch-calibration-memory-fix

MaxGhenis commented Apr 18, 2026

Uh oh!

vercel Bot commented Apr 18, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

MaxGhenis commented Apr 18, 2026

Summary

Backward compatibility

Correctness

Test plan

Uh oh!

vercel Bot commented Apr 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

vercel Bot commented Apr 18, 2026 •

edited

Loading