Skip to content

Add batch_size gradient accumulation; release pandas matrix after init#99

Merged
MaxGhenis merged 1 commit intomainfrom
batch-calibration-memory-fix
Apr 18, 2026
Merged

Add batch_size gradient accumulation; release pandas matrix after init#99
MaxGhenis merged 1 commit intomainfrom
batch-calibration-memory-fix

Conversation

@MaxGhenis
Copy link
Copy Markdown
Contributor

Summary

Two memory-scale improvements to Calibration for workloads with >1 million records (microplex-us v7 pipeline: 1.5M households × 4,000+ constraints):

  1. Release the pandas DataFrame after __init__. Calibration now builds a single float32 torch copy of the user's estimate_matrix (self.estimate_matrix_tensor) during __init__ and sets self.original_estimate_matrix = None. Downstream code in hyperparameter_tuning, evaluation, exclude_targets, and assess_analytical_solution reads the cached tensor instead of re-materializing from .values. Saves ~3 GB at v7 scale.

  2. Add batch_size for gradient accumulation. When batch_size is set, the chi-squared loss gradient is computed via a two-pass scheme: accumulate S_j under torch.no_grad(), compute a per-target coefficient $c_j = 2 (S_j - t_j + 1) / \text{denom}j^2 / n\text{targets} \cdot \nu_j$ using the same clamped denominator as metrics.loss(), then for each batch call virtual_loss = (coef · (exp(w[b]) @ A[b])).sum().backward(). Peak activation drops from $O(N \cdot k)$ to $O(B \cdot k)$.

Together with a float32-at-construction change in microplex-us (CosilicoAI/microplex-us#TODO), this unblocks calibration at the v7 scale where the previous RSS trajectory crashed macOS jetsam.

Backward compatibility

  • Calibration.original_estimate_matrix is now None after construction. The equivalent float32 data is on Calibration.estimate_matrix_tensor. I verified no readers of original_estimate_matrix exist outside microcalibrate's own code (searched policyengine-us-data, policyengine-uk-data, policyengine-data, microplex-us).
  • batch_size=None (default) preserves the existing full-batch path bit-for-bit. batch_size >= n_records degenerates to full-batch.
  • batch_size with regularize_with_l0=True raises ValueError — the L0 sparse loop uses a different objective and is not yet batched.

Correctness

The batched gradient is exactly equal to the full-batch gradient (not approximate) under the same dropout realization. Derivation, conditions, and reviewer sign-off captured in the commit message. coef uses _safe_denominator(targets) so batched and full-batch agree on targets near $-1$.

Test plan

  • pytest tests/ — 49 passed (40 pre-existing + 9 new in test_batch_reweight.py + 2 new in test_memory.py).
  • New equivalence tests cover: batch_size=None determinism; batch_size=100 vs full-batch within 1e-4; ragged tail (batch_size=333, n=1000); batch_size >= n degeneracy; batch_size=1 extreme; equivalence under dropout_rate=0.3; equivalence under non-trivial normalization_factor; equivalence under excluded_targets; L0+batch guardrail.
  • black . -l 79 and isort --profile black src/ clean.
  • Downstream: rerun the microplex-us v7 pipeline with batch_size=100_000 and confirm RSS peak stays under 40 GB. (Will verify in a follow-up and post numbers in this PR.)

🤖 Generated with Claude Code

Two memory-scale improvements for Calibration at >1e6 records:

1. Release the pandas DataFrame after __init__. Calibration now
   builds a single float32 torch copy of the user's estimate_matrix
   (self.estimate_matrix_tensor) during __init__ and sets
   self.original_estimate_matrix = None. Downstream code in
   hyperparameter_tuning, evaluation, exclude_targets, and
   assess_analytical_solution reads the cached tensor instead of
   re-materializing from .values. This avoids holding both a float64
   pandas (6 GB at 1.5M x 500) and a float32 torch (3 GB) copy
   simultaneously.

2. Add batch_size to Calibration and reweight(). When set, the
   chi-squared loss gradient is accumulated over disjoint record
   batches via a two-pass scheme:
   - Phase 1: accumulate S_j = sum_i w_i * A_{ij} per target under
     torch.no_grad() (peak memory O(batch_size * n_targets)).
   - Compute per-target coefficient
     c_j = d(loss)/d(S_j) = 2 * ((S_j - t_j + 1) / denom_j^2) / n_targets
           * normalization_factor_j
     using the same clamped denominator as metrics.loss(), so batched
     and full-batch agree on targets near -1.
   - Phase 2: per batch, compute
     virtual_loss_b = coef · (exp(weights_log[b]) @ A[b])
     and call .backward() to accumulate gradients into weights_log.
     retain_graph=True on all but the last batch preserves the
     weights_log dropout computation graph across the inner loop.

   Single-mask semantics: dropout is applied once to the full
   weights_log tensor before the phase 2 loop, then sliced per batch,
   so batched gradient equals full-batch gradient exactly (not
   approximately) under the same dropout realization.

   Not supported: regularize_with_l0=True with batch_size (the L0
   sparse loop uses a different objective and is not yet batched).
   Raises ValueError if both are set.

Tests (TDD):
- tests/test_memory.py: original_estimate_matrix is None post-init;
  estimate_matrix_tensor is present with correct dtype/shape;
  calibrate() still converges after the release.
- tests/test_batch_reweight.py: 9 tests covering
  (a) full-batch determinism,
  (b) batch_size=100 matches full-batch within 1e-4 rel err,
  (c) ragged tail (batch_size=333 with n=1000),
  (d) batch_size >= n degenerates exactly,
  (e) batch_size=1 extreme,
  (f) equivalence under dropout_rate=0.3 (single-mask invariant),
  (g) equivalence under a non-trivial normalization_factor,
  (h) equivalence under excluded_targets,
  (i) batch_size + regularize_with_l0 raises.

Reviewer feedback addressed (three subagent reviews, 2026-04-18):
- Methodology (accept): math derivation verified. E1 fix —
  coef uses _safe_denominator(targets) identical to metrics.loss,
  so targets = -1 no longer diverges between paths.
- Reproducibility (minor revisions): changelog fragments added;
  _full_estimate_matrix_tensor renamed to estimate_matrix_tensor
  (public-ish, since cross-module); L0+batch_size interaction now
  raises instead of silently running full-batch; edge-case tests
  added (dropout, normalization, excluded, batch=1); weakref-based
  memory test replaced with direct attribute-state assertions.
- Code-simplifier: unused `field` import dropped in adapter,
  DataFrame construction collapsed to dict comprehension,
  redundant isinstance guard removed, batch_starts list replaces
  recomputed n_batches.

All 49 upstream tests pass under `pytest tests/`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@vercel
Copy link
Copy Markdown
Contributor

vercel Bot commented Apr 18, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
microcalibrate Ready Ready Preview, Comment Apr 18, 2026 3:16pm

Request Review

@MaxGhenis MaxGhenis merged commit 7e31c28 into main Apr 18, 2026
6 checks passed
@MaxGhenis MaxGhenis deleted the batch-calibration-memory-fix branch April 18, 2026 15:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant