Add batch_size gradient accumulation; release pandas matrix after init#99
Merged
Add batch_size gradient accumulation; release pandas matrix after init#99
Conversation
Two memory-scale improvements for Calibration at >1e6 records:
1. Release the pandas DataFrame after __init__. Calibration now
builds a single float32 torch copy of the user's estimate_matrix
(self.estimate_matrix_tensor) during __init__ and sets
self.original_estimate_matrix = None. Downstream code in
hyperparameter_tuning, evaluation, exclude_targets, and
assess_analytical_solution reads the cached tensor instead of
re-materializing from .values. This avoids holding both a float64
pandas (6 GB at 1.5M x 500) and a float32 torch (3 GB) copy
simultaneously.
2. Add batch_size to Calibration and reweight(). When set, the
chi-squared loss gradient is accumulated over disjoint record
batches via a two-pass scheme:
- Phase 1: accumulate S_j = sum_i w_i * A_{ij} per target under
torch.no_grad() (peak memory O(batch_size * n_targets)).
- Compute per-target coefficient
c_j = d(loss)/d(S_j) = 2 * ((S_j - t_j + 1) / denom_j^2) / n_targets
* normalization_factor_j
using the same clamped denominator as metrics.loss(), so batched
and full-batch agree on targets near -1.
- Phase 2: per batch, compute
virtual_loss_b = coef · (exp(weights_log[b]) @ A[b])
and call .backward() to accumulate gradients into weights_log.
retain_graph=True on all but the last batch preserves the
weights_log dropout computation graph across the inner loop.
Single-mask semantics: dropout is applied once to the full
weights_log tensor before the phase 2 loop, then sliced per batch,
so batched gradient equals full-batch gradient exactly (not
approximately) under the same dropout realization.
Not supported: regularize_with_l0=True with batch_size (the L0
sparse loop uses a different objective and is not yet batched).
Raises ValueError if both are set.
Tests (TDD):
- tests/test_memory.py: original_estimate_matrix is None post-init;
estimate_matrix_tensor is present with correct dtype/shape;
calibrate() still converges after the release.
- tests/test_batch_reweight.py: 9 tests covering
(a) full-batch determinism,
(b) batch_size=100 matches full-batch within 1e-4 rel err,
(c) ragged tail (batch_size=333 with n=1000),
(d) batch_size >= n degenerates exactly,
(e) batch_size=1 extreme,
(f) equivalence under dropout_rate=0.3 (single-mask invariant),
(g) equivalence under a non-trivial normalization_factor,
(h) equivalence under excluded_targets,
(i) batch_size + regularize_with_l0 raises.
Reviewer feedback addressed (three subagent reviews, 2026-04-18):
- Methodology (accept): math derivation verified. E1 fix —
coef uses _safe_denominator(targets) identical to metrics.loss,
so targets = -1 no longer diverges between paths.
- Reproducibility (minor revisions): changelog fragments added;
_full_estimate_matrix_tensor renamed to estimate_matrix_tensor
(public-ish, since cross-module); L0+batch_size interaction now
raises instead of silently running full-batch; edge-case tests
added (dropout, normalization, excluded, batch=1); weakref-based
memory test replaced with direct attribute-state assertions.
- Code-simplifier: unused `field` import dropped in adapter,
DataFrame construction collapsed to dict comprehension,
redundant isinstance guard removed, batch_starts list replaces
recomputed n_batches.
All 49 upstream tests pass under `pytest tests/`.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Contributor
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Two memory-scale improvements to
Calibrationfor workloads with >1 million records (microplex-us v7 pipeline: 1.5M households × 4,000+ constraints):Release the pandas DataFrame after
__init__.Calibrationnow builds a singlefloat32torch copy of the user'sestimate_matrix(self.estimate_matrix_tensor) during__init__and setsself.original_estimate_matrix = None. Downstream code inhyperparameter_tuning,evaluation,exclude_targets, andassess_analytical_solutionreads the cached tensor instead of re-materializing from.values. Saves ~3 GB at v7 scale.Add$O(N \cdot k)$ to $O(B \cdot k)$ .
batch_sizefor gradient accumulation. Whenbatch_sizeis set, the chi-squared loss gradient is computed via a two-pass scheme: accumulateS_jundertorch.no_grad(), compute a per-target coefficient $c_j = 2 (S_j - t_j + 1) / \text{denom}j^2 / n\text{targets} \cdot \nu_j$ using the same clamped denominator asmetrics.loss(), then for each batch callvirtual_loss = (coef · (exp(w[b]) @ A[b])).sum().backward(). Peak activation drops fromTogether with a float32-at-construction change in
microplex-us(CosilicoAI/microplex-us#TODO), this unblocks calibration at the v7 scale where the previous RSS trajectory crashed macOS jetsam.Backward compatibility
Calibration.original_estimate_matrixis nowNoneafter construction. The equivalent float32 data is onCalibration.estimate_matrix_tensor. I verified no readers oforiginal_estimate_matrixexist outsidemicrocalibrate's own code (searchedpolicyengine-us-data,policyengine-uk-data,policyengine-data,microplex-us).batch_size=None(default) preserves the existing full-batch path bit-for-bit.batch_size >= n_recordsdegenerates to full-batch.batch_sizewithregularize_with_l0=TrueraisesValueError— the L0 sparse loop uses a different objective and is not yet batched.Correctness
The batched gradient is exactly equal to the full-batch gradient (not approximate) under the same dropout realization. Derivation, conditions, and reviewer sign-off captured in the commit message.$-1$ .
coefuses_safe_denominator(targets)so batched and full-batch agree on targets nearTest plan
pytest tests/— 49 passed (40 pre-existing + 9 new intest_batch_reweight.py+ 2 new intest_memory.py).batch_size=Nonedeterminism;batch_size=100vs full-batch within 1e-4; ragged tail (batch_size=333,n=1000);batch_size >= ndegeneracy;batch_size=1extreme; equivalence underdropout_rate=0.3; equivalence under non-trivialnormalization_factor; equivalence underexcluded_targets; L0+batch guardrail.black . -l 79andisort --profile black src/clean.batch_size=100_000and confirm RSS peak stays under 40 GB. (Will verify in a follow-up and post numbers in this PR.)🤖 Generated with Claude Code