Add _build_sparse_constraint_system for O(nnz) calibration build#7
Merged
Add _build_sparse_constraint_system for O(nnz) calibration build#7
Conversation
The existing _build_linear_constraint_system materializes a dense numpy array of shape (n_targets, n_records) via np.vstack(rows). At microplex-us v7 scale (~1.5M records × ~4k constraints, mostly marginal indicators that are 95%+ zero), this is ~24 GB of dense float64 allocated just to represent ~100-500 MB of nonzero data. The pipeline's L0 calibrator (microplex_us.pipelines.pe_l0) wraps the dense matrix immediately in sp.csr_matrix(A) — it already wants sparse, we just got there through a dense intermediate that wastes the memory and caused the OOM that macOS memorystatus killed as the "172 GB compressed process" in the v7 rerun on 2026-04-18. This commit adds a sparse-native builder that produces a CSR matrix row-by-row via (indptr, indices, data) construction, never allocating the full dense intermediate: - Marginal targets: each category produces a CSR row by np.flatnonzero(column == category), storing only the matching row indices with value 1.0. - Continuous targets: flatnonzero on the column values, storing only nonzero entries. - LinearConstraint rows: flatnonzero on coefficients, same idea. Semantics match _build_linear_constraint_system exactly: both return (matrix, b, names, n_categorical); the sparse version returns a CSR whose .toarray() equals the dense version's A (up to float64 rounding). Tests in tests/test_sparse_constraint_system.py pin: 1. _build_sparse_constraint_system importable from microplex.calibration. 2. Sparse output == dense output for marginal-only problem. 3. Sparse output == dense output for mixed marginal + continuous + LinearConstraint problem. 4. Actual sparsity: density < 0.45 on a realistic 4-state × 3-age marginal problem (the point of the refactor). 36 existing calibration tests unchanged; 40 total now pass. Downstream wiring: microplex-us.pipelines.pe_l0.PolicyEngineL0Calibrator calls this directly in a companion commit, bypassing the dense path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
MaxGhenis
added a commit
to CosilicoAI/microplex-us
that referenced
this pull request
Apr 22, 2026
Two linked changes: 1. pe_l0.py: PolicyEngineL0Calibrator.fit now calls _build_sparse_constraint_system from microplex.calibration directly, skipping the dense np.vstack + sp.csr_matrix(A) round-trip. At v7 scale (1.5M records × ~4k constraints) this avoids the ~24 GB dense intermediate that macOS memorystatus killed the v7 microcalibrate rerun over on 2026-04-18 (python3.14 [28015] grew to 172 GB compressed). Requires microplex from the sparse-constraint-builder branch (CosilicoAI/microplex#7). Residual computation also switched from `A @ weights - b` to `X_sparse @ weights - b`; identical numerics, no dense matrix ever materialized. 2. paper/index.qmd §3.3 / §3.4: weaken the identity-preservation definition from strict positivity (∀i: w_i' > 0) to row-set preservation (∀i: w_i' >= 0 AND id(r_i') = id(r_i)). Max's point in conversation: a record with w_i = 0 still has its entity identifier and row position in the HDF5 dataset — it's just excluded from the current year's weighted aggregates, and is available for year Y+1's calibration to re-weight up. This is consistent with CBOLT / DYNASIM's equal-per-person frozen-weight convention; zero-sparsity is a strict superset of that flexibility. §3.4 (Sparse L0) rewritten accordingly: L0 is now framed as a first-class calibrator alongside chi-squared, not as "optional post-processing." Both backends are identity-preserving under the corrected definition. The chi-squared vs L0 trade-off is now "deployment artifact size vs rare-subpopulation coverage audit burden" rather than "identity vs size." Consequence for v8: the pe_l0 backend is now recommended for memory-constrained runs on the 48 GB workstation. Next launch should use --calibration-backend pe_l0 alongside --donor-imputer-backend zi_qrf (see docs/next-run-plan.md). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Avoid dense materialization when building the calibration linear system. At microplex-us v7 scale (~1.5M records × ~4k constraints, mostly marginal indicators that are 95%+ zero), the existing
_build_linear_constraint_systemallocates a dense numpy array of ~24 GB of which only ~100-500 MB is non-zero. Downstream L0 calibrators immediately convert to CSR anyway — the dense intermediate is waste.Downstream evidence: on 2026-04-18 at 23:40:50, macOS memorystatus killed the microplex-us pipeline mid-calibration as
python3.14 [28015] 172343 MB(compressed). Root cause was this dense build path, not microcalibrate's own internals.What changes
New sibling of
_build_linear_constraint_systeminmicroplex.calibration:_build_sparse_constraint_system(data, marginal_targets, continuous_targets, linear_constraints) -> (X_csr, b, names, n_categorical)Builds the matrix row-by-row via
(indptr, indices, data)construction. For each marginal category, stores only thenp.flatnonzero(column == category)entries with value 1.0. Continuous columns andLinearConstraintcoefficients are stored only at their nonzero positions.Equivalence guarantee
_build_sparse_constraint_system(...).toarray() == _build_linear_constraint_system(...)[0]up to float64 rounding. Tests intests/test_sparse_constraint_system.pypin this on three fixtures (marginal-only, continuous+linear+marginal, high-cardinality marginal for density check).Test plan
pytest tests/test_sparse_constraint_system.py— 4 new tests pass.pytest tests/test_calibration.py tests/test_microcalibrate_adapter.py— 36 existing tests unchanged.PolicyEngineL0Calibratorswapped to call the sparse builder directly; 2 existing regression tests still pass (test_pe_l0.py).Why sparse-native and not sparse-coercion
sp.csr_matrix(dense_array)would still require the dense array to exist first. The point is to avoid allocating 24 GB in the first place; building directly into CSR storage does that.🤖 Generated with Claude Code