Add _build_sparse_constraint_system for O(nnz) calibration build by MaxGhenis · Pull Request #7 · CosilicoAI/microplex

MaxGhenis · 2026-04-19T12:08:28Z

Summary

Avoid dense materialization when building the calibration linear system. At microplex-us v7 scale (~1.5M records × ~4k constraints, mostly marginal indicators that are 95%+ zero), the existing _build_linear_constraint_system allocates a dense numpy array of ~24 GB of which only ~100-500 MB is non-zero. Downstream L0 calibrators immediately convert to CSR anyway — the dense intermediate is waste.

Downstream evidence: on 2026-04-18 at 23:40:50, macOS memorystatus killed the microplex-us pipeline mid-calibration as python3.14 [28015] 172343 MB (compressed). Root cause was this dense build path, not microcalibrate's own internals.

What changes

New sibling of _build_linear_constraint_system in microplex.calibration:

_build_sparse_constraint_system(data, marginal_targets, continuous_targets, linear_constraints) -> (X_csr, b, names, n_categorical)

Builds the matrix row-by-row via (indptr, indices, data) construction. For each marginal category, stores only the np.flatnonzero(column == category) entries with value 1.0. Continuous columns and LinearConstraint coefficients are stored only at their nonzero positions.

Equivalence guarantee

_build_sparse_constraint_system(...).toarray() == _build_linear_constraint_system(...)[0] up to float64 rounding. Tests in tests/test_sparse_constraint_system.py pin this on three fixtures (marginal-only, continuous+linear+marginal, high-cardinality marginal for density check).

Test plan

pytest tests/test_sparse_constraint_system.py — 4 new tests pass.
pytest tests/test_calibration.py tests/test_microcalibrate_adapter.py — 36 existing tests unchanged.
Downstream: microplex-us's PolicyEngineL0Calibrator swapped to call the sparse builder directly; 2 existing regression tests still pass (test_pe_l0.py).

Why sparse-native and not sparse-coercion

sp.csr_matrix(dense_array) would still require the dense array to exist first. The point is to avoid allocating 24 GB in the first place; building directly into CSR storage does that.

🤖 Generated with Claude Code

The existing _build_linear_constraint_system materializes a dense numpy array of shape (n_targets, n_records) via np.vstack(rows). At microplex-us v7 scale (~1.5M records × ~4k constraints, mostly marginal indicators that are 95%+ zero), this is ~24 GB of dense float64 allocated just to represent ~100-500 MB of nonzero data. The pipeline's L0 calibrator (microplex_us.pipelines.pe_l0) wraps the dense matrix immediately in sp.csr_matrix(A) — it already wants sparse, we just got there through a dense intermediate that wastes the memory and caused the OOM that macOS memorystatus killed as the "172 GB compressed process" in the v7 rerun on 2026-04-18. This commit adds a sparse-native builder that produces a CSR matrix row-by-row via (indptr, indices, data) construction, never allocating the full dense intermediate: - Marginal targets: each category produces a CSR row by np.flatnonzero(column == category), storing only the matching row indices with value 1.0. - Continuous targets: flatnonzero on the column values, storing only nonzero entries. - LinearConstraint rows: flatnonzero on coefficients, same idea. Semantics match _build_linear_constraint_system exactly: both return (matrix, b, names, n_categorical); the sparse version returns a CSR whose .toarray() equals the dense version's A (up to float64 rounding). Tests in tests/test_sparse_constraint_system.py pin: 1. _build_sparse_constraint_system importable from microplex.calibration. 2. Sparse output == dense output for marginal-only problem. 3. Sparse output == dense output for mixed marginal + continuous + LinearConstraint problem. 4. Actual sparsity: density < 0.45 on a realistic 4-state × 3-age marginal problem (the point of the refactor). 36 existing calibration tests unchanged; 40 total now pass. Downstream wiring: microplex-us.pipelines.pe_l0.PolicyEngineL0Calibrator calls this directly in a companion commit, bypassing the dense path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Two linked changes: 1. pe_l0.py: PolicyEngineL0Calibrator.fit now calls _build_sparse_constraint_system from microplex.calibration directly, skipping the dense np.vstack + sp.csr_matrix(A) round-trip. At v7 scale (1.5M records × ~4k constraints) this avoids the ~24 GB dense intermediate that macOS memorystatus killed the v7 microcalibrate rerun over on 2026-04-18 (python3.14 [28015] grew to 172 GB compressed). Requires microplex from the sparse-constraint-builder branch (CosilicoAI/microplex#7). Residual computation also switched from `A @ weights - b` to `X_sparse @ weights - b`; identical numerics, no dense matrix ever materialized. 2. paper/index.qmd §3.3 / §3.4: weaken the identity-preservation definition from strict positivity (∀i: w_i' > 0) to row-set preservation (∀i: w_i' >= 0 AND id(r_i') = id(r_i)). Max's point in conversation: a record with w_i = 0 still has its entity identifier and row position in the HDF5 dataset — it's just excluded from the current year's weighted aggregates, and is available for year Y+1's calibration to re-weight up. This is consistent with CBOLT / DYNASIM's equal-per-person frozen-weight convention; zero-sparsity is a strict superset of that flexibility. §3.4 (Sparse L0) rewritten accordingly: L0 is now framed as a first-class calibrator alongside chi-squared, not as "optional post-processing." Both backends are identity-preserving under the corrected definition. The chi-squared vs L0 trade-off is now "deployment artifact size vs rare-subpopulation coverage audit burden" rather than "identity vs size." Consequence for v8: the pe_l0 backend is now recommended for memory-constrained runs on the 48 GB workstation. Next launch should use --calibration-backend pe_l0 alongside --donor-imputer-backend zi_qrf (see docs/next-run-plan.md). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

MaxGhenis merged commit 5c358e3 into main Apr 19, 2026
3 checks passed

MaxGhenis deleted the sparse-constraint-builder branch April 19, 2026 12:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add _build_sparse_constraint_system for O(nnz) calibration build#7

Add _build_sparse_constraint_system for O(nnz) calibration build#7
MaxGhenis merged 1 commit intomainfrom
sparse-constraint-builder

MaxGhenis commented Apr 19, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

MaxGhenis commented Apr 19, 2026

Summary

What changes

Equivalence guarantee

Test plan

Why sparse-native and not sparse-coercion

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant