feat: aa.eval_features — model+CV+metric feature-set scorer (prototype #309) by breimanntools · Pull Request #315 · breimanntools/aaanalysis

breimanntools · 2026-06-30T23:29:34Z

Status: SOLID — works, green local tests, full ripple done. Part of #305 / prototype for #309.

What to review

New public scorer aa.eval_features(X, labels, model=None, cv=None, metric="balanced_accuracy", mask_known_pos=None, random_state=None) returning a percentage score.
The default (linear SVM + LeaveOneOut + balanced accuracy) reproduces the γ-secretase notebook's hand-rolled balanced_acc helper exactly on the same feature matrix (golden test asserts isclose vs the manual cross_val_predict(SVC(kernel="linear"), ...) recipe).
The PU mask_known_pos variant (port of cv_leave_one_out_masked from the original project's scripts/ml_benchmarking.py): masked known positives stay in every training fold but are never scored as test points.

Design notes / open API questions

Percentage scaling. Returns the bounded metric × 100 uniformly (matches the notebook's * 100). For matthews_corrcoef (range [-1, 1]) this yields [-100, 100] — documented, but a maintainer may prefer to restrict the metric registry to [0, 1] metrics, or add a percentage=True toggle.
Masked-CV predictions. The port uses model.predict (not the original's predict_proba > 0.5) to stay deterministic and avoid requiring probability=True on SVC (Platt scaling is stochastic). For a linear SVM these agree on the decision boundary. The default (unmasked) path uses cross_val_predict to byte-match the notebook.
Metric registry is a small fixed dict (balanced_accuracy, accuracy, f1, precision, recall, matthews_corrcoef). Could be widened to all sklearn.metrics.get_scorer_names() if desired.
Coordinate with Model evaluation & comparison: repeated-CV + bootstrap CIs + paired ΔMCC #91 / feat: aa.plot_eval_heatmap — house-preset evaluation heatmap (consolidate duplicated seaborn blocks) #310 so the eval/comparison surface is one API (the eval heatmap consumes scores like these).

Ripple done

numpydoc docstring with named Returns + .. include:: examples/eval_features.rst.
Executed example notebook examples/metrics/eval_features.ipynb (default / model+cv+metric swap / PU mask; display_df table + embedded outputs).
29 unit tests (tests/unit/metrics_tests/test_eval_features.py): positive + negative per parameter, reproducibility (same random_state → identical score), golden vs manual recipe. All green locally.
Wired into __init__/__all__ (CONFIRM-FIRST; feat: aa.eval_features — model + CV + metric feature-set scorer (incl. PU mask-CV) #309 is on the epic wire-to-public-API list), metrics/__init__.py, docs/source/api.rst autosummary, release notes (Unreleased · Metrics), cheat sheet.
Meta-tests green: test_param_coverage, test_utils_barrel, test_backend_import_hygiene, full api_tests (175) + metrics_tests (112).

KPIs (#309)

Default reproduces notebook bACC numbers within tolerance (golden test).
Same random_state → identical score.
≥1 unit test per parameter, positive + negative.
No new dependency (sklearn already a dep).

🤖 Generated with Claude Code

Critical self-review

Adversarial review pass (not a rubber-stamp). Defects found and fixed:

BUG — mask_known_pos + integer cv crashed. The masked path called
cv.split(...) on the raw argument, so eval_features(X, y, cv=5, mask_known_pos=mask)
raised AttributeError: 'int' object has no attribute 'split' (reproduced concretely).
Fixed by normalizing cv once with sklearn.model_selection.check_cv
(None -> LeaveOneOut, int -> StratifiedKFold, splitter passed through) so both the
default and masked paths get a real splitter. The default/int paths stay byte-identical
(this is exactly what cross_val_predict does internally). Added a mask + int cv
regression test.
Percentage/MCC inconsistency (the flagged design question) — RESOLVED by restricting
the registry to [0, 1] metrics. matthews_corrcoef ([-1, 1]) made the *100
scaling land in [-100, 100], breaking the "percentage" contract. I removed it, so every
registry metric (balanced_accuracy, accuracy, f1, precision, recall) is bounded
in [0, 1] and the returned score is a genuine percentage in [0, 100]. This is the
smallest, most honest fix and keeps a single return semantics; a signed metric can be
re-added later behind a scaling-aware design if a use case appears. Docstring, Returns,
and the supported-metric parametrization updated accordingly.
Golden test was near-tautological. It re-ran balanced_accuracy_score against itself.
Added an independent hand-rolled balanced accuracy (mean per-class recall) as a second
cross-check, so the assertion actually pins the metric semantics, not just the sklearn call.
PU-mask invariant was only checked indirectly. Added a rigorous test with a recording
estimator (class-level storage survives sklearn.clone) that asserts the masked known
positives train every scored fold and are never handed to predict (scored).
Convention/hygiene. Moved the mid-file sklearn imports in _utils/metrics.py to the
module top. Moved the all-samples-masked rejection into the frontend Validate block
(backend now trusts the frontend), keeping a defensive backend guard — with a clearer
message — only for exotic custom splitters that never cover an unmasked sample.

Verified: random_state is forwarded only to estimators exposing it (guarded by
get_params()), so the default SVM and param-less estimators never crash; same
random_state -> identical score. No new dependency, no print, bare ValueError only.
Tests: metrics_tests + api_tests green (288), example notebook re-executes under nbmake,
docstring checker clean for eval_features (only a pre-existing comp_kld advisory remains).

Residual note: the golden test runs on a synthetic make_classification matrix, not the
actual γ-secretase feature matrix (that data isn't a lightweight unit-test fixture); it proves
the default reproduces the notebook recipe exactly, which is the testable contract.

Iterative review log

Round 2 (correctness & edge cases): Found + fixed a real PU-mask bug — with a non-LOO splitter (e.g. StratifiedKFold), a masked known positive that landed in a test partition was dropped from both scoring and that fold's training, violating the documented "known positives still inform every training fold" contract (concretely reproduced: 3 of 4 folds were missing masked positives from training). Fix folds the masked test indices back into train_full per fold; LOO stays byte-identical (single-sample masked test folds are skipped outright, so the default path and golden/notebook reproduction are unchanged). Added a k-fold masked-invariant regression test (31 tests). Also verified single-class labels raise a friendly check_labels ValueError, not a cryptic sklearn error.
Round 3 (numerical robustness & API consistency): Verified adversarially that all five registry metrics return a genuine percentage in [0, 100] even on degenerate/near-random fixtures that trigger zero-division folds, and that the default + RandomForest paths are deterministic under random_state. One API-consistency wart: f1/precision/recall (binary-averaged) raise a cryptic sklearn ValueError on multi-class labels while balanced_accuracy/accuracy accept them. It already fails loud (a ValueError), so the slim honest fix is documentation, not cross-layer validation plumbing: clarified in the metric docstring which metrics are multi-class vs binary, and added a positive multi-class test for the supported metrics (33 tests). No behavior change.
Round 4 (efficiency & simplicity): Dropped two needless per-fold re-wraps in the masked-CV loop (np.asarray(y)[keep] -> y[keep], np.asarray(preds) -> preds; both are already ndarrays), a recompute-inside-loop the round-4 lens targets — output byte-identical (33 tests green). Audited the rest (frontend Validate block, eval_features_, the .astype(int) safety cast, default cross_val_predict path): no dead code, no redundant validation, no other in-loop recompute, and the per-fold refit loop is inherently non-vectorizable. Otherwise converged.
Round 5 (guides, docs & tests completeness): Fixed one convention item — metric validation now reads the registry via ut.LIST_METRICS_EVAL (barrel access) instead of a bare top-level import, matching the house ut.X constant rule. Verified: numpydoc named Returns + one-line summary; .. include:: examples/eval_features.rst backed by a present, nbmake-passing examples/metrics/eval_features.ipynb (same generation pattern as the sibling v1.1 metrics); no citations to resolve; no ADR refs introduced by this PR; frontend Validate covers every public param and the backend trusts it; bare ValueError only, no print. Every param has a positive + negative test. Docstring checker + doc/signature-drift clean for eval_features (only the pre-existing comp_kld RAISES advisory remains). Full local gate green: metrics_tests + api_tests (param_coverage, backend_import_hygiene, utils_barrel, return-contract) = 301 passed.

…ype #309) Top-level eval_features(X, labels, model=None, cv=None, metric='balanced_accuracy', mask_known_pos=None, random_state=None) returning a percentage score. The default (linear SVM + LeaveOneOut + balanced accuracy) reproduces the gamma-secretase notebook's hand-rolled balanced_acc helper. Accepts any sklearn estimator, CV splitter/int, and metric name; mask_known_pos selects the PU mask-known-positives CV variant (port of cv_leave_one_out_masked). Deterministic given random_state. Backend eval_features_ in _utils/metrics.py re-exported via utils; frontend in metrics/_metrics.py with full Validate block. Re-exported in __init__/__all__ (wire-to-public-API per epic #305). 29 unit tests (positive+negative per param, reproducibility, golden vs manual recipe). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…le notebook (prototype #309) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…gistry, PU-mask invariant test Critical-review fixes for #309: - Fix AttributeError when mask_known_pos is combined with an integer cv: the masked path called cv.split on a raw int. Normalize cv once via sklearn.check_cv (None -> LeaveOneOut, int -> StratifiedKFold, splitter as-is) so both the default and masked paths receive a real splitter; byte-identical for the existing default/int paths. - Resolve the percentage-scaling inconsistency: drop matthews_corrcoef so every registry metric is bounded in [0, 1] and the *100 scaling is a genuine percentage in [0, 100]. Docstring/Returns updated to match. - Move the mid-file sklearn imports to the module top (file-skeleton hygiene). - Move the all-samples-masked rejection to the frontend validator (backend now trusts the frontend); keep a defensive backend guard for exotic custom splitters with a clearer message. - Strengthen the golden test with an independent hand-rolled balanced accuracy (mean per-class recall) so it is not a re-run of balanced_accuracy_score against itself. - Add a rigorous PU-mask invariant test (recording estimator) asserting masked positives train every scored fold and are never scored, plus a mask+int-cv regression test. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…ing under non-LOO CV The PU mask-known-positives path dropped a masked positive from BOTH scoring and training whenever it fell in a k-fold test partition, breaking the documented 'known positives still inform every training fold' contract for any non-LOO splitter. Fold the masked test indices back into each fold's training set; leave-one-out is unaffected (single-sample masked test folds are skipped), so the default path and the notebook reproduction stay byte-identical. Add a StratifiedKFold masked-invariant regression test. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…oundary f1/precision/recall are binary-averaged (raise on multi-class labels) while balanced_accuracy/accuracy accept multi-class. Clarify the metric docstring and add a positive multi-class test for the supported metrics. Verified all five metrics return a true [0,100]% even on degenerate zero-division folds and that the default + RandomForest paths are deterministic. No behavior change. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

y and preds are already ndarrays in the masked-CV fold loop, so the extra np.asarray(...) calls per fold were a needless recompute. Output byte-identical. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Match the house ut.X constant-access convention: validate 'metric' against ut.LIST_METRICS_EVAL through the utils barrel instead of a bare top-level import. No behavior change. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

breimanntools and others added 7 commits July 1, 2026 01:29

docs: eval_features release note, API autosummary, cheat sheet, examp…

134c54e

…le notebook (prototype #309) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: aa.eval_features — model+CV+metric feature-set scorer (prototype #309)#315

feat: aa.eval_features — model+CV+metric feature-set scorer (prototype #309)#315
breimanntools wants to merge 7 commits into
masterfrom
feat/eval-features

breimanntools commented Jun 30, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

breimanntools commented Jun 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What to review

Design notes / open API questions

Ripple done

KPIs (#309)

Critical self-review

Iterative review log

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

breimanntools commented Jun 30, 2026 •

edited

Loading