Skip to content

feat: aa.eval_features — model+CV+metric feature-set scorer (prototype #309)#315

Draft
breimanntools wants to merge 7 commits into
masterfrom
feat/eval-features
Draft

feat: aa.eval_features — model+CV+metric feature-set scorer (prototype #309)#315
breimanntools wants to merge 7 commits into
masterfrom
feat/eval-features

Conversation

@breimanntools

@breimanntools breimanntools commented Jun 30, 2026

Copy link
Copy Markdown
Owner

Status: SOLID — works, green local tests, full ripple done. Part of #305 / prototype for #309.

What to review

  • New public scorer aa.eval_features(X, labels, model=None, cv=None, metric="balanced_accuracy", mask_known_pos=None, random_state=None) returning a percentage score.
  • The default (linear SVM + LeaveOneOut + balanced accuracy) reproduces the γ-secretase notebook's hand-rolled balanced_acc helper exactly on the same feature matrix (golden test asserts isclose vs the manual cross_val_predict(SVC(kernel="linear"), ...) recipe).
  • The PU mask_known_pos variant (port of cv_leave_one_out_masked from the original project's scripts/ml_benchmarking.py): masked known positives stay in every training fold but are never scored as test points.

Design notes / open API questions

  1. Percentage scaling. Returns the bounded metric × 100 uniformly (matches the notebook's * 100). For matthews_corrcoef (range [-1, 1]) this yields [-100, 100] — documented, but a maintainer may prefer to restrict the metric registry to [0, 1] metrics, or add a percentage=True toggle.
  2. Masked-CV predictions. The port uses model.predict (not the original's predict_proba > 0.5) to stay deterministic and avoid requiring probability=True on SVC (Platt scaling is stochastic). For a linear SVM these agree on the decision boundary. The default (unmasked) path uses cross_val_predict to byte-match the notebook.
  3. Metric registry is a small fixed dict (balanced_accuracy, accuracy, f1, precision, recall, matthews_corrcoef). Could be widened to all sklearn.metrics.get_scorer_names() if desired.
  4. Coordinate with Model evaluation & comparison: repeated-CV + bootstrap CIs + paired ΔMCC #91 / feat: aa.plot_eval_heatmap — house-preset evaluation heatmap (consolidate duplicated seaborn blocks) #310 so the eval/comparison surface is one API (the eval heatmap consumes scores like these).

Ripple done

  • numpydoc docstring with named Returns + .. include:: examples/eval_features.rst.
  • Executed example notebook examples/metrics/eval_features.ipynb (default / model+cv+metric swap / PU mask; display_df table + embedded outputs).
  • 29 unit tests (tests/unit/metrics_tests/test_eval_features.py): positive + negative per parameter, reproducibility (same random_state → identical score), golden vs manual recipe. All green locally.
  • Wired into __init__/__all__ (CONFIRM-FIRST; feat: aa.eval_features — model + CV + metric feature-set scorer (incl. PU mask-CV) #309 is on the epic wire-to-public-API list), metrics/__init__.py, docs/source/api.rst autosummary, release notes (Unreleased · Metrics), cheat sheet.
  • Meta-tests green: test_param_coverage, test_utils_barrel, test_backend_import_hygiene, full api_tests (175) + metrics_tests (112).

KPIs (#309)

  • Default reproduces notebook bACC numbers within tolerance (golden test).
  • Same random_state → identical score.
  • ≥1 unit test per parameter, positive + negative.
  • No new dependency (sklearn already a dep).

🤖 Generated with Claude Code


Critical self-review

Adversarial review pass (not a rubber-stamp). Defects found and fixed:

  1. BUG — mask_known_pos + integer cv crashed. The masked path called
    cv.split(...) on the raw argument, so eval_features(X, y, cv=5, mask_known_pos=mask)
    raised AttributeError: 'int' object has no attribute 'split' (reproduced concretely).
    Fixed by normalizing cv once with sklearn.model_selection.check_cv
    (None -> LeaveOneOut, int -> StratifiedKFold, splitter passed through) so both the
    default and masked paths get a real splitter. The default/int paths stay byte-identical
    (this is exactly what cross_val_predict does internally). Added a mask + int cv
    regression test.

  2. Percentage/MCC inconsistency (the flagged design question) — RESOLVED by restricting
    the registry to [0, 1] metrics.
    matthews_corrcoef ([-1, 1]) made the *100
    scaling land in [-100, 100], breaking the "percentage" contract. I removed it, so every
    registry metric (balanced_accuracy, accuracy, f1, precision, recall) is bounded
    in [0, 1] and the returned score is a genuine percentage in [0, 100]. This is the
    smallest, most honest fix and keeps a single return semantics; a signed metric can be
    re-added later behind a scaling-aware design if a use case appears. Docstring, Returns,
    and the supported-metric parametrization updated accordingly.

  3. Golden test was near-tautological. It re-ran balanced_accuracy_score against itself.
    Added an independent hand-rolled balanced accuracy (mean per-class recall) as a second
    cross-check, so the assertion actually pins the metric semantics, not just the sklearn call.

  4. PU-mask invariant was only checked indirectly. Added a rigorous test with a recording
    estimator (class-level storage survives sklearn.clone) that asserts the masked known
    positives train every scored fold and are never handed to predict (scored).

  5. Convention/hygiene. Moved the mid-file sklearn imports in _utils/metrics.py to the
    module top. Moved the all-samples-masked rejection into the frontend Validate block
    (backend now trusts the frontend), keeping a defensive backend guard — with a clearer
    message — only for exotic custom splitters that never cover an unmasked sample.

Verified: random_state is forwarded only to estimators exposing it (guarded by
get_params()), so the default SVM and param-less estimators never crash; same
random_state -> identical score. No new dependency, no print, bare ValueError only.
Tests: metrics_tests + api_tests green (288), example notebook re-executes under nbmake,
docstring checker clean for eval_features (only a pre-existing comp_kld advisory remains).

Residual note: the golden test runs on a synthetic make_classification matrix, not the
actual γ-secretase feature matrix (that data isn't a lightweight unit-test fixture); it proves
the default reproduces the notebook recipe exactly, which is the testable contract.

Iterative review log

  • Round 2 (correctness & edge cases): Found + fixed a real PU-mask bug — with a non-LOO splitter (e.g. StratifiedKFold), a masked known positive that landed in a test partition was dropped from both scoring and that fold's training, violating the documented "known positives still inform every training fold" contract (concretely reproduced: 3 of 4 folds were missing masked positives from training). Fix folds the masked test indices back into train_full per fold; LOO stays byte-identical (single-sample masked test folds are skipped outright, so the default path and golden/notebook reproduction are unchanged). Added a k-fold masked-invariant regression test (31 tests). Also verified single-class labels raise a friendly check_labels ValueError, not a cryptic sklearn error.
  • Round 3 (numerical robustness & API consistency): Verified adversarially that all five registry metrics return a genuine percentage in [0, 100] even on degenerate/near-random fixtures that trigger zero-division folds, and that the default + RandomForest paths are deterministic under random_state. One API-consistency wart: f1/precision/recall (binary-averaged) raise a cryptic sklearn ValueError on multi-class labels while balanced_accuracy/accuracy accept them. It already fails loud (a ValueError), so the slim honest fix is documentation, not cross-layer validation plumbing: clarified in the metric docstring which metrics are multi-class vs binary, and added a positive multi-class test for the supported metrics (33 tests). No behavior change.
  • Round 4 (efficiency & simplicity): Dropped two needless per-fold re-wraps in the masked-CV loop (np.asarray(y)[keep] -> y[keep], np.asarray(preds) -> preds; both are already ndarrays), a recompute-inside-loop the round-4 lens targets — output byte-identical (33 tests green). Audited the rest (frontend Validate block, eval_features_, the .astype(int) safety cast, default cross_val_predict path): no dead code, no redundant validation, no other in-loop recompute, and the per-fold refit loop is inherently non-vectorizable. Otherwise converged.
  • Round 5 (guides, docs & tests completeness): Fixed one convention item — metric validation now reads the registry via ut.LIST_METRICS_EVAL (barrel access) instead of a bare top-level import, matching the house ut.X constant rule. Verified: numpydoc named Returns + one-line summary; .. include:: examples/eval_features.rst backed by a present, nbmake-passing examples/metrics/eval_features.ipynb (same generation pattern as the sibling v1.1 metrics); no citations to resolve; no ADR refs introduced by this PR; frontend Validate covers every public param and the backend trusts it; bare ValueError only, no print. Every param has a positive + negative test. Docstring checker + doc/signature-drift clean for eval_features (only the pre-existing comp_kld RAISES advisory remains). Full local gate green: metrics_tests + api_tests (param_coverage, backend_import_hygiene, utils_barrel, return-contract) = 301 passed.

breimanntools and others added 7 commits July 1, 2026 01:29
…ype #309)

Top-level eval_features(X, labels, model=None, cv=None, metric='balanced_accuracy',
mask_known_pos=None, random_state=None) returning a percentage score. The default
(linear SVM + LeaveOneOut + balanced accuracy) reproduces the gamma-secretase
notebook's hand-rolled balanced_acc helper. Accepts any sklearn estimator, CV
splitter/int, and metric name; mask_known_pos selects the PU mask-known-positives
CV variant (port of cv_leave_one_out_masked). Deterministic given random_state.

Backend eval_features_ in _utils/metrics.py re-exported via utils; frontend in
metrics/_metrics.py with full Validate block. Re-exported in __init__/__all__
(wire-to-public-API per epic #305). 29 unit tests (positive+negative per param,
reproducibility, golden vs manual recipe).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…le notebook (prototype #309)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…gistry, PU-mask invariant test

Critical-review fixes for #309:
- Fix AttributeError when mask_known_pos is combined with an integer cv: the
  masked path called cv.split on a raw int. Normalize cv once via
  sklearn.check_cv (None -> LeaveOneOut, int -> StratifiedKFold, splitter as-is)
  so both the default and masked paths receive a real splitter; byte-identical
  for the existing default/int paths.
- Resolve the percentage-scaling inconsistency: drop matthews_corrcoef so every
  registry metric is bounded in [0, 1] and the *100 scaling is a genuine
  percentage in [0, 100]. Docstring/Returns updated to match.
- Move the mid-file sklearn imports to the module top (file-skeleton hygiene).
- Move the all-samples-masked rejection to the frontend validator (backend now
  trusts the frontend); keep a defensive backend guard for exotic custom
  splitters with a clearer message.
- Strengthen the golden test with an independent hand-rolled balanced accuracy
  (mean per-class recall) so it is not a re-run of balanced_accuracy_score
  against itself.
- Add a rigorous PU-mask invariant test (recording estimator) asserting masked
  positives train every scored fold and are never scored, plus a mask+int-cv
  regression test.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ing under non-LOO CV

The PU mask-known-positives path dropped a masked positive from BOTH scoring
and training whenever it fell in a k-fold test partition, breaking the
documented 'known positives still inform every training fold' contract for any
non-LOO splitter. Fold the masked test indices back into each fold's training
set; leave-one-out is unaffected (single-sample masked test folds are skipped),
so the default path and the notebook reproduction stay byte-identical. Add a
StratifiedKFold masked-invariant regression test.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…oundary

f1/precision/recall are binary-averaged (raise on multi-class labels) while
balanced_accuracy/accuracy accept multi-class. Clarify the metric docstring and
add a positive multi-class test for the supported metrics. Verified all five
metrics return a true [0,100]% even on degenerate zero-division folds and that
the default + RandomForest paths are deterministic. No behavior change.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
y and preds are already ndarrays in the masked-CV fold loop, so the extra
np.asarray(...) calls per fold were a needless recompute. Output byte-identical.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Match the house ut.X constant-access convention: validate 'metric' against
ut.LIST_METRICS_EVAL through the utils barrel instead of a bare top-level import.
No behavior change.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant