feat: aa.eval_features — model+CV+metric feature-set scorer (prototype #309)#315
Draft
breimanntools wants to merge 7 commits into
Draft
feat: aa.eval_features — model+CV+metric feature-set scorer (prototype #309)#315breimanntools wants to merge 7 commits into
breimanntools wants to merge 7 commits into
Conversation
…ype #309) Top-level eval_features(X, labels, model=None, cv=None, metric='balanced_accuracy', mask_known_pos=None, random_state=None) returning a percentage score. The default (linear SVM + LeaveOneOut + balanced accuracy) reproduces the gamma-secretase notebook's hand-rolled balanced_acc helper. Accepts any sklearn estimator, CV splitter/int, and metric name; mask_known_pos selects the PU mask-known-positives CV variant (port of cv_leave_one_out_masked). Deterministic given random_state. Backend eval_features_ in _utils/metrics.py re-exported via utils; frontend in metrics/_metrics.py with full Validate block. Re-exported in __init__/__all__ (wire-to-public-API per epic #305). 29 unit tests (positive+negative per param, reproducibility, golden vs manual recipe). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…le notebook (prototype #309) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…gistry, PU-mask invariant test Critical-review fixes for #309: - Fix AttributeError when mask_known_pos is combined with an integer cv: the masked path called cv.split on a raw int. Normalize cv once via sklearn.check_cv (None -> LeaveOneOut, int -> StratifiedKFold, splitter as-is) so both the default and masked paths receive a real splitter; byte-identical for the existing default/int paths. - Resolve the percentage-scaling inconsistency: drop matthews_corrcoef so every registry metric is bounded in [0, 1] and the *100 scaling is a genuine percentage in [0, 100]. Docstring/Returns updated to match. - Move the mid-file sklearn imports to the module top (file-skeleton hygiene). - Move the all-samples-masked rejection to the frontend validator (backend now trusts the frontend); keep a defensive backend guard for exotic custom splitters with a clearer message. - Strengthen the golden test with an independent hand-rolled balanced accuracy (mean per-class recall) so it is not a re-run of balanced_accuracy_score against itself. - Add a rigorous PU-mask invariant test (recording estimator) asserting masked positives train every scored fold and are never scored, plus a mask+int-cv regression test. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ing under non-LOO CV The PU mask-known-positives path dropped a masked positive from BOTH scoring and training whenever it fell in a k-fold test partition, breaking the documented 'known positives still inform every training fold' contract for any non-LOO splitter. Fold the masked test indices back into each fold's training set; leave-one-out is unaffected (single-sample masked test folds are skipped), so the default path and the notebook reproduction stay byte-identical. Add a StratifiedKFold masked-invariant regression test. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…oundary f1/precision/recall are binary-averaged (raise on multi-class labels) while balanced_accuracy/accuracy accept multi-class. Clarify the metric docstring and add a positive multi-class test for the supported metrics. Verified all five metrics return a true [0,100]% even on degenerate zero-division folds and that the default + RandomForest paths are deterministic. No behavior change. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
y and preds are already ndarrays in the masked-CV fold loop, so the extra np.asarray(...) calls per fold were a needless recompute. Output byte-identical. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Match the house ut.X constant-access convention: validate 'metric' against ut.LIST_METRICS_EVAL through the utils barrel instead of a bare top-level import. No behavior change. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Status: SOLID — works, green local tests, full ripple done. Part of #305 / prototype for #309.
What to review
aa.eval_features(X, labels, model=None, cv=None, metric="balanced_accuracy", mask_known_pos=None, random_state=None)returning a percentage score.LeaveOneOut+ balanced accuracy) reproduces the γ-secretase notebook's hand-rolledbalanced_acchelper exactly on the same feature matrix (golden test assertsisclosevs the manualcross_val_predict(SVC(kernel="linear"), ...)recipe).mask_known_posvariant (port ofcv_leave_one_out_maskedfrom the original project'sscripts/ml_benchmarking.py): masked known positives stay in every training fold but are never scored as test points.Design notes / open API questions
* 100). Formatthews_corrcoef(range [-1, 1]) this yields [-100, 100] — documented, but a maintainer may prefer to restrict the metric registry to [0, 1] metrics, or add apercentage=Truetoggle.model.predict(not the original'spredict_proba > 0.5) to stay deterministic and avoid requiringprobability=Trueon SVC (Platt scaling is stochastic). For a linear SVM these agree on the decision boundary. The default (unmasked) path usescross_val_predictto byte-match the notebook.balanced_accuracy,accuracy,f1,precision,recall,matthews_corrcoef). Could be widened to allsklearn.metrics.get_scorer_names()if desired.Ripple done
Returns+.. include:: examples/eval_features.rst.examples/metrics/eval_features.ipynb(default / model+cv+metric swap / PU mask;display_dftable + embedded outputs).tests/unit/metrics_tests/test_eval_features.py): positive + negative per parameter, reproducibility (samerandom_state→ identical score), golden vs manual recipe. All green locally.__init__/__all__(CONFIRM-FIRST; feat: aa.eval_features — model + CV + metric feature-set scorer (incl. PU mask-CV) #309 is on the epic wire-to-public-API list),metrics/__init__.py,docs/source/api.rstautosummary, release notes (Unreleased · Metrics), cheat sheet.test_param_coverage,test_utils_barrel,test_backend_import_hygiene, fullapi_tests(175) +metrics_tests(112).KPIs (#309)
random_state→ identical score.🤖 Generated with Claude Code
Critical self-review
Adversarial review pass (not a rubber-stamp). Defects found and fixed:
BUG —
mask_known_pos+ integercvcrashed. The masked path calledcv.split(...)on the raw argument, soeval_features(X, y, cv=5, mask_known_pos=mask)raised
AttributeError: 'int' object has no attribute 'split'(reproduced concretely).Fixed by normalizing
cvonce withsklearn.model_selection.check_cv(
None -> LeaveOneOut,int -> StratifiedKFold, splitter passed through) so both thedefault and masked paths get a real splitter. The default/int paths stay byte-identical
(this is exactly what
cross_val_predictdoes internally). Added amask + int cvregression test.
Percentage/MCC inconsistency (the flagged design question) — RESOLVED by restricting
the registry to
[0, 1]metrics.matthews_corrcoef([-1, 1]) made the*100scaling land in
[-100, 100], breaking the "percentage" contract. I removed it, so everyregistry metric (
balanced_accuracy,accuracy,f1,precision,recall) is boundedin
[0, 1]and the returned score is a genuine percentage in[0, 100]. This is thesmallest, most honest fix and keeps a single return semantics; a signed metric can be
re-added later behind a scaling-aware design if a use case appears. Docstring,
Returns,and the supported-metric parametrization updated accordingly.
Golden test was near-tautological. It re-ran
balanced_accuracy_scoreagainst itself.Added an independent hand-rolled balanced accuracy (mean per-class recall) as a second
cross-check, so the assertion actually pins the metric semantics, not just the sklearn call.
PU-mask invariant was only checked indirectly. Added a rigorous test with a recording
estimator (class-level storage survives
sklearn.clone) that asserts the masked knownpositives train every scored fold and are never handed to
predict(scored).Convention/hygiene. Moved the mid-file
sklearnimports in_utils/metrics.pyto themodule top. Moved the all-samples-masked rejection into the frontend Validate block
(backend now trusts the frontend), keeping a defensive backend guard — with a clearer
message — only for exotic custom splitters that never cover an unmasked sample.
Verified:
random_stateis forwarded only to estimators exposing it (guarded byget_params()), so the default SVM and param-less estimators never crash; samerandom_state-> identical score. No new dependency, noprint, bareValueErroronly.Tests:
metrics_tests+api_testsgreen (288), example notebook re-executes undernbmake,docstring checker clean for
eval_features(only a pre-existingcomp_kldadvisory remains).Residual note: the golden test runs on a synthetic
make_classificationmatrix, not theactual γ-secretase feature matrix (that data isn't a lightweight unit-test fixture); it proves
the default reproduces the notebook recipe exactly, which is the testable contract.
Iterative review log
StratifiedKFold), a masked known positive that landed in a test partition was dropped from both scoring and that fold's training, violating the documented "known positives still inform every training fold" contract (concretely reproduced: 3 of 4 folds were missing masked positives from training). Fix folds the masked test indices back intotrain_fullper fold; LOO stays byte-identical (single-sample masked test folds are skipped outright, so the default path and golden/notebook reproduction are unchanged). Added a k-fold masked-invariant regression test (31 tests). Also verified single-class labels raise a friendlycheck_labelsValueError, not a cryptic sklearn error.random_state. One API-consistency wart:f1/precision/recall(binary-averaged) raise a cryptic sklearnValueErroron multi-class labels whilebalanced_accuracy/accuracyaccept them. It already fails loud (aValueError), so the slim honest fix is documentation, not cross-layer validation plumbing: clarified in themetricdocstring which metrics are multi-class vs binary, and added a positive multi-class test for the supported metrics (33 tests). No behavior change.np.asarray(y)[keep]->y[keep],np.asarray(preds)->preds; both are already ndarrays), a recompute-inside-loop the round-4 lens targets — output byte-identical (33 tests green). Audited the rest (frontend Validate block,eval_features_, the.astype(int)safety cast, defaultcross_val_predictpath): no dead code, no redundant validation, no other in-loop recompute, and the per-fold refit loop is inherently non-vectorizable. Otherwise converged.metricvalidation now reads the registry viaut.LIST_METRICS_EVAL(barrel access) instead of a bare top-level import, matching the houseut.Xconstant rule. Verified: numpydoc namedReturns+ one-line summary;.. include:: examples/eval_features.rstbacked by a present, nbmake-passingexamples/metrics/eval_features.ipynb(same generation pattern as the sibling v1.1 metrics); no citations to resolve; no ADR refs introduced by this PR; frontend Validate covers every public param and the backend trusts it; bareValueErroronly, noprint. Every param has a positive + negative test. Docstring checker + doc/signature-drift clean foreval_features(only the pre-existingcomp_kldRAISES advisory remains). Full local gate green: metrics_tests + api_tests (param_coverage, backend_import_hygiene, utils_barrel, return-contract) = 301 passed.