Skip to content

feat(prediction): AAPred + AAPredPlot — evaluate & deploy prediction models#332

Open
breimanntools wants to merge 14 commits into
masterfrom
feat/prediction-class
Open

feat(prediction): AAPred + AAPredPlot — evaluate & deploy prediction models#332
breimanntools wants to merge 14 commits into
masterfrom
feat/prediction-class

Conversation

@breimanntools

@breimanntools breimanntools commented Jul 1, 2026

Copy link
Copy Markdown
Owner

New core subpackage aaanalysis/prediction/ — the prediction layer consolidating epic #305's model/plot children. Downstream of CPP/CPPGrid (feature space), AAPred trains and deploys real models; AAPredPlot is the single home for every prediction figure.

AAPred(Wrapper)

  • models= — select models by registry name ("svm", "rf", "extra_trees", "log_reg") and/or configured sklearn estimator instances, in any mix. The default (svm) is the linear-SVM recipe. Any other estimator (MLP, gradient boosting, xgboost, a voting/stacking ensemble, a Pipeline, ...) is used by passing an instance — cloned at fit time, so nested/**kwargs params and random_state are handled correctly.
  • fit(..., optimize_hyperparams=, param_grids=, n_cv=) — optional GridSearchCV tuning (core sklearn, no new dep).
  • eval → tidy df_eval (model × metric × principle, cv + holdout); predict_proba / predict.
  • Three prediction levels: predict_seq / predict_domain (boundary-sensitivity scan) / predict_window, binding df_feat→X internally.

AAPredPlot — one figure home, three types (each returns FigAxResult)

Consolidation

Tests / verification

  • 200+ new prediction + api-meta tests green (backend-import-hygiene, plot-return-contract, param-coverage, abbreviation registry). Adversarial-review fixes included (ensemble/instance handling, clustermap NaN/labels). Example notebooks executed with outputs.

Relates to #305 (no closing keyword).

🤖 Generated with Claude Code

breimanntools and others added 10 commits July 2, 2026 01:24
…models

Add a new core subpackage `aaanalysis/prediction/` closing the gap left by
feature engineering: a thin, opinionated evaluate-and-deploy wrapper plus its
plot pair. Downstream of CPP/CPPGrid (which optimize the *feature space*),
AAPred takes a fixed feature matrix and trains real models kept for deployment
— it deliberately does not do hyperparameter search.

AAPred(Wrapper):
- eval(X, labels, X_holdout=, labels_holdout=, metrics=, n_cv=) -> df_eval
  (model x metric x principle; principle = cv and/or holdout — two evaluation
  principles in one tidy table)
- fit / predict_proba / predict for deployment (models kept in list_models_)
- binary-only, reproducible (random_state threaded to models + StratifiedKFold)

AAPredPlot (1:1 plot pair, each returns ut.FigAxResult):
- eval  — grouped model x metric bars (CV error bars + hatched holdout, baseline)
- hist  — class-separated score distribution
- scatter — two-predictor per-sample comparison with y=x line
- cutoff — % of samples above each cutoff (survival curve)

Supporting: 5 new COL_/LIST_ constants; 10 executed example notebooks;
63 unit tests; abbreviation registry + style-guide table (aapred / aapred_plot);
backend-hygiene guard; docs API "Prediction" section.

Complements TreeModel (tree-ensemble importance) as a sibling; Pareto-over-
df_feat and a per-residue panel are intentionally deferred. Related to #305.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… plots

Bind a CPP feature definition to AAPred (df_feat + df_scales) so the model can
featurize raw sequences internally, and add the three prediction granularities
plus their plots.

AAPred:
- predict_seq(df_seq)      -> one score per protein
- predict_domain(df_seq, window=) -> domain score + boundary-sensitivity scan
  (shifts tmd_start/tmd_stop over [-window, +window], flags the best definition)
- predict_window(df_seq, tmd_len, step=) -> per-residue score profile
  (anchors a length-tmd_len window at every valid position)

AAPredPlot:
- window(df_window, threshold=, list_annotations=) -> per-residue profile with
  optional user-provided per-residue annotation tracks (topology/pLDDT/...)
- domain(df_domain) -> boundary-sensitivity curve, best offset starred,
  annotated boundary (offset 0) marked

X is computed internally via SequenceFeature.get_df_parts + feature_matrix
(Position-based for seq/domain, anchor/pos-based for window). New COL_OFFSET /
COL_RESIDUE_POS constants. 31 new tests (94 total for the subpackage); 5 new
executed example notebooks. Heavy annotation-track auto-fill (AnnotationPreprocessor),
SHAP importance track (ShapModel), and an interactive viewer are deferred follow-ups.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
One name->estimator factory (ut.get_cv_model_) covering ~10 sklearn families plus
voting/stacking meta-ensembles, so the mapping does not drift across AAPred /
predict_samples / find_features. Default 'svm' is the linear-SVM recipe. 'xgboost'
is a reserved name that raises an install hint if the optional dep is missing;
never imported at module load. A configured sklearn estimator instance may be
passed instead of a name. Adds MODEL_* constants + LIST_PRED_MODELS.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
- models= is the new primary selection API: registry name strings (ut.get_cv_model_)
  and/or configured sklearn estimator instances, in any mix; normalized to the
  existing (class, kwargs) representation so the legacy list_model_classes path and
  its tests are unchanged. Mutually exclusive with list_model_classes.
- fit() gains optimize_hyperparams / param_grids / n_cv: each model tuned via
  GridSearchCV over its grid (or a built-in default), keeping the best estimator.
- Docstrings + 5 new tests (param-coverage green).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
# Conflicts:
#	docs/source/index/docstring_guide.rst
…rts_kws=)

Use the single-call df_seq path (merged in #316) instead of the manual
get_df_parts + feature_matrix pair. Behaviour-preserving; prediction suite green.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…AAPred for deploy

Adds a note on TreeModel.predict_proba directing users to AAPred as the recommended
train/deploy/predict path (sequence/domain/window). No behavior change.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…barplot

Fold the grouped comparison barplot (value labels + chance/baseline line) from the
notebook and the #305 plot-comparison prototype into AAPredPlot as a method (not a
loose top-level plot_* symbol). Backend renderer under _backend/aa_pred/; returns
FigAxResult. Adds an executed example notebook + 9 tests; _aa_pred_plot co-owns the
aa_pred backend (like aaclust/_aaclust_plot).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…olors, cut-offs)

Fold the notebook's ranked-candidate figure into AAPredPlot: horizontal bars ranked
by prediction score, colored by class, with optional error bars and confidence cut-off
lines. Figure height scales with the number of items (0.22*n+1). Backend renderer,
executed example notebook, 9 tests.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Add clustermap: cluster samples by explanation similarity (Pearson correlation of
per-sample importance/SHAP vectors) with class-color sidebars. Core drawing from
provided vectors (no pro dep); SHAP computation stays in the pro lane. Reframe the
AAPredPlot docstring into the 3 figure types (positional / cohort / evaluation).
Backend renderer, executed example notebook, 7 tests.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
breimanntools and others added 4 commits July 2, 2026 12:32
Match the plot-return contract (ADR-0039) annotation used by the other AAPredPlot
methods; the body still returns ut.FigAxResult (a Figure/Axes tuple).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…/random_state)

Adversarial review fixes:
- Store configured estimator INSTANCES and clone() them at fit/eval time instead of
  round-tripping through get_params()+type(). Fixes construction crashes for meta-
  ensembles (voting/stacking, nested params) and **kwargs estimators (xgboost), and
  propagates random_state to user-passed instances (only when they left it unset).
- clustermap: sanitize NaN correlations from zero-variance rows (fill diag=1) so
  linkage doesn't crash; accept arbitrary (string) cosmetic class labels.
- comparison: guard np.nanmax against an all-NaN grid (no RuntimeWarning).
- drop dead use_label_encoder kwarg from the xgboost factory.
+7 regression tests (meta-ensembles, instance random_state, constant-row clustermap).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Drop the speculative registry entries (mlp/tree/lda/gradient_boosting/knn) and the
fragile meta-ensembles + optional xgboost — none were used by name anywhere, and the
meta/xgboost paths were the bug-prone ones. Keep svm/rf/extra_trees/log_reg, matching
pipe.predict_samples' established default set. Any other estimator (including a voting/
stacking ensemble or xgboost) is used by passing a configured sklearn instance, which
the clone-based path handles — so no capability is lost, just the maintenance surface.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…one refactor

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant