feat: ShapModelPlot.clustermap + shap_to_feat_imp + CPPPlot sample= (prototype for #313) by breimanntools · Pull Request #323 · breimanntools/aaanalysis

breimanntools · 2026-07-01T03:03:06Z

Part of #305 / prototype for #313.

Status

Draft, functionally complete and green locally. New unit tests (56 in the pro
plot suite + CPPPlot sample=) pass; the full tests/unit/cpp_plot_tests/ suite
(356) stays green; docstring, doc/signature-drift, param-coverage, backend
import-hygiene, utils-barrel and agentic-docs checkers all pass. ShapModelPlot /
shap_to_feat_imp are intentionally not wired into the top-level aaanalysis
namespace (see the TODO below) — they are reachable via
aaanalysis.explainable_ai_pro.

What this adds

ShapModelPlot.clustermap — clusters samples by explanation similarity
(Pearson correlation of per-sample SHAP vectors), with row/col class-color
sidebars, a class legend, a labelled horizontal colorbar and font via
plot_gco. Returns the seaborn ClusterGrid (multi-axes object — see the
residual note).
ShapModelPlot.get_clusters — deterministic dendrogram cut
(n_clusters / color_threshold), the library-grade replacement for the
original project's dendrogram-color parsing.
shap_to_feat_imp — SHAP vector → signed feature impact (reusing the
ShapModel per-sample backend, so it never diverges) / absolute importance,
both normalized so sum(|.|) == 100.
CPPPlot.ranking/profile/feature_map sample= — resolves
col_imp="feat_impact_<entry>" (+ TMD-JMD parts from df_parts via
SequenceFeature.get_seq_kws for the sequence-level plots) and sets
shap_plot=True, replacing the manual col_imp=f"..." + **seq_kws plumbing
from γ-secretase cells 30/32.

Critical self-review

Defects found and fixed while reviewing the prior (uncommitted, untested) code:

sample as an int position produced a wrong col_imp. The old
resolve_sample_kws built col_imp=f"feat_impact_{sample}" from the raw
value, so sample=0 looked for feat_impact_0 — but impact columns are keyed
by entry name (feat_impact_APP). Fixed: for profile/feature_map an
int position is mapped to its entry name via df_parts.index; ranking has
no df_parts to map a position, so it now accepts the entry name (str) only
(annotation tightened to Optional[str], clear error otherwise). Covered by
test_int_position_equals_name and test_int_position_rejected.
Misleading heatmap x-axis label. The clustermap set the heatmap's x-axis
label to "Pearson correlation (r)" — but that axis is the sample list, not a
correlation scale. Removed; the correlation label now lives only on the
colorbar.
Colorbar clipped / no sidebar legend (plot quality). The horizontal
colorbar rendered in the tiny default vertical slot and its centered label
overflowed the figure's left edge; the class-color sidebars had no legend.
Fixed with an explicit cbar_pos, a left-aligned colorbar label, and a
house-style class legend (ut.plot_legend_) top-right. Rendered to PNG and
inspected: block structure clear, dendrograms sane, nothing clipped.
df_seq docstring diverged from the canonical baseline (docstring checker
DFSEQ-BASELINE). Reworded to the canonical "DataFrame containing an entry
column …".
Minor: dropped an unused numpy import / redundant np.asarray coercion in
the backend (frontend already hands it a validated float array).

Verified byte-identical default behaviour: sample=None reproduces the
existing ranking/profile/feature_map PNG exactly, and sample="<entry>"
equals the explicit col_imp=... (+ **seq_kws) call (golden PNG-equality tests).

Residual concerns

Return type diverges from the FigAxResult family. clustermap returns a
seaborn ClusterGrid, not (fig, ax), because the figure has several axes
(heatmap, two dendrograms, colorbar, two sidebars) and callers need
grid.fig / grid.ax_heatmap / grid.dendrogram_*.linkage. Documented in the
Returns section; flagging it as a deliberate, sibling-inconsistent choice.
get_clusters is beyond the three named #313 deliverables. It is a small,
deterministic complement to clustermap (and its See Also); drop it if you
prefer to keep the PR to the exact issue scope.
method is validated only as a str (scipy raises on an unknown linkage
method) rather than against an allow-list — kept thin on purpose.
shap-availability: ShapModelPlot / shap_to_feat_imp don't need shap
themselves, but importing them goes through explainable_ai_pro.__init__ →
_shap_model (which imports shap), so they require the pro extra in
practice. Core import aaanalysis is unaffected (ShapModel is try/except-stubbed
and these two symbols are not wired top-level). Tests use
pytest.importorskip("shap").

TODO (not in this PR)

Wire ShapModelPlot / shap_to_feat_imp into top-level aaanalysis.__init__
with pro-gating (CONFIRM-FIRST; __init__.py / __all__ are confirm-first
surfaces) — kept as a TODO in explainable_ai_pro/__init__.py.
Add examples/*.rst for the new methods and switch the inline doctests to
.. include::.

Iterative review log

Round 2 (correctness & edge cases): a sample with a constant/zero-variance SHAP vector
produced an opaque ValueError: The condensed distance matrix must contain only finite values deep inside seaborn — now comp_shap_correlation detects the undefined
correlation and raises a clear message naming the offending sample(s) (covers both
clustermap and get_clusters). shap_to_feat_imp on an all-zero vector silently
returned nan; now raises a clear "undefined" ValueError. Added a regression test
proving get_clusters uses the exact same linkage the clustermap dendrogram draws
(KPI: linkage + cluster assignment match), plus negative tests for both new guards
(70 tests, all green). Verified out-of-range int sample= positions are already
rejected upstream by get_seq_kws (friendly ValueError, not IndexError).
Round 3 (plot quality): rendered the clustermap to PNG under both default Matplotlib
rcParams and aa.plot_settings(). Found the class legend overflowed the figure's
right edge by ~35px under default rcParams (title clipped to "Cl"), so it was cut on a
plain fig.savefig without bbox_inches="tight". Fixed by reserving a right margin
(grid.gs.update(right=0.80)) so the legend fits inside the canvas — verified it now
fits under both default and plot_settings() (font 18). Colorbar (horizontal, left-
aligned "Correlation (r)" label, -1/0/1 ticks), dendrograms, block structure and fonts
(plot_gco) all inspected clean. Added a regression test asserting the legend stays
within the figure bounds under default rcParams.
Round 4 (efficiency & simplicity): removed dead n_features unpacking in three spots
(check_match_shap_values_labels, clustermap, get_clusters — only n_samples is
used); replaced list(df_parts.index)[int(sample)] (materialized the whole index to
read one label) with a direct df_parts.index[int(sample)]; simplified the
duplicate-name detection from an O(n^2) list(names).count(...) to a single
collections.Counter pass. Output byte-identical (all 71 tests green, including the
sample= golden PNG-equality tests).
Round 5 (guides, docs & tests): the auto-discovering pro-contract meta-test
(test_docstring_contracts.py::test_pro_marker_in_summary) was failing for the new
shap_to_feat_imp — its one-line summary lacked the required [pro] / aaanalysis[pro]
install marker (Round 1 never ran this test). Fixed the summary. Re-verified the whole
area: docstring checker + doc/signature-drift clean; no ADR refs / print() in the new
.py; all raises are bare ValueError; every public param of clustermap /
get_clusters / shap_to_feat_imp and the CPPPlot sample= / df_seq / df_parts
shortcut has a Validate check and a positive+negative test; plot_gco fonts. Full local
gate green: shap_model_plot + full cpp_plot suite + param-coverage + import-hygiene +
return-contract + docstring-contracts + utils-barrel + agentic-docs (492 passed). The two
internal symbols remain out of the top-level __all__ (TODO kept, CONFIRM-FIRST).

🤖 Generated with Claude Code

…CPPPlot sample= shortcut Ports the explanation-similarity clustermap from the original gamma-secretase project into a library-grade pro API, adds the shap_to_feat_imp normalization helper, and lets CPPPlot.ranking/profile/feature_map resolve a sample by name. - ShapModelPlot.clustermap: correlation-of-SHAP-vectors clustermap with row/col class-color sidebars, a class legend, a labelled horizontal colorbar, and font via plot_gco; returns the seaborn ClusterGrid. - ShapModelPlot.get_clusters: deterministic dendrogram cut (n_clusters / color_threshold), replacing the original dendrogram-color parsing. - shap_to_feat_imp: signed impact (reusing the ShapModel backend) / absolute importance, both normalized to sum(|.|)=100. - CPPPlot sample=: resolves col_imp=feat_impact_<entry> (+ TMD-JMD parts from df_parts for profile/feature_map) and sets shap_plot=True; default output unchanged when sample is None. ShapModelPlot / shap_to_feat_imp stay unwired at the top level (TODO #305, CONFIRM-FIRST). pro-gated; tests skip cleanly when shap is absent. Part of #305 / prototype for #313. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…e-consistency test - comp_shap_correlation: raise a clear ValueError naming samples with a constant (zero-variance) SHAP vector instead of an opaque scipy non-finite-distance error (covers clustermap + get_clusters). - shap_to_feat_imp: raise on an all-zero vector instead of silently returning nan. - Add a regression test proving get_clusters uses the same linkage the clustermap dendrogram draws, plus negative tests for both new guards. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Under default Matplotlib rcParams the class legend overflowed the figure's right edge (~35px) and was clipped on a plain savefig without bbox_inches='tight'. Reserve a right margin via grid.gs.update(right=0.80) so the legend fits inside the canvas; verified under both default rcParams and plot_settings(). Add a regression test asserting the legend stays within the figure bounds. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…Counter dedup - Remove unused n_features unpacking (3 spots; only n_samples is used). - df_parts.index[int(sample)] instead of list(df_parts.index)[int(sample)]. - Duplicate-name detection via collections.Counter (single pass) instead of an O(n^2) list(names).count(...) scan. Output byte-identical; all tests green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The auto-discovering pro-contract meta-test requires every public *_pro symbol's one-line summary to carry the [pro] / aaanalysis[pro] install marker; shap_to_feat_imp lacked it and was failing test_pro_marker_in_summary. Add the marker. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

codecov · 2026-07-01T04:10:59Z

Codecov Report

❌ Patch coverage is 98.63014% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 94.95%. Comparing base (f1cfa72) to head (f4a21ba).

Files with missing lines	Patch %	Lines
aaanalysis/feature_engineering/_cpp_plot.py	93.75%	0 Missing and 2 partials ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master     #323      +/-   ##
==========================================
+ Coverage   94.93%   94.95%   +0.01%     
==========================================
  Files         185      187       +2     
  Lines       17862    18008     +146     
  Branches     3032     3054      +22     
==========================================
+ Hits        16957    17099     +142     
- Misses        598      599       +1     
- Partials      307      310       +3

Files with missing lines	Coverage Δ
aaanalysis/explainable_ai_pro/__init__.py	`100.00% <100.00%> (ø)`
.../explainable_ai_pro/_backend/shap_model/sm_plot.py	`100.00% <100.00%> (ø)`
aaanalysis/explainable_ai_pro/_shap_model_plot.py	`100.00% <100.00%> (ø)`
aaanalysis/feature_engineering/_cpp_plot.py	`97.96% <93.75%> (-0.38%)`	⬇️

... and 1 file with indirect coverage changes

Components	Coverage Δ
cpp_core	`94.95% <ø> (ø)`

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

breimanntools

The clustermap is not only for shap values but also for featuers or other numerical represntations. Perhaps we need a plot_clustmarp utils insterad of asigning it to SHapModel. Should we make a general plotting class called AAPlot (aap) and the predction plots can be asigned to this one as well. Or AAPredPlot) I do not know right now

breimanntools and others added 5 commits July 1, 2026 05:02

breimanntools commented Jul 1, 2026

View reviewed changes

Merge remote-tracking branch 'origin/master' into feat/shap-clustermap

550471a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: ShapModelPlot.clustermap + shap_to_feat_imp + CPPPlot sample= (prototype for #313)#323

feat: ShapModelPlot.clustermap + shap_to_feat_imp + CPPPlot sample= (prototype for #313)#323
breimanntools wants to merge 6 commits into
masterfrom
feat/shap-clustermap

breimanntools commented Jul 1, 2026 •

edited

Loading

Uh oh!

codecov Bot commented Jul 1, 2026

Uh oh!

breimanntools left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

breimanntools commented Jul 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Status

What this adds

Critical self-review

Residual concerns

TODO (not in this PR)

Iterative review log

Uh oh!

codecov Bot commented Jul 1, 2026

Codecov Report

Uh oh!

breimanntools left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

breimanntools commented Jul 1, 2026 •

edited

Loading