Skip to content

feat: ShapModelPlot.clustermap + shap_to_feat_imp + CPPPlot sample= (prototype for #313)#323

Draft
breimanntools wants to merge 6 commits into
masterfrom
feat/shap-clustermap
Draft

feat: ShapModelPlot.clustermap + shap_to_feat_imp + CPPPlot sample= (prototype for #313)#323
breimanntools wants to merge 6 commits into
masterfrom
feat/shap-clustermap

Conversation

@breimanntools

@breimanntools breimanntools commented Jul 1, 2026

Copy link
Copy Markdown
Owner

Part of #305 / prototype for #313.

Status

Draft, functionally complete and green locally. New unit tests (56 in the pro
plot suite + CPPPlot sample=) pass; the full tests/unit/cpp_plot_tests/ suite
(356) stays green; docstring, doc/signature-drift, param-coverage, backend
import-hygiene, utils-barrel and agentic-docs checkers all pass. ShapModelPlot /
shap_to_feat_imp are intentionally not wired into the top-level aaanalysis
namespace (see the TODO below) — they are reachable via
aaanalysis.explainable_ai_pro.

What this adds

  • ShapModelPlot.clustermap — clusters samples by explanation similarity
    (Pearson correlation of per-sample SHAP vectors), with row/col class-color
    sidebars, a class legend, a labelled horizontal colorbar and font via
    plot_gco. Returns the seaborn ClusterGrid (multi-axes object — see the
    residual note).
  • ShapModelPlot.get_clusters — deterministic dendrogram cut
    (n_clusters / color_threshold), the library-grade replacement for the
    original project's dendrogram-color parsing.
  • shap_to_feat_imp — SHAP vector → signed feature impact (reusing the
    ShapModel per-sample backend, so it never diverges) / absolute importance,
    both normalized so sum(|.|) == 100.
  • CPPPlot.ranking/profile/feature_map sample= — resolves
    col_imp="feat_impact_<entry>" (+ TMD-JMD parts from df_parts via
    SequenceFeature.get_seq_kws for the sequence-level plots) and sets
    shap_plot=True, replacing the manual col_imp=f"..." + **seq_kws plumbing
    from γ-secretase cells 30/32.

Critical self-review

Defects found and fixed while reviewing the prior (uncommitted, untested) code:

  1. sample as an int position produced a wrong col_imp. The old
    resolve_sample_kws built col_imp=f"feat_impact_{sample}" from the raw
    value, so sample=0 looked for feat_impact_0 — but impact columns are keyed
    by entry name (feat_impact_APP). Fixed: for profile/feature_map an
    int position is mapped to its entry name via df_parts.index; ranking has
    no df_parts to map a position, so it now accepts the entry name (str) only
    (annotation tightened to Optional[str], clear error otherwise). Covered by
    test_int_position_equals_name and test_int_position_rejected.
  2. Misleading heatmap x-axis label. The clustermap set the heatmap's x-axis
    label to "Pearson correlation (r)" — but that axis is the sample list, not a
    correlation scale. Removed; the correlation label now lives only on the
    colorbar.
  3. Colorbar clipped / no sidebar legend (plot quality). The horizontal
    colorbar rendered in the tiny default vertical slot and its centered label
    overflowed the figure's left edge; the class-color sidebars had no legend.
    Fixed with an explicit cbar_pos, a left-aligned colorbar label, and a
    house-style class legend (ut.plot_legend_) top-right. Rendered to PNG and
    inspected: block structure clear, dendrograms sane, nothing clipped.
  4. df_seq docstring diverged from the canonical baseline (docstring checker
    DFSEQ-BASELINE). Reworded to the canonical "DataFrame containing an entry
    column …".
  5. Minor: dropped an unused numpy import / redundant np.asarray coercion in
    the backend (frontend already hands it a validated float array).

Verified byte-identical default behaviour: sample=None reproduces the
existing ranking/profile/feature_map PNG exactly, and sample="<entry>"
equals the explicit col_imp=... (+ **seq_kws) call (golden PNG-equality tests).

Residual concerns

  • Return type diverges from the FigAxResult family. clustermap returns a
    seaborn ClusterGrid, not (fig, ax), because the figure has several axes
    (heatmap, two dendrograms, colorbar, two sidebars) and callers need
    grid.fig / grid.ax_heatmap / grid.dendrogram_*.linkage. Documented in the
    Returns section; flagging it as a deliberate, sibling-inconsistent choice.
  • get_clusters is beyond the three named #313 deliverables. It is a small,
    deterministic complement to clustermap (and its See Also); drop it if you
    prefer to keep the PR to the exact issue scope.
  • method is validated only as a str (scipy raises on an unknown linkage
    method) rather than against an allow-list — kept thin on purpose.
  • shap-availability: ShapModelPlot / shap_to_feat_imp don't need shap
    themselves, but importing them goes through explainable_ai_pro.__init__
    _shap_model (which imports shap), so they require the pro extra in
    practice. Core import aaanalysis is unaffected (ShapModel is try/except-stubbed
    and these two symbols are not wired top-level). Tests use
    pytest.importorskip("shap").

TODO (not in this PR)

  • Wire ShapModelPlot / shap_to_feat_imp into top-level aaanalysis.__init__
    with pro-gating (CONFIRM-FIRST; __init__.py / __all__ are confirm-first
    surfaces) — kept as a TODO in explainable_ai_pro/__init__.py.
  • Add examples/*.rst for the new methods and switch the inline doctests to
    .. include::.

Iterative review log

  • Round 2 (correctness & edge cases): a sample with a constant/zero-variance SHAP vector
    produced an opaque ValueError: The condensed distance matrix must contain only finite values deep inside seaborn — now comp_shap_correlation detects the undefined
    correlation and raises a clear message naming the offending sample(s) (covers both
    clustermap and get_clusters). shap_to_feat_imp on an all-zero vector silently
    returned nan; now raises a clear "undefined" ValueError. Added a regression test
    proving get_clusters uses the exact same linkage the clustermap dendrogram draws
    (KPI: linkage + cluster assignment match), plus negative tests for both new guards
    (70 tests, all green). Verified out-of-range int sample= positions are already
    rejected upstream by get_seq_kws (friendly ValueError, not IndexError).

  • Round 3 (plot quality): rendered the clustermap to PNG under both default Matplotlib
    rcParams and aa.plot_settings(). Found the class legend overflowed the figure's
    right edge by ~35px under default rcParams (title clipped to "Cl"), so it was cut on a
    plain fig.savefig without bbox_inches="tight". Fixed by reserving a right margin
    (grid.gs.update(right=0.80)) so the legend fits inside the canvas — verified it now
    fits under both default and plot_settings() (font 18). Colorbar (horizontal, left-
    aligned "Correlation (r)" label, -1/0/1 ticks), dendrograms, block structure and fonts
    (plot_gco) all inspected clean. Added a regression test asserting the legend stays
    within the figure bounds under default rcParams.

  • Round 4 (efficiency & simplicity): removed dead n_features unpacking in three spots
    (check_match_shap_values_labels, clustermap, get_clusters — only n_samples is
    used); replaced list(df_parts.index)[int(sample)] (materialized the whole index to
    read one label) with a direct df_parts.index[int(sample)]; simplified the
    duplicate-name detection from an O(n^2) list(names).count(...) to a single
    collections.Counter pass. Output byte-identical (all 71 tests green, including the
    sample= golden PNG-equality tests).

  • Round 5 (guides, docs & tests): the auto-discovering pro-contract meta-test
    (test_docstring_contracts.py::test_pro_marker_in_summary) was failing for the new
    shap_to_feat_imp — its one-line summary lacked the required [pro] / aaanalysis[pro]
    install marker (Round 1 never ran this test). Fixed the summary. Re-verified the whole
    area: docstring checker + doc/signature-drift clean; no ADR refs / print() in the new
    .py; all raises are bare ValueError; every public param of clustermap /
    get_clusters / shap_to_feat_imp and the CPPPlot sample= / df_seq / df_parts
    shortcut has a Validate check and a positive+negative test; plot_gco fonts. Full local
    gate green: shap_model_plot + full cpp_plot suite + param-coverage + import-hygiene +
    return-contract + docstring-contracts + utils-barrel + agentic-docs (492 passed). The two
    internal symbols remain out of the top-level __all__ (TODO kept, CONFIRM-FIRST).

🤖 Generated with Claude Code

breimanntools and others added 5 commits July 1, 2026 05:02
…CPPPlot sample= shortcut

Ports the explanation-similarity clustermap from the original gamma-secretase
project into a library-grade pro API, adds the shap_to_feat_imp normalization
helper, and lets CPPPlot.ranking/profile/feature_map resolve a sample by name.

- ShapModelPlot.clustermap: correlation-of-SHAP-vectors clustermap with row/col
  class-color sidebars, a class legend, a labelled horizontal colorbar, and font
  via plot_gco; returns the seaborn ClusterGrid.
- ShapModelPlot.get_clusters: deterministic dendrogram cut (n_clusters /
  color_threshold), replacing the original dendrogram-color parsing.
- shap_to_feat_imp: signed impact (reusing the ShapModel backend) / absolute
  importance, both normalized to sum(|.|)=100.
- CPPPlot sample=: resolves col_imp=feat_impact_<entry> (+ TMD-JMD parts from
  df_parts for profile/feature_map) and sets shap_plot=True; default output
  unchanged when sample is None.

ShapModelPlot / shap_to_feat_imp stay unwired at the top level (TODO #305,
CONFIRM-FIRST). pro-gated; tests skip cleanly when shap is absent.

Part of #305 / prototype for #313.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…e-consistency test

- comp_shap_correlation: raise a clear ValueError naming samples with a constant
  (zero-variance) SHAP vector instead of an opaque scipy non-finite-distance error
  (covers clustermap + get_clusters).
- shap_to_feat_imp: raise on an all-zero vector instead of silently returning nan.
- Add a regression test proving get_clusters uses the same linkage the clustermap
  dendrogram draws, plus negative tests for both new guards.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Under default Matplotlib rcParams the class legend overflowed the figure's right
edge (~35px) and was clipped on a plain savefig without bbox_inches='tight'.
Reserve a right margin via grid.gs.update(right=0.80) so the legend fits inside
the canvas; verified under both default rcParams and plot_settings(). Add a
regression test asserting the legend stays within the figure bounds.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…Counter dedup

- Remove unused n_features unpacking (3 spots; only n_samples is used).
- df_parts.index[int(sample)] instead of list(df_parts.index)[int(sample)].
- Duplicate-name detection via collections.Counter (single pass) instead of
  an O(n^2) list(names).count(...) scan.
Output byte-identical; all tests green.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The auto-discovering pro-contract meta-test requires every public *_pro symbol's
one-line summary to carry the [pro] / aaanalysis[pro] install marker; shap_to_feat_imp
lacked it and was failing test_pro_marker_in_summary. Add the marker.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@codecov

codecov Bot commented Jul 1, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 98.63014% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 94.95%. Comparing base (f1cfa72) to head (f4a21ba).

Files with missing lines Patch % Lines
aaanalysis/feature_engineering/_cpp_plot.py 93.75% 0 Missing and 2 partials ⚠️
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #323      +/-   ##
==========================================
+ Coverage   94.93%   94.95%   +0.01%     
==========================================
  Files         185      187       +2     
  Lines       17862    18008     +146     
  Branches     3032     3054      +22     
==========================================
+ Hits        16957    17099     +142     
- Misses        598      599       +1     
- Partials      307      310       +3     
Files with missing lines Coverage Δ
aaanalysis/explainable_ai_pro/__init__.py 100.00% <100.00%> (ø)
.../explainable_ai_pro/_backend/shap_model/sm_plot.py 100.00% <100.00%> (ø)
aaanalysis/explainable_ai_pro/_shap_model_plot.py 100.00% <100.00%> (ø)
aaanalysis/feature_engineering/_cpp_plot.py 97.96% <93.75%> (-0.38%) ⬇️

... and 1 file with indirect coverage changes

Components Coverage Δ
cpp_core 94.95% <ø> (ø)
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@breimanntools breimanntools left a comment

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The clustermap is not only for shap values but also for featuers or other numerical represntations. Perhaps we need a plot_clustmarp utils insterad of asigning it to SHapModel. Should we make a general plotting class called AAPlot (aap) and the predction plots can be asigned to this one as well. Or AAPredPlot) I do not know right now

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant