feat: SequenceFeature.scale_mean — scale-average baseline (prototype #307) by breimanntools · Pull Request #319 · breimanntools/aaanalysis

breimanntools · 2026-06-30T23:32:07Z

Status: SOLID — works end-to-end, green local tests, full ripple done.

SequenceFeature.scale_mean(df_seq, df_scales=None, list_parts=None, return_df=False)
adds the no-positional-split scale-based baseline featurizer: per sequence, each
scale is averaged over the residues of a span, giving a (n_seq, n_scales) matrix —
the counterpart to feature_matrix for the standard scale-vs-CPP comparison.

What to review

API shape on SequenceFeature (method on the existing class, no __init__/__all__
change) and the list_parts=None → whole TMD-JMD span (jmd_n+tmd+jmd_c, the
tmd_jmd part) default.
Missing/non-canonical handling: a residue is averaged only if it is in
df_scales.index; gaps (-) and other non-canonical symbols (e.g. X) are dropped.
An all-non-canonical / empty span yields an all-NaN row plus a verbose-gated
UserWarning. (Matches the gamma-secretase notebook cell 27 scale_X comprehension
verbatim on DOM_GSEC.)
Backend split: compute lives in get_scale_mean_ (_backend/cpp/sequence_feature.py);
frontend keeps the Validate block.

Ripple done

numpydoc with named Returns + Examples include; executed example notebook
examples/feature_engineering/sf_scale_mean.ipynb (aa.display_df).
18 unit tests (positive+negative per param; golden test vs the manual comprehension;
empty/all-non-canonical edge case; frequency-weighting + non-canonical-drop goldens).
Unreleased release-notes entry under Feature Engineering.
Local gates green: new tests (18), full sequence_feature_tests (451),
test_param_coverage, test_backend_import_hygiene, test_utils_barrel;
check_docstrings = 0 defects.

Open API question

df_scales is kept Optional=None (loads the bundled default, mirroring
feature_matrix) rather than required-positional as sketched in the issue. Easy to flip
to required if you prefer the issue's exact signature.

Part of #305 / prototype for #307.

Critical self-review (review-and-improve pass)

Adversarial re-read of the whole diff against .claude/rules/*. Focus was efficiency, edge-case correctness, and guide compliance.

Defect found + fixed — efficiency (headline). The backend get_scale_mean_ used a per-sequence Python comprehension with df_scales.loc[kept].mean(axis=0) per row — exactly the slow path the reference notebook used, with a pandas .loc build + .mean for every sequence. Rewrote it fully vectorized:

flatten all residues of all spans into one uint8 byte array (str.encode("latin-1","replace"), so any stray >255 codepoint is treated as non-canonical and dropped, keeping per-residue alignment),
map bytes → scale-row via a 256-entry lookup table (built only from single-character index labels, so multi-char labels can never spuriously match a residue — matches the notebook's a in df_scales.index character test),
tally a small (n_seq, n_letters) residue-count matrix with a single np.bincount, and get per-sequence sums with one BLAS matmul counts @ scales_arr.

Measured on DOM_GSEC-scale inputs (n=5040): ~15x faster and, crucially, flat across scale width (5 / 50 / 586 scales all ~11–16x) — an intermediate np.add.at scatter version I tried first was only ~1.2x at the default 586 scales because the wide unbuffered scatter dominated, so the matmul formulation is what actually fixes the wide case. Output matches the notebook cell-27 comprehension within float tolerance (max abs diff ~5.5e-16); the golden test (np.allclose, equal_nan=True) stays green.

Edge cases verified (already covered by the tests, re-checked against the new code path). Empty span and all-non-canonical span → 0/0 under np.errstate(invalid="ignore") → all-NaN row, n_kept==0, verbose-gated UserWarning; residues absent from df_scales.index (gaps -, X, or a canonical letter missing from a scales subset) are dropped; frequency weighting preserved (repeated residues counted by frequency, since bincount tallies multiplicity). No divide-by-zero warning leaks (errstate-guarded).

Guide compliance re-checked. Frontend Validate block covers every public param (ut.check_df_seq, check_df_scales, ut.check_bool for return_df, ut.check_list_parts for list_parts); backend trusts the frontend (no re-validation). Heavy logic in _backend, frontend thin. No print(), no ADR refs, bare ValueErrors via the ut.check_* family, imports only through aaanalysis.utils as ut. return_df=True returns a frame indexed like df_parts with scale-name columns (parity-tested vs the array form). numpydoc has a named Returns and the Examples include resolves (the executed sf_scale_mean.ipynb exists).

Residual / not changed (intentional).

scale_mean validates df_seq/list_parts and then calls self.get_df_parts(...), which re-validates both — a minor double-validation. Left as-is: it reuses the public part-slicing method (the documented way to build a span) rather than reaching into the backend get_df_parts_ with manual jmd_*_len handling; the redundant checks are cheap and match how sibling methods layer.
The open API question from the original body stands (df_scales kept Optional=None to mirror feature_matrix, vs the issue's required-positional sketch) — a design call for the maintainer, not a defect.

Iterative review log

Round 2 (correctness & edge cases): adversarially probed the vectorized LUT path — multi-char scale-index labels, lowercase residues, codepoints >255 (latin-1 'replace'), empty/all-NaN spans, 0/0 guard, return_df index alignment. All behave correctly and match the notebook comprehension; validators (check_df_scales unique/numeric/no-NaN index) rule out duplicate-index divergence. Added 2 regression tests locking the single-char-LUT invariant (multi-char label never matches; non-latin-1 residue dropped, no crash). 20 tests green.
Round 3 (numerical robustness & API/param-name consistency): measured the vectorized output vs the notebook comprehension on full DOM_GSEC — max abs diff 5.6e-16 (~half the values bit-exact, all within 1e-12), so the count-matrix + matmul path is numerically equivalent to per-row .mean. Confirmed 0/0 → NaN guard is the only division edge (empty rows have zero sums, never inf). API cross-checked against siblings: df_scales Optional default, list_parts/return_df naming, and the self.verbose-gated warnings.warn all match feature_matrix/get_df_parts; get_df_parts defaults (no gap-removal / no non-canonical replacement) leave the LUT as the sole residue filter. Converged — no new defect. Residual (design, not defect): scale_mean doesn't expose jmd_n_len/jmd_c_len (uses the 10/10 get_df_parts defaults), matching the issue's 3-arg signature.
Round 4 (efficiency & simplicity): removed the redundant df_parts.astype(str) full-frame copy in the backend — get_df_parts already guarantees str columns, so the coercion only allocated a copy and duplicated a frontend guarantee (backend-trusts-frontend). Output byte-identical (join verified equal with/without astype across part sets). Rest of the vectorized kernel (single LUT build, one bincount, one matmul) already minimal; no other dead code / needless copy / over-nesting found.
Round 5 (guides, docs & tests completeness): docstring checker = 0 defects for scale_mean (named Returns, one-line summary, Examples include target examples/sf_scale_mean.rst matches the sibling feature_matrix pattern; the only class advisories are pre-existing class-wide RAISES-UNDOCUMENTED, and scale_mean is not among them). Verified: Validate block covers all four public params (check_df_seq/check_df_scales/check_bool/check_list_parts), backend carries no validation, no print/ADR refs, positive+negative test per param, release-notes entry present, example notebook uses display_df with executed outputs and no bare prints. Full area gate green: 458 sequence_feature_tests + test_param_coverage + test_backend_import_hygiene + test_utils_barrel. Converged — no new defect.

…#307) Add a first-class no-positional-split scale-average featurizer: SequenceFeature.scale_mean(df_seq, df_scales, list_parts=None, return_df=False) returns a (n_seq, n_scales) matrix by averaging each scale over a sequence span. list_parts=None uses the whole TMD-JMD span (jmd_n + tmd + jmd_c). Missing / non-canonical residues (gaps, 'X', anything not in df_scales.index) are dropped before averaging; an all-non-canonical span yields an all-NaN row (verbose warn). Matches the gamma-secretase notebook cell 27 `scale_X` comprehension within float tolerance. Backend: get_scale_mean_ in _backend/cpp/sequence_feature.py. Adds numpydoc with Examples include, an executed example notebook, 18 unit tests (positive+negative per param, golden vs the manual comprehension, empty/all-non-canonical edge case), and an Unreleased release-notes entry. No __init__ change (method on existing class). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…atmul Replace the per-sequence Python comprehension (DataFrame.loc + mean per row) with a fully vectorized backend: flatten all residues into one byte array, map to scale rows through a 256-entry lookup, tally a small (n_seq, n_letters) residue-count matrix with a single np.bincount, and obtain per-sequence sums via one BLAS matmul against the scale matrix. ~15x faster on DOM_GSEC-scale inputs across scale widths; output matches the notebook comprehension within float tolerance (max abs diff ~5e-16, golden test green). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…+ non-latin1 residue) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

… trusts frontend) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

breimanntools and others added 4 commits July 1, 2026 01:31

round2(scale_mean): lock single-char-LUT invariant (multi-char label …

06160a0

…+ non-latin1 residue) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

round4(scale_mean): drop redundant df_parts.astype(str) copy (backend…

69bca17

… trusts frontend) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: SequenceFeature.scale_mean — scale-average baseline (prototype #307)#319

feat: SequenceFeature.scale_mean — scale-average baseline (prototype #307)#319
breimanntools wants to merge 4 commits into
masterfrom
feat/scale-mean-baseline

breimanntools commented Jun 30, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

breimanntools commented Jun 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Critical self-review (review-and-improve pass)

Iterative review log

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

breimanntools commented Jun 30, 2026 •

edited

Loading