Skip to content

feat: SequenceFeature.scale_mean — scale-average baseline (prototype #307)#319

Draft
breimanntools wants to merge 4 commits into
masterfrom
feat/scale-mean-baseline
Draft

feat: SequenceFeature.scale_mean — scale-average baseline (prototype #307)#319
breimanntools wants to merge 4 commits into
masterfrom
feat/scale-mean-baseline

Conversation

@breimanntools

@breimanntools breimanntools commented Jun 30, 2026

Copy link
Copy Markdown
Owner

Status: SOLID — works end-to-end, green local tests, full ripple done.

SequenceFeature.scale_mean(df_seq, df_scales=None, list_parts=None, return_df=False)
adds the no-positional-split scale-based baseline featurizer: per sequence, each
scale is averaged over the residues of a span, giving a (n_seq, n_scales) matrix —
the counterpart to feature_matrix for the standard scale-vs-CPP comparison.

What to review

  • API shape on SequenceFeature (method on the existing class, no __init__/__all__
    change) and the list_parts=None → whole TMD-JMD span (jmd_n+tmd+jmd_c, the
    tmd_jmd part) default.
  • Missing/non-canonical handling: a residue is averaged only if it is in
    df_scales.index; gaps (-) and other non-canonical symbols (e.g. X) are dropped.
    An all-non-canonical / empty span yields an all-NaN row plus a verbose-gated
    UserWarning. (Matches the gamma-secretase notebook cell 27 scale_X comprehension
    verbatim on DOM_GSEC.)
  • Backend split: compute lives in get_scale_mean_ (_backend/cpp/sequence_feature.py);
    frontend keeps the Validate block.

Ripple done

  • numpydoc with named Returns + Examples include; executed example notebook
    examples/feature_engineering/sf_scale_mean.ipynb (aa.display_df).
  • 18 unit tests (positive+negative per param; golden test vs the manual comprehension;
    empty/all-non-canonical edge case; frequency-weighting + non-canonical-drop goldens).
  • Unreleased release-notes entry under Feature Engineering.
  • Local gates green: new tests (18), full sequence_feature_tests (451),
    test_param_coverage, test_backend_import_hygiene, test_utils_barrel;
    check_docstrings = 0 defects.

Open API question

  • df_scales is kept Optional=None (loads the bundled default, mirroring
    feature_matrix) rather than required-positional as sketched in the issue. Easy to flip
    to required if you prefer the issue's exact signature.

Part of #305 / prototype for #307.


Critical self-review (review-and-improve pass)

Adversarial re-read of the whole diff against .claude/rules/*. Focus was efficiency, edge-case correctness, and guide compliance.

Defect found + fixed — efficiency (headline). The backend get_scale_mean_ used a per-sequence Python comprehension with df_scales.loc[kept].mean(axis=0) per row — exactly the slow path the reference notebook used, with a pandas .loc build + .mean for every sequence. Rewrote it fully vectorized:

  • flatten all residues of all spans into one uint8 byte array (str.encode("latin-1","replace"), so any stray >255 codepoint is treated as non-canonical and dropped, keeping per-residue alignment),
  • map bytes → scale-row via a 256-entry lookup table (built only from single-character index labels, so multi-char labels can never spuriously match a residue — matches the notebook's a in df_scales.index character test),
  • tally a small (n_seq, n_letters) residue-count matrix with a single np.bincount, and get per-sequence sums with one BLAS matmul counts @ scales_arr.

Measured on DOM_GSEC-scale inputs (n=5040): ~15x faster and, crucially, flat across scale width (5 / 50 / 586 scales all ~11–16x) — an intermediate np.add.at scatter version I tried first was only ~1.2x at the default 586 scales because the wide unbuffered scatter dominated, so the matmul formulation is what actually fixes the wide case. Output matches the notebook cell-27 comprehension within float tolerance (max abs diff ~5.5e-16); the golden test (np.allclose, equal_nan=True) stays green.

Edge cases verified (already covered by the tests, re-checked against the new code path). Empty span and all-non-canonical span → 0/0 under np.errstate(invalid="ignore") → all-NaN row, n_kept==0, verbose-gated UserWarning; residues absent from df_scales.index (gaps -, X, or a canonical letter missing from a scales subset) are dropped; frequency weighting preserved (repeated residues counted by frequency, since bincount tallies multiplicity). No divide-by-zero warning leaks (errstate-guarded).

Guide compliance re-checked. Frontend Validate block covers every public param (ut.check_df_seq, check_df_scales, ut.check_bool for return_df, ut.check_list_parts for list_parts); backend trusts the frontend (no re-validation). Heavy logic in _backend, frontend thin. No print(), no ADR refs, bare ValueErrors via the ut.check_* family, imports only through aaanalysis.utils as ut. return_df=True returns a frame indexed like df_parts with scale-name columns (parity-tested vs the array form). numpydoc has a named Returns and the Examples include resolves (the executed sf_scale_mean.ipynb exists).

Residual / not changed (intentional).

  • scale_mean validates df_seq/list_parts and then calls self.get_df_parts(...), which re-validates both — a minor double-validation. Left as-is: it reuses the public part-slicing method (the documented way to build a span) rather than reaching into the backend get_df_parts_ with manual jmd_*_len handling; the redundant checks are cheap and match how sibling methods layer.
  • The open API question from the original body stands (df_scales kept Optional=None to mirror feature_matrix, vs the issue's required-positional sketch) — a design call for the maintainer, not a defect.

Iterative review log

  • Round 2 (correctness & edge cases): adversarially probed the vectorized LUT path — multi-char scale-index labels, lowercase residues, codepoints >255 (latin-1 'replace'), empty/all-NaN spans, 0/0 guard, return_df index alignment. All behave correctly and match the notebook comprehension; validators (check_df_scales unique/numeric/no-NaN index) rule out duplicate-index divergence. Added 2 regression tests locking the single-char-LUT invariant (multi-char label never matches; non-latin-1 residue dropped, no crash). 20 tests green.
  • Round 3 (numerical robustness & API/param-name consistency): measured the vectorized output vs the notebook comprehension on full DOM_GSEC — max abs diff 5.6e-16 (~half the values bit-exact, all within 1e-12), so the count-matrix + matmul path is numerically equivalent to per-row .mean. Confirmed 0/0 → NaN guard is the only division edge (empty rows have zero sums, never inf). API cross-checked against siblings: df_scales Optional default, list_parts/return_df naming, and the self.verbose-gated warnings.warn all match feature_matrix/get_df_parts; get_df_parts defaults (no gap-removal / no non-canonical replacement) leave the LUT as the sole residue filter. Converged — no new defect. Residual (design, not defect): scale_mean doesn't expose jmd_n_len/jmd_c_len (uses the 10/10 get_df_parts defaults), matching the issue's 3-arg signature.
  • Round 4 (efficiency & simplicity): removed the redundant df_parts.astype(str) full-frame copy in the backend — get_df_parts already guarantees str columns, so the coercion only allocated a copy and duplicated a frontend guarantee (backend-trusts-frontend). Output byte-identical (join verified equal with/without astype across part sets). Rest of the vectorized kernel (single LUT build, one bincount, one matmul) already minimal; no other dead code / needless copy / over-nesting found.
  • Round 5 (guides, docs & tests completeness): docstring checker = 0 defects for scale_mean (named Returns, one-line summary, Examples include target examples/sf_scale_mean.rst matches the sibling feature_matrix pattern; the only class advisories are pre-existing class-wide RAISES-UNDOCUMENTED, and scale_mean is not among them). Verified: Validate block covers all four public params (check_df_seq/check_df_scales/check_bool/check_list_parts), backend carries no validation, no print/ADR refs, positive+negative test per param, release-notes entry present, example notebook uses display_df with executed outputs and no bare prints. Full area gate green: 458 sequence_feature_tests + test_param_coverage + test_backend_import_hygiene + test_utils_barrel. Converged — no new defect.

breimanntools and others added 4 commits July 1, 2026 01:31
…#307)

Add a first-class no-positional-split scale-average featurizer:
SequenceFeature.scale_mean(df_seq, df_scales, list_parts=None, return_df=False)
returns a (n_seq, n_scales) matrix by averaging each scale over a sequence span.
list_parts=None uses the whole TMD-JMD span (jmd_n + tmd + jmd_c). Missing /
non-canonical residues (gaps, 'X', anything not in df_scales.index) are dropped
before averaging; an all-non-canonical span yields an all-NaN row (verbose warn).
Matches the gamma-secretase notebook cell 27 `scale_X` comprehension within float
tolerance.

Backend: get_scale_mean_ in _backend/cpp/sequence_feature.py. Adds numpydoc with
Examples include, an executed example notebook, 18 unit tests (positive+negative
per param, golden vs the manual comprehension, empty/all-non-canonical edge case),
and an Unreleased release-notes entry. No __init__ change (method on existing class).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…atmul

Replace the per-sequence Python comprehension (DataFrame.loc + mean per row)
with a fully vectorized backend: flatten all residues into one byte array,
map to scale rows through a 256-entry lookup, tally a small (n_seq, n_letters)
residue-count matrix with a single np.bincount, and obtain per-sequence sums
via one BLAS matmul against the scale matrix. ~15x faster on DOM_GSEC-scale
inputs across scale widths; output matches the notebook comprehension within
float tolerance (max abs diff ~5e-16, golden test green).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…+ non-latin1 residue)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… trusts frontend)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant