feat: SequenceFeature.scale_mean — scale-average baseline (prototype #307)#319
Draft
breimanntools wants to merge 4 commits into
Draft
feat: SequenceFeature.scale_mean — scale-average baseline (prototype #307)#319breimanntools wants to merge 4 commits into
breimanntools wants to merge 4 commits into
Conversation
…#307) Add a first-class no-positional-split scale-average featurizer: SequenceFeature.scale_mean(df_seq, df_scales, list_parts=None, return_df=False) returns a (n_seq, n_scales) matrix by averaging each scale over a sequence span. list_parts=None uses the whole TMD-JMD span (jmd_n + tmd + jmd_c). Missing / non-canonical residues (gaps, 'X', anything not in df_scales.index) are dropped before averaging; an all-non-canonical span yields an all-NaN row (verbose warn). Matches the gamma-secretase notebook cell 27 `scale_X` comprehension within float tolerance. Backend: get_scale_mean_ in _backend/cpp/sequence_feature.py. Adds numpydoc with Examples include, an executed example notebook, 18 unit tests (positive+negative per param, golden vs the manual comprehension, empty/all-non-canonical edge case), and an Unreleased release-notes entry. No __init__ change (method on existing class). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…atmul Replace the per-sequence Python comprehension (DataFrame.loc + mean per row) with a fully vectorized backend: flatten all residues into one byte array, map to scale rows through a 256-entry lookup, tally a small (n_seq, n_letters) residue-count matrix with a single np.bincount, and obtain per-sequence sums via one BLAS matmul against the scale matrix. ~15x faster on DOM_GSEC-scale inputs across scale widths; output matches the notebook comprehension within float tolerance (max abs diff ~5e-16, golden test green). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…+ non-latin1 residue) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… trusts frontend) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Status: SOLID — works end-to-end, green local tests, full ripple done.
SequenceFeature.scale_mean(df_seq, df_scales=None, list_parts=None, return_df=False)adds the no-positional-split scale-based baseline featurizer: per sequence, each
scale is averaged over the residues of a span, giving a
(n_seq, n_scales)matrix —the counterpart to
feature_matrixfor the standard scale-vs-CPP comparison.What to review
SequenceFeature(method on the existing class, no__init__/__all__change) and the
list_parts=None→ whole TMD-JMD span (jmd_n+tmd+jmd_c, thetmd_jmdpart) default.df_scales.index; gaps (-) and other non-canonical symbols (e.g.X) are dropped.An all-non-canonical / empty span yields an all-NaN row plus a verbose-gated
UserWarning. (Matches the gamma-secretase notebook cell 27scale_Xcomprehensionverbatim on
DOM_GSEC.)get_scale_mean_(_backend/cpp/sequence_feature.py);frontend keeps the Validate block.
Ripple done
Returns+Examplesinclude; executed example notebookexamples/feature_engineering/sf_scale_mean.ipynb(aa.display_df).empty/all-non-canonical edge case; frequency-weighting + non-canonical-drop goldens).
sequence_feature_tests(451),test_param_coverage,test_backend_import_hygiene,test_utils_barrel;check_docstrings= 0 defects.Open API question
df_scalesis keptOptional=None(loads the bundled default, mirroringfeature_matrix) rather than required-positional as sketched in the issue. Easy to flipto required if you prefer the issue's exact signature.
Part of #305 / prototype for #307.
Critical self-review (review-and-improve pass)
Adversarial re-read of the whole diff against
.claude/rules/*. Focus was efficiency, edge-case correctness, and guide compliance.Defect found + fixed — efficiency (headline). The backend
get_scale_mean_used a per-sequence Python comprehension withdf_scales.loc[kept].mean(axis=0)per row — exactly the slow path the reference notebook used, with a pandas.locbuild +.meanfor every sequence. Rewrote it fully vectorized:uint8byte array (str.encode("latin-1","replace"), so any stray >255 codepoint is treated as non-canonical and dropped, keeping per-residue alignment),a in df_scales.indexcharacter test),(n_seq, n_letters)residue-count matrix with a singlenp.bincount, and get per-sequence sums with one BLAS matmulcounts @ scales_arr.Measured on
DOM_GSEC-scale inputs (n=5040): ~15x faster and, crucially, flat across scale width (5 / 50 / 586 scales all ~11–16x) — an intermediatenp.add.atscatter version I tried first was only ~1.2x at the default 586 scales because the wide unbuffered scatter dominated, so the matmul formulation is what actually fixes the wide case. Output matches the notebook cell-27 comprehension within float tolerance (max abs diff ~5.5e-16); the golden test (np.allclose,equal_nan=True) stays green.Edge cases verified (already covered by the tests, re-checked against the new code path). Empty span and all-non-canonical span →
0/0undernp.errstate(invalid="ignore")→ all-NaNrow,n_kept==0, verbose-gatedUserWarning; residues absent fromdf_scales.index(gaps-,X, or a canonical letter missing from a scales subset) are dropped; frequency weighting preserved (repeated residues counted by frequency, sincebincounttallies multiplicity). No divide-by-zero warning leaks (errstate-guarded).Guide compliance re-checked. Frontend Validate block covers every public param (
ut.check_df_seq,check_df_scales,ut.check_boolforreturn_df,ut.check_list_partsforlist_parts); backend trusts the frontend (no re-validation). Heavy logic in_backend, frontend thin. Noprint(), no ADR refs, bareValueErrors via theut.check_*family, imports only throughaaanalysis.utils as ut.return_df=Truereturns a frame indexed likedf_partswith scale-name columns (parity-tested vs the array form). numpydoc has a namedReturnsand theExamplesinclude resolves (the executedsf_scale_mean.ipynbexists).Residual / not changed (intentional).
scale_meanvalidatesdf_seq/list_partsand then callsself.get_df_parts(...), which re-validates both — a minor double-validation. Left as-is: it reuses the public part-slicing method (the documented way to build a span) rather than reaching into the backendget_df_parts_with manualjmd_*_lenhandling; the redundant checks are cheap and match how sibling methods layer.df_scaleskeptOptional=Noneto mirrorfeature_matrix, vs the issue's required-positional sketch) — a design call for the maintainer, not a defect.Iterative review log
check_df_scalesunique/numeric/no-NaN index) rule out duplicate-index divergence. Added 2 regression tests locking the single-char-LUT invariant (multi-char label never matches; non-latin-1 residue dropped, no crash). 20 tests green.DOM_GSEC— max abs diff 5.6e-16 (~half the values bit-exact, all within 1e-12), so the count-matrix + matmul path is numerically equivalent to per-row.mean. Confirmed 0/0 → NaN guard is the only division edge (empty rows have zero sums, never inf). API cross-checked against siblings:df_scalesOptional default,list_parts/return_dfnaming, and theself.verbose-gatedwarnings.warnall matchfeature_matrix/get_df_parts;get_df_partsdefaults (no gap-removal / no non-canonical replacement) leave the LUT as the sole residue filter. Converged — no new defect. Residual (design, not defect):scale_meandoesn't exposejmd_n_len/jmd_c_len(uses the 10/10get_df_partsdefaults), matching the issue's 3-arg signature.df_parts.astype(str)full-frame copy in the backend —get_df_partsalready guarantees str columns, so the coercion only allocated a copy and duplicated a frontend guarantee (backend-trusts-frontend). Output byte-identical (join verified equal with/without astype across part sets). Rest of the vectorized kernel (single LUT build, one bincount, one matmul) already minimal; no other dead code / needless copy / over-nesting found.scale_mean(namedReturns, one-line summary,Examplesinclude targetexamples/sf_scale_mean.rstmatches the siblingfeature_matrixpattern; the only class advisories are pre-existing class-wide RAISES-UNDOCUMENTED, andscale_meanis not among them). Verified: Validate block covers all four public params (check_df_seq/check_df_scales/check_bool/check_list_parts), backend carries no validation, noprint/ADR refs, positive+negative test per param, release-notes entry present, example notebook usesdisplay_dfwith executed outputs and no bare prints. Full area gate green: 458sequence_feature_tests+test_param_coverage+test_backend_import_hygiene+test_utils_barrel. Converged — no new defect.