feat: feature_matrix accepts df_seq + list_parts (prototype #306) by breimanntools · Pull Request #316 · breimanntools/aaanalysis

breimanntools · 2026-06-30T23:29:49Z

Morning status — SOLID. Works end to end with green local tests and the full ripple done. df_seq= / list_parts= added to SequenceFeature.feature_matrix; the legacy df_parts= path is byte-identical (golden + dataset-fingerprint regression tests). Review focus: the Validate-block wording and the choice to reject batch=True with df_seq (single-DataFrame input). One open API question below.

Part of #305 / prototype for #306.

What changed

SequenceFeature.feature_matrix(features, df_parts=None, ..., df_seq=None, list_parts=None):
- Exactly one of df_parts / df_seq must be supplied → else bare ValueError.
- When df_seq is given, df_parts is built internally via get_df_parts(df_seq, list_parts).
- When df_parts is given the code path is unchanged (byte-identical).
- batch=True together with df_seq raises ValueError; list_parts without df_seq raises ValueError.
- df_parts default flipped to None (additive; existing positional/keyword calls unaffected).
numpydoc: df_seq / list_parts params (df_seq uses the canonical baseline phrasing), a versionchanged note, df_parts marked optional.
Example notebook examples/feature_engineering/sf_feature_matrix.ipynb: new df_seq cell, re-executed with embedded outputs (nbmake-green).
Unit tests (test_sf_feature_matrix.py): TestFeatureMatrixDfSeq (positive + negative per new param) and TestFeatureMatrixRegression (byte-identical legacy df_parts path: hand-computed 1.5 golden + dataset fingerprint).
Release notes: Unreleased entry under Feature Engineering.

Local gates (all green, run against the worktree)

test_sf_feature_matrix.py — 29 passed (was 18 before the new tests).
tests/unit/sequence_feature_tests/ + test_param_coverage.py — 444 passed.
cpp feature-matrix parity + aap.predict_samples pipe — 70 passed.
nbmake on the example notebook — 1 passed.
docstring structural checker — 0 defects.

KPIs (issue #306)

Existing df_parts= output byte-identical (regression test on canonical fixture).
feature_matrix(df_seq=..., list_parts=..., ...) equals the two-step result exactly.
≥1 negative test (both / neither / batch+df_seq / list_parts w/o df_seq → ValueError).

Open API question

df_seq / list_parts are appended after batch to preserve positional compatibility, so they read out of logical order in the signature. Acceptable, or should they sit next to df_parts (a positional-compat break for the rare positional batch/n_jobs caller)?
Should list_parts passed alongside df_parts be a hard ValueError (current) or a silently-ignored no-op? Chose ValueError for clarity.

🤖 Generated with Claude Code

Critical self-review (reviewer pass)

Adversarial review by a second agent. Findings and fixes:

Defect found + fixed — property tests silently reparented. The new TestFeatureMatrixDfSeq / TestFeatureMatrixRegression classes were inserted between TestFeatureMatrixGoldenValues and its two trailing methods (test_property_shape_rows_cols, test_property_parallel_equals_serial), so those two pre-existing property tests were moved under the regression class — where they don't belong semantically ("byte-identical drift lock" vs. shape/parallel invariants). Restored them to TestFeatureMatrixGoldenValues and left the two new classes cleanly at the end of the file. No test lost; 29 pass, and collection now shows the property tests back under GoldenValues.

Verified (no change needed):

Byte-identical df_parts= path is genuinely proven, not a tautology. Recomputed the regression fingerprint on the independent master install: nansum=329.66268, X[0,0]=0.7, X[-1,-1]=0.16033 — matches the pinned constants exactly. The anchor would fail if the legacy path drifted, and it agrees with master, so byte-identity is real.
Mutual exclusion is complete and correct. (df_parts is None) == (df_seq is None) rejects both-given and neither-given; batch=True + df_seq, list_parts without df_seq, and non-DataFrame df_seq (via get_df_parts→check_df_seq) all raise clear ValueErrors. Confirmed each with concrete inputs.
Internal df_parts construction reuses get_df_parts (single call, no duplicated slicing logic; downstream re-validation is cheap and harmless).
Docstring — param order matches the signature, two distinct versionchanged:: 1.1.0 notes, Examples include resolves to an existing file. check_docstrings = 0 defects, doc_signature_drift = 0 candidates.
Efficiency — no needless recompute/copy; the df_seq path adds exactly one get_df_parts call.

Residual concerns (not blocking, not fixed):

test_df_parts_path_byte_identical_golden duplicates the existing test_golden_segment_whole_mean (both assert AC Segment(1,1) → 1.5). Intentional redundancy as an explicit regression lock; left as-is.
The two open API questions in the original body (param position after batch; hard-error vs. no-op for list_parts with df_parts) remain design calls for the maintainer.

Iterative review log

Round 2 (correctness & edge cases): found + fixed a misleading error path — a non-bool truthy batch (e.g. batch=1) with df_seq raised the "batch=True not supported with df_seq" message instead of the type error, and behaved inconsistently vs. the df_parts path. Hoisted ut.check_bool(name="batch") above the mutual-exclusion resolution (removed the now-duplicate later check) so batch is type-validated first; added test_invalid_batch_type_with_df_seq. 30 tests pass.
Round 3 (numerical robustness & API/param-name consistency with get_df_parts): re-proved the legacy df_parts= fingerprint is non-tautological by recomputing it on the independent master install (no df_seq param) — nansum=329.66268, X[0,0]=0.7, X[-1,-1]=0.16033 match the pinned constants exactly, so the anchor would fail on any drift. Confirmed mutual-exclusion is complete (both/neither/df_seq+batch/list_parts-without-df_seq) and part-building is a single get_df_parts delegation (no duplicated slicing). df_seq/list_parts names + phrasing mirror the sibling get_df_parts. Fixed one docstring-accuracy gap: documented that the shortcut only forwards list_parts while other get_df_parts options (jmd_n_len/jmd_c_len/tmd_len) use defaults, pointing to the two-step path for full control. 30 tests pass; check_docstrings 0 defects.
Round 4 (efficiency & simplicity): converged — no new defects. The df_seq path adds exactly one get_df_parts delegation (no loops, no needless copies, no recompute); the only redundancy (a duplicate check_bool(batch)) was already removed in round 2. The shared check_df_parts re-validation of the built df_parts is the common path for both input sources and is cheap — branching to skip it would add complexity, so it stays.
Round 5 (guides, docs & tests completeness): converged — no new defects. numpydoc has a one-line summary, named Returns, and an Examples include that resolves; no ADR refs / no print / ut.check_* in the diff. Frontend validates every public param (batch type-checked, df_seq/list_parts via get_df_parts delegation + mutual-exclusion), each new/changed param has positive+negative tests. Full area gate green: sequence_feature_tests 440 passed, param_coverage 5, import-hygiene 2, docstring-contracts 13, doc-signature-drift 0, example nbmake 1 passed.

Add df_seq= and list_parts= to SequenceFeature.feature_matrix so the get_df_parts -> feature_matrix two-step collapses into one call. When df_seq is given, df_parts is built internally via get_df_parts; when df_parts is given the path is unchanged (byte-identical). Exactly one of df_parts / df_seq must be supplied (bare ValueError in the Validate block); batch=True and list_parts-without-df_seq are rejected. Ripple: numpydoc (df_seq/list_parts params + versionchanged + canonical df_seq baseline), example notebook df_seq cell (re-executed with outputs), unit tests (positive+negative per new param + byte-identical regression on the legacy df_parts path), release-notes Unreleased entry. Part of #305. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…ted) The df_seq PR inserted the new TestFeatureMatrixDfSeq / TestFeatureMatrixRegression classes between TestFeatureMatrixGoldenValues and its two trailing property tests (test_property_shape_rows_cols / test_property_parallel_equals_serial), silently moving them under the regression class. Restore them to GoldenValues where they belong; the new classes now sit cleanly at the end of the file. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Hoist ut.check_bool(name='batch') above the mutual-exclusion block so a non-bool truthy batch (e.g. batch=1) with df_seq raises the type error, consistent with the df_parts path, instead of the df_seq-incompat message. Remove the now-duplicate later check; add test_invalid_batch_type_with_df_seq. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Clarify that the df_seq path uses get_df_parts defaults for its other options (jmd_n_len/jmd_c_len/tmd_len) and point to the two-step get_df_parts path for full control. API-consistency/docstring-accuracy only; no code change. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

breimanntools and others added 4 commits July 1, 2026 01:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: feature_matrix accepts df_seq + list_parts (prototype #306)#316

feat: feature_matrix accepts df_seq + list_parts (prototype #306)#316
breimanntools wants to merge 4 commits into
masterfrom
feat/feature-matrix-df-seq

breimanntools commented Jun 30, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

breimanntools commented Jun 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changed

Local gates (all green, run against the worktree)

KPIs (issue #306)

Open API question

Critical self-review (reviewer pass)

Iterative review log

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

breimanntools commented Jun 30, 2026 •

edited

Loading