Skip to content

feat: feature_matrix accepts df_seq + list_parts (prototype #306)#316

Draft
breimanntools wants to merge 4 commits into
masterfrom
feat/feature-matrix-df-seq
Draft

feat: feature_matrix accepts df_seq + list_parts (prototype #306)#316
breimanntools wants to merge 4 commits into
masterfrom
feat/feature-matrix-df-seq

Conversation

@breimanntools

@breimanntools breimanntools commented Jun 30, 2026

Copy link
Copy Markdown
Owner

Morning status — SOLID. Works end to end with green local tests and the full ripple done. df_seq= / list_parts= added to SequenceFeature.feature_matrix; the legacy df_parts= path is byte-identical (golden + dataset-fingerprint regression tests). Review focus: the Validate-block wording and the choice to reject batch=True with df_seq (single-DataFrame input). One open API question below.

Part of #305 / prototype for #306.

What changed

  • SequenceFeature.feature_matrix(features, df_parts=None, ..., df_seq=None, list_parts=None):
    • Exactly one of df_parts / df_seq must be supplied → else bare ValueError.
    • When df_seq is given, df_parts is built internally via get_df_parts(df_seq, list_parts).
    • When df_parts is given the code path is unchanged (byte-identical).
    • batch=True together with df_seq raises ValueError; list_parts without df_seq raises ValueError.
    • df_parts default flipped to None (additive; existing positional/keyword calls unaffected).
  • numpydoc: df_seq / list_parts params (df_seq uses the canonical baseline phrasing), a versionchanged note, df_parts marked optional.
  • Example notebook examples/feature_engineering/sf_feature_matrix.ipynb: new df_seq cell, re-executed with embedded outputs (nbmake-green).
  • Unit tests (test_sf_feature_matrix.py): TestFeatureMatrixDfSeq (positive + negative per new param) and TestFeatureMatrixRegression (byte-identical legacy df_parts path: hand-computed 1.5 golden + dataset fingerprint).
  • Release notes: Unreleased entry under Feature Engineering.

Local gates (all green, run against the worktree)

  • test_sf_feature_matrix.py — 29 passed (was 18 before the new tests).
  • tests/unit/sequence_feature_tests/ + test_param_coverage.py — 444 passed.
  • cpp feature-matrix parity + aap.predict_samples pipe — 70 passed.
  • nbmake on the example notebook — 1 passed.
  • docstring structural checker — 0 defects.

KPIs (issue #306)

  • Existing df_parts= output byte-identical (regression test on canonical fixture).
  • feature_matrix(df_seq=..., list_parts=..., ...) equals the two-step result exactly.
  • ≥1 negative test (both / neither / batch+df_seq / list_parts w/o df_seqValueError).

Open API question

  • df_seq / list_parts are appended after batch to preserve positional compatibility, so they read out of logical order in the signature. Acceptable, or should they sit next to df_parts (a positional-compat break for the rare positional batch/n_jobs caller)?
  • Should list_parts passed alongside df_parts be a hard ValueError (current) or a silently-ignored no-op? Chose ValueError for clarity.

🤖 Generated with Claude Code


Critical self-review (reviewer pass)

Adversarial review by a second agent. Findings and fixes:

Defect found + fixed — property tests silently reparented. The new TestFeatureMatrixDfSeq / TestFeatureMatrixRegression classes were inserted between TestFeatureMatrixGoldenValues and its two trailing methods (test_property_shape_rows_cols, test_property_parallel_equals_serial), so those two pre-existing property tests were moved under the regression class — where they don't belong semantically ("byte-identical drift lock" vs. shape/parallel invariants). Restored them to TestFeatureMatrixGoldenValues and left the two new classes cleanly at the end of the file. No test lost; 29 pass, and collection now shows the property tests back under GoldenValues.

Verified (no change needed):

  • Byte-identical df_parts= path is genuinely proven, not a tautology. Recomputed the regression fingerprint on the independent master install: nansum=329.66268, X[0,0]=0.7, X[-1,-1]=0.16033 — matches the pinned constants exactly. The anchor would fail if the legacy path drifted, and it agrees with master, so byte-identity is real.
  • Mutual exclusion is complete and correct. (df_parts is None) == (df_seq is None) rejects both-given and neither-given; batch=True + df_seq, list_parts without df_seq, and non-DataFrame df_seq (via get_df_partscheck_df_seq) all raise clear ValueErrors. Confirmed each with concrete inputs.
  • Internal df_parts construction reuses get_df_parts (single call, no duplicated slicing logic; downstream re-validation is cheap and harmless).
  • Docstring — param order matches the signature, two distinct versionchanged:: 1.1.0 notes, Examples include resolves to an existing file. check_docstrings = 0 defects, doc_signature_drift = 0 candidates.
  • Efficiency — no needless recompute/copy; the df_seq path adds exactly one get_df_parts call.

Residual concerns (not blocking, not fixed):

  • test_df_parts_path_byte_identical_golden duplicates the existing test_golden_segment_whole_mean (both assert AC Segment(1,1) → 1.5). Intentional redundancy as an explicit regression lock; left as-is.
  • The two open API questions in the original body (param position after batch; hard-error vs. no-op for list_parts with df_parts) remain design calls for the maintainer.

Iterative review log

  • Round 2 (correctness & edge cases): found + fixed a misleading error path — a non-bool truthy batch (e.g. batch=1) with df_seq raised the "batch=True not supported with df_seq" message instead of the type error, and behaved inconsistently vs. the df_parts path. Hoisted ut.check_bool(name="batch") above the mutual-exclusion resolution (removed the now-duplicate later check) so batch is type-validated first; added test_invalid_batch_type_with_df_seq. 30 tests pass.
  • Round 3 (numerical robustness & API/param-name consistency with get_df_parts): re-proved the legacy df_parts= fingerprint is non-tautological by recomputing it on the independent master install (no df_seq param) — nansum=329.66268, X[0,0]=0.7, X[-1,-1]=0.16033 match the pinned constants exactly, so the anchor would fail on any drift. Confirmed mutual-exclusion is complete (both/neither/df_seq+batch/list_parts-without-df_seq) and part-building is a single get_df_parts delegation (no duplicated slicing). df_seq/list_parts names + phrasing mirror the sibling get_df_parts. Fixed one docstring-accuracy gap: documented that the shortcut only forwards list_parts while other get_df_parts options (jmd_n_len/jmd_c_len/tmd_len) use defaults, pointing to the two-step path for full control. 30 tests pass; check_docstrings 0 defects.
  • Round 4 (efficiency & simplicity): converged — no new defects. The df_seq path adds exactly one get_df_parts delegation (no loops, no needless copies, no recompute); the only redundancy (a duplicate check_bool(batch)) was already removed in round 2. The shared check_df_parts re-validation of the built df_parts is the common path for both input sources and is cheap — branching to skip it would add complexity, so it stays.
  • Round 5 (guides, docs & tests completeness): converged — no new defects. numpydoc has a one-line summary, named Returns, and an Examples include that resolves; no ADR refs / no print / ut.check_* in the diff. Frontend validates every public param (batch type-checked, df_seq/list_parts via get_df_parts delegation + mutual-exclusion), each new/changed param has positive+negative tests. Full area gate green: sequence_feature_tests 440 passed, param_coverage 5, import-hygiene 2, docstring-contracts 13, doc-signature-drift 0, example nbmake 1 passed.

breimanntools and others added 4 commits July 1, 2026 01:29
Add df_seq= and list_parts= to SequenceFeature.feature_matrix so the
get_df_parts -> feature_matrix two-step collapses into one call. When
df_seq is given, df_parts is built internally via get_df_parts; when
df_parts is given the path is unchanged (byte-identical). Exactly one of
df_parts / df_seq must be supplied (bare ValueError in the Validate block);
batch=True and list_parts-without-df_seq are rejected.

Ripple: numpydoc (df_seq/list_parts params + versionchanged + canonical
df_seq baseline), example notebook df_seq cell (re-executed with outputs),
unit tests (positive+negative per new param + byte-identical regression on
the legacy df_parts path), release-notes Unreleased entry.

Part of #305.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ted)

The df_seq PR inserted the new TestFeatureMatrixDfSeq / TestFeatureMatrixRegression
classes between TestFeatureMatrixGoldenValues and its two trailing property tests
(test_property_shape_rows_cols / test_property_parallel_equals_serial), silently
moving them under the regression class. Restore them to GoldenValues where they
belong; the new classes now sit cleanly at the end of the file.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Hoist ut.check_bool(name='batch') above the mutual-exclusion block so a
non-bool truthy batch (e.g. batch=1) with df_seq raises the type error,
consistent with the df_parts path, instead of the df_seq-incompat message.
Remove the now-duplicate later check; add test_invalid_batch_type_with_df_seq.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Clarify that the df_seq path uses get_df_parts defaults for its other
options (jmd_n_len/jmd_c_len/tmd_len) and point to the two-step get_df_parts
path for full control. API-consistency/docstring-accuracy only; no code change.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant