feat: feature_matrix accepts df_seq + list_parts (prototype #306)#316
Draft
breimanntools wants to merge 4 commits into
Draft
feat: feature_matrix accepts df_seq + list_parts (prototype #306)#316breimanntools wants to merge 4 commits into
breimanntools wants to merge 4 commits into
Conversation
Add df_seq= and list_parts= to SequenceFeature.feature_matrix so the get_df_parts -> feature_matrix two-step collapses into one call. When df_seq is given, df_parts is built internally via get_df_parts; when df_parts is given the path is unchanged (byte-identical). Exactly one of df_parts / df_seq must be supplied (bare ValueError in the Validate block); batch=True and list_parts-without-df_seq are rejected. Ripple: numpydoc (df_seq/list_parts params + versionchanged + canonical df_seq baseline), example notebook df_seq cell (re-executed with outputs), unit tests (positive+negative per new param + byte-identical regression on the legacy df_parts path), release-notes Unreleased entry. Part of #305. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ted) The df_seq PR inserted the new TestFeatureMatrixDfSeq / TestFeatureMatrixRegression classes between TestFeatureMatrixGoldenValues and its two trailing property tests (test_property_shape_rows_cols / test_property_parallel_equals_serial), silently moving them under the regression class. Restore them to GoldenValues where they belong; the new classes now sit cleanly at the end of the file. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Hoist ut.check_bool(name='batch') above the mutual-exclusion block so a non-bool truthy batch (e.g. batch=1) with df_seq raises the type error, consistent with the df_parts path, instead of the df_seq-incompat message. Remove the now-duplicate later check; add test_invalid_batch_type_with_df_seq. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Clarify that the df_seq path uses get_df_parts defaults for its other options (jmd_n_len/jmd_c_len/tmd_len) and point to the two-step get_df_parts path for full control. API-consistency/docstring-accuracy only; no code change. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Morning status — SOLID. Works end to end with green local tests and the full ripple done.
df_seq=/list_parts=added toSequenceFeature.feature_matrix; the legacydf_parts=path is byte-identical (golden + dataset-fingerprint regression tests). Review focus: the Validate-block wording and the choice to rejectbatch=Truewithdf_seq(single-DataFrame input). One open API question below.Part of #305 / prototype for #306.
What changed
SequenceFeature.feature_matrix(features, df_parts=None, ..., df_seq=None, list_parts=None):df_parts/df_seqmust be supplied → else bareValueError.df_seqis given,df_partsis built internally viaget_df_parts(df_seq, list_parts).df_partsis given the code path is unchanged (byte-identical).batch=Truetogether withdf_seqraisesValueError;list_partswithoutdf_seqraisesValueError.df_partsdefault flipped toNone(additive; existing positional/keyword calls unaffected).df_seq/list_partsparams (df_sequses the canonical baseline phrasing), aversionchangednote,df_partsmarked optional.examples/feature_engineering/sf_feature_matrix.ipynb: newdf_seqcell, re-executed with embedded outputs (nbmake-green).test_sf_feature_matrix.py):TestFeatureMatrixDfSeq(positive + negative per new param) andTestFeatureMatrixRegression(byte-identical legacydf_partspath: hand-computed 1.5 golden + dataset fingerprint).Local gates (all green, run against the worktree)
test_sf_feature_matrix.py— 29 passed (was 18 before the new tests).tests/unit/sequence_feature_tests/+test_param_coverage.py— 444 passed.aap.predict_samplespipe — 70 passed.KPIs (issue #306)
df_parts=output byte-identical (regression test on canonical fixture).feature_matrix(df_seq=..., list_parts=..., ...)equals the two-step result exactly.batch+df_seq/list_partsw/odf_seq→ValueError).Open API question
df_seq/list_partsare appended afterbatchto preserve positional compatibility, so they read out of logical order in the signature. Acceptable, or should they sit next todf_parts(a positional-compat break for the rare positionalbatch/n_jobscaller)?list_partspassed alongsidedf_partsbe a hardValueError(current) or a silently-ignored no-op? Chose ValueError for clarity.🤖 Generated with Claude Code
Critical self-review (reviewer pass)
Adversarial review by a second agent. Findings and fixes:
Defect found + fixed — property tests silently reparented. The new
TestFeatureMatrixDfSeq/TestFeatureMatrixRegressionclasses were inserted betweenTestFeatureMatrixGoldenValuesand its two trailing methods (test_property_shape_rows_cols,test_property_parallel_equals_serial), so those two pre-existing property tests were moved under the regression class — where they don't belong semantically ("byte-identical drift lock" vs. shape/parallel invariants). Restored them toTestFeatureMatrixGoldenValuesand left the two new classes cleanly at the end of the file. No test lost; 29 pass, and collection now shows the property tests back underGoldenValues.Verified (no change needed):
df_parts=path is genuinely proven, not a tautology. Recomputed the regression fingerprint on the independentmasterinstall:nansum=329.66268,X[0,0]=0.7,X[-1,-1]=0.16033— matches the pinned constants exactly. The anchor would fail if the legacy path drifted, and it agrees with master, so byte-identity is real.(df_parts is None) == (df_seq is None)rejects both-given and neither-given;batch=True + df_seq,list_partswithoutdf_seq, and non-DataFramedf_seq(viaget_df_parts→check_df_seq) all raise clearValueErrors. Confirmed each with concrete inputs.df_partsconstruction reusesget_df_parts(single call, no duplicated slicing logic; downstream re-validation is cheap and harmless).versionchanged:: 1.1.0notes,Examplesinclude resolves to an existing file.check_docstrings= 0 defects,doc_signature_drift= 0 candidates.df_seqpath adds exactly oneget_df_partscall.Residual concerns (not blocking, not fixed):
test_df_parts_path_byte_identical_goldenduplicates the existingtest_golden_segment_whole_mean(both assertACSegment(1,1) → 1.5). Intentional redundancy as an explicit regression lock; left as-is.batch; hard-error vs. no-op forlist_partswithdf_parts) remain design calls for the maintainer.Iterative review log
batch(e.g.batch=1) withdf_seqraised the "batch=True not supported with df_seq" message instead of the type error, and behaved inconsistently vs. thedf_partspath. Hoistedut.check_bool(name="batch")above the mutual-exclusion resolution (removed the now-duplicate later check) sobatchis type-validated first; addedtest_invalid_batch_type_with_df_seq. 30 tests pass.get_df_parts): re-proved the legacydf_parts=fingerprint is non-tautological by recomputing it on the independent master install (nodf_seqparam) — nansum=329.66268, X[0,0]=0.7, X[-1,-1]=0.16033 match the pinned constants exactly, so the anchor would fail on any drift. Confirmed mutual-exclusion is complete (both/neither/df_seq+batch/list_parts-without-df_seq) and part-building is a singleget_df_partsdelegation (no duplicated slicing).df_seq/list_partsnames + phrasing mirror the siblingget_df_parts. Fixed one docstring-accuracy gap: documented that the shortcut only forwardslist_partswhile otherget_df_partsoptions (jmd_n_len/jmd_c_len/tmd_len) use defaults, pointing to the two-step path for full control. 30 tests pass;check_docstrings0 defects.df_seqpath adds exactly oneget_df_partsdelegation (no loops, no needless copies, no recompute); the only redundancy (a duplicatecheck_bool(batch)) was already removed in round 2. The sharedcheck_df_partsre-validation of the builtdf_partsis the common path for both input sources and is cheap — branching to skip it would add complexity, so it stays.Returns, and anExamplesinclude that resolves; no ADR refs / noprint/ut.check_*in the diff. Frontend validates every public param (batchtype-checked,df_seq/list_partsviaget_df_partsdelegation + mutual-exclusion), each new/changed param has positive+negative tests. Full area gate green: sequence_feature_tests 440 passed, param_coverage 5, import-hygiene 2, docstring-contracts 13, doc-signature-drift 0, example nbmake 1 passed.