feat: dPULearn.mine_negatives + get_labels + named sample colors (prototype #308)#321
Draft
breimanntools wants to merge 7 commits into
Draft
feat: dPULearn.mine_negatives + get_labels + named sample colors (prototype #308)#321breimanntools wants to merge 7 commits into
breimanntools wants to merge 7 commits into
Conversation
…totype #308) Three additive conveniences for the positive/unlabelled -> mined-negatives flow, removing the recurring vstack/label-vector/color-lookup plumbing in the gamma-secretase use case. All existing APIs stay byte-identical. - dPULearn.mine_negatives(X_pos, X_unlabelled, ...): one-call sugar over fit that returns the reliable-negative boolean mask over X_unlabelled. Equals the manual labels_[len(X_pos):]==0 result exactly (regression-tested). fit(X, labels=...) unchanged. - get_labels(df, positive_label=1, col_label="label"): binary int label vector, the single-call form of (df[col]==x).astype(int).to_numpy(). - COLOR_SAMPLES_POS/NEG/UNL/REL_NEG: public named aliases for the canonical sample colors, equal to plot_get_cdict("DICT_COLOR")["SAMPLES_*"] (golden-tested). Wired get_labels + the 4 color constants into __init__/__all__ (on the #308 wire-to-public-API list). Ripple: numpydoc + 2 executed example notebooks (get_labels, dpul_mine_negatives), 39 unit tests (positive+negative+regression), release-notes Unreleased entries, cheat-sheet rows. Part of #305 / prototype for #308 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #321 +/- ##
=======================================
Coverage 94.93% 94.94%
=======================================
Files 185 186 +1
Lines 17862 17897 +35
Branches 3032 3034 +2
=======================================
+ Hits 16957 16992 +35
Misses 598 598
Partials 307 307
🚀 New features to boost your workflow:
|
… in get_labels Reorder the get_labels Validate block so check_str(col_label) runs before check_df(cols_required=col_label). A non-str col_label now surfaces a clear 'col_label' error instead of an internal 'cols_required' one. No behaviour change on valid input. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
mine_negatives validated X_pos and X_unlabelled separately with the default check_X min_n_samples=3, so it rejected n_pos<3 inputs that the manual stacking path accepts (the >=3 floor belongs to the stacked matrix, which fit enforces). Relax the per-matrix check to min_n_samples=1 to restore exact equivalence; add tests for the small-positive-set equivalence and get_labels single-class/NaN mapping. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…sistency The rest of the package spells it 'unlabeled' (American, 85 uses) and abbreviates the marker as label_unl / n_unl_to_neg; the new public mine_negatives parameter used the British two-L 'X_unlabelled'. Rename the new/unreleased parameter, its match helper, docstrings, tests, cheat-sheet and release-notes entries, and re-execute the example notebook. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
With no pre-labeled negatives, n_neg (total) and n_unl_to_neg (from the pool) are always equivalent in mine_negatives, so exposing both was redundant. Replace them with a single required n_neg (the method is new/unreleased, so non-breaking); it calls fit(n_unl_to_neg=n_neg) internally. Update docstring, tests, and the re-executed example notebook. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…d error mine_negatives delegated n_neg validation to fit (which sees it as n_unl_to_neg), so an invalid n_neg raised an error naming the internal parameter. Validate n_neg explicitly in the frontend so the message names n_neg, and assert the name in the negative test. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…-mask # Conflicts: # docs/source/index/release_notes.rst
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Status: SOLID
Works end to end, all new local tests green, full ripple done. Three additive
conveniences; every existing API stays byte-identical. Ready for maintainer review.
What to review
dPULearn.mine_negativesdesign: a thin wrapper that stacksX_posoverX_unlabelled, fits with1/2markers, and returns the boolean mask(
labels_[len(X_pos):] == 0). Returns the mask only (not a subframe) — the issuesaid "mask and/or subframe";
X_unlabelled[mask]gives the subframe, so the mask isthe minimal primitive. Open question: do you also want a
return_df/subframe option?get_labelslives indata_handling(re-exported top-level). Added an optionalcol_label="label"beyond the issue's(df, positive_label=1)signature formulti-column frames — drop it if you want the exact issue signature.
get_labelsraises ifpositive_labelis absent from the column (friendlier than thesilent all-zeros of the manual expression). Flag if you'd rather match the manual
expression exactly (no raise).
COLOR_SAMPLES_POS = COLOR_POS, ...) in_constants.py;DICT_COLORvalues are untouched, golden test asserts equality.Part of #305 / prototype for #308. Intentionally has no issue-closing keyword; #308 stays open.
Deliverables (#308)
dPULearn.mine_negatives(X_pos, X_unlabelled, n_neg=None, n_unl_to_neg=None, metric=None, n_components=0.80)→ boolean mask overX_unlabelledrows flagging themined reliable negatives. Equals the notebook cell 18/24 manual
vstack+[1]*..+[2]*..+labels_[len(X_pos):]==0path exactly (regression-tested,incl. on the real DOM_GSEC_PU feature matrix in the example notebook: 49 of 631).
The pre-existing
fit(X, labels=...)path is untouched (byte-identical, regression-tested).aa.get_labels(df, positive_label=1, col_label="label")→ binaryintlabel vector;the single-call form of
(df[col]==x).astype(int).to_numpy(). Golden-tested on PU / binary/ multi-class / string encodings.
aa.COLOR_SAMPLES_POS / NEG / UNL / REL_NEG— named constants equal toplot_get_cdict("DICT_COLOR")["SAMPLES_*"](golden test).Public API
Wired
get_labels+ the 4 color constants into__init__.py/__all__(#308 is on thewire-to-public-API list).
mine_negativesis a method on the already-publicdPULearn.Ripple
examples/data_handling/get_labels.ipynb,examples/pu_learning/dpul_mine_negatives.ipynb) — passnbmakeLocal verification
tests/unit/api_tests(param-coverage, abbreviation registry, barrel, ...): 175 passed🤖 Generated with Claude Code
Critical self-review
Adversarial review-and-improve pass (issue #308). Verdict: implementation is
correct and the three additive surfaces behave as specified. One clarity fix
applied; the rest verified, not rubber-stamped.
Fixed
get_labelsvalidation order.check_df(cols_required=col_label)ranbefore
check_str(col_label), so a non-stringcol_labelsurfaced aninternal
'cols_required'error rather than a clear'col_label'one.Reordered so the parameter is validated before it is used as a required-column
key. No behaviour change on valid input.
Verified correct (with a concrete check)
mine_negativesis purely additive: itvstacksX_pos/X_unlabelled, builds the[1]*n_pos + [2]*n_unlvector,calls the untouched
fit(label_pos=1, label_unl=2), and returnslabels_[n_pos:] == 0. Golden tests assert byte-equality vs the manualstacking path across seeds
{0,1,7}and thecosinemetric, and thatn_neg==
n_unl_to_negwith no pre-labeled negatives.fitunchanged. The diff to_dpulearn.pyis strictly additive (onehelper + one method); the existing
fit(X, labels=...)path is not touched,so byte-identity is structural, and the equivalence tests pin the sugar to it.
COLOR_SAMPLES_*are defined in_constants.py,re-exported via
from ._constants import *inutils.py(sout.COLOR_SAMPLES_*resolves), and a golden test asserts equality vs
plot_get_cdict("DICT_COLOR")["SAMPLES_*"].__all__has no duplicates, every entry resolves, and thedata_handling front-door docstring lists
get_labels. Meta-tests pass:test_param_coverage,test_class_abbreviation_registry,test_backend_import_hygiene,test_docstring_contracts,test_utils_barrel.print()in library code, bareValueErroronly,constants via
ut.X,# I / # IIskeleton, numpydoc with namedReturns,no ADR refs.
check_docstrings.pyreports 0 defects on both changed modules.--nbmake, shipexecuted outputs, and use
display_dffor DataFrames.Notes for the maintainer (not changed — design calls, per spec)
mine_negativesname/shape. The method both mutates the instance (setslabels_/df_pu_) and returns a mask, so it is not a pure.fit(returns self) nor a pure query. The name reads well and the dual behaviour is
documented, but it is a third public verb on a
Wrapperwhose template is.fit/.eval. Worth a conscious sign-off on the name and the mutate+returnshape.
n_negvsn_unl_to_negredundancy inmine_negatives. With nopre-labeled-negative path exposed, the two arguments are always equivalent
here (unlike
fit, wherelabel_negdistinguishes them). Both are kept forparity with
fit; you may prefer to expose only one to reduce surface.Residual concern I did not "fix"
TestFitUnchangedregression asserts contract properties (positives stay1, exactly N mined0, values ⊆{0,1,2}) rather than a pinned goldenvector. It does not, on its own, catch a future algorithm change to
fit(the equivalence tests would move with it). I left it as-is: byte-identity vs
master is already guaranteed by the additive diff, and pinning a random
PCA-selection vector is brittle and belongs in the dedicated exact-value
regression suite, not here.
Iterative review log
mine_negativesvalidatedX_pos/X_unlabelledseparately with the defaultcheck_Xmin_n_samples=3, so it raised forn_pos<3while the manual stacked path accepts it (the >=3 floor belongs to the stacked matrix, enforced byfit) — breaking the "equals the manual path exactly" contract. Relaxed per-matrix validation tomin_n_samples=1(float coercion + feature-dim check kept; stackedfitstill enforces >=3). Added tests:n_pos=1mask-equals-manual;get_labelssingle-class column maps to all-ones (no spurious raise), NaN maps to 0. (39 -> 42 new tests green; full dpulearn suite 95 green.)X_unlabelled(British, two L), but the entire codebase uses American "unlabeled" (85 uses, plus thelabel_unl/n_unl_to_negmarkers). Renamed the new/unreleased parameter, helpercheck_match_X_pos_X_unlabeled, docstrings, tests, cheat-sheet and release-notes entries toX_unlabeledfor sibling consistency; re-executed the example notebook (49 of 631, mask-equals-manual: True). No numerical-robustness defects found (seeded determinism, feature-dim + count validation all correct). Maintainer note: an even terserX_unlwould pair most tightly withX_pos/X_neg/label_unl, but I kept the full word per the issue author's explicitX_unlabellednaming.mine_negativesthere are never pre-labeled negatives, son_neg(total) andn_unl_to_neg(from-pool) were always mathematically equivalent — a confusing always-equivalent duplicate. Collapsed the two count arguments into a single requiredn_neg(the method is new/unreleased, so non-breaking); it callsfit(n_unl_to_neg=n_neg)internally. Updated docstring, tests (dropped the now-moot both/neither/equivalence cases, added ann_neg<1negative test), and the example notebook (re-executed: 49 of 631, mask-equals-manual: True). The body-and-return shape (fits the instance and returns the mask) is intentional/documented and kept as-is. No loop/vectorization defects (alreadynp.vstack+ boolean-mask, no iterrows).n_negraised'n_unl_to_neg' should be an integer with n >= 1(naming an internal parametermine_negativesusers never see). Added an explicitut.check_number_range(name="n_neg", ...)in the frontend so the message namesn_neg, closing the "frontend validates every public param" gap; strengthened the negative test withmatch="n_neg". Verified: numpydoc named Returns + one-line summaries, Examples includes resolve (both notebooks exist and pass nbmake), no ADR refs in any new .py, bare ValueError only, no print, citations resolve. Full area gate green: 917 unit tests (dpulearn / data_handling / plotting / api meta-tests incl. param-coverage, import-hygiene, barrel), docstring checker 0 defects, doc/signature-drift 0 candidates, nbmake 2/2.