Restore scorer-training (.param) workflow on this fork#27
Open
Restore scorer-training (.param) workflow on this fork#27
Conversation
Recovers the MS-GF+ scorer training entry points stripped in c400668, 9bf01c8, and 657cc5e so a developer can produce a .param model file again without restoring the legacy mzid/ package or params/ParamManager. Files added: - ui/ScoringParamGen.java with a self-contained args[] CLI (replaces the deleted ParamManager-based one). Accepts pre-converted .tsv training results plus a spectrum directory; mzID input is rejected with a clear message because mzid/MzIDParser is gone in this fork. - msscorer/ScoringParameterGenerator.java and msscorer/ScoringParameterGeneratorWithErrors.java with imports rewired (parser.MgfSpectrumParser -> mgf.MgfSpectrumParser) and the protected-table assignments switched from Hashtable to HashMap to match the current NewRankScorer field types. - msutil/AnnotatedSpectra.java with parser.BufferedLineReader rewired to mgf.BufferedLineReader and the mzID branch in parseFile() turned into an explicit "not supported in this build" error path. - misc/TrainScoringParameters.java restored verbatim (its imports already match the current tree). The MzMLAdapter.turnOffLogs() call from upstream ScoringParamGen is dropped since mzml/MzMLAdapter no longer exists; the non-Stax mzml adapter was deleted in the cleanup. No pom.xml or non-trainer files were modified. mvn -B -o compile passes; ScoringParamGen prints help/usage when launched with no args.
…atedSpectra
Two test classes covering the recovery surface only — argument parsing,
TSV parsing, error paths, and integration with the SpectraAccessor /
mgf reader. The scoring-parameter trainer itself
(ScoringParameterGeneratorWithErrors.generateParameters) is NOT
exercised; that needs thousands of confidently-annotated PSMs and is
out of scope for a recovery smoke test.
TestScoringParamGenRecovery (17 tests):
- Empty / malformed argv (lone "-i", bare positional, unknown option)
- Each required-arg-missing path (-i, -d, -m), each invalid-enum path
(activation method, instrument, enzyme, protocol)
- Existence checks for the input TSV and the spectrum directory
- "-thread abc" rejected with a typed message
- Happy-path arg parsing followed by an empty TSV: error bubbles up
from the parser
- Happy-path arg parsing with a valid-shape TSV whose only row is
above the FDR floor: returns "No results to train on. Exiting."
*without* invoking the trainer (proves the early-exit gate works)
- main() with empty args / -h / --help / -help does not throw
TestAnnotatedSpectraRecovery (11 tests):
- .mzid files rejected with the explicit "not supported in this build"
message (the mzid/ package was removed; recovery cuts that input
path)
- Unknown extensions rejected
- TSV header validation: must start with "#"; required columns
(#SpecFile, SpecID, Peptide) checked; FDR column aliases (FDR,
EFDR, QValue, SpecQValue) accepted interchangeably
- Rows above the FDR threshold are dropped silently
- Real MS2 spectrum integration via the existing
src/test/resources/iprg-2013/F13.mgf fixture: TSV references a
SpecID that doesn't exist in the mgf, parser returns
"F13.mgf:index=99999999 is not available!" cleanly (no crash)
- High-level parse() aggregates multiple files; fails fast on first
error when dropErrors is disabled
The F13.mgf fixture is real spectrum data (~146K lines of MS2). Astral
mzML files are not shipped as test fixtures (size); the integration
boundary is the same — SpectraAccessor lookup against a real spec file —
so testing against F13 proves the recovered code reads spectra
correctly. The recovered trainer is instrument-agnostic by construction.
Result: 28/28 new tests pass; combined scoped suite (recovery +
calibrator + scaffolding + pin writer + concurrent runner) is 79/79.
Closes the gap left by the existing recovery tests, which only cover CLI arg parsing and TSV-error paths. This test feeds a synthetic in-memory SpectraContainer of annotated spectra straight into ScoringParameterGeneratorWithErrors.generateParameters, asserts the .param binary and .param.txt debug dump are produced, and round-trips the binary through NewRankScorer to prove the on-disk format matches what MS-GF+ loads at search time. Three @test methods: - generatesParamFileFromSyntheticAnnotatedSpectra: 200 unique tryptic peptides * 3 reps = 600 charge-2 spectra; HCD/LowRes/Tryp; verifies file production, non-empty content, and full round-trip via NewRankScorer (activation method, instrument, enzyme, partitions). - debugTextDumpReferencesActivationAndEnzyme: re-runs at smaller scale with CID and inspects the human-readable .param.txt header lines. - trainerHandlesInputSizeJustAboveMinimumPartitionThreshold: documents the trainer's effective input-size floor (~360 spectra at one charge after the (peptide, charge) dedup to numSpecsPerPeptide=3) so future readers don't accidentally shrink fixtures below it. Runs in ~1.3s. Uses LOW_RESOLUTION_LTQ to keep errorScalingFactor=0 and skip the optional ion-error / noise-error stages while still exercising charge histogram, partitioning, precursor OFF, fragment OFF, and rank distribution writes.
Adds a user-facing documentation page covering the recovered scorer
training workflow:
- When to train a new model (vs. relying on the bundled .param files).
- Required inputs: spectra, FASTA, mods file, and a target/decoy
MS-GF+ search at 1% FDR providing annotated PSMs.
- End-to-end three-step recipe: search -> ScoringParamGen -> use the
.param.
- The TSV input format AnnotatedSpectra expects (#SpecFile, SpecID,
Peptide, Charge, FDR/EFDR/QValue/SpecQValue), which matches what
DirectTSVWriter already emits, so no converter is needed.
- The CLI reference for the recovered ScoringParamGen entry point.
- Caveats specific to this fork: mzID input was removed (commit
9bf01c8); pre-convert to TSV. The legacy params/ParamManager CLI
framework is gone; the recovered entry point parses args manually.
- Empirical minimum-input note pinned by TestScoringParamGenSmoke
(~360 unique-peptide-tagged spectra after the trainer's internal
dedup; ~500 unique peptides recommended for usable rank
distributions).
The doc lives under docs/ rather than .claude/plans/ so it ships in
the public repo and is reachable from the project root.
ⓘ You've reached your Qodo monthly free-tier limit. Reviews pause until next month — upgrade your plan to continue now, or link your paid account if you already have one. |
|
Important Review skippedAuto reviews are disabled on base/target branches other than the default branch. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
.paramscoring models on this fork. Five files brought back from upstreamMSGFPlus/msgfplus@masterand rewired to the post-cleanup tree (no mzID/ParamManagerdependencies revived)..paramfile viaNewRankScorer.docs/training-scoring-models.mdexplaining the workflow.What's restored
ui/ScoringParamGen.javaargs[]parser, picocli-free)msscorer/ScoringParameterGeneratorWithErrors.javamsscorer/ScoringParameterGenerator.javamsutil/AnnotatedSpectra.javamisc/TrainScoringParameters.javaUpstream removal commits:
9bf01c8(Apr 19 — mzID +ScoringParamGen+AnnotatedSpectra),657cc5e(Apr 27 — bothScoringParameterGenerator*),c400668(earlier —TrainScoringParameters). Files fetched from upstreammasterand rewired:parser.MgfSpectrumParsermgf.MgfSpectrumParserdfa5dd9parser.BufferedLineReadermgf.BufferedLineReadermzid.MzIDParser.mzidinputs return explicit error; pre-convert to TSVparams.ParamManagerargs[]parsingmzml.MzMLAdapter.turnOffLogs()StaxMzMLParserexists in this forkHashtable<…>HashMap<…>NewRankScorer's narrowed signaturesTests added (31 total, all <2s combined)
TestScoringParamGenRecoveryTestAnnotatedSpectraRecovery.mzidrejection ("not supported in this build"), unknown extensions, TSV header validation, FDR column aliases, real MS2 integration viaiprg-2013/F13.mgfTestScoringParamGenSmokegenerateParameters→ assert.paramexists, round-trip-load viaNewRankScorer, verify activation/instrument/enzyme/partition set. Documents the partition-step minimum (~360 dedup-survived spectra).Combined regression suite (recovery + Phase B calibrator + scaffolding + PIN writer + isolation): 88/88 pass.
Documentation
docs/training-scoring-models.mdcovers:.param).ScoringParamGen→ use the.param.DirectTSVWriteroutput by construction; no converter needed).ParamManagerexcised, manual-i -d -m -inst -earg parsing).TestScoringParamGenSmoke.What's NOT in scope
MSGFPlusCLI as a subcommand. It's a standalone entry point invoked viajava -cp MSGFPlus.jar edu.ucsd.msjava.ui.ScoringParamGen ..., matching the upstream UX..mzidPSM input. Pre-convert to TSV upstream; the existingDirectTSVWriteroutput is already trainer-ready.params/ParamManagerframework restoration.Test plan
mvn -B verifypasses ondevafter merge.TestScoringParamGenSmokedirectly:mvn -B -o test -Dtest=TestScoringParamGenSmoke→ 3/3 pass in <2s,.paramproduced and round-trippable.-tda 1 -addFeatures 1 -o train.tsv, thenjava -cp MSGFPlus.jar edu.ucsd.msjava.ui.ScoringParamGen -i train.tsv -d <specdir> -m HCD -inst HighRes -e Tryp→.paramlands in cwd.