Skip to content

Restore scorer-training (.param) workflow on this fork#27

Open
ypriverol wants to merge 4 commits intodevfrom
feat/scorer-trainer-recovery
Open

Restore scorer-training (.param) workflow on this fork#27
ypriverol wants to merge 4 commits intodevfrom
feat/scorer-trainer-recovery

Conversation

@ypriverol
Copy link
Copy Markdown
Member

Summary

  • Restores the upstream MS-GF+ scorer-training entry point so users can train new .param scoring models on this fork. Five files brought back from upstream MSGFPlus/msgfplus@master and rewired to the post-cleanup tree (no mzID/ParamManager dependencies revived).
  • Adds 31 tests covering CLI argument validation, TSV parsing, error paths, and an end-to-end smoke test that produces and round-trips an actual .param file via NewRankScorer.
  • Adds a user-facing docs/training-scoring-models.md explaining the workflow.

What's restored

File LOC Role
ui/ScoringParamGen.java 205 CLI front door (manual args[] parser, picocli-free)
msscorer/ScoringParameterGeneratorWithErrors.java 841 High-res Astral-suitable trainer
msscorer/ScoringParameterGenerator.java 718 Parent class
msutil/AnnotatedSpectra.java 325 TSV PSM loader
misc/TrainScoringParameters.java 164 Alt entry point

Upstream removal commits: 9bf01c8 (Apr 19 — mzID + ScoringParamGen + AnnotatedSpectra), 657cc5e (Apr 27 — both ScoringParameterGenerator*), c400668 (earlier — TrainScoringParameters). Files fetched from upstream master and rewired:

Adapter From To Note
parser.MgfSpectrumParser gone mgf.MgfSpectrumParser Renamed in dfa5dd9
parser.BufferedLineReader gone mgf.BufferedLineReader Same rename
mzid.MzIDParser gone cut .mzid inputs return explicit error; pre-convert to TSV
params.ParamManager gone cut Manual args[] parsing
mzml.MzMLAdapter.turnOffLogs() gone dropped Only StaxMzMLParser exists in this fork
Field types Hashtable<…> HashMap<…> Match NewRankScorer's narrowed signatures

Tests added (31 total, all <2s combined)

Class Tests Coverage
TestScoringParamGenRecovery 17 CLI argument parsing, required-arg validation, invalid-enum rejection (activation/instrument/enzyme/protocol), existence checks, happy-path early exit when no PSMs pass FDR floor
TestAnnotatedSpectraRecovery 11 .mzid rejection ("not supported in this build"), unknown extensions, TSV header validation, FDR column aliases, real MS2 integration via iprg-2013/F13.mgf
TestScoringParamGenSmoke 3 End-to-end: synthesize 200 unique tryptic peptides × 3 reps = 600 charge-2 annotated spectra → generateParameters → assert .param exists, round-trip-load via NewRankScorer, verify activation/instrument/enzyme/partition set. Documents the partition-step minimum (~360 dedup-survived spectra).

Combined regression suite (recovery + Phase B calibrator + scaffolding + PIN writer + isolation): 88/88 pass.

Documentation

docs/training-scoring-models.md covers:

  • When to train a new model (vs. using the bundled .param).
  • Required inputs and their formats.
  • Three-step recipe: MS-GF+ search → ScoringParamGen → use the .param.
  • TSV column spec (matches DirectTSVWriter output by construction; no converter needed).
  • CLI reference for the recovered entry point.
  • Caveats specific to this fork (mzID excised, ParamManager excised, manual -i -d -m -inst -e arg parsing).
  • Empirical minimum-input note from TestScoringParamGenSmoke.

What's NOT in scope

  • The trainer is not wired into the main MSGFPlus CLI as a subcommand. It's a standalone entry point invoked via java -cp MSGFPlus.jar edu.ucsd.msjava.ui.ScoringParamGen ..., matching the upstream UX.
  • .mzid PSM input. Pre-convert to TSV upstream; the existing DirectTSVWriter output is already trainer-ready.
  • params/ParamManager framework restoration.

Test plan

  • CI: mvn -B verify passes on dev after merge.
  • Run TestScoringParamGenSmoke directly: mvn -B -o test -Dtest=TestScoringParamGenSmoke → 3/3 pass in <2s, .param produced and round-trippable.
  • Invoke the CLI on a real run: search a small mzML with -tda 1 -addFeatures 1 -o train.tsv, then java -cp MSGFPlus.jar edu.ucsd.msjava.ui.ScoringParamGen -i train.tsv -d <specdir> -m HCD -inst HighRes -e Tryp.param lands in cwd.
  • OFF-mode parity unaffected — recovered files are pure additions; no calibrator/scanner/CLI-search-path code touched.

ypriverol added 4 commits May 1, 2026 15:24
Recovers the MS-GF+ scorer training entry points stripped in c400668,
9bf01c8, and 657cc5e so a developer can produce a .param model file
again without restoring the legacy mzid/ package or params/ParamManager.

Files added:
- ui/ScoringParamGen.java with a self-contained args[] CLI (replaces
  the deleted ParamManager-based one). Accepts pre-converted .tsv
  training results plus a spectrum directory; mzID input is rejected
  with a clear message because mzid/MzIDParser is gone in this fork.
- msscorer/ScoringParameterGenerator.java and
  msscorer/ScoringParameterGeneratorWithErrors.java with imports
  rewired (parser.MgfSpectrumParser -> mgf.MgfSpectrumParser) and the
  protected-table assignments switched from Hashtable to HashMap to
  match the current NewRankScorer field types.
- msutil/AnnotatedSpectra.java with parser.BufferedLineReader rewired
  to mgf.BufferedLineReader and the mzID branch in parseFile() turned
  into an explicit "not supported in this build" error path.
- misc/TrainScoringParameters.java restored verbatim (its imports
  already match the current tree).

The MzMLAdapter.turnOffLogs() call from upstream ScoringParamGen is
dropped since mzml/MzMLAdapter no longer exists; the non-Stax mzml
adapter was deleted in the cleanup. No pom.xml or non-trainer files
were modified.

mvn -B -o compile passes; ScoringParamGen prints help/usage when
launched with no args.
…atedSpectra

Two test classes covering the recovery surface only — argument parsing,
TSV parsing, error paths, and integration with the SpectraAccessor /
mgf reader. The scoring-parameter trainer itself
(ScoringParameterGeneratorWithErrors.generateParameters) is NOT
exercised; that needs thousands of confidently-annotated PSMs and is
out of scope for a recovery smoke test.

TestScoringParamGenRecovery (17 tests):
  - Empty / malformed argv (lone "-i", bare positional, unknown option)
  - Each required-arg-missing path (-i, -d, -m), each invalid-enum path
    (activation method, instrument, enzyme, protocol)
  - Existence checks for the input TSV and the spectrum directory
  - "-thread abc" rejected with a typed message
  - Happy-path arg parsing followed by an empty TSV: error bubbles up
    from the parser
  - Happy-path arg parsing with a valid-shape TSV whose only row is
    above the FDR floor: returns "No results to train on. Exiting."
    *without* invoking the trainer (proves the early-exit gate works)
  - main() with empty args / -h / --help / -help does not throw

TestAnnotatedSpectraRecovery (11 tests):
  - .mzid files rejected with the explicit "not supported in this build"
    message (the mzid/ package was removed; recovery cuts that input
    path)
  - Unknown extensions rejected
  - TSV header validation: must start with "#"; required columns
    (#SpecFile, SpecID, Peptide) checked; FDR column aliases (FDR,
    EFDR, QValue, SpecQValue) accepted interchangeably
  - Rows above the FDR threshold are dropped silently
  - Real MS2 spectrum integration via the existing
    src/test/resources/iprg-2013/F13.mgf fixture: TSV references a
    SpecID that doesn't exist in the mgf, parser returns
    "F13.mgf:index=99999999 is not available!" cleanly (no crash)
  - High-level parse() aggregates multiple files; fails fast on first
    error when dropErrors is disabled

The F13.mgf fixture is real spectrum data (~146K lines of MS2). Astral
mzML files are not shipped as test fixtures (size); the integration
boundary is the same — SpectraAccessor lookup against a real spec file —
so testing against F13 proves the recovered code reads spectra
correctly. The recovered trainer is instrument-agnostic by construction.

Result: 28/28 new tests pass; combined scoped suite (recovery +
calibrator + scaffolding + pin writer + concurrent runner) is 79/79.
Closes the gap left by the existing recovery tests, which only cover CLI
arg parsing and TSV-error paths. This test feeds a synthetic in-memory
SpectraContainer of annotated spectra straight into
ScoringParameterGeneratorWithErrors.generateParameters, asserts the .param
binary and .param.txt debug dump are produced, and round-trips the binary
through NewRankScorer to prove the on-disk format matches what MS-GF+
loads at search time.

Three @test methods:
 - generatesParamFileFromSyntheticAnnotatedSpectra: 200 unique tryptic
   peptides * 3 reps = 600 charge-2 spectra; HCD/LowRes/Tryp; verifies
   file production, non-empty content, and full round-trip via
   NewRankScorer (activation method, instrument, enzyme, partitions).
 - debugTextDumpReferencesActivationAndEnzyme: re-runs at smaller scale
   with CID and inspects the human-readable .param.txt header lines.
 - trainerHandlesInputSizeJustAboveMinimumPartitionThreshold: documents
   the trainer's effective input-size floor (~360 spectra at one charge
   after the (peptide, charge) dedup to numSpecsPerPeptide=3) so future
   readers don't accidentally shrink fixtures below it.

Runs in ~1.3s. Uses LOW_RESOLUTION_LTQ to keep errorScalingFactor=0 and
skip the optional ion-error / noise-error stages while still exercising
charge histogram, partitioning, precursor OFF, fragment OFF, and rank
distribution writes.
Adds a user-facing documentation page covering the recovered scorer
training workflow:

  - When to train a new model (vs. relying on the bundled .param files).
  - Required inputs: spectra, FASTA, mods file, and a target/decoy
    MS-GF+ search at 1% FDR providing annotated PSMs.
  - End-to-end three-step recipe: search -> ScoringParamGen -> use the
    .param.
  - The TSV input format AnnotatedSpectra expects (#SpecFile, SpecID,
    Peptide, Charge, FDR/EFDR/QValue/SpecQValue), which matches what
    DirectTSVWriter already emits, so no converter is needed.
  - The CLI reference for the recovered ScoringParamGen entry point.
  - Caveats specific to this fork: mzID input was removed (commit
    9bf01c8); pre-convert to TSV. The legacy params/ParamManager CLI
    framework is gone; the recovered entry point parses args manually.
  - Empirical minimum-input note pinned by TestScoringParamGenSmoke
    (~360 unique-peptide-tagged spectra after the trainer's internal
    dedup; ~500 unique peptides recommended for usable rank
    distributions).

The doc lives under docs/ rather than .claude/plans/ so it ships in
the public repo and is reachable from the project root.
@qodo-code-review
Copy link
Copy Markdown

ⓘ You've reached your Qodo monthly free-tier limit. Reviews pause until next month — upgrade your plan to continue now, or link your paid account if you already have one.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 1, 2026

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: ab441ae5-4cf8-43b4-9042-2fdb3fbbf4a4

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/scorer-trainer-recovery

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant