bigbio · ypriverol · May 1, 2026 · May 1, 2026 · May 1, 2026 · May 1, 2026
diff --git a/docs/training-scoring-models.md b/docs/training-scoring-models.md
@@ -0,0 +1,181 @@
+# Training MS-GF+ scoring models
+
+MS-GF+ ships with a set of pre-trained scoring models (`.param` files in
+`src/main/resources/`) that cover the common combinations of activation
+method, instrument type, enzyme, and protocol. The bundled set includes
+HCD/QExactive/Tryp, HCD/HighRes/Tryp/TMT, and so on. If your data does
+not match any of the bundled combinations, or if you want a model
+specifically tuned for your instrument, you can train your own.
+
+This page describes the recovered training entry point on this fork and
+the end-to-end workflow.
+
+## When to train a new model
+
+You need a custom model only when:
+
+- Your activation/instrument/enzyme/protocol combination has no bundled
+  `.param` file. Run a search with `-inst HighRes -m HCD -e Tryp` and
+  watch the startup log: if MS-GF+ falls back to a generic model, that's
+  the signal.
+- Your instrument's fragmentation pattern differs materially from the
+  bundled training data (e.g. a new generation Astral run vs. a Q
+  Exactive of 2014).
+- You want to compare an in-house trained model against the bundled one
+  to quantify the gain.
+
+For most users on standard tryptic HCD runs, the bundled model is fine
+and Phase B's calibrated precursor-window tightening (the
+`-precursorCal auto` flag) is the bigger lever.
+
+## What you need
+
+1. **Spectra**: one or more mzML or MGF files of MS/MS data from the
+   instrument and acquisition mode you want to train for.
+2. **A protein database**: a FASTA covering the species in your
+   training data.
+3. **A modifications file**: standard MS-GF+ `Mods.txt` format.
+4. **A target/decoy MS-GF+ search of (1) against (2) at standard 1%
+   FDR**: this provides the annotated PSMs the trainer learns from.
+   The trainer needs **a few hundred** confidently identified PSMs to
+   produce a stable model; thousands is better.
+
+The mzID input that upstream MS-GF+ supported was removed from this
+fork. The trainer now accepts only TSV PSM lists. The standard MS-GF+
+TSV writer (`DirectTSVWriter`, the one you get with the default search
+output) produces TSVs that already match the trainer's expected format.
+
+## Workflow
+
+### Step 1 — Search your training data
+
+Run an MS-GF+ search the same way you would for any project, with
+`-tda 1` (target-decoy on) so the output has a `QValue` column the
+trainer can filter on. Use a precision tolerance appropriate for the
+instrument; for high-resolution data, `-precursorCal auto` is
+recommended so Phase B's calibration tightens the window before the
+search.
+
+Example:
+
+```sh
+java -Xmx16G -jar MSGFPlus.jar \
+  -s training.mzML \
+  -d training.fasta \
+  -mod Mods.txt \
+  -t 10ppm \
+  -tda 1 \
+  -inst HighRes \
+  -m HCD \
+  -e Tryp \
+  -precursorCal auto \
+  -addFeatures 1 \
+  -o training.tsv
+```
+
+The output `training.tsv` is the input to the trainer.
+
+### Step 2 — Train the scoring model
+
+Invoke `ScoringParamGen` with the same activation method, instrument
+type, enzyme, and protocol you used in step 1. The trainer pulls the
+high-confidence PSMs (default `QValue ≤ 0.01`), looks up each PSM's
+spectrum in the directory you supply, and writes a `.param` file in
+the current working directory.
+
+```sh
+java -Xmx4G -cp MSGFPlus.jar edu.ucsd.msjava.ui.ScoringParamGen \
+  -i training.tsv \
+  -d /path/to/spectra-directory \
+  -m HCD \
+  -inst HighRes \
+  -e Tryp \
+  -protocol Standard
+```
+
+The output filename is derived from the data type: e.g. for the
+arguments above, `HCD_HighRes_Tryp_Standard.param`.
+
+### Step 3 — Use the model
+
+Drop the `.param` file into `src/main/resources/` (rebuilding the JAR)
+or supply it via the appropriate parameter file. MS-GF+ will pick it up
+when the search's `(activationMethod, instrumentType, enzyme, protocol)`
+tuple matches the filename.
+
+## TSV input format
+
+`AnnotatedSpectra` (the trainer's TSV reader) requires these columns,
+identified by header name (case-insensitive):
+
+| Column | Required | Description |
+|---|---|---|
+| `#SpecFile` | yes | Spectrum file name (matched by basename against `-d`). |
+| `SpecID` | yes | Index or scan identifier; passed to `SpectraAccessor.getSpectrumById`. |
+| `Peptide` | yes | Peptide sequence. May include `K.PEPTIDE.K` flanking residues; flankers are stripped. |
+| `Charge` | yes | Integer charge state. |
+| `FDR` / `EFDR` / `QValue` / `SpecQValue` | yes (any one) | Used to filter rows; default threshold is 0.01. |
+
+Extra columns are ignored. The `Peptide` field accepts standard MS-GF+
+modification syntax (e.g. `K.PEP+57.021M+15.995IDE.K`). The trainer
+matches each PSM's `(SpecID, Charge)` against the spectrum file and
+verifies that the peptide's theoretical mass is within 5 Da of the
+spectrum's precursor mass; mismatches are reported and abort the run
+(unless `-dropErrors 1` is supplied).
+
+## CLI reference
+
+```
+java -cp MSGFPlus.jar edu.ucsd.msjava.ui.ScoringParamGen [options]
+
+Required:
+  -i  <tsv1[,tsv2,...]>  Training result TSV files (mzID input not supported in this build)
+  -d  <specDir>          Directory holding the spectrum files referenced by the TSVs
+  -m  <activation>       Activation method (CID, ETD, HCD, UVPD, etc.)
+  -inst <instrument>     Instrument type (LowRes, HighRes, QExactive, etc.)
+  -e  <enzyme>           Enzyme name (Tryp, Chymotryp, LysC, AspN, etc.)
+
+Optional:
+  -protocol <name>       Protocol (default: NoProtocol/automatic)
+  -thread <int>          Worker threads for parsing PSMs (default: 1)
+  -dropErrors 0|1        Drop datasets with errors instead of failing (default: 0)
+  -mgf 0|1               Also emit aggregated <dataType>.mgf (default: 0)
+```
+
+## Notes and limitations
+
+- **mzID input was removed**: the upstream MS-GF+ `ui/ScoringParamGen`
+  CLI accepted `.mzid` files via `MzIDParser`. That class was deleted
+  from this fork in commit `9bf01c8`. If your existing pipeline produces
+  mzID, pre-convert to TSV (e.g. with the upstream MS-GF+ JAR's
+  `MzIDToTsv`) before passing to the trainer.
+- **No `params/ParamManager`**: the upstream CLI used the (now-removed)
+  `ParamManager` framework. The recovered entry point parses arguments
+  manually; option semantics match the upstream CLI, but the help text
+  is shorter.
+- **Output goes to the current directory**: the `.param` file lands
+  wherever you launched the JVM. There is no `-o` option; supply the
+  intended output directory via `cd` before invoking, or copy the file
+  afterwards.
+- **Minimum training data**: the trainer applies an internal dedup
+  (`(peptide, charge)` keyed, capped at 3 spectra per pair) before the
+  partition step. Empirically (see `TestScoringParamGenSmoke`), the
+  partition step refuses to fit under ~360 dedup-survived spectra at a
+  single charge, silently emitting an empty partition set and aborting
+  the model write. To stay above the floor, plan on **≥ 200 unique
+  peptide identifications across the dominant charge state**, and
+  preferably ≥ 500 unique peptides for a model with usable rank
+  distributions.
+
+## See also
+
+- The recovered code lives at:
+  - `src/main/java/edu/ucsd/msjava/ui/ScoringParamGen.java`
+  - `src/main/java/edu/ucsd/msjava/msscorer/ScoringParameterGeneratorWithErrors.java`
+  - `src/main/java/edu/ucsd/msjava/msscorer/ScoringParameterGenerator.java`
+  - `src/main/java/edu/ucsd/msjava/msutil/AnnotatedSpectra.java`
+  - `src/main/java/edu/ucsd/msjava/misc/TrainScoringParameters.java`
+- The original upstream documentation page is still live at
+  <https://msgfplus.github.io/msgfplus/ScoringParamGen.html>; the
+  command-line semantics here match it apart from the mzID input
+  caveat above.
diff --git a/src/main/java/edu/ucsd/msjava/misc/TrainScoringParameters.java b/src/main/java/edu/ucsd/msjava/misc/TrainScoringParameters.java
@@ -0,0 +1,164 @@
+package edu.ucsd.msjava.misc;
+
+import edu.ucsd.msjava.msscorer.NewRankScorer;
+import edu.ucsd.msjava.msscorer.NewScorerFactory.SpecDataType;
+import edu.ucsd.msjava.msscorer.ScoringParameterGeneratorWithErrors;
+import edu.ucsd.msjava.msutil.ActivationMethod;
+import edu.ucsd.msjava.msutil.AminoAcidSet;
+import edu.ucsd.msjava.msutil.Enzyme;
+import edu.ucsd.msjava.msutil.InstrumentType;
+import edu.ucsd.msjava.msutil.Protocol;
+
+import java.io.BufferedInputStream;
+import java.io.File;
+import java.io.FileInputStream;
+import java.io.InputStream;
+import java.util.Calendar;
+
+/**
+ * Internal harness for batch-training MS-GF+ param files from a fixed
+ * directory layout. Restored verbatim from upstream; not intended as a
+ * customer-facing entry point.
+ */
+public class TrainScoringParameters {
+
+    private static final String PARAM_DIR = System.getProperty("user.home") + "/Research/Data/TrainingMSGFPlus/new";
+    private static final String BACKUP_DIR = System.getProperty("user.home") + "/Research/Data/TrainingMSGFPlus/backup";
+    private static final String SPEC_DIR = System.getProperty("user.home") + "/Research/Data/TrainingMSGFPlus/AnnotatedSpectra";
+
+    public static void main(String argv[]) throws Exception {
+//		backup();
+//		createParamFiles();
+        testParamFiles();
+    }
+
+    public static void backup() throws Exception {
+        File paramDir = new File(PARAM_DIR);
+        boolean paramExists = false;
+        for (File paramFile : paramDir.listFiles()) {
+            if (paramFile.getName().endsWith(".param"))
+                paramExists = true;
+        }
+        if (!paramExists) {
+            System.out.println("No param file to backup.");
+            return;
+        }
+        Calendar calendar = Calendar.getInstance();
+        String dateStr = calendar.get(Calendar.MONTH) + "_" + calendar.get(Calendar.DAY_OF_MONTH) + "_" + calendar.get(Calendar.YEAR);
+        String backupDirName = "ParamBackup_" + dateStr;
+        File backupDir = new File(BACKUP_DIR + "/" + backupDirName);
+        if (backupDir.exists()) {
+            System.out.println("Backup directory already exists: " + backupDir.getPath());
+            System.exit(-1);
+        }
+        backupDir.mkdirs();
+        System.out.println(backupDir.getPath() + " is created.");
+
+        boolean backupSuccess = true;
+        for (File paramFile : paramDir.listFiles()) {
+            if (paramFile.getName().endsWith(".param")) {
+                File newFile = new File(backupDir, paramFile.getName());
+                boolean isBackupSuccessful = paramFile.renameTo(newFile);
+                System.out.println("Moving " + paramFile.getPath() + " to " + newFile.getPath() + (isBackupSuccessful ? " succeeded." : " failed."));
+                if (!isBackupSuccessful) {
+                    backupSuccess = false;
+                    break;
+                }
+            }
+        }
+        if (backupSuccess)
+            System.out.println("Backup complete.");
+        else {
+            backupDir.delete();
+            System.out.println(backupDir.getPath() + " is deleted.");
+            System.out.println("Backup failed.");
+            System.exit(0);
+        }
+    }
+
+    public static void createParamFiles() throws Exception {
+        File specDir = new File(SPEC_DIR);
+        if (!specDir.exists()) {
+            System.err.println("Training spectra directory doesn't exist:" + specDir.getPath());
+            System.exit(-1);
+        }
+
+        AminoAcidSet aaSet = AminoAcidSet.getStandardAminoAcidSetWithFixedCarbamidomethylatedCys();
+        for (File specFile : specDir.listFiles()) {
+            String specFileName = specFile.getName();
+            if (specFileName.endsWith(".mgf")) {
+                String id = specFileName.substring(0, specFileName.lastIndexOf('.'));
+                String[] token = id.split("_");
+                if (token.length != 3 && token.length != 4) {
+                    System.err.println("Wrong file name: " + specFile.getName());
+                    System.exit(-1);
+                }
+                String actMethodStr = token[0];
+                String instTypeStr = token[1];
+                String enzymeStr = token[2];
+                String protocolStr = null;
+                if (token.length == 4)
+                    protocolStr = token[3];
+
+                ActivationMethod actMethod = ActivationMethod.get(actMethodStr);
+                if (actMethod == null) {
+                    System.err.println("Unrecognized ActivationMethod: " + actMethodStr + "(" + specFileName + ")");
+                    System.exit(-1);
+                }
+                InstrumentType instType = InstrumentType.get(instTypeStr);
+                if (instType == null) {
+                    System.err.println("Unrecognized InstrumentType: " + instTypeStr + "(" + specFileName + ")");
+                    System.exit(-1);
+                }
+                Enzyme enzyme = Enzyme.getEnzymeByName(enzymeStr);
+                if (enzyme == null) {
+                    System.err.println("Unrecognized Enzyme: " + enzymeStr + "(" + specFileName + ")");
+                    System.exit(-1);
+                }
+
+                Protocol protocol = null;
+                if (protocolStr != null) {
+                    protocol = Protocol.get(protocolStr);
+                    if (protocol == null) {
+                        System.err.println("Unrecognized Protocol: " + protocolStr + "(" + specFileName + ")");
+                        System.exit(-1);
+                    }
+                } else
+                    protocol = Protocol.AUTOMATIC;
+
+                if (actMethod == null || instType == null || enzyme == null || protocol == null) {
+                    System.err.println("Wrong file name: " + specFile.getName());
+                    System.exit(-1);
+                }
+
+                SpecDataType dataType = new SpecDataType(actMethod, instType, enzyme, protocol);
+                System.out.println("Processing " + dataType.toString());
+                ScoringParameterGeneratorWithErrors.generateParameters(
+                        specFile,
+                        dataType,
+                        aaSet,
+                        new File(PARAM_DIR),
+                        false,
+                        false,
+                        false);
+            }
+        }
+        System.out.println("Successfully generated parameters!");
+    }
+
+    public static void testParamFiles() throws Exception {
+        for (File f : new File(PARAM_DIR).listFiles()) {
+            if (f.getName().endsWith(".param")) {
+                System.out.println("Reading " + f.getName());
+                InputStream is = new BufferedInputStream(new FileInputStream(f));
+                NewRankScorer scorer = new NewRankScorer(new BufferedInputStream(is));
+                System.out.println(scorer.getSpecDataType());
+                if (!f.getName().substring(0, f.getName().lastIndexOf('.')).equals(scorer.getSpecDataType().toString())) {
+                    System.out.println(f.getName().substring(0, f.getName().lastIndexOf('.')) + " != " + scorer.getSpecDataType().toString());
+                    System.out.println("********* Mismatch **********");
+                }
+            }
+        }
+        System.out.println("Read Success");
+    }
+}