diff --git a/docs/training-scoring-models.md b/docs/training-scoring-models.md new file mode 100644 index 00000000..cc12e827 --- /dev/null +++ b/docs/training-scoring-models.md @@ -0,0 +1,181 @@ +# Training MS-GF+ scoring models + +MS-GF+ ships with a set of pre-trained scoring models (`.param` files in +`src/main/resources/`) that cover the common combinations of activation +method, instrument type, enzyme, and protocol. The bundled set includes +HCD/QExactive/Tryp, HCD/HighRes/Tryp/TMT, and so on. If your data does +not match any of the bundled combinations, or if you want a model +specifically tuned for your instrument, you can train your own. + +This page describes the recovered training entry point on this fork and +the end-to-end workflow. + +## When to train a new model + +You need a custom model only when: + +- Your activation/instrument/enzyme/protocol combination has no bundled + `.param` file. Run a search with `-inst HighRes -m HCD -e Tryp` and + watch the startup log: if MS-GF+ falls back to a generic model, that's + the signal. +- Your instrument's fragmentation pattern differs materially from the + bundled training data (e.g. a new generation Astral run vs. a Q + Exactive of 2014). +- You want to compare an in-house trained model against the bundled one + to quantify the gain. + +For most users on standard tryptic HCD runs, the bundled model is fine +and Phase B's calibrated precursor-window tightening (the +`-precursorCal auto` flag) is the bigger lever. + +## What you need + +1. **Spectra**: one or more mzML or MGF files of MS/MS data from the + instrument and acquisition mode you want to train for. +2. **A protein database**: a FASTA covering the species in your + training data. +3. **A modifications file**: standard MS-GF+ `Mods.txt` format. +4. **A target/decoy MS-GF+ search of (1) against (2) at standard 1% + FDR**: this provides the annotated PSMs the trainer learns from. + The trainer needs **a few hundred** confidently identified PSMs to + produce a stable model; thousands is better. + +The mzID input that upstream MS-GF+ supported was removed from this +fork. The trainer now accepts only TSV PSM lists. The standard MS-GF+ +TSV writer (`DirectTSVWriter`, the one you get with the default search +output) produces TSVs that already match the trainer's expected format. + +## Workflow + +### Step 1 — Search your training data + +Run an MS-GF+ search the same way you would for any project, with +`-tda 1` (target-decoy on) so the output has a `QValue` column the +trainer can filter on. Use a precision tolerance appropriate for the +instrument; for high-resolution data, `-precursorCal auto` is +recommended so Phase B's calibration tightens the window before the +search. + +Example: + +```sh +java -Xmx16G -jar MSGFPlus.jar \ + -s training.mzML \ + -d training.fasta \ + -mod Mods.txt \ + -t 10ppm \ + -tda 1 \ + -inst HighRes \ + -m HCD \ + -e Tryp \ + -precursorCal auto \ + -addFeatures 1 \ + -o training.tsv +``` + +The output `training.tsv` is the input to the trainer. + +### Step 2 — Train the scoring model + +Invoke `ScoringParamGen` with the same activation method, instrument +type, enzyme, and protocol you used in step 1. The trainer pulls the +high-confidence PSMs (default `QValue ≤ 0.01`), looks up each PSM's +spectrum in the directory you supply, and writes a `.param` file in +the current working directory. + +```sh +java -Xmx4G -cp MSGFPlus.jar edu.ucsd.msjava.ui.ScoringParamGen \ + -i training.tsv \ + -d /path/to/spectra-directory \ + -m HCD \ + -inst HighRes \ + -e Tryp \ + -protocol Standard +``` + +The output filename is derived from the data type: e.g. for the +arguments above, `HCD_HighRes_Tryp_Standard.param`. + +### Step 3 — Use the model + +Drop the `.param` file into `src/main/resources/` (rebuilding the JAR) +or supply it via the appropriate parameter file. MS-GF+ will pick it up +when the search's `(activationMethod, instrumentType, enzyme, protocol)` +tuple matches the filename. + +## TSV input format + +`AnnotatedSpectra` (the trainer's TSV reader) requires these columns, +identified by header name (case-insensitive): + +| Column | Required | Description | +|---|---|---| +| `#SpecFile` | yes | Spectrum file name (matched by basename against `-d`). | +| `SpecID` | yes | Index or scan identifier; passed to `SpectraAccessor.getSpectrumById`. | +| `Peptide` | yes | Peptide sequence. May include `K.PEPTIDE.K` flanking residues; flankers are stripped. | +| `Charge` | yes | Integer charge state. | +| `FDR` / `EFDR` / `QValue` / `SpecQValue` | yes (any one) | Used to filter rows; default threshold is 0.01. | + +Extra columns are ignored. The `Peptide` field accepts standard MS-GF+ +modification syntax (e.g. `K.PEP+57.021M+15.995IDE.K`). The trainer +matches each PSM's `(SpecID, Charge)` against the spectrum file and +verifies that the peptide's theoretical mass is within 5 Da of the +spectrum's precursor mass; mismatches are reported and abort the run +(unless `-dropErrors 1` is supplied). + +## CLI reference + +``` +java -cp MSGFPlus.jar edu.ucsd.msjava.ui.ScoringParamGen [options] + +Required: + -i Training result TSV files (mzID input not supported in this build) + -d Directory holding the spectrum files referenced by the TSVs + -m Activation method (CID, ETD, HCD, UVPD, etc.) + -inst Instrument type (LowRes, HighRes, QExactive, etc.) + -e Enzyme name (Tryp, Chymotryp, LysC, AspN, etc.) + +Optional: + -protocol Protocol (default: NoProtocol/automatic) + -thread Worker threads for parsing PSMs (default: 1) + -dropErrors 0|1 Drop datasets with errors instead of failing (default: 0) + -mgf 0|1 Also emit aggregated .mgf (default: 0) +``` + +## Notes and limitations + +- **mzID input was removed**: the upstream MS-GF+ `ui/ScoringParamGen` + CLI accepted `.mzid` files via `MzIDParser`. That class was deleted + from this fork in commit `9bf01c8`. If your existing pipeline produces + mzID, pre-convert to TSV (e.g. with the upstream MS-GF+ JAR's + `MzIDToTsv`) before passing to the trainer. +- **No `params/ParamManager`**: the upstream CLI used the (now-removed) + `ParamManager` framework. The recovered entry point parses arguments + manually; option semantics match the upstream CLI, but the help text + is shorter. +- **Output goes to the current directory**: the `.param` file lands + wherever you launched the JVM. There is no `-o` option; supply the + intended output directory via `cd` before invoking, or copy the file + afterwards. +- **Minimum training data**: the trainer applies an internal dedup + (`(peptide, charge)` keyed, capped at 3 spectra per pair) before the + partition step. Empirically (see `TestScoringParamGenSmoke`), the + partition step refuses to fit under ~360 dedup-survived spectra at a + single charge, silently emitting an empty partition set and aborting + the model write. To stay above the floor, plan on **≥ 200 unique + peptide identifications across the dominant charge state**, and + preferably ≥ 500 unique peptides for a model with usable rank + distributions. + +## See also + +- The recovered code lives at: + - `src/main/java/edu/ucsd/msjava/ui/ScoringParamGen.java` + - `src/main/java/edu/ucsd/msjava/msscorer/ScoringParameterGeneratorWithErrors.java` + - `src/main/java/edu/ucsd/msjava/msscorer/ScoringParameterGenerator.java` + - `src/main/java/edu/ucsd/msjava/msutil/AnnotatedSpectra.java` + - `src/main/java/edu/ucsd/msjava/misc/TrainScoringParameters.java` +- The original upstream documentation page is still live at + ; the + command-line semantics here match it apart from the mzID input + caveat above. diff --git a/src/main/java/edu/ucsd/msjava/misc/TrainScoringParameters.java b/src/main/java/edu/ucsd/msjava/misc/TrainScoringParameters.java new file mode 100644 index 00000000..466e3619 --- /dev/null +++ b/src/main/java/edu/ucsd/msjava/misc/TrainScoringParameters.java @@ -0,0 +1,164 @@ +package edu.ucsd.msjava.misc; + +import edu.ucsd.msjava.msscorer.NewRankScorer; +import edu.ucsd.msjava.msscorer.NewScorerFactory.SpecDataType; +import edu.ucsd.msjava.msscorer.ScoringParameterGeneratorWithErrors; +import edu.ucsd.msjava.msutil.ActivationMethod; +import edu.ucsd.msjava.msutil.AminoAcidSet; +import edu.ucsd.msjava.msutil.Enzyme; +import edu.ucsd.msjava.msutil.InstrumentType; +import edu.ucsd.msjava.msutil.Protocol; + +import java.io.BufferedInputStream; +import java.io.File; +import java.io.FileInputStream; +import java.io.InputStream; +import java.util.Calendar; + +/** + * Internal harness for batch-training MS-GF+ param files from a fixed + * directory layout. Restored verbatim from upstream; not intended as a + * customer-facing entry point. + */ +public class TrainScoringParameters { + + private static final String PARAM_DIR = System.getProperty("user.home") + "/Research/Data/TrainingMSGFPlus/new"; + private static final String BACKUP_DIR = System.getProperty("user.home") + "/Research/Data/TrainingMSGFPlus/backup"; + private static final String SPEC_DIR = System.getProperty("user.home") + "/Research/Data/TrainingMSGFPlus/AnnotatedSpectra"; + + public static void main(String argv[]) throws Exception { +// backup(); +// createParamFiles(); + testParamFiles(); + } + + public static void backup() throws Exception { + File paramDir = new File(PARAM_DIR); + boolean paramExists = false; + for (File paramFile : paramDir.listFiles()) { + if (paramFile.getName().endsWith(".param")) + paramExists = true; + } + if (!paramExists) { + System.out.println("No param file to backup."); + return; + } + Calendar calendar = Calendar.getInstance(); + String dateStr = calendar.get(Calendar.MONTH) + "_" + calendar.get(Calendar.DAY_OF_MONTH) + "_" + calendar.get(Calendar.YEAR); + String backupDirName = "ParamBackup_" + dateStr; + File backupDir = new File(BACKUP_DIR + "/" + backupDirName); + if (backupDir.exists()) { + System.out.println("Backup directory already exists: " + backupDir.getPath()); + System.exit(-1); + } + backupDir.mkdirs(); + System.out.println(backupDir.getPath() + " is created."); + + boolean backupSuccess = true; + for (File paramFile : paramDir.listFiles()) { + if (paramFile.getName().endsWith(".param")) { + File newFile = new File(backupDir, paramFile.getName()); + boolean isBackupSuccessful = paramFile.renameTo(newFile); + System.out.println("Moving " + paramFile.getPath() + " to " + newFile.getPath() + (isBackupSuccessful ? " succeeded." : " failed.")); + if (!isBackupSuccessful) { + backupSuccess = false; + break; + } + } + } + if (backupSuccess) + System.out.println("Backup complete."); + else { + backupDir.delete(); + System.out.println(backupDir.getPath() + " is deleted."); + System.out.println("Backup failed."); + System.exit(0); + } + } + + public static void createParamFiles() throws Exception { + File specDir = new File(SPEC_DIR); + if (!specDir.exists()) { + System.err.println("Training spectra directory doesn't exist:" + specDir.getPath()); + System.exit(-1); + } + + AminoAcidSet aaSet = AminoAcidSet.getStandardAminoAcidSetWithFixedCarbamidomethylatedCys(); + for (File specFile : specDir.listFiles()) { + String specFileName = specFile.getName(); + if (specFileName.endsWith(".mgf")) { + String id = specFileName.substring(0, specFileName.lastIndexOf('.')); + String[] token = id.split("_"); + if (token.length != 3 && token.length != 4) { + System.err.println("Wrong file name: " + specFile.getName()); + System.exit(-1); + } + String actMethodStr = token[0]; + String instTypeStr = token[1]; + String enzymeStr = token[2]; + String protocolStr = null; + if (token.length == 4) + protocolStr = token[3]; + + ActivationMethod actMethod = ActivationMethod.get(actMethodStr); + if (actMethod == null) { + System.err.println("Unrecognized ActivationMethod: " + actMethodStr + "(" + specFileName + ")"); + System.exit(-1); + } + InstrumentType instType = InstrumentType.get(instTypeStr); + if (instType == null) { + System.err.println("Unrecognized InstrumentType: " + instTypeStr + "(" + specFileName + ")"); + System.exit(-1); + } + Enzyme enzyme = Enzyme.getEnzymeByName(enzymeStr); + if (enzyme == null) { + System.err.println("Unrecognized Enzyme: " + enzymeStr + "(" + specFileName + ")"); + System.exit(-1); + } + + Protocol protocol = null; + if (protocolStr != null) { + protocol = Protocol.get(protocolStr); + if (protocol == null) { + System.err.println("Unrecognized Protocol: " + protocolStr + "(" + specFileName + ")"); + System.exit(-1); + } + } else + protocol = Protocol.AUTOMATIC; + + if (actMethod == null || instType == null || enzyme == null || protocol == null) { + System.err.println("Wrong file name: " + specFile.getName()); + System.exit(-1); + } + + SpecDataType dataType = new SpecDataType(actMethod, instType, enzyme, protocol); + System.out.println("Processing " + dataType.toString()); + ScoringParameterGeneratorWithErrors.generateParameters( + specFile, + dataType, + aaSet, + new File(PARAM_DIR), + false, + false, + false); + } + } + System.out.println("Successfully generated parameters!"); + } + + public static void testParamFiles() throws Exception { + for (File f : new File(PARAM_DIR).listFiles()) { + if (f.getName().endsWith(".param")) { + System.out.println("Reading " + f.getName()); + InputStream is = new BufferedInputStream(new FileInputStream(f)); + NewRankScorer scorer = new NewRankScorer(new BufferedInputStream(is)); + System.out.println(scorer.getSpecDataType()); + if (!f.getName().substring(0, f.getName().lastIndexOf('.')).equals(scorer.getSpecDataType().toString())) { + System.out.println(f.getName().substring(0, f.getName().lastIndexOf('.')) + " != " + scorer.getSpecDataType().toString()); + System.out.println("********* Mismatch **********"); + } + } + } + System.out.println("Read Success"); + } +} diff --git a/src/main/java/edu/ucsd/msjava/msscorer/ScoringParameterGenerator.java b/src/main/java/edu/ucsd/msjava/msscorer/ScoringParameterGenerator.java new file mode 100644 index 00000000..9cef6e1a --- /dev/null +++ b/src/main/java/edu/ucsd/msjava/msscorer/ScoringParameterGenerator.java @@ -0,0 +1,718 @@ +package edu.ucsd.msjava.msscorer; + +import edu.ucsd.msjava.mgf.MgfSpectrumParser; +import edu.ucsd.msjava.msgf.Histogram; +import edu.ucsd.msjava.msgf.NominalMass; +import edu.ucsd.msjava.msgf.Tolerance; +import edu.ucsd.msjava.msscorer.NewScorerFactory.SpecDataType; +import edu.ucsd.msjava.msutil.*; + +import java.io.File; +import java.util.ArrayList; +import java.util.Collections; +import java.util.HashMap; +import java.util.HashSet; +import java.util.Hashtable; +import java.util.TreeSet; + +/** + * This only supports low accuracy fragment ions. + * + * @author sangtaekim + */ +public class ScoringParameterGenerator extends NewRankScorer { + private static final float MIN_OFFSET_MASS = -120; // for ion types + private static final float MAX_OFFSET_MASS = 38; + private static final float MIN_PRECURSOR_OFFSET = -300; // for precursors + private static final float MAX_PRECURSOR_OFFSET = 30; + private static final int MIN_NUM_SPECTRA_PER_PARTITION = 400; // 400 + private static final int MIN_NUM_SPECTRA_FOR_PRECURSOR_OFF = 150; + + private static final float MIN_PRECURSOR_OFFSET_PROBABILITY = 0.15f; // 0.15 + private static final float MIN_ION_OFFSET_PROBABILITY = 0.15f; // 0.15, for ion types + private static final int MAX_RANK = 150; + private static final int NUM_SEGMENTS_PER_SPECTRUM = 2; // 2 + + + private static final int[] smoothingRanks = {3, 5, 10, 20, 50, Integer.MAX_VALUE}; //Ranks around which smoothing occurs + private static final int[] smoothingWindowSize = {0, 1, 2, 3, 4, 5}; //Smoothing windows for each smoothing rank + + private static final int NUM_NOISE_IONS = 10; + protected static final int MAX_CHARGE = 20; + + public static void main(String argv[]) { + File specFile = null; + File outputFile = null; + boolean isText = false; + AminoAcidSet aaSet = AminoAcidSet.getStandardAminoAcidSetWithFixedCarbamidomethylatedCys(); + int numSpecsPerPeptide = 1; + int errorScalingFactor = 10; + + // Fragmentation method + ActivationMethod activationMethod = null; + InstrumentType instType = null; + Enzyme enzyme = null; + + for (int i = 0; i < argv.length; i += 2) { + if (!argv[i].startsWith("-") || i + 1 >= argv.length) + printUsageAndExit("Invalid parameter!"); + if (argv[i].equalsIgnoreCase("-i")) { + specFile = new File(argv[i + 1]); + if (!specFile.exists()) { + printUsageAndExit(argv[i + 1] + " doesn't exist."); + } + int posDot = specFile.getName().lastIndexOf('.'); + if (posDot >= 0) { + String extension = specFile.getName().substring(posDot); + if (!extension.equalsIgnoreCase(".mgf")) + printUsageAndExit("Invalid spectrum format: " + argv[i + 1]); + } else + printUsageAndExit("Invalid spectrum format: " + argv[i + 1]); + } else if (argv[i].equalsIgnoreCase("-o")) { + outputFile = new File(argv[i + 1]); + } else if (argv[i].equalsIgnoreCase("-t")) { + outputFile = new File(argv[i + 1]); + isText = true; + } else if (argv[i].equalsIgnoreCase("-fixMod")) { + if (argv[i + 1].equalsIgnoreCase("0")) + aaSet = AminoAcidSet.getStandardAminoAcidSet(); + else if (argv[i + 1].equalsIgnoreCase("1")) + aaSet = AminoAcidSet.getStandardAminoAcidSetWithFixedCarbamidomethylatedCys(); + else if (argv[i + 1].equalsIgnoreCase("2")) + aaSet = AminoAcidSet.getStandardAminoAcidSetWithFixedCarboxymethylatedCys(); + else + printUsageAndExit("Invalid -fixMod parameter: " + argv[i + 1]); + } else if (argv[i].equalsIgnoreCase("-pep")) { + numSpecsPerPeptide = Integer.parseInt(argv[i + 1]); + } else if (argv[i].equalsIgnoreCase("-err")) { + errorScalingFactor = Integer.parseInt(argv[i + 1]); + } else if (argv[i].equalsIgnoreCase("-m")) { + if (argv[i + 1].equalsIgnoreCase("1")) { + activationMethod = ActivationMethod.CID; + } else if (argv[i + 1].equalsIgnoreCase("2")) { + activationMethod = ActivationMethod.ETD; + } else if (argv[i + 1].equalsIgnoreCase("3")) { + activationMethod = ActivationMethod.HCD; + } else if (argv[i + 1].equalsIgnoreCase("4")) { + activationMethod = ActivationMethod.UVPD; + } else { + printUsageAndExit("Invalid activation method: " + argv[i + 1]); + } + } else if (argv[i].equalsIgnoreCase("-inst")) { + if (argv[i + 1].equalsIgnoreCase("0")) { + instType = InstrumentType.LOW_RESOLUTION_LTQ; + } else if (argv[i + 1].equalsIgnoreCase("1")) { + instType = InstrumentType.TOF; + } else if (argv[i + 1].equalsIgnoreCase("2")) { + instType = InstrumentType.HIGH_RESOLUTION_LTQ; + } else { + printUsageAndExit("Invalid instrument type: " + argv[i + 1]); + } + } else if (argv[i].equalsIgnoreCase("-e")) { + if (argv[i + 1].equalsIgnoreCase("0")) + enzyme = null; + else if (argv[i + 1].equalsIgnoreCase("1")) + enzyme = Enzyme.TRYPSIN; + else if (argv[i + 1].equalsIgnoreCase("2")) + enzyme = Enzyme.CHYMOTRYPSIN; + else if (argv[i + 1].equalsIgnoreCase("3")) + enzyme = Enzyme.LysC; + else if (argv[i + 1].equalsIgnoreCase("4")) + enzyme = Enzyme.LysN; + else if (argv[i + 1].equalsIgnoreCase("5")) + enzyme = Enzyme.GluC; + else if (argv[i + 1].equalsIgnoreCase("6")) + enzyme = Enzyme.ArgC; + else if (argv[i + 1].equalsIgnoreCase("7")) + enzyme = Enzyme.AspN; + else + printUsageAndExit("Invalid enzyme: " + argv[i + 1]); + } else + printUsageAndExit("Invalid parameters!"); + } + if (specFile == null) + printUsageAndExit("missing annotatedMgfFileName!"); + if (outputFile == null) + printUsageAndExit("missing outputFileName!"); + if (activationMethod == null) + printUsageAndExit("missing activationMethod!"); + if (instType == null) + printUsageAndExit("missing instrumentType!"); + + generateParameters(specFile, activationMethod, instType, enzyme, Protocol.AUTOMATIC, numSpecsPerPeptide, errorScalingFactor, outputFile, aaSet, isText, false); + } + + public static void printUsageAndExit(String message) { + System.err.println(message); + System.out.println("usage: java -Xmx2000M -cp MSGF.jar edu.ucsd.msjava.msscorer.ScoringParameterGenerator\n" + + "\t-i annotatedMgfFileName (*.mgf)\n" + + "\t-o outputFileName (e.g. CID_Tryp.param)\n" + + "\t-m FragmentationMethodID (1: CID, 2: ETD, 3: HCD, 4: UVPD)\n" + + "\t-inst InstrumentID (0: Low-res LCQ/LTQ, 1: TOF , 2: High-res LTQ)\n" + + "\t-e EnzymeID (0: No enzyme, 1: Trypsin (Default), 2: Chymotrypsin, 3: Lys-C, 4: Lys-N, 5: Glu-C, 6: Arg-C, 7: Asp-N)\n" + + "\t[-fixMod 0/1/2] (0: NoCysteineProtection, 1: CarbamidomethyC (default), 2: CarboxymethylC)\n" + + "\t[-pep numPeptidesPerSpec] (default: 1)\n" + + "\t[-err errorScalingFactor] (default: 10)" + ); + System.exit(0); + } + + public static void generateParameters( + File specFile, + ActivationMethod activationMethod, + InstrumentType instType, + Enzyme enzyme, + Protocol protocol, + int numSpecsPerPeptide, + int errorScalingFactor, + File outputFile, + AminoAcidSet aaSet, + boolean isText, + boolean verbose) { + SpectraContainer container = new SpectraContainer(specFile.getPath(), new MgfSpectrumParser().aaSet(aaSet)); + + // multiple spectra with the same peptide -> one spec per peptide + HashMap> pepSpecMap = new HashMap>(); + SpectraContainer specContOnePerPep = new SpectraContainer(); + for (Spectrum spec : container) { + String pep = spec.getAnnotationStr() + ":" + spec.getCharge(); + if (pep != null && pep.length() > 0) { + ArrayList specList = pepSpecMap.get(pep); + if (specList == null) { + specList = new ArrayList(); + pepSpecMap.put(pep, specList); + } + if (specList.size() < numSpecsPerPeptide) + specList.add(spec); + } + } + for (ArrayList specList : pepSpecMap.values()) + for (Spectrum spec : specList) + specContOnePerPep.add(spec); + + SpecDataType dataType = new SpecDataType(activationMethod, instType, enzyme, protocol); + ScoringParameterGenerator gen = new ScoringParameterGenerator(specContOnePerPep, dataType); + + gen.tolerance(new Tolerance(1 / Constants.INTEGER_MASS_SCALER / 2)); + + gen.partition(NUM_SEGMENTS_PER_SPECTRUM); + if (verbose) + System.out.println("Partition: " + gen.partitionSet.size()); + + gen.precursorOFF(MIN_PRECURSOR_OFFSET_PROBABILITY); + if (verbose) + System.out.println("PrecursorOFF Done."); + + gen.filterPrecursorPeaks(); + if (verbose) + System.out.println("Filtering Done."); + + gen.selectIonTypes(MIN_ION_OFFSET_PROBABILITY); + if (verbose) + System.out.println("Ion types selected."); + + gen.generateRankDist(MAX_RANK); + if (verbose) + System.out.println("Rank distribution computed."); + + gen.smoothing(); + if (verbose) + System.out.println("Smoothing complete."); + + if (!isText) + gen.writeParameters(outputFile); + else + gen.writeParametersPlainText(outputFile); + + if (verbose) + System.out.println("Writing Done."); + } + + private SpectraContainer specContainer; + + public ScoringParameterGenerator(SpectraContainer specContainer, SpecDataType dataType) { + this.specContainer = specContainer; + super.dataType = dataType; + } + + public void partition(int numSegments) { + super.numSegments = numSegments; + chargeHist = new Histogram(); + partitionSet = new TreeSet(); + + Hashtable> parentMassMap = new Hashtable>(); + for (Spectrum spec : specContainer) { + int charge = spec.getCharge(); + if (charge <= 0) + continue; + chargeHist.add(charge); + if (spec.getAnnotation() != null) { + ArrayList precursorList = parentMassMap.get(charge); + if (precursorList == null) { + precursorList = new ArrayList(); + parentMassMap.put(charge, precursorList); + } + precursorList.add(spec.getPrecursorMass()); + } + } + + for (int c = chargeHist.minKey(); c <= chargeHist.maxKey(); c++) { + ArrayList parentMassList = parentMassMap.get(c); + if (parentMassList == null) + continue; + + int numSpec = parentMassList.size(); + if (numSpec < Math.round(MIN_NUM_SPECTRA_PER_PARTITION * 0.9f)) + continue; + + Collections.sort(parentMassList); + int bestSetSize = 0; + int smallestRemainder = MIN_NUM_SPECTRA_PER_PARTITION; + for (int i = Math.round(MIN_NUM_SPECTRA_PER_PARTITION * 0.9f); i <= Math.round(MIN_NUM_SPECTRA_PER_PARTITION * 1.1f); i++) { + int remainder = numSpec % i; + if (i - remainder < remainder) + remainder = i - remainder; + if (remainder < smallestRemainder || (remainder == smallestRemainder && Math.abs(MIN_NUM_SPECTRA_PER_PARTITION - i) < Math.abs(MIN_NUM_SPECTRA_PER_PARTITION - bestSetSize))) { + bestSetSize = i; + smallestRemainder = remainder; + } + } + int num = 0; + for (int i = 0; i == 0 || i < Math.round(numSpec / (float) bestSetSize); i++) { + if (num != 0) { + for (int seg = 0; seg < numSegments; seg++) + partitionSet.add(new Partition(c, parentMassList.get(num), seg)); + } else { + for (int seg = 0; seg < numSegments; seg++) + partitionSet.add(new Partition(c, 0f, seg)); + } + num += bestSetSize; + } + } + } + + private void precursorOFF(float minProbThreshold) { + if (chargeHist == null) { + assert (false) : "partition() must have been called before"; + return; + } + precursorOFFMap = new java.util.TreeMap>(); + numPrecurOFF = 0; + + for (int charge = chargeHist.minKey(); charge <= chargeHist.maxKey(); charge++) { + if (chargeHist.get(charge) < MIN_NUM_SPECTRA_FOR_PRECURSOR_OFF) + continue; + ArrayList precursorOffsetList = new ArrayList(); + int numSpecs = 0; + Hashtable> histList = new Hashtable>(); + for (int c = charge; c >= 2; c--) + histList.put(c, new Histogram()); + + for (Spectrum spec : specContainer) { + if (spec.getAnnotation() == null) + continue; + if (spec.getCharge() != charge) + continue; + numSpecs++; + spec = filter.apply(spec); + float precursorNeutralMass = spec.getPrecursorMass(); + for (int c = charge; c >= 2; c--) { + float precursorMz = (precursorNeutralMass + c * (float) Composition.ChargeCarrierMass()) / c; + ArrayList peakList = spec.getPeakListByMassRange( + precursorMz + MIN_PRECURSOR_OFFSET / (float) c - mme.getToleranceAsDa(precursorMz + MIN_PRECURSOR_OFFSET / (float) c) / 2, + precursorMz + MAX_PRECURSOR_OFFSET / (float) c + mme.getToleranceAsDa(precursorMz + MAX_PRECURSOR_OFFSET / (float) c) / 2); + + int prevMassIndexDiff = Integer.MIN_VALUE; + for (Peak p : peakList) { + float peakMass = p.getMz(); + int massIndexDiff = NominalMass.toNominalMass(peakMass - precursorMz); + if (massIndexDiff > prevMassIndexDiff) { + histList.get(c).add(massIndexDiff); + prevMassIndexDiff = massIndexDiff; + } + } + } + } + + for (int c = charge; c >= 2; c--) { + ArrayList keyList = new ArrayList(histList.get(c).keySet()); + Collections.sort(keyList); + for (Integer key : keyList) { + float prob = (histList.get(c).get(key)) / (float) numSpecs; + if (prob > minProbThreshold) { + precursorOffsetList.add(new PrecursorOffsetFrequency((charge - c), NominalMass.getMassFromNominalMass(key), prob)); + } + } + } + precursorOFFMap.put(charge, precursorOffsetList); + numPrecurOFF += precursorOffsetList.size(); + } + } + + private void filterPrecursorPeaks() { + if (this.precursorOFFMap == null) + return; + for (Spectrum spec : specContainer) { + for (PrecursorOffsetFrequency off : this.getPrecursorOFF(spec.getCharge())) + spec.filterPrecursorPeaks(mme, off.getReducedCharge(), off.getOffset()); + } + } + + private Pair getPrecursorMassRange(Partition partition) { + float minParentMass = partition.getParentMass(); + float maxParentMass = Float.MAX_VALUE; + Partition higherPartition = partitionSet.higher(partition); + if (higherPartition != null) { + if (higherPartition.getCharge() == partition.getCharge() && higherPartition.getSegNum() == partition.getSegNum()) { + maxParentMass = higherPartition.getParentMass(); + } + } + return new Pair(minParentMass, maxParentMass); + } + + private void selectIonTypes(float minProbThreshold) { + if (partitionSet == null) { + assert (false) : "partition() must have been called before!"; + return; + } + + fragOFFTable = new HashMap>(); + insignificantFragOFFTable = new HashMap>(); + + for (Partition partition : partitionSet) { + int charge = partition.getCharge(); + Pair parentMassRange = getPrecursorMassRange(partition); + int seg = partition.getSegNum(); + + int numSpec = 0; + Hashtable> prefixIonFreq = new Hashtable>(); + Hashtable> suffixIonFreq = new Hashtable>(); + for (int c = 1; c <= charge; c++) { + prefixIonFreq.put(c, new Histogram()); + suffixIonFreq.put(c, new Histogram()); + } + + int numCleavages = 0; + for (Spectrum spec : specContainer) { + if (spec.getAnnotation() == null) + continue; + if (spec.getCharge() != charge) + continue; + + float curParentMass = spec.getPrecursorMass(); + if (curParentMass < parentMassRange.getFirst() || curParentMass >= parentMassRange.getSecond()) + continue; + + Peptide annotation = spec.getAnnotation(); + numCleavages += annotation.size() - 1; + numSpec++; + spec = filter.apply(spec); + + for (int c = 1; c <= charge; c++) { + for (int direction = 0; direction < 2; direction++) { + double accurateMass = 0; + Hashtable> ionFreq = null; + for (int i = 0; i < annotation.size() - 1; i++) { + if (direction == 0) { + accurateMass += annotation.get(i).getAccurateMass(); + ionFreq = prefixIonFreq; + } else if (direction == 1) { + accurateMass += annotation.get(annotation.size() - 1 - i).getAccurateMass(); + ionFreq = suffixIonFreq; + } + float mass = (float) (accurateMass / c); + ArrayList peakList = spec.getPeakListByMassRange( + mass + MIN_OFFSET_MASS / (float) c - mme.getToleranceAsDa(mass), + mass + MAX_OFFSET_MASS / (float) c + mme.getToleranceAsDa(mass)); + int prevIntOffset = Integer.MIN_VALUE; + for (Peak p : peakList) { + float peakMz = p.getMz(); + int segNum = getSegmentNum(peakMz, curParentMass); + if (segNum != seg) + continue; + float offset = peakMz - mass; + int intOffset = NominalMass.toNominalMass(offset); + if (intOffset > prevIntOffset) { + ionFreq.get(c).add(intOffset); + prevIntOffset = intOffset; + } + } + } + } + } + } + + float maxProb = 0; + int maxCharge = 0; + int maxDirection = 0; + float maxOffset = 0; + + ArrayList fragmentOffsetFrequencyList = new ArrayList(); + ArrayList insignificantFragmentOffsetFrequencyList = new ArrayList(); + for (int c = 1; c <= charge; c++) { + for (int direction = 0; direction < 2; direction++) { + ArrayList keyList; + if (direction == 0) + keyList = new ArrayList(prefixIonFreq.get(c).keySet()); + else + keyList = new ArrayList(suffixIonFreq.get(c).keySet()); + + Collections.sort(keyList); + for (Integer key : keyList) { + float offset = NominalMass.getMassFromNominalMass(key); + int freq; + if (direction == 0) + freq = prefixIonFreq.get(c).get(key); + else + freq = suffixIonFreq.get(c).get(key); + float prob = freq / (float) numCleavages * numSegments; + if (prob > maxProb) { + maxProb = prob; + maxCharge = c; + maxDirection = direction; + maxOffset = offset; + } + if (prob > minProbThreshold) { + if (direction == 0) + fragmentOffsetFrequencyList.add(new FragmentOffsetFrequency(new IonType.PrefixIon(c, offset), prob)); + else + fragmentOffsetFrequencyList.add(new FragmentOffsetFrequency(new IonType.SuffixIon(c, offset), prob)); + } else { + if (direction == 0) + insignificantFragmentOffsetFrequencyList.add(new FragmentOffsetFrequency(new IonType.PrefixIon(c, offset), prob)); + else + insignificantFragmentOffsetFrequencyList.add(new FragmentOffsetFrequency(new IonType.SuffixIon(c, offset), prob)); + } + } + } + } + + if (fragmentOffsetFrequencyList.size() == 0) { + if (maxDirection == 0) + fragmentOffsetFrequencyList.add(new FragmentOffsetFrequency(new IonType.PrefixIon(maxCharge, maxOffset), maxProb)); + else + fragmentOffsetFrequencyList.add(new FragmentOffsetFrequency(new IonType.SuffixIon(maxCharge, maxOffset), maxProb)); + } + + Collections.sort(insignificantFragmentOffsetFrequencyList); + ArrayList noiseOffsetFrequencyList = new ArrayList(NUM_NOISE_IONS); + + int numNoise = 0; + for (FragmentOffsetFrequency off : insignificantFragmentOffsetFrequencyList) { + if (off.getIonType().getCharge() == 1) + noiseOffsetFrequencyList.add(off); + if (++numNoise >= NUM_NOISE_IONS) + break; + } + Collections.sort(fragmentOffsetFrequencyList, Collections.reverseOrder()); + fragOFFTable.put(partition, fragmentOffsetFrequencyList); + insignificantFragOFFTable.put(partition, noiseOffsetFrequencyList); + } + } + + private void generateRankDist(int maxRank) { + if (partitionSet == null) { + assert (false) : "partition() must have been called!"; + return; + } + + rankDistTable = new HashMap>(); + this.maxRank = maxRank; + + for (Partition partition : partitionSet) { + int charge = partition.getCharge(); + IonType[] ionTypes = getIonTypes(partition); + if (ionTypes == null || ionTypes.length == 0) + continue; + Pair parentMassRange = getPrecursorMassRange(partition); + int seg = partition.getSegNum(); + + int numSpec = 0; + Hashtable> rankDist = new Hashtable>(); + Hashtable rankDistMaxRank = new Hashtable(); + Hashtable rankDistUnexplained = new Hashtable(); + + for (IonType ion : ionTypes) { + rankDist.put(ion, new Histogram()); + rankDistMaxRank.put(ion, 0f); + rankDistUnexplained.put(ion, 0f); + } + rankDist.put(IonType.NOISE, new Histogram()); + + float[] noiseDist = new float[maxRank + 2]; + int numMaxRankPeaks = 0; + int totalCleavageSites = 0; + + for (Spectrum spec : specContainer) { + int numExplainedPeaks = 0; + if (spec.getAnnotation() == null) + continue; + if (spec.getCharge() != charge) + continue; + float curParentMass = spec.getPrecursorMass(); + if (curParentMass < parentMassRange.getFirst() || curParentMass >= parentMassRange.getSecond()) + continue; + + Peptide annotation = spec.getAnnotation(); + spec.setRanksOfPeaks(); + numSpec++; + numMaxRankPeaks += spec.size() - maxRank + 1; + totalCleavageSites += annotation.size() - 1; + int prmMassIndex = 0; + int srmMassIndex = 0; + + HashSet explainedPeakSet = new HashSet(); + Hashtable numExplainedMaxRankPeaks = new Hashtable(); + for (IonType ion : ionTypes) { + numExplainedMaxRankPeaks.put(ion, 0); + } + + int numSignalBinsAtThisSegment = 0; + for (int i = 0; i < annotation.size() - 1; i++) { + prmMassIndex += NominalMass.toNominalMass(annotation.get(i).getMass()); + srmMassIndex += NominalMass.toNominalMass(annotation.get(annotation.size() - 1 - i).getMass()); + + float prm = NominalMass.getMassFromNominalMass(prmMassIndex); + float srm = NominalMass.getMassFromNominalMass(srmMassIndex); + for (IonType ion : ionTypes) { + float theoMass; + if (ion instanceof IonType.PrefixIon) + theoMass = ion.getMz(prm); + else + theoMass = ion.getMz(srm); + + int segNum = super.getSegmentNum(theoMass, curParentMass); + if (segNum == seg) { + numSignalBinsAtThisSegment++; + Peak p = spec.getPeakByMass(theoMass, mme); + if (p != null) { + numExplainedPeaks++; + int rank = p.getRank(); + if (rank >= maxRank) { + rank = maxRank; + numExplainedMaxRankPeaks.put(ion, numExplainedMaxRankPeaks.get(ion) + 1); + } + explainedPeakSet.add(p); + rankDist.get(ion).add(rank); + } else { + rankDist.get(ion).add(maxRank + 1); + } + } + } + } + + ArrayList unexplainedPeaksAtThisSegment = new ArrayList(); + int numPeaksAtThisSegment = 0; + int numMaxRankPeaksAtThisSegment = 0; + for (Peak p : spec) { + if (super.getSegmentNum(p.getMz(), curParentMass) == seg) { + numPeaksAtThisSegment++; + if (p.getRank() >= maxRank) + numMaxRankPeaksAtThisSegment++; + if (!explainedPeakSet.contains(p)) + unexplainedPeaksAtThisSegment.add(p); + } + } + + float midMassThisSegment = (1f / numSegments * seg + 1f / numSegments / 2) * annotation.getParentMass(); + float numBinsAtThisSegment = annotation.getParentMass() / numSegments / mme.getToleranceAsDa(midMassThisSegment) / 2; + + for (Peak p : unexplainedPeaksAtThisSegment) { + int rank = p.getRank(); + float noiseFreq = (annotation.size() - 1) / numSegments / numBinsAtThisSegment; + if (rank >= maxRank) + noiseDist[maxRank] += noiseFreq / numMaxRankPeaksAtThisSegment; + else + noiseDist[rank] += noiseFreq; + } + + for (IonType ion : ionTypes) { + if (numMaxRankPeaksAtThisSegment > 0) { + Float prevSumFreq = rankDistMaxRank.get(ion); + float curFreq = numExplainedMaxRankPeaks.get(ion) / (float) numMaxRankPeaksAtThisSegment; + rankDistMaxRank.put(ion, prevSumFreq + curFreq); + } + } + + noiseDist[maxRank + 1] += (numBinsAtThisSegment - numPeaksAtThisSegment) * (annotation.size() - 1) / numSegments / numBinsAtThisSegment; + } + + HashMap freqDist = new HashMap(); + for (IonType ion : ionTypes) { + Float[] dist = new Float[maxRank + 1]; + Histogram hist = rankDist.get(ion); + for (int i = 1; i <= maxRank - 1; i++) { + Integer num = hist.get(i); + dist[i - 1] = (num / (float) numSpec); + } + dist[maxRank - 1] = rankDistMaxRank.get(ion) / numSpec; + dist[maxRank] = hist.get(maxRank + 1) / (float) numSpec; + freqDist.put(ion, dist); + } + + // noise + Float[] dist = new Float[maxRank + 1]; + for (int i = 1; i <= maxRank + 1; i++) + dist[i - 1] = noiseDist[i] / numSpec; + freqDist.put(IonType.NOISE, dist); + + rankDistTable.put(partition, freqDist); + } + } + + protected void smoothing() { + smoothingRankDistTable(); + } + + protected void smoothingRankDistTable() { + if (rankDistTable == null) + return; + assert (smoothingRanks.length == smoothingWindowSize.length); + for (Partition partition : rankDistTable.keySet()) { + HashMap table = this.rankDistTable.get(partition); + for (IonType ion : table.keySet()) { + Float[] freq = table.get(ion); + Float[] smoothedFreq = new Float[freq.length]; + int smoothingIndex = 0; + for (int i = 0; i < freq.length - 2; i++) { + if (smoothingIndex < smoothingRanks.length - 1 && + i == smoothingRanks[smoothingIndex]) + smoothingIndex++; + int windowSize = smoothingWindowSize[smoothingIndex]; + float sumFrequencies = 0; + int numIndicesSummed = 0; + for (int d = -windowSize; d <= windowSize; d++) { + int index = i + d; + if (index < 0 || index > freq.length - 3) + continue; + sumFrequencies += freq[index]; + numIndicesSummed++; + } + while (sumFrequencies == 0 && windowSize < freq.length - 4) { + windowSize++; + int index = i - windowSize; + if (index >= 0) { + sumFrequencies += freq[index]; + numIndicesSummed++; + } + index = i + windowSize; + if (index <= freq.length - 3) { + sumFrequencies += freq[index]; + numIndicesSummed++; + } + } + if (sumFrequencies != 0) + smoothedFreq[i] = sumFrequencies / numIndicesSummed; + else + assert (false); + } + for (int i = 0; i < freq.length - 2; i++) + freq[i] = smoothedFreq[i]; + if (freq[freq.length - 1] == 0) + freq[freq.length - 1] = Float.MIN_VALUE; + if (freq[freq.length - 2] == 0) + freq[freq.length - 2] = freq[freq.length - 3]; + } + } + } +} diff --git a/src/main/java/edu/ucsd/msjava/msscorer/ScoringParameterGeneratorWithErrors.java b/src/main/java/edu/ucsd/msjava/msscorer/ScoringParameterGeneratorWithErrors.java new file mode 100644 index 00000000..3c1c6e6e --- /dev/null +++ b/src/main/java/edu/ucsd/msjava/msscorer/ScoringParameterGeneratorWithErrors.java @@ -0,0 +1,841 @@ +package edu.ucsd.msjava.msscorer; + +import edu.ucsd.msjava.mgf.MgfSpectrumParser; +import edu.ucsd.msjava.msgf.Histogram; +import edu.ucsd.msjava.msgf.IntHistogram; +import edu.ucsd.msjava.msgf.NominalMass; +import edu.ucsd.msjava.msgf.Tolerance; +import edu.ucsd.msjava.msscorer.NewScorerFactory.SpecDataType; +import edu.ucsd.msjava.msutil.*; +import edu.ucsd.msjava.msutil.IonType.PrefixIon; + +import java.io.File; +import java.util.ArrayList; +import java.util.Collections; +import java.util.HashMap; +import java.util.HashSet; +import java.util.Hashtable; +import java.util.TreeMap; +import java.util.TreeSet; + +/** + * This only supports low accuracy fragment ions. + * + * @author sangtaekim + */ +public class ScoringParameterGeneratorWithErrors extends NewRankScorer { + private static final float MIN_PRECURSOR_OFFSET = -300; // for precursors + private static final float MAX_PRECURSOR_OFFSET = 30; + private static final int MIN_NUM_SPECTRA_PER_PARTITION = 400; + private static final int MIN_NUM_SPECTRA_FOR_PRECURSOR_OFF = 150; + private static final int MAX_NUM_PARTITIONS_PER_CHARGE = 30; + + private static final float MIN_PRECURSOR_OFFSET_PROBABILITY = 0.15f; + private static final float MIN_ION_OFFSET_PROBABILITY = 0.15f; + private static final float MIN_MAIN_ION_OFFSET_PROBABILITY = 0.01f; + + private static final int MAX_RANK = 150; + private static final int NUM_SEGMENTS_PER_SPECTRUM = 2; + + private static final int[] smoothingRanks = {3, 5, 10, 20, 50, Integer.MAX_VALUE}; + private static final int[] smoothingWindowSize = {0, 1, 2, 3, 4, 5}; + + private static final float DECONVOLUTION_MASS_TOLERANCE = 0.02f; + protected static final int MAX_CHARGE = 20; + + public static void generateParameters( + File specFile, + SpecDataType dataType, + AminoAcidSet aaSet, + File outputDir, + boolean isText, + boolean verbose, + boolean singlePartition + ) { + SpectraContainer container = new SpectraContainer(specFile.getPath(), new MgfSpectrumParser().aaSet(aaSet)); + generateParameters(container, dataType, aaSet, outputDir, isText, verbose, singlePartition); + } + + public static void generateParameters( + SpectraContainer container, + SpecDataType dataType, + AminoAcidSet aaSet, + File outputDir, + boolean isText, + boolean verbose) { + generateParameters(container, dataType, aaSet, outputDir, isText, verbose, false); + } + + public static void generateParameters( + SpectraContainer container, + SpecDataType dataType, + AminoAcidSet aaSet, + File outputDir, + boolean isText, + boolean verbose, + boolean singlePartition) { + if (verbose) + System.out.println("Number of annotated PSMs: " + container.size()); + + String paramFileName = dataType.toString() + ".param"; + + File outputFile; + if (outputDir != null) + outputFile = new File(outputDir, paramFileName); + else + outputFile = new File(paramFileName); + + if (verbose) + System.out.println("Output file name: " + outputFile.getAbsolutePath()); + int errorScalingFactor = 0; + boolean applyDeconvolution = false; + + if (dataType.getInstrumentType() == InstrumentType.HIGH_RESOLUTION_LTQ + || dataType.getInstrumentType() == InstrumentType.TOF + || dataType.getInstrumentType().isHighResolution()) { + errorScalingFactor = 100; + applyDeconvolution = true; + if (verbose) + System.out.println("High-precision MS/MS data: " + + "errorScalingFactor(" + errorScalingFactor + ") " + + "chargeDeconvolution(" + applyDeconvolution + ")"); + } + + boolean considerPhosLoss = false; + if (dataType.getProtocol().getName().equals("Phosphorylation")) { + considerPhosLoss = true; + if (verbose) + System.out.println("Consider H3PO4 loss."); + } + + boolean consideriTRAQLoss = false; + if (dataType.getProtocol().getName().equals("iTRAQ")) { + consideriTRAQLoss = true; + if (verbose) + System.out.println("Consider iTRAQ loss."); + } + + boolean considerTMTLoss = false; + if (dataType.getProtocol().getName().equals("TMT")) { + considerTMTLoss = true; + if (verbose) + System.out.println("Consider TMT loss."); + } + + if (dataType.getProtocol().getName().equals("iTRAQPhospho")) { + considerPhosLoss = true; + consideriTRAQLoss = true; + if (verbose) + System.out.println("Consider iTRAQ and H3PO4 loss."); + } + + HashSet pepSet = new HashSet(); + for (Spectrum spec : container) + pepSet.add(spec.getAnnotationStr()); + + if (verbose) + System.out.println("Number of unique peptides: " + pepSet.size()); + int numSpecsPerPeptide; + if (pepSet.size() < 2000) { + numSpecsPerPeptide = 3; + } else { + numSpecsPerPeptide = 1; + } + if (verbose) + System.out.println("Consider " + numSpecsPerPeptide + " per spectrum."); + + HashMap> pepSpecMap = new HashMap>(); + for (Spectrum spec : container) { + if (spec.getAnnotationStr() == null) + continue; + String pep = spec.getAnnotationStr() + ":" + spec.getCharge(); + if (pep != null && pep.length() > 0) { + ArrayList specList = pepSpecMap.get(pep); + if (specList == null) { + specList = new ArrayList(); + pepSpecMap.put(pep, specList); + } + if (specList.size() < numSpecsPerPeptide) + specList.add(spec); + } + } + + SpectraContainer specContOnePerPep = new SpectraContainer(); + for (ArrayList specList : pepSpecMap.values()) { + for (Spectrum spec : specList) { + specContOnePerPep.add(spec); + } + } + + ScoringParameterGeneratorWithErrors gen = new ScoringParameterGeneratorWithErrors( + specContOnePerPep, + dataType, + considerPhosLoss, + consideriTRAQLoss, + considerTMTLoss, + applyDeconvolution); + + gen.tolerance(new Tolerance(0.5f)); + + if (singlePartition) + gen.partition(2, true); + else + gen.partition(NUM_SEGMENTS_PER_SPECTRUM, false); + if (verbose) + System.out.println("Partition: " + gen.partitionSet.size()); + + gen.precursorOFF(MIN_PRECURSOR_OFFSET_PROBABILITY); + if (verbose) + System.out.println("PrecursorOFF Done."); + + gen.filterPrecursorPeaks(); + if (verbose) + System.out.println("Filtering Done."); + + if (applyDeconvolution) { + gen.deconvoluteSpectra(); + if (verbose) + System.out.println("Deconvolution Done."); + } + + gen.selectIonTypes(); + if (verbose) + System.out.println("Ion types selected."); + + gen.generateRankDist(MAX_RANK); + if (verbose) + System.out.println("Rank distribution computed."); + + gen.generateErrorDist(errorScalingFactor); + if (verbose) + System.out.println("Error disbribution computed"); + + gen.smoothing(); + if (verbose) + System.out.println("Smoothing complete."); + + gen.writeParameters(outputFile); + gen.writeParametersPlainText(new File(outputFile.getPath() + ".txt")); + + if (verbose) + System.out.println("Writing Done."); + } + + private SpectraContainer specContainer; + private final boolean considerPhosLoss; + private final boolean consideriTRAQLoss; + private final boolean considerTMTLoss; + + public ScoringParameterGeneratorWithErrors(SpectraContainer specContainer, SpecDataType dataType, boolean considerPhosLoss, boolean consideriTRAQLoss, boolean considerTMTLoss, boolean applyDeconvolution) { + this.specContainer = specContainer; + this.considerPhosLoss = considerPhosLoss; + this.consideriTRAQLoss = consideriTRAQLoss; + this.considerTMTLoss = considerTMTLoss; + super.dataType = dataType; + super.applyDeconvolution = applyDeconvolution; + super.deconvolutionErrorTolerance = DECONVOLUTION_MASS_TOLERANCE; + } + + public void partition(int numSegments, boolean singlePartition) { + super.numSegments = numSegments; + chargeHist = new Histogram(); + partitionSet = new TreeSet(); + + Hashtable> parentMassMap = new Hashtable>(); + for (Spectrum spec : specContainer) { + int charge = spec.getCharge(); + if (charge <= 0) + continue; + chargeHist.add(charge); + if (spec.getAnnotation() != null) { + ArrayList precursorList = parentMassMap.get(charge); + if (precursorList == null) { + precursorList = new ArrayList(); + parentMassMap.put(charge, precursorList); + } + precursorList.add(spec.getPrecursorMass()); + } + } + + for (int c = chargeHist.minKey(); c <= chargeHist.maxKey(); c++) { + ArrayList parentMassList = parentMassMap.get(c); + if (parentMassList == null) + continue; + + int numSpec = parentMassList.size(); + if (numSpec < Math.round(MIN_NUM_SPECTRA_PER_PARTITION * 0.9f)) + continue; + + int partitionSize = Math.max(numSpec / MAX_NUM_PARTITIONS_PER_CHARGE, MIN_NUM_SPECTRA_PER_PARTITION); + + Collections.sort(parentMassList); + int bestSetSize = 0; + + if (singlePartition) + bestSetSize = numSpec; + else { + int smallestRemainder = partitionSize; + for (int i = Math.round(partitionSize * 0.9f); i <= Math.round(partitionSize * 1.1f); i++) { + int remainder = numSpec % i; + if (i - remainder < remainder) + remainder = i - remainder; + if (remainder < smallestRemainder || (remainder == smallestRemainder && Math.abs(partitionSize - i) < Math.abs(partitionSize - bestSetSize))) { + bestSetSize = i; + smallestRemainder = remainder; + } + } + } + int num = 0; + for (int i = 0; i == 0 || i < Math.round(numSpec / (float) bestSetSize); i++) { + if (num != 0) { + for (int seg = 0; seg < numSegments; seg++) + partitionSet.add(new Partition(c, parentMassList.get(num), seg)); + } else { + for (int seg = 0; seg < numSegments; seg++) + partitionSet.add(new Partition(c, 0f, seg)); + } + num += bestSetSize; + } + } + } + + private void precursorOFF(float minProbThreshold) { + if (chargeHist == null) { + assert (false) : "partition() must have been called before"; + return; + } + precursorOFFMap = new TreeMap>(); + numPrecurOFF = 0; + + for (int charge = chargeHist.minKey(); charge <= chargeHist.maxKey(); charge++) { + if (chargeHist.get(charge) < MIN_NUM_SPECTRA_FOR_PRECURSOR_OFF) + continue; + ArrayList precursorOffsetList = new ArrayList(); + int numSpecs = 0; + Hashtable> histList = new Hashtable>(); + for (int c = charge; c >= 2; c--) + histList.put(c, new Histogram()); + + for (Spectrum spec : specContainer) { + if (spec.getAnnotation() == null) + continue; + if (spec.getCharge() != charge) + continue; + numSpecs++; + spec = filter.apply(spec); + float precursorNeutralMass = spec.getPrecursorMass(); + for (int c = charge; c >= 2; c--) { + float precursorMz = (precursorNeutralMass + c * (float) Composition.ChargeCarrierMass()) / c; + ArrayList peakList = spec.getPeakListByMassRange( + precursorMz + MIN_PRECURSOR_OFFSET / (float) c - mme.getToleranceAsDa(precursorMz + MIN_PRECURSOR_OFFSET / (float) c) / 2, + precursorMz + MAX_PRECURSOR_OFFSET / (float) c + mme.getToleranceAsDa(precursorMz + MAX_PRECURSOR_OFFSET / (float) c) / 2); + + int prevMassIndexDiff = Integer.MIN_VALUE; + for (Peak p : peakList) { + float peakMass = p.getMz(); + int massIndexDiff = NominalMass.toNominalMass(peakMass - precursorMz); + if (massIndexDiff > prevMassIndexDiff) { + histList.get(c).add(massIndexDiff); + prevMassIndexDiff = massIndexDiff; + } + } + } + } + + for (int c = charge; c >= 2; c--) { + ArrayList keyList = new ArrayList(histList.get(c).keySet()); + Collections.sort(keyList); + for (Integer key : keyList) { + float prob = (histList.get(c).get(key)) / (float) numSpecs; + if (prob > minProbThreshold) { + precursorOffsetList.add(new PrecursorOffsetFrequency((charge - c), NominalMass.getMassFromNominalMass(key), prob)); + } + } + } + precursorOFFMap.put(charge, precursorOffsetList); + numPrecurOFF += precursorOffsetList.size(); + } + } + + private void filterPrecursorPeaks() { + if (this.precursorOFFMap == null) + return; + for (Spectrum spec : specContainer) { + for (PrecursorOffsetFrequency off : this.getPrecursorOFF(spec.getCharge())) + spec.filterPrecursorPeaks(mme, off.getReducedCharge(), off.getOffset()); + } + } + + private void deconvoluteSpectra() { + SpectraContainer newSpecContainer = new SpectraContainer(); + for (Spectrum spec : specContainer) { + newSpecContainer.add(spec.getDeconvolutedSpectrum(DECONVOLUTION_MASS_TOLERANCE)); + } + specContainer = newSpecContainer; + } + + private Pair getPrecursorMassRange(Partition partition) { + float minParentMass = partition.getParentMass(); + float maxParentMass = Float.MAX_VALUE; + Partition higherPartition = partitionSet.higher(partition); + if (higherPartition != null) { + if (higherPartition.getCharge() == partition.getCharge() && higherPartition.getSegNum() == partition.getSegNum()) { + maxParentMass = higherPartition.getParentMass(); + } + } + return new Pair(minParentMass, maxParentMass); + } + + private void selectIonTypes() { + if (partitionSet == null) { + assert (false) : "partition() must have been called before!"; + return; + } + + fragOFFTable = new HashMap>(); + + for (Partition partition : partitionSet) { + int charge = partition.getCharge(); + Pair parentMassRange = getPrecursorMassRange(partition); + + SpectraContainer curPartContainer = new SpectraContainer(); + for (Spectrum spec : specContainer) { + if (spec.getAnnotation() == null) + continue; + if (spec.getCharge() != charge) + continue; + + float curParentMass = spec.getPrecursorMass(); + if (curParentMass < parentMassRange.getFirst() || curParentMass >= parentMassRange.getSecond()) + continue; + + curPartContainer.add(spec); + } + + ArrayList signalFragmentOffsetFrequencyList = new ArrayList(); + + int seg = partition.getSegNum(); + IonType[] allIonTypes = IonType.getAllKnownIonTypes(Math.min(charge, 3), true, considerPhosLoss, consideriTRAQLoss, considerTMTLoss).toArray(new IonType[0]); + + IonProbability probGen = new IonProbability( + curPartContainer.iterator(), + allIonTypes, + mme) + .filter(filter) + .segment(seg, numSegments); + + float[] ionProb = probGen.getIonProb(); + + float signalThreshold = MIN_ION_OFFSET_PROBABILITY; + for (int i = 0; i < allIonTypes.length; i++) { + if (ionProb[i] >= signalThreshold) + signalFragmentOffsetFrequencyList.add(new FragmentOffsetFrequency(allIonTypes[i], ionProb[i])); + } + + if (signalFragmentOffsetFrequencyList.size() == 0) { + int maxIndex = -1; + float maxIonProb = Float.MIN_VALUE; + for (int i = 0; i < allIonTypes.length; i++) { + if (ionProb[i] > MIN_MAIN_ION_OFFSET_PROBABILITY && ionProb[i] > maxIonProb) { + maxIndex = i; + maxIonProb = ionProb[i]; + } + } + if (maxIndex >= 0) + signalFragmentOffsetFrequencyList.add(new FragmentOffsetFrequency(allIonTypes[maxIndex], maxIonProb)); + } + + Collections.sort(signalFragmentOffsetFrequencyList, Collections.reverseOrder()); + fragOFFTable.put(partition, signalFragmentOffsetFrequencyList); + } + super.determineIonTypes(); + } + + private void generateRankDist(int maxRank) { + if (partitionSet == null) { + assert (false) : "partition() must have been called!"; + return; + } + + rankDistTable = new HashMap>(); + this.maxRank = maxRank; + + for (Partition partition : partitionSet) { + int charge = partition.getCharge(); + IonType[] ionTypes = getIonTypes(partition); + if (ionTypes == null || ionTypes.length == 0) + continue; + + Pair parentMassRange = getPrecursorMassRange(partition); + int seg = partition.getSegNum(); + + int numSpec = 0; + Hashtable> rankDist = new Hashtable>(); + Hashtable rankDistMaxRank = new Hashtable(); + Hashtable rankDistUnexplained = new Hashtable(); + + for (IonType ion : ionTypes) { + rankDist.put(ion, new Histogram()); + rankDistMaxRank.put(ion, 0f); + rankDistUnexplained.put(ion, 0f); + } + rankDist.put(IonType.NOISE, new Histogram()); + + float[] noiseDist = new float[maxRank + 2]; + int numMaxRankPeaks = 0; + int totalCleavageSites = 0; + + for (Spectrum spec : specContainer) { + int numExplainedPeaks = 0; + if (spec.getAnnotation() == null) + continue; + if (spec.getCharge() != charge) + continue; + float curParentMass = spec.getPrecursorMass(); + if (curParentMass < parentMassRange.getFirst() || curParentMass >= parentMassRange.getSecond()) + continue; + + Peptide annotation = spec.getAnnotation(); + spec.setRanksOfPeaks(); + numSpec++; + numMaxRankPeaks += spec.size() - maxRank + 1; + totalCleavageSites += annotation.size() - 1; + int prmMassIndex = 0; + int srmMassIndex = 0; + + HashSet explainedPeakSet = new HashSet(); + Hashtable numExplainedMaxRankPeaks = new Hashtable(); + for (IonType ion : ionTypes) { + numExplainedMaxRankPeaks.put(ion, 0); + } + + int numSignalBinsAtThisSegment = 0; + for (int i = 0; i < annotation.size() - 1; i++) { + prmMassIndex += NominalMass.toNominalMass(annotation.get(i).getMass()); + srmMassIndex += NominalMass.toNominalMass(annotation.get(annotation.size() - 1 - i).getMass()); + + float prm = NominalMass.getMassFromNominalMass(prmMassIndex); + float srm = NominalMass.getMassFromNominalMass(srmMassIndex); + for (IonType ion : ionTypes) { + float theoMass; + if (ion instanceof IonType.PrefixIon) + theoMass = ion.getMz(prm); + else + theoMass = ion.getMz(srm); + + int segNum = super.getSegmentNum(theoMass, curParentMass); + if (segNum == seg) { + numSignalBinsAtThisSegment++; + Peak p = spec.getPeakByMass(theoMass, mme); + if (p != null) { + numExplainedPeaks++; + int rank = p.getRank(); + if (rank >= maxRank) { + rank = maxRank; + numExplainedMaxRankPeaks.put(ion, numExplainedMaxRankPeaks.get(ion) + 1); + } + explainedPeakSet.add(p); + rankDist.get(ion).add(rank); + } else { + rankDist.get(ion).add(maxRank + 1); + } + } + } + } + + ArrayList unexplainedPeaksAtThisSegment = new ArrayList(); + int numPeaksAtThisSegment = 0; + int numMaxRankPeaksAtThisSegment = 0; + for (Peak p : spec) { + if (super.getSegmentNum(p.getMz(), curParentMass) == seg) { + numPeaksAtThisSegment++; + if (p.getRank() >= maxRank) + numMaxRankPeaksAtThisSegment++; + if (!explainedPeakSet.contains(p)) + unexplainedPeaksAtThisSegment.add(p); + } + } + + float midMassThisSegment = (1f / numSegments * seg + 1f / numSegments / 2) * annotation.getParentMass(); + float numBinsAtThisSegment = annotation.getParentMass() / numSegments / mme.getToleranceAsDa(midMassThisSegment) / 2; + + for (Peak p : unexplainedPeaksAtThisSegment) { + int rank = p.getRank(); + float noiseFreq = (annotation.size() - 1) / numSegments / numBinsAtThisSegment; + if (rank >= maxRank) + noiseDist[maxRank] += noiseFreq / numMaxRankPeaksAtThisSegment; + else + noiseDist[rank] += noiseFreq; + } + + for (IonType ion : ionTypes) { + if (numMaxRankPeaksAtThisSegment > 0) { + Float prevSumFreq = rankDistMaxRank.get(ion); + float curFreq = numExplainedMaxRankPeaks.get(ion) / (float) numMaxRankPeaksAtThisSegment; + rankDistMaxRank.put(ion, prevSumFreq + curFreq); + } + } + + noiseDist[maxRank + 1] += (numBinsAtThisSegment - numPeaksAtThisSegment) * (annotation.size() - 1) / numSegments / numBinsAtThisSegment; + } + + HashMap freqDist = new HashMap(); + for (IonType ion : ionTypes) { + Float[] dist = new Float[maxRank + 1]; + Histogram hist = rankDist.get(ion); + for (int i = 1; i <= maxRank - 1; i++) { + Integer num = hist.get(i); + dist[i - 1] = (num / (float) numSpec); + } + dist[maxRank - 1] = rankDistMaxRank.get(ion) / numSpec; + dist[maxRank] = hist.get(maxRank + 1) / (float) numSpec; + freqDist.put(ion, dist); + } + + // noise + Float[] dist = new Float[maxRank + 1]; + for (int i = 1; i <= maxRank + 1; i++) + dist[i - 1] = noiseDist[i] / numSpec; + freqDist.put(IonType.NOISE, dist); + + rankDistTable.put(partition, freqDist); + } + } + + private void generateErrorDist(int errorScalingFactor) { + this.errorScalingFactor = errorScalingFactor; + if (errorScalingFactor > 0) { + generateIonErrorDist(); + generateNoiseErrorDist(); + } + } + + private void generateIonErrorDist() { + ionErrDistTable = new HashMap(); + ionExistenceTable = new HashMap(); + for (Partition partition : partitionSet) { + int charge = partition.getCharge(); + Pair parentMassRange = getPrecursorMassRange(partition); + int seg = partition.getSegNum(); + if (seg != super.getNumSegments() - 1) + continue; + IonType mainIon = this.getMainIonType(partition); + IntHistogram errHist = new IntHistogram(); + int[] edgeCount = new int[4]; + int numSpecs = 0; + for (Spectrum spec : specContainer) { + if (spec.getAnnotation() == null) + continue; + if (spec.getCharge() != charge) + continue; + + float curParentMass = spec.getPrecursorMass(); + if (curParentMass < parentMassRange.getFirst() || curParentMass >= parentMassRange.getSecond()) + continue; + + numSpecs++; + Peptide peptide; + + peptide = spec.getAnnotation(); + + int intResidueMass = 0; + float[] obsMass = new float[peptide.size() + 1]; + + obsMass[0] = mainIon.getOffset(); + for (int i = 0; i < peptide.size() - 1; i++) { + if (mainIon instanceof PrefixIon) + intResidueMass += peptide.get(i).getNominalMass(); + else + intResidueMass += peptide.get(peptide.size() - 1 - i).getNominalMass(); + + float theoMass = mainIon.getMz(NominalMass.getMassFromNominalMass(intResidueMass)); + Peak p = spec.getPeakByMass(theoMass, mme); + if (p != null) + obsMass[i + 1] = p.getMz(); + else + obsMass[i + 1] = -1; + } + + obsMass[peptide.size()] = mainIon.getMz(peptide.getMass()); + for (int i = 1; i <= peptide.size(); i++) { + if (obsMass[i] >= 0) { + if (obsMass[i - 1] >= 0) { + AminoAcid aa; + if (mainIon instanceof PrefixIon) + aa = peptide.get(i - 1); + else + aa = peptide.get(peptide.size() - i); + + float expMass = obsMass[i] - obsMass[i - 1]; + float theoMass = aa.getMass() / mainIon.getCharge(); + float diff = expMass - theoMass; + int diffIndex = Math.round(diff * errorScalingFactor); + if (diffIndex > errorScalingFactor) + diffIndex = errorScalingFactor; + else if (diffIndex < -errorScalingFactor) + diffIndex = -errorScalingFactor; + errHist.add(diffIndex); + edgeCount[3]++; + } else + edgeCount[1]++; + } else { + if (obsMass[i - 1] >= 0) + edgeCount[2]++; + else + edgeCount[0]++; + } + } + } + + Float[] ionErrHist = new Float[2 * errorScalingFactor + 1]; + float[] smoothedHist = errHist.getSmoothedHist(errorScalingFactor); + for (int i = -errorScalingFactor; i <= errorScalingFactor; i++) + ionErrHist[i + errorScalingFactor] = smoothedHist[i + errorScalingFactor] / (float) errHist.totalCount(); + + Float[] ionExistence = new Float[edgeCount.length]; + int sumEdgeCount = 0; + for (int i = 0; i < edgeCount.length; i++) + sumEdgeCount += edgeCount[i]; + for (int i = 0; i < edgeCount.length; i++) + ionExistence[i] = edgeCount[i] / (float) sumEdgeCount; + + for (int i = 0; i < this.numSegments; i++) { + Partition part = new Partition(partition.getCharge(), partition.getParentMass(), i); + if (partitionSet.contains(part)) { + ionErrDistTable.put(part, ionErrHist); + ionExistenceTable.put(part, ionExistence); + } + } + } + } + + private void generateNoiseErrorDist() { + this.noiseErrDistTable = new HashMap(); + AminoAcidSet aaSet = AminoAcidSet.getStandardAminoAcidSetWithFixedCarbamidomethylatedCys(); + AminoAcid aaK = aaSet.getAminoAcid('K'); + AminoAcid aaQ = aaSet.getAminoAcid('Q'); + int heaviestAANominalMass = aaSet.getMaxNominalMass(); + float[] nominalMass = new float[heaviestAANominalMass + 1]; + for (AminoAcid aa : aaSet) + nominalMass[aa.getNominalMass()] = aa.getMass(); + + for (Partition partition : partitionSet) { + int charge = partition.getCharge(); + Pair parentMassRange = getPrecursorMassRange(partition); + int seg = partition.getSegNum(); + if (seg != super.getNumSegments() - 1) + continue; + + IntHistogram errHist = new IntHistogram(); + int numSpecs = 0; + for (Spectrum spec : specContainer) { + if (spec.getAnnotation() == null) + continue; + if (spec.getCharge() != charge) + continue; + + float curParentMass = spec.getPrecursorMass(); + if (curParentMass < parentMassRange.getFirst() || curParentMass >= parentMassRange.getSecond()) + continue; + + Spectrum noiseSpec = (Spectrum) spec.clone(); + + numSpecs++; + + for (int i = 0; i < noiseSpec.size() - 1; i++) { + Peak p1 = noiseSpec.get(i); + float p1Mass = p1.getMz(); + int nominalP1 = NominalMass.toNominalMass(p1.getMz()); + for (int j = i + 1; j < noiseSpec.size(); j++) { + Peak p2 = noiseSpec.get(j); + float p2Mass = p2.getMz(); + int nominalP2 = NominalMass.toNominalMass(p2.getMz()); + int nominalDiff = nominalP2 - nominalP1; + if (nominalDiff > heaviestAANominalMass) + break; + if (nominalMass[nominalDiff] == 0) + continue; + + float diff = p2Mass - p1Mass; + float aaMass = nominalMass[nominalDiff]; + if (nominalDiff == 128) { + if (Math.abs(diff - aaQ.getMass()) > Math.abs(diff - aaK.getMass())) + aaMass = aaK.getMass(); + else + aaMass = aaQ.getMass(); + } + float err = diff - aaMass; + errHist.add(Math.round(err * errorScalingFactor)); + } + } + } + Float[] noiseErrHist = new Float[2 * errorScalingFactor + 1]; + float[] smoothedHist = errHist.getSmoothedHist(errorScalingFactor); + for (int i = -errorScalingFactor; i <= errorScalingFactor; i++) + noiseErrHist[i + errorScalingFactor] = smoothedHist[i + errorScalingFactor] / (float) errHist.totalCount(); + + for (int i = 0; i < this.numSegments; i++) { + Partition part = new Partition(partition.getCharge(), partition.getParentMass(), i); + if (partitionSet.contains(part)) { + noiseErrDistTable.put(part, noiseErrHist); + } + } + } + } + + protected void smoothing() { + smoothingRankDistTable(); + } + + protected void smoothingRankDistTable() { + if (rankDistTable == null) + return; + assert (smoothingRanks.length == smoothingWindowSize.length); + for (Partition partition : rankDistTable.keySet()) { + HashMap table = this.rankDistTable.get(partition); + for (IonType ion : table.keySet()) { + Float[] freq = table.get(ion); + Float[] smoothedFreq = new Float[freq.length]; + int smoothingIndex = 0; + for (int i = 0; i < freq.length - 2; i++) { + if (smoothingIndex < smoothingRanks.length - 1 && + i == smoothingRanks[smoothingIndex]) + smoothingIndex++; + int windowSize = smoothingWindowSize[smoothingIndex]; + float sumFrequencies = 0; + int numIndicesSummed = 0; + for (int d = -windowSize; d <= windowSize; d++) { + int index = i + d; + if (index < 0 || index > freq.length - 3) + continue; + sumFrequencies += freq[index]; + numIndicesSummed++; + } + while (sumFrequencies == 0 && windowSize < freq.length - 4) { + windowSize++; + int index = i - windowSize; + if (index >= 0) { + sumFrequencies += freq[index]; + numIndicesSummed++; + } + index = i + windowSize; + if (index <= freq.length - 3) { + sumFrequencies += freq[index]; + numIndicesSummed++; + } + } + if (sumFrequencies != 0) + smoothedFreq[i] = sumFrequencies / numIndicesSummed; + else + assert (false); + } + for (int i = 0; i < freq.length - 2; i++) + freq[i] = smoothedFreq[i]; + if (freq[freq.length - 1] == 0) + freq[freq.length - 1] = Float.MIN_VALUE; + if (freq[freq.length - 2] == 0) + freq[freq.length - 2] = freq[freq.length - 3]; + } + } + } +} diff --git a/src/main/java/edu/ucsd/msjava/msutil/AnnotatedSpectra.java b/src/main/java/edu/ucsd/msjava/msutil/AnnotatedSpectra.java new file mode 100644 index 00000000..8a39b7ac --- /dev/null +++ b/src/main/java/edu/ucsd/msjava/msutil/AnnotatedSpectra.java @@ -0,0 +1,325 @@ +package edu.ucsd.msjava.msutil; + +import edu.ucsd.msjava.mgf.BufferedLineReader; +import edu.ucsd.msjava.misc.ExceptionCapturer; +import edu.ucsd.msjava.misc.ProgressData; +import edu.ucsd.msjava.misc.ProgressReporter; +import edu.ucsd.msjava.misc.ThreadPoolExecutorWithExceptions; + +import java.io.File; +import java.io.IOException; +import java.io.PrintStream; +import java.util.ArrayList; +import java.util.Collections; +import java.util.HashMap; +import java.util.Iterator; +import java.util.List; +import java.util.concurrent.TimeUnit; + +/** + * Restored trainer support class. mzID input has been excised in this fork + * (the {@code mzid/} package and its {@code MzIDParser} were removed); only + * pre-converted TSV result files are accepted. + */ +public class AnnotatedSpectra implements ProgressReporter, ExceptionCapturer { + private File[] resultFiles; + private File specDir; + private AminoAcidSet aaSet; + private float fdrThreshold = 0.01f; + private ProgressData progress; + private boolean dropErrorDatasets = false; + private Throwable exception = null; + + @Override + public void setProgressData(ProgressData data) { + progress = data; + } + + @Override + public ProgressData getProgressData() { + return progress; + } + + @Override + public boolean hasException() { + return exception != null; + } + + @Override + public Throwable getException() { + return exception; + } + + public void setDropErrorDatasets(boolean dropErrors) { + dropErrorDatasets = dropErrors; + } + + private SpectraContainer annotatedSpectra; + + public AnnotatedSpectra(File[] resultFiles, File specDir, AminoAcidSet aaSet) { + this.resultFiles = resultFiles; + this.specDir = specDir; + this.aaSet = aaSet; + this.progress = null; + } + + public AnnotatedSpectra(File[] resultFiles, File specDir, AminoAcidSet aaSet, float fdrThreshold, boolean dropErrors) { + this.resultFiles = resultFiles; + this.specDir = specDir; + this.aaSet = aaSet; + this.fdrThreshold = fdrThreshold; + this.dropErrorDatasets = dropErrors; + this.progress = null; + } + + public static class ConcurrentAnnotatedSpectraParser extends AnnotatedSpectra implements Runnable { + private List results; + private List errors; + + public ConcurrentAnnotatedSpectraParser(File[] resultFiles, File specDir, AminoAcidSet aaSet, float fdrThreshold, boolean dropErrors, List resultList, List errorList) { + super(resultFiles, specDir, aaSet, fdrThreshold, dropErrors); + results = resultList; + errors = errorList; + } + + @Override + public void run() { + String result = parse(); + results.addAll(getAnnotatedSpecContainer()); + if (result != null) { + errors.add(result); + } + } + } + + public AnnotatedSpectra fdrThreshold(float fdrThreshold) { + this.fdrThreshold = fdrThreshold; + return this; + } + + public SpectraContainer getAnnotatedSpecContainer() { + return annotatedSpectra; + } + + public String parse(int numThreads, boolean dropErrors) { + if (numThreads <= 1) { + return parse(); + } + + List results = Collections.synchronizedList(new ArrayList()); + List errors = Collections.synchronizedList(new ArrayList()); + + // Thread pool + ThreadPoolExecutorWithExceptions executor = ThreadPoolExecutorWithExceptions.newFixedThreadPool(numThreads); + executor.setTaskName("Parse"); + + try { + List> taskFiles = new ArrayList>(); + for (int i = 0; i < numThreads; i++) { + taskFiles.add(new ArrayList()); + } + for (int i = 0; i < resultFiles.length; i++) { + taskFiles.get(i % numThreads).add(resultFiles[i]); + } + for (int i = 0; i < numThreads; i++) { + List thisTaskFiles = taskFiles.get(i); + System.out.println("Task " + (i + 1) + ": " + thisTaskFiles.size() + " files."); + ConcurrentAnnotatedSpectraParser parser = new ConcurrentAnnotatedSpectraParser(thisTaskFiles.toArray(new File[0]), specDir, aaSet, fdrThreshold, dropErrors, results, errors); + executor.execute(parser); + } + taskFiles.clear(); + + executor.outputProgressReport(); + executor.shutdown(); + + try { + executor.awaitTerminationWithExceptions(Long.MAX_VALUE, TimeUnit.NANOSECONDS); + } catch (InterruptedException e) { + if (!executor.HasThrownData()) { + e.printStackTrace(); + } + } + + executor.outputProgressReport(); + } catch (OutOfMemoryError ex) { + ex.printStackTrace(); + executor.shutdownNow(); + return "Task terminated; results incomplete. Please run again with a greater amount of memory, using \"-Xmx4G\", for example."; + } catch (Exception ex) { + ex.printStackTrace(); + executor.shutdownNow(); + return "Task terminated; results incomplete. Please run again."; + } catch (Throwable ex) { + ex.printStackTrace(); + executor.shutdownNow(); + return "Task terminated; results incomplete. Please run again."; + } + annotatedSpectra = new SpectraContainer(); + annotatedSpectra.addAll(results); + String errorList = null; + for (String error : errors) { + if (errorList == null) { + errorList = error; + } else { + errorList += "\n" + error; + } + } + return errorList; + } + + public String parse() { + if (progress == null) { + progress = new ProgressData(); + } + annotatedSpectra = new SpectraContainer(); + + System.out.println("Using " + resultFiles.length + " result files:"); + for (File resultFile : resultFiles) + System.out.println("\t" + resultFile.getName()); + + int count = 0; + int total = resultFiles.length; + String aggErrs = null; + for (File resultFile : resultFiles) { + String errMsg = parseFile(resultFile); + count++; + progress.report(count, total); + if (errMsg != null) { + String msg = "Error while parsing " + resultFile.getName() + ": " + errMsg; + if (dropErrorDatasets) { + System.out.println(msg); + if (aggErrs == null) { + aggErrs = msg; + } else { + aggErrs += "\n" + msg; + } + } else { + exception = new Exception(msg); + return msg; + } + } + } + return null; + } + + public void writeToMgf(PrintStream out) { + if (annotatedSpectra != null) { + for (Spectrum spec : annotatedSpectra) + spec.outputMgf(out); + } + } + + public String parseFile(File resultFile) { + System.out.println("Parsing " + resultFile.getName()); + File tsvResultFile; + if (resultFile.getName().endsWith(".mzid")) { + // mzID input was supported by upstream via MzIDParser; the mzid/ package + // was removed in this fork. Pre-convert .mzid to .tsv and pass the .tsv. + return "mzid input is not supported in this build; pre-convert to .tsv and supply that instead."; + } else if (resultFile.getName().endsWith(".tsv")) { + tsvResultFile = resultFile; + } else { + return "Unrecognized result file extension (expected .tsv): " + resultFile.getName(); + } + + BufferedLineReader in; + try { + in = new BufferedLineReader(tsvResultFile.getPath()); + } catch (IOException e) { + e.printStackTrace(); + return "Unable to open " + tsvResultFile.getPath(); + } + + String s = in.readLine(); + if (s == null || !s.startsWith("#")) { + return "Not a valid tsv result file"; + } + + int specIdCol = -1; + int specFileCol = -1; + int pepCol = -1; + int fdrCol = -1; + int chargeCol = -1; + + String[] label = s.split("\t"); + for (int i = 0; i < label.length; i++) { + if (label[i].equalsIgnoreCase("#SpecFile")) + specFileCol = i; + else if (label[i].equalsIgnoreCase("SpecID")) + specIdCol = i; + else if (label[i].equalsIgnoreCase("Peptide")) + pepCol = i; + else if (label[i].equalsIgnoreCase("FDR") || label[i].equalsIgnoreCase("EFDR") || label[i].equalsIgnoreCase("QValue") || label[i].equalsIgnoreCase("SpecQValue")) + fdrCol = i; + else if (label[i].equalsIgnoreCase("Charge")) + chargeCol = i; + } + if (specIdCol < 0 || specFileCol < 0 || pepCol < 0) + return "Not a valid mzid file"; + if (fdrCol < 0) + return "QValue is missing"; + + ArrayList resultList = new ArrayList(); + while ((s = in.readLine()) != null) { + String[] token = s.split("\t"); + if (token.length <= specIdCol || token.length <= specFileCol || token.length <= pepCol || token.length <= fdrCol) + continue; + + float fdr = Float.parseFloat(token[fdrCol]); + + if (fdr <= fdrThreshold) { + resultList.add(s); + } + } + + Iterator itr = resultList.iterator(); + List annotatedResults = new ArrayList(); + + HashMap specAccessorMap = new HashMap(); + while (itr.hasNext()) { + String str = itr.next(); + String[] token = str.split("\t"); + + String pep = token[pepCol]; + if (pep.matches(".\\..+\\..")) + pep = pep.substring(pep.indexOf('.') + 1, pep.lastIndexOf('.')); + + String specFileName = token[specFileCol]; + specFileName = new File(specFileName).getName(); + + int charge = Integer.parseInt(token[chargeCol]); + + SpectraAccessor specAccessor = specAccessorMap.get(specFileName); + if (specAccessor == null) { + File specFile = new File(specDir.getPath() + File.separator + specFileName); + specAccessor = new SpectraAccessor(specFile); + specAccessorMap.put(specFileName, specAccessor); + } + + String specId = token[specIdCol]; + Spectrum spec = specAccessor.getSpectrumById(specId); + + if (spec == null) + return specFileName + ":" + specId + " is not available!"; + else { + Peptide peptide = new Peptide(pep, aaSet); + spec.setCharge(charge); + + if (Math.abs(spec.getPeptideMass() - peptide.getMass()) < 5) { + spec.setAnnotation(peptide); + annotatedResults.add(spec); + } else { + return "parent mass doesn't match " + specFileName + ":" + specId + " " + peptide.toString() + " " + spec.getPeptideMass() + " != " + peptide.getMass(); + } + } + } + + try { + in.close(); + } catch (IOException e) { + e.printStackTrace(); + } + annotatedSpectra.addAll(annotatedResults); + return null; + } +} diff --git a/src/main/java/edu/ucsd/msjava/ui/ScoringParamGen.java b/src/main/java/edu/ucsd/msjava/ui/ScoringParamGen.java new file mode 100644 index 00000000..b335342f --- /dev/null +++ b/src/main/java/edu/ucsd/msjava/ui/ScoringParamGen.java @@ -0,0 +1,205 @@ +package edu.ucsd.msjava.ui; + +import edu.ucsd.msjava.msscorer.NewScorerFactory.SpecDataType; +import edu.ucsd.msjava.msscorer.ScoringParameterGeneratorWithErrors; +import edu.ucsd.msjava.msutil.ActivationMethod; +import edu.ucsd.msjava.msutil.AminoAcidSet; +import edu.ucsd.msjava.msutil.AnnotatedSpectra; +import edu.ucsd.msjava.msutil.Enzyme; +import edu.ucsd.msjava.msutil.InstrumentType; +import edu.ucsd.msjava.msutil.Protocol; + +import java.io.BufferedOutputStream; +import java.io.File; +import java.io.FileNotFoundException; +import java.io.FileOutputStream; +import java.io.PrintStream; +import java.util.ArrayList; +import java.util.List; + +/** + * Trainer entry point. The MS-GF+ {@code params/ParamManager} CLI framework + * was removed from this fork; this restored entry point uses a simple + * {@code args[]} parser instead. Inputs that depended on the deleted + * {@code mzid/} package are not accepted - supply pre-converted .tsv files. + */ +public class ScoringParamGen { + + public static final int VERSION = 8831; + public static final String DATE = "02/04/2013"; + + public static void main(String argv[]) { + if (argv.length == 0 || hasHelpFlag(argv)) { + printUsageInfo(null); + return; + } + + String errMsg = run(argv); + if (errMsg != null) { + System.err.println("[Error] " + errMsg); + System.out.println(); + printUsageInfo(null); + } + } + + private static boolean hasHelpFlag(String[] argv) { + for (String a : argv) { + if (a.equalsIgnoreCase("-h") || a.equalsIgnoreCase("--help") || a.equalsIgnoreCase("-help")) + return true; + } + return false; + } + + static String run(String[] argv) { + // -i [,results2.tsv,...] (one or more comma-separated training result TSVs) + // -d (directory holding the spectrum files referenced by the TSVs) + // -m (e.g. CID, ETD, HCD, UVPD) + // -inst (e.g. LowRes, HighRes) + // -e (e.g. Tryp) + // -protocol (e.g. NoProtocol, optional, default automatic) + // -thread (default 1) + // -dropErrors 0|1 (default 0) + // -mgf 0|1 (default 0; if 1, also writes .mgf) + File[] resultFiles = null; + File specDir = null; + ActivationMethod activationMethod = null; + InstrumentType instType = null; + Enzyme enzyme = null; + Protocol protocol = null; + int numThreads = 1; + boolean dropErrors = false; + boolean createMgf = false; + + for (int i = 0; i < argv.length; i += 2) { + if (!argv[i].startsWith("-") || i + 1 >= argv.length) { + return "Invalid parameter: " + argv[i]; + } + String key = argv[i]; + String val = argv[i + 1]; + if (key.equalsIgnoreCase("-i")) { + String[] paths = val.split(","); + List files = new ArrayList(paths.length); + for (String p : paths) { + File f = new File(p); + if (!f.exists()) + return "Input file does not exist: " + p; + files.add(f); + } + resultFiles = files.toArray(new File[0]); + } else if (key.equalsIgnoreCase("-d")) { + specDir = new File(val); + if (!specDir.exists() || !specDir.isDirectory()) + return "Spectrum directory does not exist or is not a directory: " + val; + } else if (key.equalsIgnoreCase("-m")) { + activationMethod = ActivationMethod.get(val); + if (activationMethod == null) + return "Unrecognized activation method: " + val; + } else if (key.equalsIgnoreCase("-inst")) { + instType = InstrumentType.get(val); + if (instType == null) + return "Unrecognized instrument type: " + val; + } else if (key.equalsIgnoreCase("-e")) { + enzyme = Enzyme.getEnzymeByName(val); + if (enzyme == null) + return "Unrecognized enzyme: " + val; + } else if (key.equalsIgnoreCase("-protocol")) { + protocol = Protocol.get(val); + if (protocol == null) + return "Unrecognized protocol: " + val; + } else if (key.equalsIgnoreCase("-thread")) { + try { + numThreads = Integer.parseInt(val); + } catch (NumberFormatException e) { + return "-thread must be an integer"; + } + } else if (key.equalsIgnoreCase("-dropErrors")) { + dropErrors = "1".equals(val); + } else if (key.equalsIgnoreCase("-mgf")) { + createMgf = "1".equals(val); + } else { + return "Unknown option: " + key; + } + } + + if (resultFiles == null) return "missing -i (training result TSV files)"; + if (specDir == null) return "missing -d (spectrum directory)"; + if (activationMethod == null) return "missing -m (activation method)"; + if (instType == null) return "missing -inst (instrument type)"; + if (enzyme == null) return "missing -e (enzyme)"; + if (protocol == null) protocol = Protocol.AUTOMATIC; + + long time = System.currentTimeMillis(); + AminoAcidSet aaSet = AminoAcidSet.getStandardAminoAcidSet(); + + AnnotatedSpectra annotatedSpec = new AnnotatedSpectra(resultFiles, specDir, aaSet); + System.out.println("Reading training PSMs..."); + String parseErr = annotatedSpec.parse(numThreads, dropErrors); + if (parseErr != null) { + if (dropErrors) { + System.out.println("Datasets with errors (dropped): " + parseErr); + } else { + return parseErr; + } + } + if (annotatedSpec.getAnnotatedSpecContainer() == null + || annotatedSpec.getAnnotatedSpecContainer().isEmpty()) { + return "No results to train on. Exiting."; + } + System.out.println("Done."); + + SpecDataType dataType = new SpecDataType(activationMethod, instType, enzyme, protocol); + + if (createMgf) { + String mgfFileName = dataType.toString() + ".mgf"; + File mgfFile = new File(mgfFileName); + System.out.println("Creating " + mgfFile.getPath()); + try { + PrintStream mgfOut = new PrintStream(new BufferedOutputStream(new FileOutputStream(mgfFile))); + annotatedSpec.writeToMgf(mgfOut); + mgfOut.close(); + } catch (FileNotFoundException e) { + e.printStackTrace(); + } + } + + ScoringParameterGeneratorWithErrors.generateParameters( + annotatedSpec.getAnnotatedSpecContainer(), + dataType, + aaSet, + new File("."), + false, + true); + + System.out.format("ScoringParamGen complete (total elapsed time: %.2f sec)\n", + (System.currentTimeMillis() - time) / (float) 1000); + return null; + } + + public static void printUsageInfo(String message) { + if (message != null) { + System.err.println(message); + } + System.out.println("ScoringParamGen v" + VERSION + " (" + DATE + ")"); + System.out.println("Trains MS-GF+ scoring parameters (.param) from annotated PSM training data."); + System.out.println(); + System.out.println("Usage: java -Xmx2000M -cp MSGFPlus.jar edu.ucsd.msjava.ui.ScoringParamGen [options]"); + System.out.println(); + System.out.println("Required:"); + System.out.println(" -i Training result TSV files (mzID input not supported in this build)"); + System.out.println(" -d Directory holding the spectrum files referenced by the TSVs"); + System.out.println(" -m Activation method (e.g. CID, ETD, HCD, UVPD)"); + System.out.println(" -inst Instrument type (e.g. LowRes, HighRes, QExactive)"); + System.out.println(" -e Enzyme name (e.g. Tryp, Chymotryp, LysC, AspN)"); + System.out.println(); + System.out.println("Optional:"); + System.out.println(" -protocol Protocol (default: NoProtocol/automatic)"); + System.out.println(" -thread Worker threads for parsing PSMs (default: 1)"); + System.out.println(" -dropErrors 0|1 Drop datasets with errors instead of failing (default: 0)"); + System.out.println(" -mgf 0|1 Also emit aggregated .mgf (default: 0)"); + System.out.println(); + System.out.println("Notes:"); + System.out.println(" * The output .param file is written to the current working directory."); + System.out.println(" * mzID input was supported by upstream MS-GF+ but the mzid/ package was removed"); + System.out.println(" from this fork; pre-convert .mzid to .tsv if needed."); + } +} diff --git a/src/test/java/edu/ucsd/msjava/msutil/TestAnnotatedSpectraRecovery.java b/src/test/java/edu/ucsd/msjava/msutil/TestAnnotatedSpectraRecovery.java new file mode 100644 index 00000000..b5fc5001 --- /dev/null +++ b/src/test/java/edu/ucsd/msjava/msutil/TestAnnotatedSpectraRecovery.java @@ -0,0 +1,232 @@ +package edu.ucsd.msjava.msutil; + +import org.junit.Rule; +import org.junit.Test; +import org.junit.rules.TemporaryFolder; + +import java.io.File; +import java.io.IOException; +import java.nio.charset.StandardCharsets; +import java.nio.file.Files; +import java.nio.file.Path; + +import static org.junit.Assert.assertEquals; +import static org.junit.Assert.assertNotNull; +import static org.junit.Assert.assertNull; +import static org.junit.Assert.assertTrue; + +/** + * Integration tests for the restored {@link AnnotatedSpectra} TSV/spectrum + * loader. Covers: + * + * - Extension validation: .mzid is rejected with the explicit "not supported + * in this build" message (the mzid/ package was removed); .tsv is the + * only accepted format. + * - TSV header validation: required columns must be present; missing columns + * produce specific errors. + * - End-to-end TSV + spectrum-file integration using the existing + * src/test/resources/iprg-2013/F13.mgf fixture (real MS2 data, 146K lines). + * Exercises SpectraAccessor lookup against an MGF spec dir without + * requiring any specific instrument data; the recovered code is + * instrument-agnostic. Astral-format mzML files are too large to ship as + * test fixtures, but the integration boundary is identical: parse a TSV + * row, look up the named SpecID via SpectraAccessor, return a graceful + * error if not found. + * + * Does NOT exercise the actual scoring-parameter trainer + * (ScoringParameterGeneratorWithErrors.generateParameters); that needs + * thousands of confidently-annotated PSMs and is out of scope for a recovery + * smoke test. + */ +public class TestAnnotatedSpectraRecovery { + + @Rule + public TemporaryFolder tmp = new TemporaryFolder(); + + private static final String F13_MGF_RESOURCE_PATH = "src/test/resources/iprg-2013/F13.mgf"; + + private File writeTsv(String content) throws IOException { + File f = tmp.newFile("results.tsv"); + Files.write(f.toPath(), content.getBytes(StandardCharsets.UTF_8)); + return f; + } + + private AnnotatedSpectra newWith(File... resultFiles) { + return new AnnotatedSpectra(resultFiles, tmp.getRoot(), AminoAcidSet.getStandardAminoAcidSet()); + } + + @Test + public void rejectsMzidExtensionWithExplicitMessage() throws IOException { + File mzid = tmp.newFile("results.mzid"); + Files.write(mzid.toPath(), "".getBytes(StandardCharsets.UTF_8)); + + AnnotatedSpectra parser = newWith(mzid); + String err = parser.parseFile(mzid); + + assertNotNull("mzid input should be rejected", err); + assertTrue("error should call out that mzid is unsupported in this build; got: " + err, + err.contains("mzid input is not supported in this build")); + } + + @Test + public void rejectsUnknownExtension() throws IOException { + File txt = tmp.newFile("results.txt"); + Files.write(txt.toPath(), "irrelevant".getBytes(StandardCharsets.UTF_8)); + + AnnotatedSpectra parser = newWith(txt); + String err = parser.parseFile(txt); + + assertNotNull(err); + assertTrue("error should mention the unrecognized extension; got: " + err, + err.startsWith("Unrecognized result file extension")); + } + + @Test + public void rejectsTsvWithoutHeaderHash() throws IOException { + // Header line must start with "#" per MS-GF+ result format. A blank + // first line (or a line that doesn't start with #) is rejected. + File tsv = writeTsv("SpecFile\tSpecID\tPeptide\tCharge\tQValue\nfoo\t1\tK.PEPTIDE.K\t2\t0.001\n"); + + AnnotatedSpectra parser = newWith(tsv); + String err = parser.parseFile(tsv); + + assertEquals("Not a valid tsv result file", err); + } + + @Test + public void rejectsTsvMissingRequiredColumns() throws IOException { + // Header has only #SpecFile and Charge; SpecID and Peptide are required. + File tsv = writeTsv("#SpecFile\tCharge\nfoo\t2\n"); + + AnnotatedSpectra parser = newWith(tsv); + String err = parser.parseFile(tsv); + + // Note: the upstream error message reads "Not a valid mzid file" even + // though the input is .tsv — that string predates the mzID excision. + // Keeping it as-is to stay faithful to the upstream parser; the test + // documents the actual behavior so future readers aren't surprised. + assertNotNull(err); + assertTrue("missing required columns should be flagged; got: " + err, + err.contains("Not a valid")); + } + + @Test + public void rejectsTsvMissingFdrColumn() throws IOException { + // Header has SpecID/SpecFile/Peptide but no QValue/EFDR/SpecQValue/FDR. + File tsv = writeTsv("#SpecFile\tSpecID\tPeptide\tCharge\nfoo\t1\tK.PEPTIDE.K\t2\n"); + + AnnotatedSpectra parser = newWith(tsv); + String err = parser.parseFile(tsv); + + assertEquals("QValue is missing", err); + } + + @Test + public void acceptsAnyOfTheFdrColumnAliases() throws IOException { + // The parser recognises FDR / EFDR / QValue / SpecQValue interchangeably + // as the FDR column. Verify each alias produces a passing header check + // (no error). Use parse() (not parseFile() directly) because parseFile + // expects the container to have been initialised by parse() first. + for (String fdrAlias : new String[]{"FDR", "EFDR", "QValue", "SpecQValue"}) { + File tsv = tmp.newFile("results-" + fdrAlias + ".tsv"); + Files.write(tsv.toPath(), + ("#SpecFile\tSpecID\tPeptide\tCharge\t" + fdrAlias + "\n") + .getBytes(StandardCharsets.UTF_8)); + AnnotatedSpectra parser = newWith(tsv); + String err = parser.parse(); + // No data rows -> no lookups -> parser returns null (success). + assertNull("FDR column alias '" + fdrAlias + "' should be accepted; got: " + err, err); + } + } + + @Test + public void filtersRowsAboveFdrThreshold() throws IOException { + // Default fdrThreshold is 0.01. A row with QValue=0.99 is dropped; + // no spectrum lookup is attempted, so the parser succeeds without + // touching the spec dir. + File tsv = writeTsv( + "#SpecFile\tSpecID\tPeptide\tCharge\tQValue\n" + + "irrelevant.mgf\tindex=0\tK.PEPTIDE.K\t2\t0.99\n"); + + AnnotatedSpectra parser = newWith(tsv); + String err = parser.parse(); + + assertNull("rows above FDR threshold should be dropped silently; got: " + err, err); + assertNotNull(parser.getAnnotatedSpecContainer()); + assertTrue("no PSMs should have passed the filter", + parser.getAnnotatedSpecContainer().isEmpty()); + } + + @Test + public void looksUpSpectraInRealMgfFixtureAndReportsMissingId() throws IOException { + // Drive the full TSV-row parse path against the real F13.mgf fixture. + // The TSV references a SpecID that does not exist in F13.mgf, so the + // SpectraAccessor returns null and the parser produces a clean + // "is not available!" error rather than crashing. + Path mgfSrc = new File(F13_MGF_RESOURCE_PATH).toPath(); + assertTrue("test fixture missing: " + F13_MGF_RESOURCE_PATH, Files.exists(mgfSrc)); + + File specDir = tmp.newFolder("specs"); + Path mgfDest = new File(specDir, "F13.mgf").toPath(); + Files.copy(mgfSrc, mgfDest); + + File tsv = writeTsv( + "#SpecFile\tSpecID\tPeptide\tCharge\tQValue\n" + + "F13.mgf\tindex=99999999\tK.PEPTIDE.K\t2\t0.001\n"); + + AnnotatedSpectra parser = new AnnotatedSpectra( + new File[]{tsv}, specDir, AminoAcidSet.getStandardAminoAcidSet()); + String err = parser.parseFile(tsv); + + assertNotNull("missing spec ID should yield an error", err); + assertTrue("error should name the file and id; got: " + err, + err.contains("F13.mgf:index=99999999") && err.contains("is not available")); + } + + @Test + public void parseTopLevelHandlesValidEmptyTsvCleanly() throws IOException { + // The high-level parse() method aggregates per-file results; an empty + // (header-only) TSV produces no PSMs and no error. + File tsv = writeTsv("#SpecFile\tSpecID\tPeptide\tCharge\tQValue\n"); + + AnnotatedSpectra parser = newWith(tsv); + String err = parser.parse(); + + assertNull("empty but well-formed TSV should not error; got: " + err, err); + assertNotNull(parser.getAnnotatedSpecContainer()); + assertTrue(parser.getAnnotatedSpecContainer().isEmpty()); + } + + @Test + public void parseAggregatesMultipleFiles() throws IOException { + // Two empty-but-valid TSVs — both succeed, container is empty, + // no exceptions propagate. Proves the multi-file iteration loop works. + File tsv1 = writeTsv("#SpecFile\tSpecID\tPeptide\tCharge\tQValue\n"); + File tsv2 = tmp.newFile("results2.tsv"); + Files.write(tsv2.toPath(), + "#SpecFile\tSpecID\tPeptide\tCharge\tQValue\n".getBytes(StandardCharsets.UTF_8)); + + AnnotatedSpectra parser = newWith(tsv1, tsv2); + String err = parser.parse(); + + assertNull("aggregating empty TSVs should not error; got: " + err, err); + assertTrue(parser.getAnnotatedSpecContainer().isEmpty()); + } + + @Test + public void parseFailFastOnFirstErrorWhenDropErrorsDisabled() throws IOException { + // Two TSVs, first is invalid (.mzid) → parse() returns the first + // error and stops. + File mzid = tmp.newFile("first.mzid"); + Files.write(mzid.toPath(), "".getBytes(StandardCharsets.UTF_8)); + File tsv = writeTsv("#SpecFile\tSpecID\tPeptide\tCharge\tQValue\n"); + + AnnotatedSpectra parser = new AnnotatedSpectra( + new File[]{mzid, tsv}, tmp.getRoot(), AminoAcidSet.getStandardAminoAcidSet()); + String err = parser.parse(); + + assertNotNull(err); + assertTrue("should surface the mzid-rejection message; got: " + err, + err.contains("mzid input is not supported in this build")); + } +} diff --git a/src/test/java/edu/ucsd/msjava/ui/TestScoringParamGenRecovery.java b/src/test/java/edu/ucsd/msjava/ui/TestScoringParamGenRecovery.java new file mode 100644 index 00000000..2ad6d563 --- /dev/null +++ b/src/test/java/edu/ucsd/msjava/ui/TestScoringParamGenRecovery.java @@ -0,0 +1,221 @@ +package edu.ucsd.msjava.ui; + +import org.junit.Rule; +import org.junit.Test; +import org.junit.rules.TemporaryFolder; + +import java.io.File; +import java.io.IOException; +import java.nio.charset.StandardCharsets; +import java.nio.file.Files; + +import static org.junit.Assert.assertEquals; +import static org.junit.Assert.assertNotNull; +import static org.junit.Assert.assertNull; +import static org.junit.Assert.assertTrue; + +/** + * Integration tests for the restored {@link ScoringParamGen} CLI front door. + * + * Scope: the recovery surface only — argument parsing, validation of required + * options, error paths for invalid enums (activation method, instrument, + * enzyme, protocol), and the TSV/spectrum-dir existence checks. The actual + * .param model generation is not exercised; that requires real annotated + * training PSMs which are not part of this fixture set. + */ +public class TestScoringParamGenRecovery { + + @Rule + public TemporaryFolder tmp = new TemporaryFolder(); + + private File makeEmptyTsv() throws IOException { + File f = tmp.newFile("results.tsv"); + Files.write(f.toPath(), new byte[0]); + return f; + } + + private File makeSpecDir() throws IOException { + return tmp.newFolder("spectra"); + } + + @Test + public void runWithoutArgsReportsMissingI() { + assertEquals("missing -i (training result TSV files)", + ScoringParamGen.run(new String[]{})); + } + + @Test + public void runWithSingleDanglingArgReportsInvalidParameter() { + // The arg parser requires key/value pairs; a lone "-i" is malformed. + assertEquals("Invalid parameter: -i", + ScoringParamGen.run(new String[]{"-i"})); + } + + @Test + public void runWithBareTokenReportsInvalidParameter() { + // First arg must start with "-". "junk" is a bare positional, rejected. + assertEquals("Invalid parameter: junk", + ScoringParamGen.run(new String[]{"junk", "value"})); + } + + @Test + public void runWithUnknownOptionRejected() throws IOException { + File tsv = makeEmptyTsv(); + String err = ScoringParamGen.run(new String[]{"-i", tsv.getPath(), "-bogus", "x"}); + assertEquals("Unknown option: -bogus", err); + } + + @Test + public void runWithMissingInputFileRejected() { + String err = ScoringParamGen.run(new String[]{"-i", "/nonexistent/path/to/results.tsv"}); + assertNotNull(err); + assertTrue("error should mention the bad path; got: " + err, + err.startsWith("Input file does not exist:")); + } + + @Test + public void runWithMissingSpecDirRejected() throws IOException { + File tsv = makeEmptyTsv(); + String err = ScoringParamGen.run(new String[]{ + "-i", tsv.getPath(), + "-d", "/nonexistent/directory"}); + assertNotNull(err); + assertTrue("error should mention the missing dir; got: " + err, + err.startsWith("Spectrum directory does not exist")); + } + + @Test + public void runWithInvalidActivationMethodRejected() throws IOException { + File tsv = makeEmptyTsv(); + File dir = makeSpecDir(); + String err = ScoringParamGen.run(new String[]{ + "-i", tsv.getPath(), + "-d", dir.getPath(), + "-m", "FAKE_ACTIVATION"}); + assertEquals("Unrecognized activation method: FAKE_ACTIVATION", err); + } + + @Test + public void runWithInvalidInstrumentRejected() throws IOException { + File tsv = makeEmptyTsv(); + File dir = makeSpecDir(); + String err = ScoringParamGen.run(new String[]{ + "-i", tsv.getPath(), + "-d", dir.getPath(), + "-m", "HCD", + "-inst", "FAKE_INSTRUMENT"}); + assertEquals("Unrecognized instrument type: FAKE_INSTRUMENT", err); + } + + @Test + public void runWithInvalidEnzymeRejected() throws IOException { + File tsv = makeEmptyTsv(); + File dir = makeSpecDir(); + String err = ScoringParamGen.run(new String[]{ + "-i", tsv.getPath(), + "-d", dir.getPath(), + "-m", "HCD", + "-inst", "QExactive", + "-e", "FAKE_ENZYME"}); + assertEquals("Unrecognized enzyme: FAKE_ENZYME", err); + } + + @Test + public void runWithInvalidProtocolRejected() throws IOException { + File tsv = makeEmptyTsv(); + File dir = makeSpecDir(); + String err = ScoringParamGen.run(new String[]{ + "-i", tsv.getPath(), + "-d", dir.getPath(), + "-m", "HCD", + "-inst", "QExactive", + "-e", "Tryp", + "-protocol", "FAKE_PROTOCOL"}); + assertEquals("Unrecognized protocol: FAKE_PROTOCOL", err); + } + + @Test + public void runWithNonNumericThreadRejected() throws IOException { + File tsv = makeEmptyTsv(); + File dir = makeSpecDir(); + String err = ScoringParamGen.run(new String[]{ + "-i", tsv.getPath(), + "-d", dir.getPath(), + "-m", "HCD", + "-inst", "QExactive", + "-e", "Tryp", + "-thread", "abc"}); + assertEquals("-thread must be an integer", err); + } + + @Test + public void runWithMissingDReportedAfterIIsValid() throws IOException { + File tsv = makeEmptyTsv(); + String err = ScoringParamGen.run(new String[]{"-i", tsv.getPath()}); + assertEquals("missing -d (spectrum directory)", err); + } + + @Test + public void runWithMissingMReportedAfterIDIsValid() throws IOException { + File tsv = makeEmptyTsv(); + File dir = makeSpecDir(); + String err = ScoringParamGen.run(new String[]{ + "-i", tsv.getPath(), + "-d", dir.getPath()}); + assertEquals("missing -m (activation method)", err); + } + + @Test + public void runWithAllRequiredArgsButEmptyTsvFailsAtParse() throws IOException { + // All CLI args validate; the empty TSV (no header line) fails inside + // AnnotatedSpectra.parseFile with "Not a valid tsv result file". + // This proves the CLI hands off to the parser correctly post-recovery. + File tsv = makeEmptyTsv(); + File dir = makeSpecDir(); + String err = ScoringParamGen.run(new String[]{ + "-i", tsv.getPath(), + "-d", dir.getPath(), + "-m", "HCD", + "-inst", "QExactive", + "-e", "Tryp"}); + assertNotNull("expected a parse error to bubble up; got null", err); + assertTrue("error should come from TSV parsing; got: " + err, + err.contains("Not a valid tsv result file") + || err.startsWith("Error while parsing")); + } + + @Test + public void runWithNoMatchingPsmsExitsCleanlyBeforeTraining() throws IOException { + // Build a valid-shaped TSV whose only row has FDR > 0.01 (the trainer's + // FDR floor). AnnotatedSpectra accepts the file, filters the row out, + // and the CLI returns "No results to train on" *without* invoking the + // ScoringParameterGeneratorWithErrors trainer. Proves the full + // arg-parse + handoff + early-exit chain works end to end. + File tsv = tmp.newFile("noresults.tsv"); + String content = String.join("\t", "#SpecFile", "SpecID", "Peptide", "Charge", "QValue") + "\n" + + String.join("\t", "synth.mgf", "index=0", "K.PEPTIDE.K", "2", "0.99") + "\n"; + Files.write(tsv.toPath(), content.getBytes(StandardCharsets.UTF_8)); + File dir = makeSpecDir(); + String err = ScoringParamGen.run(new String[]{ + "-i", tsv.getPath(), + "-d", dir.getPath(), + "-m", "HCD", + "-inst", "QExactive", + "-e", "Tryp"}); + assertEquals("No results to train on. Exiting.", err); + } + + @Test + public void mainWithEmptyArgsDoesNotThrow() { + // Smoke check: empty args triggers printUsageInfo and returns. Should + // not throw or call System.exit. + ScoringParamGen.main(new String[]{}); + } + + @Test + public void mainWithHelpFlagDoesNotThrow() { + ScoringParamGen.main(new String[]{"-h"}); + ScoringParamGen.main(new String[]{"--help"}); + ScoringParamGen.main(new String[]{"-help"}); + } +} diff --git a/src/test/java/msgfplus/TestScoringParamGenSmoke.java b/src/test/java/msgfplus/TestScoringParamGenSmoke.java new file mode 100644 index 00000000..259ea5d6 --- /dev/null +++ b/src/test/java/msgfplus/TestScoringParamGenSmoke.java @@ -0,0 +1,387 @@ +package msgfplus; + +import edu.ucsd.msjava.msscorer.NewRankScorer; +import edu.ucsd.msjava.msscorer.NewScorerFactory.SpecDataType; +import edu.ucsd.msjava.msscorer.ScoringParameterGeneratorWithErrors; +import edu.ucsd.msjava.msutil.ActivationMethod; +import edu.ucsd.msjava.msutil.AminoAcidSet; +import edu.ucsd.msjava.msutil.Composition; +import edu.ucsd.msjava.msutil.Enzyme; +import edu.ucsd.msjava.msutil.InstrumentType; +import edu.ucsd.msjava.msutil.Peak; +import edu.ucsd.msjava.msutil.Peptide; +import edu.ucsd.msjava.msutil.Protocol; +import edu.ucsd.msjava.msutil.SpectraContainer; +import edu.ucsd.msjava.msutil.Spectrum; +import org.junit.Rule; +import org.junit.Test; +import org.junit.rules.TemporaryFolder; + +import java.io.File; +import java.util.ArrayList; +import java.util.List; +import java.util.Random; + +import static org.junit.Assert.assertEquals; +import static org.junit.Assert.assertNotNull; +import static org.junit.Assert.assertTrue; + +/** + * End-to-end smoke test for the recovered scorer trainer. + * + * The other recovery tests + * ({@link edu.ucsd.msjava.ui.TestScoringParamGenRecovery}, + * {@link edu.ucsd.msjava.msutil.TestAnnotatedSpectraRecovery}) + * cover argument parsing and TSV-error paths but stop short of actually + * invoking + * {@link ScoringParameterGeneratorWithErrors#generateParameters(SpectraContainer, + * SpecDataType, AminoAcidSet, File, boolean, boolean)}. + * + * This test closes that gap. It builds a synthetic in-memory + * {@link SpectraContainer} of annotated spectra (no MGF/mzML on disk, no + * MS-GF+ search), feeds it directly to the trainer, and verifies that: + * + * 1. {@code generateParameters} runs to completion without throwing. + * 2. A {@code .param} file is produced on disk and is non-empty. + * 3. The companion {@code .param.txt} debug dump is also produced. + * 4. The binary {@code .param} parses cleanly via {@link NewRankScorer}, + * proving the on-disk format is round-trippable by the same code that + * MS-GF+ uses to load shipped scoring models at search time. + * + * Approach rationale (Approach A in the planning notes): synthetic spectra + * are sufficient because the trainer's invariants are all numeric --- it + * computes b/y ion frequencies, rank distributions, and precursor offset + * frequencies from (peptide, peak list) pairs without verifying that the + * peptide is the "correct" assignment. A grid of distinct tryptic peptides + * with realistic b/y peak placement at varied intensities exercises every + * code path in the trainer (partition, precursorOFF, ion-type selection, + * rank distribution, smoothing, write) at low cost. + * + * The {@code MIN_NUM_SPECTRA_PER_PARTITION = 400} threshold inside the + * trainer is the lower bound on input size, but the trainer also dedupes + * to {@code numSpecsPerPeptide = 3} spectra per (peptide, charge) pair + * before partitioning. So the unique-peptide count is the dominant knob: + * 200 unique peptides * 3 reps = 600 spectra retained, comfortably above + * the 360 floor. The {@code IonProbability.getIonProb} path further + * requires {@code numObservedPeaks + numMissingPeaks > 1000} per ion type + * per segment for that ion type to be selected; with peptides of length + * 8-12 and full b+y ion ladders, 200 unique peptides produce well over + * that threshold. + * + * We use {@link InstrumentType#LOW_RESOLUTION_LTQ} ("LowRes") because: + * - it sets {@code errorScalingFactor = 0} inside the trainer, skipping + * the (slow, optional) ion-error and noise-error distribution stages, + * - it sets {@code applyDeconvolution = false}, skipping a full container + * rebuild, + * - the resulting .param file still exercises every required write path + * (charge histogram, partition table, precursor OFF, fragment OFF, rank + * distributions, no error dist). + * + * Runtime target: under 60s on a developer laptop. + */ +public class TestScoringParamGenSmoke { + + @Rule + public TemporaryFolder tmp = new TemporaryFolder(); + + /** + * Residues used as the variable "body" of synthetic tryptic peptides. + * Excludes K and R (those are reserved for the C-terminus to keep the + * generated sequences tryptic) and excludes I/L collisions / U/O + * non-standard codes. Cysteine is also omitted to avoid the standard + * AA set's lack of fixed carbamidomethylation distorting masses in a + * way that the trainer's nominal-mass binning might handle awkwardly. + */ + private static final char[] BODY_RESIDUES = + {'A', 'D', 'E', 'F', 'G', 'H', 'L', 'M', 'N', 'P', + 'Q', 'S', 'T', 'V', 'W', 'Y'}; + + /** + * Generates a deterministic pool of {@code count} distinct synthetic + * tryptic peptides. Each peptide has length 8-12, body residues drawn + * from {@link #BODY_RESIDUES}, and ends in K or R. + * + * The trainer keeps at most {@code numSpecsPerPeptide} (= 3 for + * fixtures with under 2000 unique peptides) {@code (peptide, charge)} + * pairs in its training set, so the unique-peptide count is the real + * upper bound on usable training spectra after dedup. To clear the + * {@code MIN_NUM_SPECTRA_PER_PARTITION} = 400 partitioning floor we + * need ~140+ unique peptides (140 * 3 = 420), which this generator + * supplies easily. + */ + private static List generateSyntheticTrypticPeptides(int count, long seed) { + Random rnd = new Random(seed); + java.util.LinkedHashSet peps = new java.util.LinkedHashSet(); + // Defensive guard: we ask for distinct peptides, so make sure the + // alphabet is large enough that collisions don't loop forever. + // 16^8 ~= 4e9 distinct length-8 bodies; we won't realistically + // exceed that. + while (peps.size() < count) { + int len = 8 + rnd.nextInt(5); // 8..12 residues total + StringBuilder sb = new StringBuilder(len); + for (int i = 0; i < len - 1; i++) { + sb.append(BODY_RESIDUES[rnd.nextInt(BODY_RESIDUES.length)]); + } + sb.append(rnd.nextBoolean() ? 'K' : 'R'); + peps.add(sb.toString()); + } + return new ArrayList(peps); + } + + /** + * Writes a synthetic annotated MS2 spectrum into the container. + * + * @param peptideStr the unmodified peptide sequence (e.g. "AAGDPLQNK") + * @param charge precursor charge state (typically 2) + * @param noisePeakCount number of random noise peaks to scatter alongside + * the b/y ions; small (3-5) is enough to keep the + * trainer's noise-distribution stage exercised + * @param rnd shared RNG for reproducible noise placement + */ + private static Spectrum buildSyntheticSpectrum(String peptideStr, + int charge, + AminoAcidSet aaSet, + int noisePeakCount, + Random rnd) { + Peptide pep = new Peptide(peptideStr, aaSet); + float parentMass = pep.getParentMass(); // M = sum(residues) + H2O + float proton = (float) Composition.ChargeCarrierMass(); + float precursorMz = (parentMass + charge * proton) / charge; + + Spectrum spec = new Spectrum(); + spec.setPrecursor(new Peak(precursorMz, 1.0f, charge)); + spec.setActivationMethod(ActivationMethod.HCD); + spec.setAnnotation(pep); + + // b ion ladder: prefix mass + proton. + float prefix = 0f; + for (int i = 0; i < pep.size() - 1; i++) { + prefix += pep.get(i).getMass(); + float bMz = prefix + proton; + // Vary intensities so setRanksOfPeaks produces a non-trivial ranking. + float intensity = 100f + (i * 17f) + rnd.nextFloat() * 50f; + spec.add(new Peak(bMz, intensity, 1)); + } + + // y ion ladder: suffix mass + H2O + proton. + float suffix = 0f; + float h2oPlusProton = (float) Composition.H2O + proton; + for (int i = 0; i < pep.size() - 1; i++) { + suffix += pep.get(pep.size() - 1 - i).getMass(); + float yMz = suffix + h2oPlusProton; + float intensity = 200f + (i * 13f) + rnd.nextFloat() * 50f; + spec.add(new Peak(yMz, intensity, 1)); + } + + // A few low-intensity noise peaks to keep the noise distribution + // (rankDistTable's IonType.NOISE entry, generated by generateRankDist) + // populated with non-zero counts. + for (int i = 0; i < noisePeakCount; i++) { + float mz = 100f + rnd.nextFloat() * (precursorMz - 100f); + float intensity = 5f + rnd.nextFloat() * 15f; + spec.add(new Peak(mz, intensity, 1)); + } + + // Peaks must be in ascending m/z order for getPeakByMass binary search. + spec.setRanksOfPeaks(); + java.util.Collections.sort(spec, new Peak.MassComparator()); + return spec; + } + + /** + * Build {@code uniquePeptides * spectraPerPeptide} synthetic annotated + * spectra at the given charge state. + * + * The trainer dedupes to {@code numSpecsPerPeptide} per (peptide, + * charge) pair internally (= 3 for fixtures under 2000 unique + * peptides), so {@code spectraPerPeptide} should be 3 to ensure all + * generated spectra are retained after dedup. {@code uniquePeptides} + * therefore directly determines how many spectra survive into + * partition(). + */ + private static SpectraContainer buildSyntheticContainer(int uniquePeptides, + int spectraPerPeptide, + int charge, + AminoAcidSet aaSet, + long seed) { + // Deterministic RNG so the test is reproducible across runs / CI. + Random rnd = new Random(seed); + List peptides = generateSyntheticTrypticPeptides(uniquePeptides, seed); + SpectraContainer container = new SpectraContainer(); + for (String pep : peptides) { + for (int rep = 0; rep < spectraPerPeptide; rep++) { + container.add(buildSyntheticSpectrum(pep, charge, aaSet, 4, rnd)); + } + } + return container; + } + + @Test + public void generatesParamFileFromSyntheticAnnotatedSpectra() throws Exception { + // The trainer's MIN_NUM_SPECTRA_PER_PARTITION (400) is the dominant + // input-size constraint, and the trainer dedupes to 3 spectra per + // (peptide, charge) pair before partitioning. 200 unique peptides + // * 3 reps = 600 retained spectra, comfortably above the floor and + // well under 60s end-to-end. + final int uniquePeptides = 200; + final int spectraPerPeptide = 3; + final int charge = 2; + final int totalSpectra = uniquePeptides * spectraPerPeptide; + + AminoAcidSet aaSet = AminoAcidSet.getStandardAminoAcidSet(); + SpectraContainer container = buildSyntheticContainer( + uniquePeptides, spectraPerPeptide, charge, aaSet, 42L); + + // Sanity: the container is well-populated and every spectrum is annotated. + assertEquals(totalSpectra, container.size()); + for (Spectrum s : container) { + assertNotNull("every synthetic spectrum must carry a Peptide annotation", s.getAnnotation()); + assertEquals(charge, s.getCharge()); + assertTrue("every synthetic spectrum must have b/y peaks", s.size() > 0); + } + + // LowRes: errorScalingFactor=0, no deconvolution. Tryp + Standard + // protocol = the simplest happy path through the trainer. + SpecDataType dataType = new SpecDataType( + ActivationMethod.HCD, + InstrumentType.LOW_RESOLUTION_LTQ, + Enzyme.TRYPSIN, + Protocol.STANDARD); + + File outDir = tmp.newFolder("trainerOut"); + + long t0 = System.currentTimeMillis(); + ScoringParameterGeneratorWithErrors.generateParameters( + container, + dataType, + aaSet, + outDir, + /* isText = */ false, + /* verbose = */ false); + long elapsedMs = System.currentTimeMillis() - t0; + + // Soft runtime guard: if we ever drift over ~45s on this 600-spectrum + // input something has likely regressed badly. The constraint in the + // test plan is <60s end-to-end. + assertTrue("trainer took too long: " + elapsedMs + " ms (expected < 45000)", + elapsedMs < 45_000); + + // The .param file is written as ".param" inside outDir; the + // dataType.toString() form for non-Standard protocols includes a + // trailing "_". Standard protocol omits the suffix per + // SpecDataType.toString(). + String expectedName = dataType.toString() + ".param"; + File paramFile = new File(outDir, expectedName); + assertTrue("expected .param file was not produced at " + paramFile.getAbsolutePath(), + paramFile.exists() && paramFile.isFile()); + assertTrue(".param file is empty", paramFile.length() > 0); + + File paramTextFile = new File(outDir, expectedName + ".txt"); + assertTrue("companion .param.txt debug dump was not produced", + paramTextFile.exists() && paramTextFile.length() > 0); + + // Round-trip: reload via NewRankScorer. This exercises the same code + // path MS-GF+ uses at search time to read a shipped .param resource, + // so a successful constructor here proves the trainer's binary + // output is structurally valid (charge histogram, partitions, + // precursor OFF, fragment OFF, rank distributions, error dist + // section absent because errorScalingFactor=0). + NewRankScorer reloaded = new NewRankScorer(paramFile.getPath()); + assertNotNull(reloaded); + assertNotNull("reloaded scorer must expose its dataType", reloaded.getSpecDataType()); + assertEquals("activation method round-tripped", + ActivationMethod.HCD, reloaded.getSpecDataType().getActivationMethod()); + assertEquals("instrument type round-tripped", + InstrumentType.LOW_RESOLUTION_LTQ, reloaded.getSpecDataType().getInstrumentType()); + assertEquals("enzyme round-tripped", + Enzyme.TRYPSIN, reloaded.getSpecDataType().getEnzyme()); + assertNotNull("reloaded scorer must expose a non-null partition set", + reloaded.getParitionSet()); + assertTrue("reloaded scorer must have at least one partition", + reloaded.getParitionSet().size() > 0); + } + + @Test + public void debugTextDumpReferencesActivationAndEnzyme() throws Exception { + // Smaller second @Test that re-runs the trainer with a slightly + // different shape and inspects the human-readable .param.txt to + // confirm the trainer wrote sensible header lines. This is cheap + // because the trainer dominates the runtime and we already paid + // for it above; this test pays for it again with a deliberately + // smaller input to keep the suite fast. + final int uniquePeptides = 180; + final int spectraPerPeptide = 3; + final int charge = 2; + AminoAcidSet aaSet = AminoAcidSet.getStandardAminoAcidSet(); + SpectraContainer container = buildSyntheticContainer( + uniquePeptides, spectraPerPeptide, charge, aaSet, 1337L); + + SpecDataType dataType = new SpecDataType( + ActivationMethod.CID, + InstrumentType.LOW_RESOLUTION_LTQ, + Enzyme.TRYPSIN, + Protocol.STANDARD); + File outDir = tmp.newFolder("trainerOut2"); + + ScoringParameterGeneratorWithErrors.generateParameters( + container, dataType, aaSet, outDir, false, false); + + File paramTextFile = new File(outDir, dataType.toString() + ".param.txt"); + assertTrue("companion .param.txt debug dump was not produced", + paramTextFile.exists() && paramTextFile.length() > 0); + + String contents = new String(java.nio.file.Files.readAllBytes(paramTextFile.toPath())); + assertTrue(".param.txt should mention the activation method; got first 200 chars: " + + contents.substring(0, Math.min(200, contents.length())), + contents.contains("CID")); + assertTrue(".param.txt should mention the enzyme", + contents.contains("Tryp")); + assertTrue(".param.txt should include a charge histogram section", + contents.contains("ChargeHistogram")); + } + + /** + * Documents the trainer's effective lower bound on input size. After the + * (peptide, charge) dedup ({@code numSpecsPerPeptide = 3} for fixtures + * under 2000 unique peptides), the trainer needs ~360 surviving spectra + * (= {@code MIN_NUM_SPECTRA_PER_PARTITION * 0.9}) at one charge to + * produce any partition at all; below that, partition() skips the + * charge entirely and {@code writeParameters} dies via assert. + * + * 130 unique peptides * 3 reps = 390 retained spectra -- just clears + * the floor with margin for the rounding in the {@code 0.9f} factor. + * The test makes the implicit contract explicit so future readers don't + * accidentally shrink fixtures below it. + */ + @Test + public void trainerHandlesInputSizeJustAboveMinimumPartitionThreshold() throws Exception { + final int uniquePeptides = 130; + final int spectraPerPeptide = 3; + final int charge = 2; + AminoAcidSet aaSet = AminoAcidSet.getStandardAminoAcidSet(); + SpectraContainer container = buildSyntheticContainer( + uniquePeptides, spectraPerPeptide, charge, aaSet, 7L); + + SpecDataType dataType = new SpecDataType( + ActivationMethod.HCD, + InstrumentType.LOW_RESOLUTION_LTQ, + Enzyme.TRYPSIN, + Protocol.STANDARD); + File outDir = tmp.newFolder("trainerOut3"); + + ScoringParameterGeneratorWithErrors.generateParameters( + container, dataType, aaSet, outDir, false, false); + + File paramFile = new File(outDir, dataType.toString() + ".param"); + assertTrue("trainer should still emit a .param at the input-size boundary", + paramFile.exists() && paramFile.length() > 0); + + NewRankScorer reloaded = new NewRankScorer(paramFile.getPath()); + assertNotNull(reloaded); + assertNotNull(reloaded.getParitionSet()); + // With 380 spectra at one charge and numSegments=2, the partition + // routine produces 2 partitions (one per segment). + assertTrue("expected at least one partition; got " + reloaded.getParitionSet().size(), + reloaded.getParitionSet().size() >= 1); + } +}