diff --git a/.claude/plans/parameter-modernization-flag-inventory.md b/.claude/plans/parameter-modernization-flag-inventory.md
new file mode 100644
index 00000000..68ac2d6d
--- /dev/null
+++ b/.claude/plans/parameter-modernization-flag-inventory.md
@@ -0,0 +1,90 @@
+# MS-GF+ flag inventory (Phase 1 input)
+
+Snapshot of every flag registered by `ParamManager.addMSGFPlusParams()`
+plus the parsing semantics each one currently relies on. This is the
+foundation document for the Phase 1 picocli rewrite described in
+`parameter-modernization.md`. Total: 34 flags (27 visible + 7 hidden).
+Required: `-s`, `-d`.
+
+## Visible flags
+
+| Short | Canonical name | Type | Default | Bounds | Notes |
+|---|---|---|---|---|---|
+| `-conf` | `ConfigurationFile` | file | — | exists | Config file; CLI overrides config |
+| `-s` | `SpectrumFile` | file/dir | — | exists | **Required.** mzML/mzXML/mgf/ms2/pkl/_dta.txt or directory |
+| `-d` | `DatabaseFile` | file | — | exists | **Required.** *.fasta / *.fa / *.faa |
+| `-decoy` | `DecoyPrefix` | string | `DECOY_` | — | Decoy protein prefix |
+| `-o` | `OutputFile` | file | `.pin` | — | *.pin (default) or *.tsv |
+| `-t` | `PrecursorMassTolerance` | tolerance | `20ppm` | ≥0 | Symmetric (`20ppm`) or asymmetric (`0.5Da,2.5Da`); units must match |
+| `-ti` | `IsotopeErrorRange` | int range | `0,1` | ≥0, max-incl | Isotope-error window, both ends inclusive |
+| `-m` | `FragmentationMethodID` | dyn-enum | `ASWRITTEN` | — | 0=as-written, 1=CID, 2=ETD, 3=HCD |
+| `-inst` | `InstrumentID` | dyn-enum | `LOW_RES_LTQ` | registry | `InstrumentType` registry-driven |
+| `-e` | `EnzymeID` | dyn-enum | `TRYPSIN` | registry | `Enzyme` registry-driven |
+| `-protocol` | `ProtocolID` | dyn-enum | `AUTOMATIC` | registry | `Protocol` registry-driven |
+| `-ntt` | `NTT` | enum | `2` | 0..2 | Number of tolerable termini |
+| `-mod` | `ModificationFile` | file | built-in (C+57) | exists | Mod file; config-file path also accepts `StaticMod=`/`DynamicMod=`/`CustomAA=` |
+| `-minLength` | `MinPepLength` | int | `6` | ≥1 | |
+| `-maxLength` | `MaxPepLength` | int | `40` | ≥1 | |
+| `-minCharge` | `MinCharge` | int | `2` | ≥1 | |
+| `-maxCharge` | `MaxCharge` | int | `3` | ≥1 | |
+| `-n` | `NumMatchesPerSpec` | int | `1` | ≥1 | |
+| `-thread` | `NumThreads` | int | `Runtime.availableProcessors()` | ≥1 | |
+| `-tasks` | `NumTasks` | int | `0` (auto) | ≥-10 | 0=auto, >0=fixed, <0=N×threads |
+| `-minSpectraPerThread` | `MinSpectraPerThread` | int | `250` | ≥1 | |
+| `-verbose` | `Verbose` | enum | `0` | 0..1 | 0=total, 1=per-thread |
+| `-tda` | `TDA` | enum | `0` | 0..1 | 0=no decoy, 1=concat decoy search |
+| `-addFeatures` | `AddFeatures` | enum | `0` | 0..1 | Percolator extra features |
+| `-outputFormat` | `OutputFormat` | enum | `pin` | pin/tsv | mzIdentML removed |
+| `-precursorCal` | `PrecursorCal` | string | `auto` | auto/on/off | Case-insensitive |
+| `-ccm` | `ChargeCarrierMass` | double | `1.00727649` | >0.1 | Proton mass default |
+| `-maxMissedCleavages` | `MaxMissedCleavages` | int | `-1` | ≥-1 | -1 = unlimited |
+| `-numMods` | `NumMods` | int | `3` | ≥0 | Max dynamic mods per peptide |
+| `-allowDenseCentroidedPeaks` | `AllowDenseCentroidedPeaks` | enum | `0` | 0..1 | |
+| `-msLevel` | `MSLevel` | int range | `2,2` | ≥1, max-incl | `min,max` or single |
+| `-u` | `PrecursorMassToleranceUnits` | enum | `2` | 0..2 | **Hidden** — legacy; 0=Da, 1=ppm, 2=as-written |
+
+## Hidden flags
+
+| Short | Canonical name | Type | Default | Notes |
+|---|---|---|---|---|
+| `-dd` | `DBIndexDir` | dir | — | Database index dir |
+| `-index` | `SpecIndex` | int range | `1,INT_MAX-1` | Spectrum index range, both inclusive |
+| `-edgeScore` | `EdgeScore` | enum | `0` | 0=use, 1=skip |
+| `-minNumPeaks` | `MinNumPeaks` | int | `Constants.MIN_NUM_PEAKS_PER_SPECTRUM` | |
+| `-iso` | `NumIsoforms` | int | `Constants.NUM_VARIANTS_PER_PEPTIDE` | |
+| `-ignoreMetCleavage` | `IgnoreMetCleavage` | enum | `0` | 0=consider, 1=ignore |
+| `-minDeNovoScore` | `MinDeNovoScore` | int | `Constants.MIN_DE_NOVO_SCORE` | |
+
+## Sharp edges the picocli rewrite must preserve
+
+1. **Asymmetric tolerance.** `-t 0.5Da,2.5Da` → left tolerance (observed < theoretical) ≠ right tolerance. Both sides must use the same unit. Numeric-only value (e.g. `20`) defaults to Da. Trailing unit suffix is case-insensitive (`Da`/`ppm`/`Th`).
+2. **Range inclusivity is per-flag.** `IntRangeParameter` defaults to `min` inclusive / `max` exclusive, but `-ti`, `-index`, `-msLevel` flip max to inclusive via `.setMaxInclusive()`.
+3. **Dynamic enums.** `-inst`, `-e`, `-protocol`, `-m` are registry-driven (`InstrumentType`, `Enzyme`, `Protocol`, `ActivationMethod`). Numeric indices depend on registry load order; help text is generated at startup. Picocli converters must read from the same registries, not hardcode indices.
+4. **`OutputFormat` legacy mapping is gone.** Old `0=mzIdentML`, `2=both` are no longer accepted; only `pin` (0) and `tsv` (1) remain. Numeric indices are deprecated but still parse internally.
+5. **`-precursorCal` is a string, not an enum class.** Values: `auto` / `on` / `off` (case-insensitive, `.trim()`-ed). `auto` means "run pre-pass, apply only if ≥200 confident PSMs collected".
+6. **Trailing `!` on numbers.** `IntParameter` and `DoubleParameter` strip trailing `!` (legacy DMS config-file integration). Decide if Phase 1 keeps this quirk.
+7. **`-tasks` semantics.** `0` = auto, `>0` = fixed, `<0` = `N × threads`. Range allows down to `-10`.
+8. **Config-file-only entries.** `StaticMod=`, `DynamicMod=`, `CustomAA=` are not CLI flags. They're parsed from `-mod` file and `-conf` config file only. Repeated entries are *expected* (each line is a separate mod). Config parser preserves order.
+9. **Config-file aliases (canonical-name normalization in `ParamNameEnum.getParamNameFromLine()`).** Auto-renames at least 13 deprecated keys:
+ - `IsotopeError` → `IsotopeErrorRange`
+ - `TargetDecoyAnalysis` → `TDA`
+ - `FragmentationMethod` → `FragmentationMethodID`
+ - `Instrument` → `InstrumentID`
+ - `Enzyme` → `EnzymeID`
+ - `Protocol` → `ProtocolID`
+ - `NumTolerableTermini` → `NTT`
+ - `MinNumPeaks` → `MinNumPeaksPerSpectrum`
+ - `MaxNumMods` / `MaxNumModsPerPeptide` → `NumMods`
+ - `minLength` / `MinPeptideLength` → `MinPepLength`
+ - `maxLength` / `MaxPeptideLength` → `MaxPepLength`
+ - `PMTolerance` / `ParentMassTolerance` → `PrecursorMassTolerance`
+10. **File-format validation chain.** Order: directory-vs-file → format-suffix match → existence → no-reuse. Suffix matching is case-insensitive for `.pin`/`.tsv`/`.fasta`. Spec parameter auto-allows directories.
+11. **Defaults that depend on runtime.** `-thread` defaults to `Runtime.getRuntime().availableProcessors()` (includes hyperthreading; per CLAUDE.md, physical cores often give better wall-time).
+12. **Help-text drift.** Existing tests likely compare exact `--help` output. picocli's formatter is different. Decide: snapshot-update vs. custom renderer that mimics current format.
+
+## Out-of-scope reminders for Phase 1
+
+- `MSGFDB`, `MSGF`, `MSGFLib` entry points share `ParamManager`. Phase 1 only modernizes `MSGFPlus`; the other three keep using `ParamManager.parseParams()` until Phase 4.
+- Config-file parsing is Phase 2. Phase 1 covers CLI only.
+- The `Parameter` / `IntParameter` / `IntRangeParameter` / `ToleranceParameter` / etc. hierarchy is **not** removed in Phase 1. Removal is Phase 3.
+- `ParamManager` itself stays. Phase 1 adds an adapter that produces a populated `ParamManager` from the typed `MSGFPlusOptions`, so `SearchParams.parse(ParamManager)` is unchanged.
diff --git a/.claude/plans/parameter-modernization.md b/.claude/plans/parameter-modernization.md
new file mode 100644
index 00000000..19a6961f
--- /dev/null
+++ b/.claude/plans/parameter-modernization.md
@@ -0,0 +1,159 @@
+# Plan: modernize MS-GF+ parameter handling
+
+**Status: proposed**
+Branch: `perf/search-sync-cleanup` (worktree at
+`/Users/yperez/work/msgfplus-workspace/search-sync-cleanup`).
+
+## Why this exists
+
+The current parameter stack under `edu.ucsd.msjava.params` is doing
+several jobs at once:
+- command-line parsing
+- type conversion
+- validation
+- help/usage rendering
+- config-file alias handling
+- backward-compatibility shims
+
+That works, but it spreads option behavior across many small classes
+(`Parameter`, `NumberParameter`, `RangeParameter`, `ToleranceParameter`,
+`FileParameter`, enum wrappers, and `ParamManager`). The result is more
+code than we need for a solved problem and a higher risk of subtle
+parsing drift when new flags are added.
+
+## Goals
+
+- Reduce the amount of custom CLI parsing code.
+- Keep existing MS-GF+ command-line behavior stable where practical.
+- Preserve current config-file semantics in the first migration step.
+- Keep `SearchParams` as the internal domain model for search settings.
+- Improve help/usage generation and validation error consistency.
+
+## Non-goals
+
+- No search algorithm changes.
+- No performance claim for the search itself; parsing happens once at
+ startup and is not a runtime hotspot.
+- No forced removal of legacy config-file aliases in phase 1.
+- No broad package cleanup bundled into this effort.
+
+## Recommended direction
+
+Adopt `picocli` for command-line parsing and help generation, while
+keeping a thin MSGF+-specific compatibility layer for:
+- legacy option names and aliases
+- config-file parsing
+- repeated modification/custom-AA entries
+- conversion into `SearchParams`, `AminoAcidSet`, `Tolerance`, and
+ related domain objects
+
+## Proposed migration shape
+
+### Phase 1: introduce a typed CLI model beside `ParamManager`
+
+- Add a new options class for `MSGFPlus` under `edu.ucsd.msjava.cli`.
+- Represent flags as typed fields with defaults, required markers,
+ and descriptions.
+- Add custom `picocli` converters for:
+ - precursor mass tolerance
+ - integer and float ranges
+ - output format
+ - precursor calibration mode
+ - file/directory validation
+- Keep `ParamManager` intact during this phase.
+- Add an adapter that maps parsed CLI options into the current
+ `SearchParams` inputs.
+
+Success criteria:
+- `MSGFPlus` can parse its current CLI arguments through the new path.
+- Generated help text is complete and readable.
+- Existing tests for parameter behavior still pass or are updated
+ mechanically where output formatting differs.
+
+### Phase 2: preserve config-file compatibility explicitly
+
+- Keep `ParamParser` or replace it with a thinner reader that still
+ accepts the current `key=value` format.
+- Centralize legacy config-name alias resolution in one place instead
+ of scattering it through `ParamNameEnum`.
+- Support repeated config entries for:
+ - `DynamicMod`
+ - `StaticMod`
+ - `CustomAA`
+- Feed config values into the same typed options model used by CLI.
+
+Success criteria:
+- Existing example parameter files still load.
+- Duplicate-entry behavior for mods/custom amino acids is preserved.
+- Command-line values continue to override config-file values.
+
+### Phase 3: move validation out of the custom parameter hierarchy
+
+- Replace per-type `parse()` methods with:
+ - `picocli` conversion
+ - explicit validation methods on the typed options object
+ - targeted domain-level validation while building `SearchParams`
+- Collapse or remove custom classes that are no longer needed:
+ - `Parameter`
+ - `NumberParameter`
+ - `RangeParameter`
+ - `IntParameter`
+ - `FloatParameter`
+ - `DoubleParameter`
+ - `IntRangeParameter`
+ - `FloatRangeParameter`
+ - enum parameter wrappers
+
+Success criteria:
+- No user-visible behavior regressions on required flags, defaults,
+ range checks, or enum choices.
+- Validation failures still produce actionable messages.
+
+### Phase 4: reduce `ParamManager` to compatibility-only or retire it
+
+- If any remaining tools still depend on `ParamManager`, keep it only as
+ a compatibility facade over the new parser.
+- Otherwise remove `ParamManager` from the active CLI path.
+- Decide whether `MSGFDB` migrates in the same PR series or follows
+ after `MSGFPlus` is stable.
+
+## Main risks
+
+- Help text and error messages may change in ways that break tests or
+ documentation.
+- Config-file behavior is more important than it looks; it includes
+ legacy aliases and repeated entries that generic CLI libraries do not
+ model by default.
+- `MSGFDB` and `MSGFPlus` share parts of the current stack, so an
+ incomplete migration could increase duplication before it decreases.
+
+## Validation plan
+
+- Add focused tests for:
+ - required arguments
+ - default values
+ - bad range syntax
+ - enum parsing
+ - file existence checks
+ - config-file override precedence
+ - repeated modification/custom-AA entries
+- Keep existing `SearchParams` tests green.
+- Run at least one end-to-end `MSGFPlus` smoke test on a known fixture.
+- Compare old vs new parser outcomes for a representative set of real
+ command lines and config files.
+
+## Suggested implementation order
+
+1. Add `picocli` dependency.
+2. Build a typed `MSGFPlusOptions` class and converters.
+3. Parse CLI into the new options class without removing `ParamManager`.
+4. Add an adapter into the current `SearchParams` build path.
+5. Port config-file handling.
+6. Remove unused custom parameter classes.
+7. Migrate `MSGFDB` only after `MSGFPlus` is stable.
+
+## Recommendation on branch strategy
+
+Do this in a dedicated refactor branch, not as part of a performance
+cleanup PR. The expected win is maintainability and correctness, not
+search throughput, and the surface area touches the public CLI.
diff --git a/.claude/plans/search-sync-cleanup.md b/.claude/plans/search-sync-cleanup.md
new file mode 100644
index 00000000..bf7ec3e6
--- /dev/null
+++ b/.claude/plans/search-sync-cleanup.md
@@ -0,0 +1,133 @@
+# Plan: search-path sync cleanup + per-task result buffers
+
+**Status: SHIPPED in PR #25** (https://github.com/bigbio/msgfplus/pull/25)
+Branch: `perf/search-sync-cleanup` (worktree at
+`/Users/yperez/work/msgfplus-workspace/search-sync-cleanup`).
+
+Successor to PR #24. Pure refactor + instrumentation — no scoring,
+parser, or `.pin` feature changes. Output bit-identical to dev's tip
+on every measurable axis.
+
+## What shipped (6 commits)
+
+1. **T1 — per-task wall stats + tail-imbalance summary**
+ `RunMSGFPlus` captures preprocess / db-search / compute-evalue /
+ total wall into a `TaskWallStats` accessor; `MSGFPlus.runMSGFPlus`
+ prints a one-line summary at end of search:
+ ```
+ Task wall summary (n=12): min=101.7s median=224.2s p95=246.4s
+ max=246.4s total=2356.7s tail_gap=22.2s (10% of median)
+ ```
+ On Astral the measured `tail_gap` is **10 % of median**, which means
+ T2 and T3 can't deliver substantial wins on this workload.
+
+2. **Drop dead `synchronized` wrappers in DBScanner + ScoredSpectraMap.**
+ Each instance is task-local (verified: no internal fork-out in
+ `dbSearch`, no shared instance across threads). Plain `HashMap` /
+ `TreeMap` replace the `Collections.synchronizedMap` /
+ `synchronizedSortedMap` wrappers; `synchronized` modifier dropped
+ from `addDBMatches`, `generateSpecIndexDBMatchMap`,
+ `addResultsToList`, `addDBSearchResults`. Memory-visibility safety
+ preserved via `awaitTermination`'s happens-before.
+
+3. **Per-task local result buffers + final merge.**
+ Replaced the global `Collections.synchronizedList`
+ with a per-task `ArrayList`. Each `RunMSGFPlus` owns its own buffer;
+ main thread drains all buffers after `awaitTermination`.
+ `RunMSGFPlus`'s constructor drops the `resultList` parameter; new
+ `getResults()` accessor.
+
+4. **T2 — `-Dmsgfplus.numTasksPerThread=N`** (default 3, unchanged).
+ Lets operators raise the multiplier on datasets where T1's
+ `tail_gap` shows real imbalance.
+
+5. **T3 — `-Dmsgfplus.useForkJoin=true`** (default false, unchanged).
+ Opt-in `ForkJoinPool` swap. Default keeps
+ `ThreadPoolExecutorWithExceptions` (which retains progress
+ reporting + exception-capture-via-afterExecute). FJP path uses
+ `Future.get()` for exception propagation.
+
+6. **Polish — tighter result-buffer merge + `drainResultsTo` + reused
+ null sink.** Static `NULL_PRINT_STREAM` cached instead of allocated
+ per `run()`; `drainResultsTo(dest)` clears per-task buffers
+ immediately after merge so heap is collectible; pre-size merged
+ `ArrayList` to `sum(t.getResultCount())` to avoid resize-and-copy;
+ `submittedTasks.clear()` after summary drops strong refs to all 12
+ task instances before the FDR / write phase.
+
+## Validation gate cleared (Astral 3-arm + Percolator)
+
+Astral 3-arm cold, 8 GB heap, 4 threads, default sysprops.
+**All 8 parity numbers bit-identical to dev's tip:**
+
+| Metric | dev | this branch |
+|---|---:|---:|
+| armB raw targets | 89,479 | 89,479 ✓ |
+| armB raw decoys | 46,792 | 46,792 ✓ |
+| armB 1 % FDR targets | 35,818 | 35,818 ✓ |
+| armB 5 % FDR targets | 40,408 | 40,408 ✓ |
+| armC raw targets | 89,360 | 89,360 ✓ |
+| armC raw decoys | 46,913 | 46,913 ✓ |
+| armC 1 % FDR targets | 35,767 | 35,767 ✓ |
+| armC 5 % FDR targets | 40,426 | 40,426 ✓ |
+
+Walltime delta vs master in the same run:
+- armB: 752.2s vs 848.8s = **−11.4 %**
+- armC: 798.2s vs 848.8s = **−5.9 %**
+
+(First run came in with armC at 6298s; root-caused to OS thrashing —
+load avg 5-8, ~120 MB free RAM, 165M page reclaims, Rancher VM eating
+1 GB. Re-ran after stopping Rancher; wall normalized. Not a code
+issue. Documented in PR #25 description.)
+
+## What we learned vs. expected wins
+
+The plan predicted:
+- Step 1 (sync removal): 0–2 % wall. Possibly negative if biased
+ locking was helping. Code clarity is the more reliable win.
+- Step 2 (per-task buffers): 2–8 % wall, scaling with PSM count.
+- T2 / T3: only worth doing if profiler shows real tail-imbalance.
+
+What we measured:
+- Combined wall improvement: **11.4 % on armB, 5.9 % on armC** —
+ better than the upper end of the per-step predictions, suggesting
+ the gains compound (less monitor traffic + cheaper drain phase).
+- T1's measured tail_gap on Astral: **10 % of median** — small enough
+ that T2/T3 default-on would give marginal wins. They ship as opt-in
+ knobs precisely so they don't gate the default behavior.
+
+## What this branch is NOT
+
+Not a fragment-index revival. Not a primitive mass-window port. Not
+a peak-storage refactor (`Peak` → `float[]`). Not a CLI / format
+change. Originated from a third-party review of PR #24.
+
+## Follow-ups (out of scope for this PR)
+
+- **Profile on TMT and a metaproteomic FASTA** with the new T1
+ summary. Astral's 10 % tail_gap might not represent uneven
+ workloads — homolog-rich DBs are the place T2/T3 should bite.
+- **`DatabaseMatch.indices` from `TreeSet` to primitive
+ `int[]`** (M1 from the broader memory-roadmap discussion). Highest
+ expected impact for homolog-heavy databases (5-12× memory reduction
+ per match); needs a metaproteomic test fixture to validate.
+- **Parser cache stores raw `float[] mz, float[] intensity`** (M3),
+ with a fresh `Spectrum` built per `getSpectrumBySpecIndex`. Side
+ benefit: cache-layer immutability instead of cloneSpectrum.
+- **`Peak`/`Spectrum` storage refactor** (M2). Multi-PR. Big surface
+ area. Defer until M1 + M3 land.
+
+## Open questions resolved
+
+- **Did the custom `ThreadPoolExecutorWithExceptions` preserve
+ awaitTermination's happens-before on the exception path?** Yes —
+ observed bit-identical results in armB / armC across the 3-arm
+ benchmark, which would not be the case if visibility were broken.
+
+- **Was HotSpot already eliding the uncontended monitors?** Probably
+ partially. Step 2 (sync removal) on its own gives an unmeasured
+ delta; combined with steps 3–6 the total is 11.4 %. We can't
+ attribute that 11.4 % to any single commit without per-commit
+ benchmarks, but the polish commit (#6) likely contributes
+ meaningfully via the pre-sized `ArrayList` and immediate
+ per-task-buffer release.
diff --git a/README.md b/README.md
index 14d0e2fa..ac2748d6 100644
--- a/README.md
+++ b/README.md
@@ -12,8 +12,8 @@
MS-GF+ (aka MSGF+ or MSGFPlus) performs peptide identification by scoring
MS/MS spectra against peptides derived from a protein sequence database.
-It supports the HUPO PSI standard input file (mzML) and additional legacy spectrum inputs, and saves results in
-the mzIdentML format, though results can easily be transformed to TSV.
+It supports the HUPO PSI standard input file (mzML) plus MGF, and writes
+Percolator `.pin` (default) or TSV output.
ProteomeXchange supports Complete data submissions using MS-GF+ search results.
MS-GF+ is developed by Sangtae Kim and the PNNL Proteomics team at the
@@ -22,10 +22,11 @@ Center for Computational Mass Spectrometry, University of California, San Diego.
## What is different in this fork?
- **Streaming mzML parser** -- replaces the in-memory preload with a single-pass StAX parser, significantly reducing memory usage for large files
-- **Primary maintained formats: mzML and MGF** -- mzXML is not available in this fork
+- **Spectrum input narrowed to mzML and MGF** -- mzXML, MS2, PKL, and `_dta.txt` are not supported in this fork
+- **mzIdentML output removed** -- output is Percolator `.pin` (default) or TSV; feed `.pin` straight into Percolator for rescoring
+- **Picocli-based CLI** -- declarative typed flags with auto-generated `-h/--help`
- **Java 17 minimum** -- updated from Java 8
- **CI/CD** -- GitHub Actions for automated testing and releases
-- **Direct TSV output** -- optional TSV output alongside mzIdentML
## Requirements
@@ -39,13 +40,13 @@ Download the latest release from the [Releases page](https://github.com/bigbio/m
## Quick Start
```bash
-# Basic search
+# Basic search (writes results.pin in Percolator format)
java -Xmx4G -jar MSGFPlus.jar \
-s spectra.mzML \
-d database.fasta \
- -o results.mzid
+ -o results.pin
-# TMT search with target-decoy analysis
+# TMT search with target-decoy analysis, Percolator-ready output
java -Xmx8G -jar MSGFPlus.jar \
-s spectra.mzML \
-d database.fasta \
@@ -56,11 +57,13 @@ java -Xmx8G -jar MSGFPlus.jar \
-e 1 \
-protocol 4 \
-mod mods.txt \
- -o results.mzid
+ -o results.pin
-# Convert mzid output to TSV
-java -cp MSGFPlus.jar edu.ucsd.msjava.ui.MzIDToTsv \
- -i results.mzid \
+# Direct TSV output (skip Percolator)
+java -Xmx4G -jar MSGFPlus.jar \
+ -s spectra.mzML \
+ -d database.fasta \
+ -outputFormat tsv \
-o results.tsv
```
@@ -70,14 +73,14 @@ java -cp MSGFPlus.jar edu.ucsd.msjava.ui.MzIDToTsv \
| Flag | Name | Description |
|------|------|-------------|
-| `-s` | SpectrumFile | Input spectrum file (`*.mzML`, `*.mgf`, `*.ms2`, `*.pkl`, `*_dta.txt`). Spectra should be centroided. |
+| `-s` | SpectrumFile | Input spectrum file (`*.mzML`, `*.mgf`). Spectra should be centroided. |
| `-d` | DatabaseFile | Protein sequence database (`*.fasta`, `*.fa`, `*.faa`). |
### Core Search Parameters
| Flag | Name | Default | Description |
|------|------|---------|-------------|
-| `-o` | OutputFile | `[input].mzid` | Output file path (`.mzid` format). |
+| `-o` | OutputFile | `[input].pin` | Output file path (`.pin` Percolator format, default; `.tsv` if `-outputFormat tsv`). |
| `-conf` | ConfigurationFile | — | Configuration file; command-line options override config file settings. |
| `-t` | PrecursorMassTolerance | `20ppm` | Precursor mass tolerance (e.g., `2.5Da`, `20ppm`, or `0.5Da,2.5Da` for asymmetric). |
| `-ti` | IsotopeErrorRange | `0,1` | Range of allowed isotope peak errors (e.g., `-1,2`). |
diff --git a/docs/changelog.md b/docs/changelog.md
index 29d94d53..713e0bd5 100644
--- a/docs/changelog.md
+++ b/docs/changelog.md
@@ -2,15 +2,136 @@
[MS-GF+ Documentation home](readme.md)
-**vNEXT — Unreleased (breaking change)**
-
-- **BREAKING:** mzIdentML (`.mzid`) support fully removed — no backward-compatibility shim. MS-GF+ now writes only Percolator `.pin` (default) or TSV, and every `.mzid`-related utility has been deleted:
- - **Output:** `MZIdentMLGen`, `AnalysisProtocolCollectionGen`, `MzIDTest` — deleted. `-o foo.mzid` is rejected at argument-parse time (no extension rewrite). `Unimod` and `UnimodComposition` are retained for future PTM-aware enhancements to `DirectPinWriter` — they carry the modification-accession + mass tables that a richer pin output would need to populate the `Peptide` column with proper Unimod references.
- - **Input / legacy tools:** `MzIDToTsv` (CLI `edu.ucsd.msjava.ui.MzIDToTsv`), `MzIDParser`, `AnnotatedSpectra`, `ScoringParamGen` — deleted. Users who need to post-process legacy `.mzid` files must use MS-GF+ v2026.03.25 or earlier, or an external mzid converter.
- - `-outputFormat` now accepts only `pin` (default) and `tsv`. Integer aliases: `0=pin, 1=tsv`. Previous `0=mzid, 2=both, 3=pin` layout is rejected.
- - `-o OutputFile` must end in `.pin` or `.tsv`. `.mzid` paths are rejected.
-- Added precursor mass calibration: `-precursorCal auto|on|off` (default `auto`). Merged via PR #22.
-- Added Percolator `.pin` output with OpenMS-parity features (`enzN`, `enzC`, `enzInt`, `mass`, `lnDeltaSpecEValue`, `matchedIonRatio`, `longest_b`, `longest_y`, `longest_y_pct`) and lowercase renames (`peplen`, `charge2/3/4`, `dm`, `absdm`, `isotope_error`) for OpenMS `PercolatorAdapter` interoperability. Merged via PR #22 + this PR.
+**vNEXT — Unreleased (multiple breaking changes)**
+
+This release modernises the CLI surface and trims a large amount of
+legacy code. Net change: roughly **−2,400 LOC** vs the previous
+release, with the CLI flag contract preserved for normal users and a
+few deliberate breaking changes called out below.
+
+### CLI parser modernisation
+
+- The CLI is now driven by [picocli](https://picocli.info) via
+ `edu.ucsd.msjava.cli.MSGFPlusOptions`. All flags are declared once
+ with typed Java fields; help (`-h`/`--help`) and version (`-V`) are
+ auto-generated.
+- `-conf` config-file inputs flow through the same path: any field
+ the CLI did not set is filled in from the config file (CLI takes
+ precedence). Legacy aliases continue to be recognised, including
+ `IsotopeError` → `IsotopeErrorRange`, `FragmentationMethod` →
+ `FragmentationMethodID`, `Instrument` → `InstrumentID`, `Enzyme`
+ → `EnzymeID`, `Protocol` → `ProtocolID`, `NumTolerableTermini` →
+ `NTT`, `MinNumPeaks` → `MinNumPeaksPerSpectrum`, `MaxNumMods` /
+ `MaxNumModsPerPeptide` → `NumMods`, `MinPeptideLength` /
+ `minLength` → `MinPepLength`, `MaxPeptideLength` / `maxLength` →
+ `MaxPepLength`, `PMTolerance` / `ParentMassTolerance` →
+ `PrecursorMassTolerance`. Config-file keys are matched
+ case-insensitively (so `minCharge=`, `MinCharge=`, and `MINCHARGE=`
+ all work).
+- `DynamicMod=`, `StaticMod=`, and `CustomAA=` config-file entries
+ continue to be repeatable; each line is collected into the AA set.
+- Validation surface restored: invalid numeric values (e.g.
+ `-thread 0`, `-ntt 5`, `-tda 2`) and out-of-range enum-like IDs
+ (e.g. `-m 99`, `-inst 99`) now produce a clean user-facing error
+ string instead of a stack trace.
+
+### Breaking changes
+
+- **`-outputFormat` accepts only named values.** `pin` (default) and
+ `tsv` are the supported forms (case-insensitive). The legacy
+ numeric aliases `0` and `1` are no longer accepted; users on those
+ invocations should switch to the named values.
+- **`-precursorCal` is now a typed enum.** `auto` (default), `on`,
+ and `off` are still the only valid values; invalid values now fail
+ fast at parse time instead of silently mapping to `auto`.
+- **Spectrum input narrowed to `*.mzML` and `*.mgf`.** Support for
+ `*.mzXML`, `*.ms2`, `*.pkl`, and `*_dta.txt` has been removed
+ along with their parsers. `MgfSpectrumParser`, `BufferedLineReader`,
+ `BufferedRandomAccessLineReader`, and the shared `LineReader` /
+ `SpectrumParser` interfaces moved from `edu.ucsd.msjava.parser` to
+ `edu.ucsd.msjava.mgf` to reflect the trimmed scope.
+- **Deprecated `MSGFDB` entry point removed.** `cli.MSGFDB` (legacy
+ v8091, "08/06/2012") and `docs/ms-gfdb.md` have been deleted, along
+ with `ParamManager.addMSGFDBParams` / `addMSGFParams` /
+ `addMSGFLibParams` (the latter two were dead — no entry points
+ existed). The MSGFDB-only `ParamNameEnum` entries `C13`, `NNET`,
+ `UNIFORM_AA_PROBABILITY`, and `OUTPUT_FILE` are gone, as are the
+ `showFDR`, `showDecoy`, and `replicate` config-file keys.
+- mzIdentML (`.mzid`) support remains fully removed (introduced in a
+ prior commit on this branch). MS-GF+ writes only `.pin` (default)
+ or `.tsv`. Every `.mzid`-related utility has been deleted:
+ - **Output:** `MZIdentMLGen`, `AnalysisProtocolCollectionGen`,
+ `MzIDTest`. `Unimod` and `UnimodComposition` are retained for
+ future PTM-aware enhancements to `DirectPinWriter` — they carry
+ the modification-accession + mass tables a richer pin output
+ would need.
+ - **Input / legacy tools:** `MzIDToTsv` (CLI
+ `edu.ucsd.msjava.ui.MzIDToTsv`), `MzIDParser`,
+ `AnnotatedSpectra`, `ScoringParamGen`. Users who need to
+ post-process legacy `.mzid` files must use MS-GF+ v2026.03.25
+ or earlier, or an external mzid converter.
+
+### Internal refactor
+
+- The entire `edu.ucsd.msjava.params` package has been deleted
+ (~2,100 LOC across 18 classes including `ParamManager`, the
+ `Parameter` / `IntParameter` / `FloatParameter` / `IntRangeParameter`
+ / `ToleranceParameter` / `EnumParameter` / `FileParameter` /
+ `StringParameter` hierarchy, and `ParamParser`). Two small helpers
+ (`ParamObject`, `UserParam`) moved to `edu.ucsd.msjava.msutil`
+ where their `ActivationMethod` / `Enzyme` / `InstrumentType` /
+ `Protocol` consumers already live.
+- Top-level package reorganisation:
+ - `edu.ucsd.msjava.ui.MSGFPlus` → `edu.ucsd.msjava.cli.MSGFPlus`.
+ - `edu.ucsd.msjava.mzid.{DirectPinWriter,DirectTSVWriter,Unimod,UnimodComposition}`
+ → `edu.ucsd.msjava.output.*`.
+ - `edu.ucsd.msjava.parser.*` → `edu.ucsd.msjava.mgf.*` (after
+ dropping the legacy-format parsers).
+ - `net.pempek.unicode.UnicodeBOMInputStream` →
+ `edu.ucsd.msjava.mgf.UnicodeBOMInputStream`.
+ - `edu.ucsd.msjava.mslibsearch.ProcessedSpectrum` deleted (no
+ references).
+- New typed value classes in `cli/`:
+ - `MSGFPlusOptions` — picocli `@Command` with all MSGFPlus flags.
+ - `PrecursorTolerance` — symmetric or asymmetric tolerance with
+ matching-unit + non-negative validation.
+ - `IntRange` — inclusive integer range used by `-ti`, `-msLevel`,
+ `-index`.
+ - `OutputFormat` — enum (`PIN`, `TSV`).
+- `picocli` 4.7.6 added as a runtime dependency.
+- New regression tests covering the `CustomAA=` config-file path,
+ the `-m 4 = UVPD` mapping, case-insensitive config keys, and
+ out-of-range flag rejection. The full scoped test sweep includes
+ 78 tests.
+
+### Bench gate
+
+The Astral 3-arm correctness gate (`benchmark/run_astral_3arm.sh`,
+ProteoBench Module 8) on the prior modernisation pass confirmed
+**bit-identical PSM target/decoy counts** to the pre-PR#22 baseline
+JAR when `-precursorCal off` is supplied:
+
+| Arm | JAR | -precursorCal | targets | decoys |
+|---|---|---|---|---|
+| A | baseline (pre-PR #22) | n/a | 89,479 | 46,792 |
+| B | new branch | off | **89,479** | **46,792** |
+| C | new branch | auto | 89,360 | 46,913 |
+
+Arm C's small delta is the calibrator's expected effect when AUTO
+collects ≥200 confident PSMs. The CLI rewrite does not touch the
+search hot path, so this gate continues to apply for the additional
+fixes layered on top.
+
+### Earlier in this release cycle
+
+- Added precursor mass calibration: `-precursorCal auto|on|off`
+ (default `auto`). Merged via PR #22.
+- Added Percolator `.pin` output with OpenMS-parity features
+ (`enzN`, `enzC`, `enzInt`, `mass`, `lnDeltaSpecEValue`,
+ `matchedIonRatio`, `longest_b`, `longest_y`, `longest_y_pct`) and
+ lowercase column renames (`peplen`, `charge2/3/4`, `dm`, `absdm`,
+ `isotope_error`) for OpenMS `PercolatorAdapter` interoperability.
+ Merged via PR #22 + this PR.
**v2026.03.25**
diff --git a/docs/examples/MSGFPlus_Params.txt b/docs/examples/MSGFPlus_Params.txt
index c1ef196e..66282805 100644
--- a/docs/examples/MSGFPlus_Params.txt
+++ b/docs/examples/MSGFPlus_Params.txt
@@ -1,139 +1,139 @@
-# SpectrumFile
-# *.mzML, *.mzXML, *.mgf, *.ms2, *.pkl or *_dta.txt
-# Spectra should be centroided (see below for MSConvert example). Profile spectra will be ignored.
-# Use of -s at the command line will override this filename
-#SpectrumFile=InstrumentFile.mzML
-
-# FASTA file
-# "*.fasta or *.fa or *.faa
-# Use of -d at the command line will override this filename
-#DatabaseFile=Proteins.fasta
-
-# Prefix for decoy proteins in the FASTA file
-#DecoyPrefix=XXX
-
-# Precursor mass tolerance
-# Examples: 2.5Da or 30ppm
-# Use comma to set asymmetric values, for example "0.5Da,2.5Da" will set 0.5Da to the left (expMasstheoMass)
-PrecursorMassTolerance=20ppm
-
-# Max Number of Dynamic (Variable) Modifications per peptide
-# Default: 3
-# If this value is large, the search will be slow
-NumMods=3
-
-# Modifications (see below for examples)
-StaticMod=C2H3N1O1, C, fix, any, Carbamidomethyl # Fixed Carbamidomethyl C (alkylation)
-StaticMod=229.1629, *, fix, N-term, TMT6plex
-StaticMod=229.1629, K, fix, any, TMT6plex
-
-DynamicMod=O1, M, opt, any, Oxidation # Oxidized methionine
-DynamicMod=-187.152366, K, opt, any, AcNoTMT # Residue tagged by MSGF+ with static TMT6, but is actually acetylated and does not have TMT
-
-# Custom AA specification
-#CustomAA=C3H5NO, U, custom, U, Selenocysteine # Custom amino acids can only have C, H, N, O, and S
-#CustomAA=C6H11NO, X, custom, X, Leu_Ile # Leucine or Isoleucine
-
-# Fragmentation Method
-# 0 means as written in the spectrum or CID if no info (Default)
-# 1 means CID
-# 2 means ETD
-# 3 means HCD
-FragmentationMethodID=0
-
-# Instrument ID
-# 0 means Low-res LCQ/LTQ (Default for CID and ETD); use InstrumentID=0 if analyzing a dataset with low-res CID and high-res HCD spectra
-# 1 means High-res LTQ (Default for HCD; also appropriate for high res CID); use InstrumentID=1 for Orbitrap, Lumos, and QEHFX instruments
-# 2 means TOF
-# 3 means Q-Exactive
-InstrumentID=1
-
-# Enzyme ID
-# 0 means unspecific cleavage (cleave after any residue)
-# 1 means Trypsin (Default); optionally use this along with NTT=0 for a no-enzyme-specificity search of a tryptically digested sample
-# 2: Chymotrypsin, 3: Lys-C, 4: Lys-N, 5: Glu-C, 6: Arg-C, 7: Asp-N, 8: alphaLP, 9: No Cleavage (for peptidomics), 10: TrypPlusC (cleave after K, R, or C)
-EnzymeID=1
-
-# Isotope error range
-# Takes into account of the error introduced by choosing non-monoisotopic peak for fragmentation.
-# Useful for accurate precursor ion masses
-# Ignored if the parent mass tolerance is > 0.5Da or 500ppm
-# The combination of -t and -ti determins the precursor mass tolerance.
-# e.g. "-t 20ppm -ti -1,2" tests abs(exp-calc-n*1.00335Da)<20ppm for n=-1, 0, 1, 2.
-IsotopeErrorRange=-1,2
-
-# Number of tolerable termini
-# The number of peptide termini that must have been cleaved by the enzyme (default 1)
-# For trypsin, 2 means fully tryptic only, 1 means partially tryptic, and 0 means no-enzyme search
-NTT=2
-
-# Control N-terminal methionine cleavage
-# 0 means to consider protein N-term Met cleavage (Default)
-# 1 means to ignore protein N-term Met cleavage
-IgnoreMetCleavage=0
-
-# Target/Decoy search mode
-# 0 means don't search decoy database (default)
-# 1 means search decoy database to compute FDR (source FASTA file must be forward-only proteins)
-TDA=1
-
-# Number of concurrent threads to be executed
-# Default: Number of available cores
-# To use three threads use NumThreads=3
-NumThreads=All
-
-# Minimum peptide length to consider
-# Default: 6
-MinPepLength=6
-
-# Maximum peptide length to consider
-# Default: 40
-MaxPepLength=50
-
-# Minimum precursor charge to consider (if not specified in the spectrum file)
-# Default: 2
-MinCharge=2
-
-# Maximum precursor charge to consider (if not specified in the spectrum file)
-# Default: 3
-MaxCharge=5
-
-# Number of matches per spectrum to be reported
-# If this value is greater than 1, the FDR values computed by MS-GF+ will be skewed by high-scoring 2nd and 3rd hits
-NumMatchesPerSpec=1
-
-# Mass of charge carrier
-# Default: mass of proton
-#ChargeCarrierMass=1.00727649
-
-# Maximum missed cleavages
-# Exclude peptides with more than this number of missed cleavages from the search, Default: -1 (no limit)
-#MaxMissedCleavages=-1
-
-# Minimum number of peaks per spectrum, Default:
-# Default: 10
-#MinNumPeaksPerSpectrum=10
-
-# Number of isoforms to consider per peptide
-# Default: 128
-#NumIsoforms=128
-
-# Amino Acid Modification Examples
-# Specify static modifications using one or more StaticMod= entries
-# Specify dynamic modifications using one or more DynamicMod= entries
-# Modification format is:
-# Mass or CompositionString, Residues, ModType, Position, Name (all five fields are required).
-# CompositionString can only contain a limited set of elements, primarily C H N O S or P
-#
-# Examples:
-# C2H3N1O1, C, fix, any, Carbamidomethyl # Fixed Carbamidomethyl C (alkylation)
-# O1, M, opt, any, Oxidation # Oxidation M
-# 15.994915, M, opt, any, Oxidation # Oxidation M (mass is used instead of CompositionString)
-# H-1N-1O1, NQ, opt, any, Deamidated # Negative numbers are allowed.
-# CH2, K, opt, any, Methyl # Methylation K
-# C2H2O1, K, opt, any, Acetyl # Acetylation K
-# HO3P, STY,opt, any, Phospho # Phosphorylation STY
-# C2H3NO, *, opt, N-term, Carbamidomethyl # Variable Carbamidomethyl N-term
-# H-2O-1, E, opt, N-term, Glu->pyro-Glu # Pyro-glu from E
-# H-3N-1, Q, opt, N-term, Gln->pyro-Glu # Pyro-glu from Q
-# C2H2O, *, opt, Prot-N-term, Acetyl # Acetylation Protein N-term
+# SpectrumFile
+# *.mzML or *.mgf
+# Spectra should be centroided (see below for MSConvert example). Profile spectra will be ignored.
+# Use of -s at the command line will override this filename
+#SpectrumFile=InstrumentFile.mzML
+
+# FASTA file
+# "*.fasta or *.fa or *.faa
+# Use of -d at the command line will override this filename
+#DatabaseFile=Proteins.fasta
+
+# Prefix for decoy proteins in the FASTA file
+#DecoyPrefix=XXX
+
+# Precursor mass tolerance
+# Examples: 2.5Da or 30ppm
+# Use comma to set asymmetric values, for example "0.5Da,2.5Da" will set 0.5Da to the left (expMasstheoMass)
+PrecursorMassTolerance=20ppm
+
+# Max Number of Dynamic (Variable) Modifications per peptide
+# Default: 3
+# If this value is large, the search will be slow
+NumMods=3
+
+# Modifications (see below for examples)
+StaticMod=C2H3N1O1, C, fix, any, Carbamidomethyl # Fixed Carbamidomethyl C (alkylation)
+StaticMod=229.1629, *, fix, N-term, TMT6plex
+StaticMod=229.1629, K, fix, any, TMT6plex
+
+DynamicMod=O1, M, opt, any, Oxidation # Oxidized methionine
+DynamicMod=-187.152366, K, opt, any, AcNoTMT # Residue tagged by MSGF+ with static TMT6, but is actually acetylated and does not have TMT
+
+# Custom AA specification
+#CustomAA=C3H5NO, U, custom, U, Selenocysteine # Custom amino acids can only have C, H, N, O, and S
+#CustomAA=C6H11NO, X, custom, X, Leu_Ile # Leucine or Isoleucine
+
+# Fragmentation Method
+# 0 means as written in the spectrum or CID if no info (Default)
+# 1 means CID
+# 2 means ETD
+# 3 means HCD
+FragmentationMethodID=0
+
+# Instrument ID
+# 0 means Low-res LCQ/LTQ (Default for CID and ETD); use InstrumentID=0 if analyzing a dataset with low-res CID and high-res HCD spectra
+# 1 means High-res LTQ (Default for HCD; also appropriate for high res CID); use InstrumentID=1 for Orbitrap, Lumos, and QEHFX instruments
+# 2 means TOF
+# 3 means Q-Exactive
+InstrumentID=1
+
+# Enzyme ID
+# 0 means unspecific cleavage (cleave after any residue)
+# 1 means Trypsin (Default); optionally use this along with NTT=0 for a no-enzyme-specificity search of a tryptically digested sample
+# 2: Chymotrypsin, 3: Lys-C, 4: Lys-N, 5: Glu-C, 6: Arg-C, 7: Asp-N, 8: alphaLP, 9: No Cleavage (for peptidomics), 10: TrypPlusC (cleave after K, R, or C)
+EnzymeID=1
+
+# Isotope error range
+# Takes into account of the error introduced by choosing non-monoisotopic peak for fragmentation.
+# Useful for accurate precursor ion masses
+# Ignored if the parent mass tolerance is > 0.5Da or 500ppm
+# The combination of -t and -ti determins the precursor mass tolerance.
+# e.g. "-t 20ppm -ti -1,2" tests abs(exp-calc-n*1.00335Da)<20ppm for n=-1, 0, 1, 2.
+IsotopeErrorRange=-1,2
+
+# Number of tolerable termini
+# The number of peptide termini that must have been cleaved by the enzyme (default 1)
+# For trypsin, 2 means fully tryptic only, 1 means partially tryptic, and 0 means no-enzyme search
+NTT=2
+
+# Control N-terminal methionine cleavage
+# 0 means to consider protein N-term Met cleavage (Default)
+# 1 means to ignore protein N-term Met cleavage
+IgnoreMetCleavage=0
+
+# Target/Decoy search mode
+# 0 means don't search decoy database (default)
+# 1 means search decoy database to compute FDR (source FASTA file must be forward-only proteins)
+TDA=1
+
+# Number of concurrent threads to be executed
+# Default: Number of available cores
+# To use three threads use NumThreads=3
+NumThreads=All
+
+# Minimum peptide length to consider
+# Default: 6
+MinPepLength=6
+
+# Maximum peptide length to consider
+# Default: 40
+MaxPepLength=50
+
+# Minimum precursor charge to consider (if not specified in the spectrum file)
+# Default: 2
+MinCharge=2
+
+# Maximum precursor charge to consider (if not specified in the spectrum file)
+# Default: 3
+MaxCharge=5
+
+# Number of matches per spectrum to be reported
+# If this value is greater than 1, the FDR values computed by MS-GF+ will be skewed by high-scoring 2nd and 3rd hits
+NumMatchesPerSpec=1
+
+# Mass of charge carrier
+# Default: mass of proton
+#ChargeCarrierMass=1.00727649
+
+# Maximum missed cleavages
+# Exclude peptides with more than this number of missed cleavages from the search, Default: -1 (no limit)
+#MaxMissedCleavages=-1
+
+# Minimum number of peaks per spectrum, Default:
+# Default: 10
+#MinNumPeaksPerSpectrum=10
+
+# Number of isoforms to consider per peptide
+# Default: 128
+#NumIsoforms=128
+
+# Amino Acid Modification Examples
+# Specify static modifications using one or more StaticMod= entries
+# Specify dynamic modifications using one or more DynamicMod= entries
+# Modification format is:
+# Mass or CompositionString, Residues, ModType, Position, Name (all five fields are required).
+# CompositionString can only contain a limited set of elements, primarily C H N O S or P
+#
+# Examples:
+# C2H3N1O1, C, fix, any, Carbamidomethyl # Fixed Carbamidomethyl C (alkylation)
+# O1, M, opt, any, Oxidation # Oxidation M
+# 15.994915, M, opt, any, Oxidation # Oxidation M (mass is used instead of CompositionString)
+# H-1N-1O1, NQ, opt, any, Deamidated # Negative numbers are allowed.
+# CH2, K, opt, any, Methyl # Methylation K
+# C2H2O1, K, opt, any, Acetyl # Acetylation K
+# HO3P, STY,opt, any, Phospho # Phosphorylation STY
+# C2H3NO, *, opt, N-term, Carbamidomethyl # Variable Carbamidomethyl N-term
+# H-2O-1, E, opt, N-term, Glu->pyro-Glu # Pyro-glu from E
+# H-3N-1, Q, opt, N-term, Gln->pyro-Glu # Pyro-glu from Q
+# C2H2O, *, opt, Prot-N-term, Acetyl # Acetylation Protein N-term
diff --git a/docs/ms-gfdb.md b/docs/ms-gfdb.md
deleted file mode 100644
index aed3fc5d..00000000
--- a/docs/ms-gfdb.md
+++ /dev/null
@@ -1,210 +0,0 @@
-# MS-GFDB
-
-[MS-GF+ Documentation home](readme.md)
-
-MS-GFDB is an old application that is no longer under development. It was supserseded by [MS-GF+](msgfplus.md).
-MS-GF+ has all the functionalities provided by MS-GFDB, plus numerous improvements.
-
-### Differences between MS-GF+ and MS-GFDB
-
-- **Input**
- - MS-GF+ supports mzML in addition to mzXML, mgf, ms2, pkl and \_dta.txt
- - "-t PrecursorMassTolerance" is optional with MS-GF+ (default 20ppm)
- - "-c13 0/1/2" was changed to "-ti IsotopeErrorRange" in MS-GF+
- - IsotopeErrorRange: MinIsotopeError,MaxIsotopeError (both are inclusive)
- - -c13 x == -ti 0,x
- - "-nnet" was changed to "-ntt" in MS-GF+
- - -nnet 0 == -ntt 2, -nnet 1 == -ntt 1, -nnet 2 == -ntt 0
- - Modification file format change
- - In MS-GF+, the name of the modification should match the PSI-MS name (accessible from [http://www.unimod.org](http://www.unimod.org/))
- - CompositionStr can take Br, Cl, Fe, Se in addition to C, H, N, O, S, and P
- - The sequence of the atoms can be arbitrary.
- - Previously C2H2O was valid but OH2C2 was invalid
- - With MS-GF+, both are valid
- - "-uniformAAProb 0/1" was deleted in MS-GF+
- - "-addFeatures 0/1" was added to MS-GF+; "-addFeatures 1" will output the following extra features for each PSM (will be useful to downstream tools like Percolator or IDPicker):
- - MS2IonCurrent: Summed intensity of all product ions
- - ExplainedIonCurrentRatio: Summed intensity of all matched product ions (e.g. b, b-H2O, y, etc.) divided by MS2IonCurrent
- - NTermIonCurrentRatio: Summed intensity of all matched prefix ions (e.g. b, b-H2O, etc.) divided by MS2IonCurrent
- - CTermIonCurrentRatio: Summed intensity of all matched suffix ions (e.g. y, y-H2O, etc.) divided by MS2IonCurrent
- - "-showQValue 0/1" was added to MS-GF+
- - "-showDecoy 0/1" was added to MS-GF+
-- **Output**
- - Output format for MS-GF+ is the HUPO PSI mzIdentML version 1.1 (\*.mzid); see for details.
- - Decoy protein prefix is "XXX" in MS-GF+ (vs. "REV" in MS-GFDB)
- - MS-GF+ provides a converter from mzIdentML to tsv (the resulting tsv file will be similar to the MS-GFDB output file).
- - The converter is included in the MSGFPlus.jar file
- - It can be run by "java -Xmx2000M edu.ucsd.msjava.ui.MzIDToTsv"
- - A faster converter that supports larger result files is the Mzid-To-Tsv-Converter, [available on GitHub](https://github.com/PNNL-Comp-Mass-Spec/Mzid-To-Tsv-Converter/releases). This is a C# application that works under Windows or on Linux with mono.
- - Difference between the MS-GFDB output and the MS-GF+ TSV output
- - MS-GF+ includes SpecID (native spectrum ID) instead of SpecIndex
- - MS-GF+ reports IsotopeError
- - When a peptide matches to multiple proteins, all protein accessions will be reported by MS-GF+
- - SpecProb was renamed to SpecEValue in MS-GF+
- - MS-GF+ reports EValue (database-level E-value) instead of PValue (database-level P-value)
- - FDR and PepFDR were renamed to QValue and PepQValue in MS-GF+
-
-# MS-GFDB
-
-
-```text
-Usage: java -Xmx2000M -jar MSGFDB.jar
- -s SpectrumFile (*.mzXML, *.mzML, *.mgf, *.ms2, *.pkl or *_dta.txt)
- -d DatabaseFile (*.fasta or .fa)
- -t ParentMassTolerance (e.g. 2.5Da, 30ppm, or 0.5Da,2.5Da)
- Use comma to set asymmetric values. E.g. "-t 0.5Da,2.5Da" will set 0.5Da to the left (expMasstheoMass).
- [-o outputFileName] (Default: stdout)
- [-thread NumOfThreads] (Number of concurrent threads to be executed, Default: Number of available cores)
- [-tda 0/1] (0: don't search decoy database (default), 1: search decoy database to compute FDR)
- [-m FragmentationMethodID] (0: as written in the spectrum or CID if no info (Default), 1: CID, 2: ETD, 3: HCD, 4: Merge spectra from the same precursor)
- [-inst InstrumentID] (0: Low-res LCQ/LTQ (Default for CID and ETD), 1: High-res LTQ (Default for HCD), 2: TOF)
- [-e EnzymeID] (0: No enzyme, 1: Trypsin (Default), 2: Chymotrypsin, 3: Lys-C, 4: Lys-N, 5: Glu-C, 6: Arg-C, 7: Asp-N, 8: aLP, 9: Endogenous peptides)
- [-c13 0/1/2] (Number of allowed C13, Default: 1)
- [-nnet 0/1/2] (Number of allowed non-enzymatic termini, Default: 1)
- [-mod ModificationFileName] (Modification file, Default: standard amino acids with fixed C+57)
- [-minLength MinPepLength] (Minimum peptide length to consider, Default: 6)
- [-maxLength MaxPepLength] (Maximum peptide length to consider, Default: 40)
- [-minCharge MinPrecursorCharge] (Minimum precursor charge to consider if not specified in the spectrum file, Default: 2)
- [-maxCharge MaxPrecursorCharge] (Maximum precursor charge to consider if not specified in the spectrum file, Default: 3)
- [-n NumMatchesPerSpec] (Number of matches per spectrum to be reported, Default: 1)
- [-uniformAAProb 0/1] (0: use amino acid probabilities computed from the input database (default), 1: use probability 0.05 for all amino acids)
-```
-
-
-### Parameters:
-
-- **-s SpectrumFile** (\*.mzXML, \*.mzML, \*.mgf, \*.ms2, \*.pkl or \*\_dta.txt) - Required
- - Spectrum file name. Currently, MS-GFDB supports the following file formats: mzXML, mzML, mgf, ms2, pkl and \_dta.txt.
-- **-d DatabaseFile** (\*.fasta or \*.fa) - Required
- - Path to the protein database file. If the database file does not have auxiliary index files (\*.canno, \*.cnlcp, \*.csarr, and \*.cseq), MS-GFDB will create them.
- - When "-tda 1" option is used, the database must contain only target protein sequences.
-
-If multiple MS-GFDB processes access the same database file, it is strongly recommended to index the database prior to the database search by running BuildSA (see below).
-
-- **-t ParentMassTolerance** - Required
- - Parent mass tolerance in Da. or ppm. There must be no space between the number and the unit. E.g. 2.5Da, 30ppm
- - To set asymmetric tolerances, use comma to separate left (experimental mass \< theoretical mass) or right (experimental mass \> theoretical mass) tolerances. E.g. 0.5Da,2.5Da
-- **-o OutputFile** (Default: stdout)
- - Filename where the output will be written.
- - The output will be printed to standard out by default.
-- **-thread NumOfThreads** (Number of concurrent threads to be executed, Default: Number of available cores)
- - Number of concurrent threads to be executed together.
- - Default value is the number of available logical cores (e.g. 8 for quad-core processor with hyper-threading support).
-- **-tda 0/1** (0: don't search decoy database (default), 1: search decoy database to compute FDR)
- - Indicates whether to search the decoy database or not.
- - If 0, the decoy database is not searched and FDRs are theoretically derived from P-values (EFDR).
- - If 1, FDRs are computed based on the target-decoy approach (i.e. reversed database is appended to the target database and MS-GFDB searches the combined database)
- - FDR(t) = \#(DecoyPSMs with score equal or above t) / \#(TargetPSMs with score equal or above t).
- - PSM: Peptide-Spectrum Match
- - -log(SpecProb) is used as the score to compute FDR.
-
-If -tda 1 is specified, MS-GFDB automatically creates a combined target/reversed database file (DBFileName.revConcat.fasta). Thus, when specifying "-d" parameter, DatabaseFile must contain only target proteins.
-
-- **-m FragmentationMethodID** (0: as written in the spectrum or CID if no info (Default), 1: CID, 2: ETD, 3: HCD, 4: Merge spectra from the same precursor)
- - Fragmentation method identifier (used to determine the scoring model).
- - If the identifier is 0 and fragmentation method is written in the spectrum file (e.g. activationMethod field in mzXML files), MS-GFDB will recognize the fragmentation method and use a relevant scoring model.
- - If the identifier is 0 and there is no fragmentation method information in the spectrum (e.g. mgf files), CID model will be used by default.
- - If the identifier is non-zero and the spectrum has fragmentation method information, only the spectra that match with the identifier will be processed.
- - If the identifier is non-zero and the spectrum has no fragmentation method information, MS-GFDB will process all spectra assuming the specified fragmentation method.
- - If the identifier is 4, MS/MS spectra from the same precursor ion (e.g. CID/ETD pairs, CID/HCD/ETD triplets) will be merged and the "merged" spectrum will be used for searching instead of individual spectra. See Kim et al., MCP 2010 for details.
-- **-inst InstrumentID** (0: Low-res LCQ/LTQ (Default for CID and ETD), 1: TOF , 2: High-res LTQ (Default for HCD))
- - Identifier of the instrument to generate MS/MS spectra (used to determine the scoring model).
- - For "hybrid" spectra with high-precision MS1 and low-precision MS2, use 0.
- - For usual low-precision instruments (e.g. Thermo LTQ), use 0.
- - For TOF instruments, use 1.
- - If MS/MS fragment ion peaks are of high-precision (e.g. tolerance = 10ppm), use 2.
-- **-e EnzymeID** (Default: 1)
- - Enzyme identifier. Trypsin (1) will be used by default.
- - 0: No enzyme, 1: Trypsin (default), 2: Chymotrypsin, 3: Lys-C, 4: Lys-N, 5: Glu-C, 6: Arg-C, 7: Asp-N, 8: alphaLP, 9: Endogenous peptides
-- **-c13 0/1/2** (Number of allowed isotope errors, Default: 1)
- - Instruments often choose 2nd or 3rd isotope peak instead of mono-isotope peak from MS1 spectrum.
- - If this value is non-zero, expPeptideMass-1.00335 (i.e. mass(13C)-mass(12C)) and expPeptideMass-2.00671 (i.e. 2\*(mass(C13)-mass(C12)) (only if -c13 2) will be considered along with expPeptideMass.
- - If accurate precursor ion mass is available (e.g. LTQ-Orbitrap), it is better to set a narrow parent mass tolerance and non-zero -c13 value (e.g. -t 30ppm -c13 1) than to set a wide tolerance (e.g. -t 0.5Da,2.5Da).
- - If the parent mass tolerance is equal to or larger than 0.5Da or 500ppm, this parameter will be ignored.
-- **-nnet 0/1/2** (Number of allowed non-enzymatic termini, Default: 1)
- - This parameter is used to determine the enzyme cleavage rule.
- - Specifies the maximum number of peptide termini that are not cleaved by the enzyme.
- - For example, for trypsin, K.ACDEFGHR.C, G.ACDEFGHR.C, K.ACDEFGHI.C and G.ACDEFGHR.C have 0, 1, 1 and 2 non-enzymatic termini, accordingly.
- - By default, -nnet 1 will be used. Using -nnet 0 (or 2) will make the search significantly faster (slower).
-- **-mod ModificationFile** (Default: standard amino acids with fixed C+57)\]
- - Modification file name. ModificationFile contains the modifications to be considered in the search.
- - If -mod option is not specified, standard amino acids with fixed Carbamidomethylation C will be used.
- - See an [example MS-GFDB modification file](msgfdb_modfile.md).
-- **-minLength MinPepLength** (Default: 6)
- - Minimum length of the peptide to be considered.
-- **-maxLength MaxPepLength** (Default: 40)
- - Maximum length of the peptide to be considered.
-- **-minCharge MinPrecursorCharge** (Default: 2)
- - Minimum precursor charge to consider. This parameter is used only for spectra with no charge.
-- **-maxCharge MinPrecursorCharge** (Default: 3)
- - Maximum precursor charge to consider. This parameter is used only for spectra with no charge.
-- **-n NumMatchesPerSpec** (Default: 1)
- - Number of peptide matches per spectrum to report.
- - Expected false discovery rates (EFDRs) will be reported only when this value is 1.
-- **-uniformAAProb** 0/1 (Default: 0)
- - If 0, compute amino acid frequencies from the input database and use them as amino acid probabilities.
- - If 1, use uniform amino acid probability (preferable when the database size is small).
-
-### MS-GFDB output
-
-MS-GFDB outputs a tab-delimited file with the following columns: \#SpecFile, Scan#, FragMethod, Precursor, PMError, Charge, Peptide, Protein, DeNovoScore, MSGFScore, SpecProb, P-value, EFDR.
-
-- **SpecFile**: spectrum file name
-- **SpecIndex**: spectrum index (1-based) in the file. The first spectrum has index 1, the second has index 2, and so on. For mzXML files this value is same as the scan number.
-- **Scan#**: scan number of the spectrum. If the scan number is not available, the value will be -1.
-- **FragMethod**: fragmentation method used to generate the spectrum (e.g. CID, ETD, etc.). When spectra from the same precursor are merged, fragmentation methods of merged spectra will be shown as a form "FragMethod1/FragMethod2/..." (e.g. CID/ETD, CID/HCD/ETD).
-- **Precursor**: precursor mass in m/z or ppm
-- **Charge**: precursor ion charge
-- **Peptide**: peptide sequence with neighboring amino acids
-- **Protein**: protein name
-- **DeNovoScore**: the score of the optimal scoring peptide (not necessary in the database)
-- **MSGFScore**: MS-GF raw score of the peptide-spectrum match (MSGFScore \<= DeNovoScore)
-- **SpecProb**: spectral probability (spectrum level p-value) of the peptide-spectrum match
-- **P-value**: database level p-value (probability that a random PSM have an equal or better score against a random database of the same size)
-- **EFDR** or **FDR**: false discovery rate
- - If "-tda 1" is specified, FDRs are estimated using the target-decoy approach using the spectral probability (SpecProb) as the score (the lower, the better).
- - Otherwise, FDRs are estimated using P-values without searching the decoy database (EFDR). See Gupta et al., JASMS 2011 for details.
- - MS-GFDB reports EFDR only when it is configured to report 1 peptide match per spectrum (i.e. -n 1).
- - EFDR accurately estimates FDR when the parent mass tolerance is equal or larger than 0.5.
- - EFDR conservatively estimates FDR when the parent mass tolerance is small.
- - E.g. When parent mass tolerance is 30ppm, at EFDR 1% threshold, one identifies approximately 7% less peptide-spectrum matches (PSMs) compared to the case when the target-decoy approach is used to estimate the FDR.
-- **PepFDR**
- - Peptide-level FDR estimated using the target-decoy approach.
- - Reported only if "-tda 1" is specified.
- - If multiple spectra are matched to the same peptide, only the best scoring PSM (lowest SpecProb) is retained. After that, PepFDR is calculated as \#DecoyPSMs\>s / \#TargetPSMs\>s among the retained PSMs. This approximates the FDR of the set of unique peptides. In the MS-GFDB output, the same PepFDR value is given to all PSMs sharing the peptide. So, even a low-quality PSM may get a low PepFDR value (if it has a high-quality "sibling" PSM sharing the peptide). Note that this should not be used to count the number of identified PSMs.
-
-### MS-GFDB output example
-
-
-| \#SpecFile | SpecIndex | Scan# | FragMethod | Precursor | PMError(ppm) | Charge | Peptide | Protein | DeNovoScore | MSGFScore | SpecProb | P-value | FDR | PepFDR |
-|----|----|----|----|----|----|----|----|----|----|----|----|----|----|----|
-| 090121_NM_Trypsin_20.mzXML | 2838 | 2838 | CID | 964.7707 | 1.5199227 | 3 | K.TIQNSSVSPTSSSSSSSSTGETQTQSSSR.L | IPI:IPI00002349.2\|SWISS-PROT:Q7Z417\|TREMBL:A1L3A7\|ENSEMBL:ENSP00000225388\|REFSEQ:NP_065823\|H-INV:HIT000001036\|VEGA:OTTHUMP00000181037 Tax_Id=9606 Gene_Symbol=NUFIP2 Nuclear fragile X mental retardation-interacting protein 2 | 190 | 181 | 9.380133E-30 | 2.9333857E-22 | 0.0 | 0.0 |
-| 090121_NM_Trypsin_20.mzXML | 3671 | 3671 | ETD | 1113.4758 | 0.6583758 | 2 | R.VGPADDGPAPSGEEEGEGGGEAGGK.E | IPI:IPI00016725.2\|SWISS-PROT:Q9UKN8\|TREMBL:B3KNH2;Q05CN7\|ENSEMBL:ENSP00000361219\|REFSEQ:NP_036336\|H-INV:HIT000071196\|VEGA:OTTHUMP00000022434 Tax_Id=9606 Gene_Symbol=GTF3C4 General transcription factor 3C polypeptide 4 | 162 | 158 | 1.9912463E-28 | 6.0892146E-21 | 0.0 | 0.0 |
-| 090121_NM_Trypsin_20.mzXML | 3031 | 3031 | ETD | 651.64874 | 1.7510794 | 3 | K.GAAAAAAASGAAGGGGGGAGAGAPGGGR.L | IPI:IPI00644073.1\|VEGA:OTTHUMP00000038687 Tax_Id=9606 Gene_Symbol=INTS3 18 kDa protein | 214 | 202 | 6.7318633E-28 | 2.093763E-20 | 0.0 | 0.0 |
-| 090121_NM_Trypsin_20.mzXML | 19088 | 19088 | CID | 1199.0916 | 10.392676 | 2 | K.VNFSPPGDTNSLFPGTWYLER.V | IPI:IPI00945760.1\|TREMBL:B7Z784;B7Z7M8;B7Z8R3\|REFSEQ:NP_001159579 Tax_Id=9606 Gene_Symbol=HMGCS2 hydroxymethylglutaryl-CoA synthase, mitochondrial isoform 2 precursor | 243 | 243 | 2.9611275E-27 | 8.838129E-20 | 0.0 | 0.0 |
-| 090121_NM_Trypsin_20.mzXML | 3030 | 3030 | CID/ETD | 651.64874 | 1.7510794 | 3 | K.GAAAAAAASGAAGGGGGGAGAGAPGGGR.L | IPI:IPI00644073.1\|VEGA:OTTHUMP00000038687 Tax_Id=9606 Gene_Symbol=INTS3 18 kDa protein | 389 | 389 | 7.508096E-33 | 2.335189E-25 | 0.0 | 0.0 |
-
-
-# BuildSA
-
-Index a protein database for fast searching.
-
-
-```text
-Usage: java -cp MSGFDB.jar msdbsearch.BuildSA
- -d DatabaseFile (*.fasta or *.fa)
- [-tda 0/1/2] (0: target only, 1: target-decoy database only, 2: both)
-```
-
-
-**Parameters:**
-
-- **-d DbPath**
- - Name of a protein database (\*.fasta or \*.fa)
- - Database file must ends with ".fasta" or ".fa".
-- **-tda 0/1/2**
- - If 0, only "DatabaseFile" will be indexed.
- - If 1, a new database file (\*.revConcat.fasta) will be generated by appending reversed proteins. This forward-reverse database will be indexed.
- - If 2, both the original database and the forward-reverse database file will be indexed.
-
-BuildSA creates a suffix array of the protein database. For an input database file DBFileName.fasta, BuildSA will generate 4 auxiliary files (DbFileName.canno, DBFileName.cnlcp, DBFileName.csarr, DBFileName.cseq).It needs to be executed only once per each database file.
diff --git a/docs/msgfplus.md b/docs/msgfplus.md
index 19117e5b..d3a8b3aa 100644
--- a/docs/msgfplus.md
+++ b/docs/msgfplus.md
@@ -10,7 +10,7 @@ Usage: java -Xmx3500M -jar MSGFPlus.jar
An example parameter file is at https://github.com/MSGFPlus/msgfplus/blob/master/docs/examples/MSGFPlus_Params.txt
Additional parameter files are at https://github.com/MSGFPlus/msgfplus/tree/master/docs/parameterfiles
-[-s SpectrumFile] (*.mzML, *.mzXML, *.mgf, *.ms2, *.pkl or *_dta.txt)
+[-s SpectrumFile] (*.mzML or *.mgf)
Spectra should be centroided (see below for MSConvert example). Profile spectra will be ignored.
[-d DatabaseFile] (*.fasta or *.fa or *.faa)
@@ -123,9 +123,9 @@ Usage: java -Xmx3500M -jar MSGFPlus.jar
[-numMods Count] (Maximum number of dynamic (variable) modifications per peptide; Default: 3)
-[-allowDenseCentroidedPeaks 0/1] (Default: 0 (disabled); 1: (for mzML/mzXML input only) allows inclusion of spectra with high-density centroid data in the search)
- MS-GF+ checks the distance between consecutive peaks in the spectrum, and if the median distance is less than 50 ppm, they are considered profile spectra regardless of the value provided in mzML and mzXML files.
- This parameter allows overriding this check when the mzML/mzXML file says the spectrum is centroided.
+[-allowDenseCentroidedPeaks 0/1] (Default: 0 (disabled); 1: (for mzML input only) allows inclusion of spectra with high-density centroid data in the search)
+ MS-GF+ checks the distance between consecutive peaks in the spectrum, and if the median distance is less than 50 ppm, they are considered profile spectra regardless of the value provided in the mzML file.
+ This parameter allows overriding this check when the mzML file says the spectrum is centroided.
```
@@ -146,10 +146,10 @@ Example command (low-precision spectra):
### Parameters:
-- **-s SpectrumFile** (.mzML\*, \*.mzXML, \*.mgf, \*.ms2, \*.pkl or \*\_dta.txt) - Required
+- **-s SpectrumFile** (\*.mzML or \*.mgf) - Required
- - Spectrum file name. Currently, MS-GF+ supports the following file formats: mzML, mzXML, mzML, mgf, ms2, pkl and \_dta.txt.
- - We recommend to use mzML, whenever possible.
+ - Spectrum file name. This fork supports two spectrum file formats: `mzML` and `mgf`. Legacy formats (`mzXML`, `ms2`, `pkl`, `_dta.txt`) are not supported.
+ - We recommend `mzML` whenever possible.
- For Thermo .raw files, obtain a centroided .mzML using MSConvert, which is part of [ProteoWizard](http://proteowizard.sourceforge.net/).
`MSConvert.exe --mzML --32 --filter "peakPicking true 1-" DatasetName.raw`
diff --git a/docs/output.md b/docs/output.md
index bb840273..f091479c 100644
--- a/docs/output.md
+++ b/docs/output.md
@@ -8,8 +8,10 @@ Select the format with `-outputFormat`:
| Flag | Format | Extension | Typical use |
|---|---|---|---|
-| `-outputFormat 0` (default) | Percolator `.pin` | `.pin` | Feed to Percolator / MS²Rescore / Mokapot for FDR-calibrated rescoring |
-| `-outputFormat 1` | Tab-separated values | `.tsv` | Direct inspection / downstream tools that consume TSV |
+| `-outputFormat pin` (default) | Percolator `.pin` | `.pin` | Feed to Percolator / MS²Rescore / Mokapot for FDR-calibrated rescoring |
+| `-outputFormat tsv` | Tab-separated values | `.tsv` | Direct inspection / downstream tools that consume TSV |
+
+`-outputFormat` accepts the named values `pin` and `tsv` (case-insensitive). Numeric forms (`0`, `1`) accepted by older releases are no longer recognised — pass the named value instead.
The output path (`-o`) must use the matching extension. If `-o` is omitted, MS-GF+ writes `.pin` (or `.tsv`) in the spectrum file's directory.
diff --git a/docs/readme.md b/docs/readme.md
index 3f58ab68..14fc1ecc 100644
--- a/docs/readme.md
+++ b/docs/readme.md
@@ -10,8 +10,8 @@ Static HTML under `docs/` was replaced with these Markdown pages so they read we
### Summary
- MS-GF+ is an MS/MS database search tool that is sensitive (it identifies more peptides than other database search tools and as many peptides as spectral library search tools) and universal (works well for diverse types of spectra, different configurations of MS instruments and different experimental protocols).
-- Input: HUPO PSI standard mzML (also mzXML / MGF / MS2 / PKL).
-- Output: Percolator `.pin` (default, for rescoring) or TSV. **mzIdentML (`.mzid`) output has been removed as of the next release** — MS-GF+ now feeds downstream Percolator pipelines directly via `.pin`. See [Changelog](changelog.md) for migration notes.
+- Input: HUPO PSI standard mzML and MGF only (mzXML, MS2, PKL, and `_dta.txt` are not supported in this fork).
+- Output: Percolator `.pin` (default, for rescoring) or TSV. mzIdentML (`.mzid`) output has been removed — MS-GF+ now feeds downstream Percolator pipelines directly via `.pin`. See [Changelog](changelog.md) for migration notes.
### Usage and help
@@ -21,7 +21,6 @@ Static HTML under `docs/` was replaced with these Markdown pages so they read we
- [Suffix array builder (BuildSA)](buildsa.md)
- [Isobaric labelling: TMT / TMTpro / iTRAQ recipes](isobariclabeling.md)
- [Troubleshooting & common errors](troubleshooting.md)
-- [MS-GFDB (obsolete)](ms-gfdb.md)
### Publications
diff --git a/docs/troubleshooting.md b/docs/troubleshooting.md
index fa6ce803..1d499daa 100644
--- a/docs/troubleshooting.md
+++ b/docs/troubleshooting.md
@@ -50,7 +50,7 @@ MS-GF+ currently uses `int`-indexed suffix-array and byte-array structures for t
Affected workflows: metaproteomics, proteogenomics, antibody-repertoire searches, and pan-microbial databases.
-**Workaround today** — split the FASTA into chunks ≤ 250 MB, run one MS-GF+ search per chunk, and merge the resulting mzIdentML files. [MzidMerger](https://github.com/PNNL-Comp-Mass-Spec/Mzid-Merger) is the standard tool for the merge step.
+**Workaround today** — split the FASTA into chunks ≤ 250 MB, run one MS-GF+ search per chunk, and concatenate the resulting `.pin` (or `.tsv`) files. For `.pin` outputs the header line repeats per chunk; drop duplicate header rows after the first, then feed the merged file to Percolator.
**Planned fix** — 64-bit indexed FASTA storage is tracked as Priority 1 in the `bigbio/msgfplus` performance roadmap. See the investigation note in `.claude/investigations/` (not shipped).
@@ -107,7 +107,7 @@ Related issue: [#52](https://github.com/MSGFPlus/msgfplus/issues/52).
Reported in [OpenMS #1764](https://github.com/OpenMS/OpenMS/issues/1764). The command line works; TOPPAS fails because of how it passes environment and quoted arguments.
-**Workaround** — run MS-GF+ directly from the command line and import the resulting mzIdentML into OpenMS.
+**Workaround** — run MS-GF+ directly from the command line and feed the resulting `.pin` (or `.tsv`) into OpenMS via `PercolatorAdapter` / `MSGFPlusAdapter`.
---
diff --git a/pom.xml b/pom.xml
index 935a2624..0256882d 100644
--- a/pom.xml
+++ b/pom.xml
@@ -38,7 +38,7 @@
true
- edu.ucsd.msjava.ui.MSGFPlus
+ edu.ucsd.msjava.cli.MSGFPlus
@@ -80,7 +80,7 @@
MSGFPlus
- edu.ucsd.msjava.ui.MSGFPlus
+ edu.ucsd.msjava.cli.MSGFPlus
@@ -113,13 +113,10 @@
test
jar
-
- org.apache.commons
- commons-text
- 1.11.0
-
+
+
it.unimi.dsi
fastutil
@@ -136,9 +133,9 @@
1.2.12
- commons-io
- commons-io
- 2.15.1
+ info.picocli
+ picocli
+ 4.7.6
@@ -159,4 +156,4 @@
https://proteomics.ucsd.edu
MSGF+
-
\ No newline at end of file
+
diff --git a/src/main/java/edu/ucsd/msjava/cli/IntRange.java b/src/main/java/edu/ucsd/msjava/cli/IntRange.java
new file mode 100644
index 00000000..7a8cd369
--- /dev/null
+++ b/src/main/java/edu/ucsd/msjava/cli/IntRange.java
@@ -0,0 +1,51 @@
+package edu.ucsd.msjava.cli;
+
+import picocli.CommandLine.ITypeConverter;
+import picocli.CommandLine.TypeConversionException;
+
+/**
+ * Inclusive integer range parsed from CLI/config-file syntax
+ * {@code "min,max"} or single value {@code "n"} (interpreted as
+ * {@code n,n}). Used by {@code -ti}, {@code -msLevel}, {@code -index}.
+ */
+public record IntRange(int min, int max) {
+
+ public IntRange {
+ if (min > max) {
+ throw new IllegalArgumentException("min (" + min + ") > max (" + max + ")");
+ }
+ }
+
+ public static IntRange parse(String value) {
+ String[] tok = value.split(",");
+ try {
+ if (tok.length == 1) {
+ int v = Integer.parseInt(tok[0].trim());
+ return new IntRange(v, v);
+ }
+ if (tok.length == 2) {
+ return new IntRange(
+ Integer.parseInt(tok[0].trim()),
+ Integer.parseInt(tok[1].trim()));
+ }
+ } catch (NumberFormatException e) {
+ throw new IllegalArgumentException("invalid range: " + value, e);
+ }
+ throw new IllegalArgumentException("invalid range syntax (expected 'min,max' or single int): " + value);
+ }
+
+ @Override public String toString() {
+ return min == max ? Integer.toString(min) : min + "," + max;
+ }
+
+ /** picocli {@link ITypeConverter} that wraps {@link #parse(String)}. */
+ public static final class Converter implements ITypeConverter {
+ @Override public IntRange convert(String value) {
+ try {
+ return parse(value);
+ } catch (IllegalArgumentException e) {
+ throw new TypeConversionException(e.getMessage());
+ }
+ }
+ }
+}
diff --git a/src/main/java/edu/ucsd/msjava/ui/MSGFPlus.java b/src/main/java/edu/ucsd/msjava/cli/MSGFPlus.java
similarity index 76%
rename from src/main/java/edu/ucsd/msjava/ui/MSGFPlus.java
rename to src/main/java/edu/ucsd/msjava/cli/MSGFPlus.java
index 3a10c9bc..31b7188e 100644
--- a/src/main/java/edu/ucsd/msjava/ui/MSGFPlus.java
+++ b/src/main/java/edu/ucsd/msjava/cli/MSGFPlus.java
@@ -1,542 +1,642 @@
-package edu.ucsd.msjava.ui;
-
-import edu.ucsd.msjava.fdr.ComputeFDR;
-import edu.ucsd.msjava.misc.MSGFLogger;
-import edu.ucsd.msjava.misc.RunManifestWriter;
-import edu.ucsd.msjava.misc.ThreadPoolExecutorWithExceptions;
-import edu.ucsd.msjava.msdbsearch.*;
-import edu.ucsd.msjava.msgf.Tolerance;
-import edu.ucsd.msjava.msscorer.NewScorerFactory.SpecDataType;
-import edu.ucsd.msjava.msutil.*;
-import edu.ucsd.msjava.mzid.DirectPinWriter;
-import edu.ucsd.msjava.mzid.DirectTSVWriter;
-import edu.ucsd.msjava.mzml.StaxMzMLParser;
-import edu.ucsd.msjava.params.ParamManager;
-import edu.ucsd.msjava.sequences.Constants;
-
-import java.io.File;
-import java.io.IOException;
-import java.nio.file.Paths;
-import java.util.ArrayList;
-import java.util.Collections;
-import java.util.List;
-import java.util.concurrent.TimeUnit;
-import java.util.logging.Level;
-import java.util.logging.Logger;
-
-
-public class MSGFPlus {
- public static final String VERSION = "Release (v2026.03.25)";
- public static final String RELEASE_DATE = "25 March 2026";
-
- public static final String DECOY_DB_EXTENSION = ".revCat.fasta";
- public static final String DEFAULT_DECOY_PROTEIN_PREFIX = "XXX";
-
- // Set this to true when debugging
- private static final boolean DISABLE_THREADING = false;
-
- // Snapshot of the original CLI argv, captured in main() so that
- // RunManifestWriter can record it alongside the mzid without
- // threading argv through runMSGFPlus's many call sites.
- private static volatile String[] argvSnapshot = new String[0];
-
- public static void main(String argv[]) {
- long startTime = System.currentTimeMillis();
- argvSnapshot = argv == null ? new String[0] : argv.clone();
-
- ParamManager paramManager = new ParamManager("MS-GF+", MSGFPlus.VERSION, MSGFPlus.RELEASE_DATE, "java -Xmx3500M -jar MSGFPlus.jar");
- paramManager.addMSGFPlusParams();
-
- if (argv.length == 0) {
- paramManager.printUsageInfo();
- return;
- }
-
- StaxMzMLParser.turnOffLogs();
-
- // Parse parameters
- String errMessage = paramManager.parseParams(argv);
- if (errMessage != null) {
- MSGFLogger.error(errMessage);
- System.out.println();
- paramManager.printUsageInfo();
- System.exit(-1);
- }
-
- // Propagate verbose flag to the shared logger before any downstream code logs.
- MSGFLogger.setVerbose(paramManager.getVerboseFlag() == 1);
-
- // Running MS-GF+
- paramManager.printToolInfo();
- paramManager.printJVMInfo();
- String errorMessage = null;
- try {
- errorMessage = runMSGFPlus(paramManager);
- } catch (Exception e) {
- e.printStackTrace();
- System.exit(-1);
- }
-
- if (errorMessage != null) {
- MSGFLogger.error(errorMessage);
- System.out.println();
- System.exit(-1);
- } else
- MSGFLogger.info("MS-GF+ complete (total elapsed time: %.2f sec)", (System.currentTimeMillis() - startTime) / (float) 1000);
- }
-
- public static String runMSGFPlus(ParamManager paramManager) {
- SearchParams params = new SearchParams();
- String errorMessage = params.parse(paramManager);
-
- if (errorMessage != null) {
- return errorMessage;
- }
-
- List ioList = params.getDBSearchIOList();
- boolean multiFiles = false;
- if (ioList.size() >= 2) {
- MSGFLogger.info("Processing " + ioList.size() + " spectra");
- for (DBSearchIOFiles ioFiles : ioList) {
- MSGFLogger.debug("\t" + ioFiles.getSpecFile().getName());
- }
- multiFiles = true;
- }
-
- int ioIndex = -1;
- for (DBSearchIOFiles ioFiles : ioList) {
- ++ioIndex;
- File specFile = ioFiles.getSpecFile();
- SpecFileFormat specFormat = ioFiles.getSpecFileFormat();
- File outputFile = ioFiles.getOutputFile();
-
- if (multiFiles) {
- if (!outputFile.exists()) {
- MSGFLogger.info("\nProcessing " + specFile.getPath());
- MSGFLogger.debug("Writing results to " + outputFile.getPath());
- String errMsg = runMSGFPlus(ioIndex, specFormat, outputFile, params);
- if (errMsg != null) {
- return errMsg;
- }
- RunManifestWriter.write(ioFiles, params, VERSION, argvSnapshot);
- } else {
- MSGFLogger.info("\nIgnoring " + specFile.getPath());
- MSGFLogger.debug("Output file " + outputFile.getPath() + " exists.");
- }
- } else {
- String errMsg = runMSGFPlus(ioIndex, specFormat, outputFile, params);
- if (errMsg != null) {
- return errMsg;
- }
- RunManifestWriter.write(ioFiles, params, VERSION, argvSnapshot);
- }
- }
-
- return null;
- }
-
- private static String runMSGFPlus(int ioIndex, SpecFileFormat specFormat, File outputFile, SearchParams params) {
- long startTime = System.currentTimeMillis();
-
- // Verify that the output directory exists and can be written to
- File outputDirectory = outputFile.getParentFile();
- if (outputDirectory != null) {
- if (!outputDirectory.exists()) {
- System.out.println("Creating directory " + outputDirectory.getPath());
- boolean success = outputDirectory.mkdirs();
- if (!success) {
- return "Unable to create the missing directory: " + outputDirectory.getPath();
- }
- } else if (!outputDirectory.isDirectory()) {
- return "Invalid output file path (file path instead of directory path?): " + outputDirectory.getPath();
- }
-
- // An easy way to test for write access is outputDirectory.canWrite()
- // However, on Windows this is not always accurate
- // Thus, create a temporary file then delete it
- try {
- File testFile = File.createTempFile("MSGFPlus", ".tmp", outputDirectory);
- testFile.delete();
- } catch (java.io.IOException e) {
- return "Cannot create files in the output directory: " + e.getMessage();
- } catch (SecurityException e) {
- return "Cannot create files in the output directory; permission denied for: " + outputDirectory.getPath();
- }
- }
-
- // DB file
- File databaseFile = params.getDatabaseFile();
-
- if (databaseFile == null) {
- return "Database file is not defined; use -d at the command line or DatabaseFile in a config file";
- }
-
- if (!databaseFile.exists()) {
- return "Database file not found: " + databaseFile.getPath();
- }
-
- // Precursor mass tolerance
- Tolerance leftPrecursorMassTolerance = params.getLeftPrecursorMassTolerance();
- Tolerance rightPrecursorMassTolerance = params.getRightPrecursorMassTolerance();
-
- int minIsotopeError = params.getMinIsotopeError(); // inclusive
- int maxIsotopeError = params.getMaxIsotopeError(); // inclusive
-
- Enzyme enzyme = params.getEnzyme();
-
- ActivationMethod activationMethod = params.getActivationMethod();
- InstrumentType instType = params.getInstType();
- Protocol protocol = params.getProtocol();
-
- AminoAcidSet aaSet = params.getAASet();
-
- int startSpecIndex = params.getStartSpecIndex();
- int endSpecIndex = params.getEndSpecIndex();
-
- boolean useTDA = params.useTDA();
-
- int minCharge = params.getMinCharge();
- int maxCharge = params.getMaxCharge();
-
- int numThreads = params.getNumThreads();
- boolean doNotUseEdgeScore = params.doNotUseEdgeScore();
- boolean allowDenseCentroidedPeaks = params.getAllowDenseCentroidedPeaks();
-
- int minNumPeaksPerSpectrum = params.getMinNumPeaksPerSpectrum();
- if (minNumPeaksPerSpectrum == -1) // not specified
- {
- if (instType == InstrumentType.TOF)
- minNumPeaksPerSpectrum = Constants.MIN_NUM_PEAKS_PER_SPECTRUM_TOF;
- else
- minNumPeaksPerSpectrum = Constants.MIN_NUM_PEAKS_PER_SPECTRUM;
- }
-
- String decoyProteinPrefix = params.getDecoyProteinPrefix();
-
- System.out.println("Loading database files...");
-
- File dbIndexDir = params.getDBIndexDir();
- if (dbIndexDir != null) {
-
- File newDBFile = new File(Paths.get(dbIndexDir.getPath(), databaseFile.getName()).toString());
- if (!useTDA) {
- if (!newDBFile.exists()) {
- System.out.println("Creating " + newDBFile.getPath() + ".");
- ReverseDB.copyDB(databaseFile.getPath(), newDBFile.getPath());
- }
- }
- databaseFile = newDBFile;
- }
-
- if (useTDA) {
- String dbFileName = databaseFile.getName();
- String concatDBFileName = dbFileName.substring(0, dbFileName.lastIndexOf('.')) + DECOY_DB_EXTENSION;
-
- String concatDBFilePath = Paths.get(databaseFile.getAbsoluteFile().getParent(), concatDBFileName).toString();
- File concatTargetDecoyDBFile = new File(concatDBFilePath);
-
- if (!concatTargetDecoyDBFile.exists()) {
- System.out.println("Creating " + concatTargetDecoyDBFile.getPath() + ".");
- if (ReverseDB.reverseDB(databaseFile.getPath(), concatTargetDecoyDBFile.getPath(), true, decoyProteinPrefix) == false) {
- return "Cannot create a decoy database file!";
- }
- }
- databaseFile = concatTargetDecoyDBFile;
- }
-
- DBScanner.setAminoAcidProbabilities(databaseFile.getPath(), aaSet);
- aaSet.registerEnzyme(enzyme);
-
- CompactFastaSequence fastaSequence = new CompactFastaSequence(databaseFile.getPath());
- fastaSequence.setDecoyProteinPrefix(decoyProteinPrefix);
-
- if (useTDA) {
- float ratioUniqueProteins = fastaSequence.getRatioUniqueProteins();
- if (ratioUniqueProteins < 0.5f) {
- fastaSequence.printTooManyDuplicateSequencesMessage(databaseFile.getName(), "MS-GF+");
- System.exit(-1);
- }
-
- float fractionDecoyProteins = fastaSequence.getFractionDecoyProteins();
- if (fractionDecoyProteins < 0.4f || fractionDecoyProteins > 0.6f) {
- MSGFLogger.error("Error while reading: " + databaseFile.getName() + " (fraction of decoy proteins: " + fractionDecoyProteins + ")");
- MSGFLogger.error("Delete " + databaseFile.getName() + " and run MS-GF+ again.");
- MSGFLogger.error("Decoy protein names should start with " + fastaSequence.getDecoyProteinPrefix());
- System.exit(-1);
- }
- }
-
- CompactSuffixArray sa = new CompactSuffixArray(fastaSequence, params.getMaxPeptideLength());
- System.out.print("Loading database finished ");
- System.out.format("(elapsed time: %.2f sec)\n", (float) (System.currentTimeMillis() - startTime) / 1000);
-
- System.out.println("Reading spectra...");
-
- File specFile = params.getDBSearchIOList().get(ioIndex).getSpecFile();
-
- // Show a message of the form "Opening mzML file QC_Mam_19_01_PNNL_10_06Jan21_Arwen_WBEH-20-12-01.mzML"
- System.out.printf("Opening %s %s\n", specFormat.getPSIName(), specFile.getName());
-
- SpectraAccessor specAcc = new SpectraAccessor(specFile, specFormat);
- int minMSLevel = params.getMinMSLevel();
- int maxMSLevel = params.getMaxMSLevel();
- specAcc.setMSLevelRange(minMSLevel, maxMSLevel);
-
- if (specAcc.getSpecMap() == null || specAcc.getSpecItr() == null)
- return "Error while parsing spectrum file: " + specFile.getPath();
-
- ArrayList specKeyList = SpecKey.getSpecKeyList(specAcc,
- startSpecIndex, endSpecIndex, minCharge, maxCharge, activationMethod, minNumPeaksPerSpectrum, allowDenseCentroidedPeaks,
- minMSLevel, maxMSLevel);
-
- int specSize = specKeyList.size();
- if (specSize == 0)
- return specFile.getPath() + " does not have any valid spectra";
-
- System.out.print("Reading spectra finished ");
- System.out.format("(elapsed time: %.2f sec)\n", (float) (System.currentTimeMillis() - startTime) / 1000);
-
- if (numThreads <= 0)
- numThreads = 1;
-
- // Minimum spectra/task(or thread) floor for efficiency; going smaller slows down processing.
- // Configurable via -minSpectraPerThread for users on many-core hosts with small inputs (see #52).
- int spectraPerTaskMinimum = params.getMinSpectraPerThread();
- int maxThreads = Math.max(1, Math.round((float) specSize / spectraPerTaskMinimum));
- if (maxThreads < numThreads) {
- if (maxThreads == 1) {
- System.out.println("Note: under " + spectraPerTaskMinimum + " spectra; using 1 thread instead of " + numThreads);
- } else {
- System.out.println("Note: " + spectraPerTaskMinimum + " spectra per thread minimum; using " + maxThreads + " threads instead of " + numThreads);
- }
-
- numThreads = maxThreads;
- }
-
- System.out.println("Using " + numThreads + (numThreads == 1 ? " thread." : " threads."));
-
- // Print out parameters
- System.out.println("Search Parameters:");
- System.out.println(params.toString());
-
- SpecDataType specDataType = new SpecDataType(activationMethod, instType, enzyme, protocol);
-
- // Achievement B — two-pass precursor mass calibration (P2-cal).
- // Runs a sampled pre-pass over the current file's SpecKeys to learn
- // a per-file ppm shift, then stores it on DBSearchIOFiles so every
- // task-local ScoredSpectraMap picks it up. OFF mode is a strict
- // no-op: we skip the pre-pass entirely and never call the setter,
- // so DBSearchIOFiles.precursorMassShiftPpm stays at its 0.0 default
- // and ScoredSpectraMap.applyShift() takes its exact-zero fast path.
- DBSearchIOFiles currentIoFiles = params.getDBSearchIOList().get(ioIndex);
- if (params.getPrecursorCalMode() != SearchParams.PrecursorCalMode.OFF) {
- long calStart = System.currentTimeMillis();
- MassCalibrator calibrator = new MassCalibrator(
- specAcc,
- sa,
- aaSet,
- params,
- specKeyList,
- leftPrecursorMassTolerance,
- rightPrecursorMassTolerance,
- minIsotopeError,
- maxIsotopeError,
- specDataType);
- double shiftPpm = calibrator.learnPrecursorShiftPpm(ioIndex);
- boolean applyLearnedShift = shiftPpm != 0.0
- || params.getPrecursorCalMode() == SearchParams.PrecursorCalMode.ON;
- if (applyLearnedShift) {
- currentIoFiles.setPrecursorMassShiftPpm(shiftPpm);
- System.out.printf("Precursor mass shift learned: %.3f ppm (elapsed: %.2f sec)%n",
- shiftPpm, (System.currentTimeMillis() - calStart) / 1000.0);
- } else {
- System.out.printf("Precursor mass calibration skipped (insufficient confident PSMs; elapsed: %.2f sec)%n",
- (System.currentTimeMillis() - calStart) / 1000.0);
- }
- }
- double precursorMassShiftPpm = currentIoFiles.getPrecursorMassShiftPpm();
-
- List resultList = Collections.synchronizedList(new ArrayList());
-
- int toIndexGlobal = specSize;
- while (toIndexGlobal < specSize) {
- SpecKey lastSpecKey = specKeyList.get(toIndexGlobal - 1);
- SpecKey nextSpecKey = specKeyList.get(toIndexGlobal);
-
- if (lastSpecKey.getSpecIndex() == nextSpecKey.getSpecIndex())
- toIndexGlobal++;
- else
- break;
- }
-
- System.out.println("Spectrum 0-" + (toIndexGlobal - 1) + " (total: " + specSize + ")");
-
- // Thread pool
- ThreadPoolExecutorWithExceptions executor = ThreadPoolExecutorWithExceptions.newFixedThreadPool(numThreads);
- executor.setTaskName("Search");
-
- int numTasks = Math.min(numThreads * 3, Math.round((float) specSize / spectraPerTaskMinimum));
- if (numThreads <= 1) {
- numTasks = 1;
- }
-
- if (params.getNumTasks() != 0) {
- numTasks = params.getNumTasks();
- if (numTasks < 0) {
- numTasks = numThreads * (numTasks * -1);
- }
- if (numTasks < numThreads) {
- System.out.println("Changing specified tasks from " + numTasks + " to " + numThreads + " to provide the minimum of one task per thread.");
- numTasks = numThreads;
- }
- }
- if (numTasks > 1) {
- System.out.println("Splitting work into " + numTasks + " tasks.");
- } else {
- System.out.println("Searching using a single task.");
- }
-
- // Partition specKeyList
- int size = toIndexGlobal;
- int residue = size % numTasks;
-
- int[] startIndex = new int[numTasks];
- int[] endIndex = new int[numTasks];
-
- int subListSize = size / numTasks;
- for (int i = 0; i < numTasks; i++) {
- startIndex[i] = i > 0 ? endIndex[i - 1] : 0;
- endIndex[i] = startIndex[i] + subListSize + (i < residue ? 1 : 0);
-
- subListSize = size / numTasks;
- while (endIndex[i] < specKeyList.size()) {
- SpecKey lastSpecKey = specKeyList.get(endIndex[i] - 1);
- SpecKey nextSpecKey = specKeyList.get(endIndex[i]);
-
- if (lastSpecKey.getSpecIndex() == nextSpecKey.getSpecIndex()) {
- ++endIndex[i];
- --subListSize;
- } else
- break;
- }
- }
-
- try {
- for (int i = 0; i < numTasks; i++) {
- final int taskStartIndex = startIndex[i];
- final int taskEndIndex = endIndex[i];
- final boolean storeRankScorer = params.outputAdditionalFeatures();
- final int taskNum = i + 1;
-
- // Defer ScoredSpectraMap construction to the worker thread so all
- // tasks' spectrum heaps aren't allocated up front when queued.
- ConcurrentMSGFPlus.RunMSGFPlus msgfplusExecutor = new ConcurrentMSGFPlus.RunMSGFPlus(
- () -> {
- ScoredSpectraMap specScanner = new ScoredSpectraMap(
- specAcc,
- Collections.synchronizedList(specKeyList.subList(taskStartIndex, taskEndIndex)),
- leftPrecursorMassTolerance,
- rightPrecursorMassTolerance,
- minIsotopeError,
- maxIsotopeError,
- specDataType,
- storeRankScorer,
- false,
- precursorMassShiftPpm
- );
- if (doNotUseEdgeScore)
- specScanner.turnOffEdgeScoring();
- return specScanner;
- },
- sa,
- params,
- resultList,
- taskNum
- );
-
- if (DISABLE_THREADING) {
- msgfplusExecutor.run();
- } else {
- executor.execute(msgfplusExecutor);
- }
-
- }
- // Output initial progress report.
- executor.outputProgressReport();
-
- executor.shutdown();
-
- try {
- executor.awaitTerminationWithExceptions(Long.MAX_VALUE, TimeUnit.NANOSECONDS);
- } catch (InterruptedException e) {
- if (!executor.HasThrownData()) {
- e.printStackTrace();
- Logger.getLogger(MSGFPlus.class.getName()).log(Level.SEVERE, e.getMessage(), e);
- }
- }
-
- // Output completed progress report.
- executor.outputProgressReport();
-
- } catch (OutOfMemoryError ex) {
- ex.printStackTrace();
- Logger.getLogger(MSGFPlus.class.getName()).log(Level.SEVERE, null, ex);
- executor.shutdownNow();
- int taskMult = numTasks / numThreads;
- return "Task terminated; results incomplete. Please run again with a greater amount of memory, using \"-Xmx4G\", for example.\n" +
- "\tYou can also use less memory by increasing the number of tasks used for the search, at the cost of more time.\n" +
- "\tTry doubling the number used for this search with \"-tasks -" + (taskMult * 2) + "\" or \"-tasks " + (numTasks * 2) + "\".";
- } catch (Exception ex) {
- ex.printStackTrace();
- Logger.getLogger(MSGFPlus.class.getName()).log(Level.SEVERE, null, ex);
- executor.shutdownNow();
- return "Task terminated; results incomplete. Please run again.";
- } catch (Throwable ex) {
- ex.printStackTrace();
- Logger.getLogger(MSGFPlus.class.getName()).log(Level.SEVERE, null, ex);
- executor.shutdownNow();
- return "Task terminated; results incomplete. Please run again.";
- }
-
- long qValueStartTime = System.currentTimeMillis();
-
- if (params.useTDA()) {
- // Compute Q-values
- System.out.println("Computing q-values...");
- ComputeFDR.addQValues(resultList, sa, false, decoyProteinPrefix);
- System.out.print("Computing q-values finished ");
- System.out.format("(elapsed time: %.2f sec)\n", (float) (System.currentTimeMillis() - qValueStartTime) / 1000);
- }
-
- // Sort by spectral E-values then write to disk
-
- long saveResultsStartTime = System.currentTimeMillis();
-
- System.out.println("Writing results...");
- Collections.sort(resultList);
-
- if (params.writeTsv()) {
- DirectTSVWriter tsvWriter = new DirectTSVWriter(params, aaSet, sa, specAcc, ioIndex);
- try {
- tsvWriter.writeResults(resultList, outputFile);
- } catch (IOException e) {
- return "Error writing TSV output: " + e.getMessage();
- }
- System.out.println("TSV file: " + outputFile.getPath());
- }
-
- if (params.writePin()) {
- DirectPinWriter pinWriter = new DirectPinWriter(params, aaSet, sa, specAcc, ioIndex);
- try {
- pinWriter.writeResults(resultList, outputFile);
- } catch (IOException e) {
- return "Error writing pin output: " + e.getMessage();
- }
- System.out.println("PIN file: " + outputFile.getPath());
- }
-
- System.out.print("Writing results finished ");
- System.out.format("(elapsed time: %.2f sec)\n", (float) (System.currentTimeMillis() - saveResultsStartTime) / 1000);
- return null;
- }
-}
+package edu.ucsd.msjava.cli;
+
+import edu.ucsd.msjava.fdr.ComputeFDR;
+import edu.ucsd.msjava.misc.MSGFLogger;
+import edu.ucsd.msjava.misc.RunManifestWriter;
+import edu.ucsd.msjava.misc.ThreadPoolExecutorWithExceptions;
+import edu.ucsd.msjava.msdbsearch.*;
+import edu.ucsd.msjava.msgf.Tolerance;
+import edu.ucsd.msjava.msscorer.NewScorerFactory.SpecDataType;
+import edu.ucsd.msjava.msutil.*;
+import edu.ucsd.msjava.output.DirectPinWriter;
+import edu.ucsd.msjava.output.DirectTSVWriter;
+import edu.ucsd.msjava.mzml.StaxMzMLParser;
+import edu.ucsd.msjava.sequences.Constants;
+import picocli.CommandLine;
+import picocli.CommandLine.ParameterException;
+
+import java.io.File;
+import java.io.IOException;
+import java.nio.file.Paths;
+import java.util.ArrayList;
+import java.util.Collections;
+import java.util.List;
+import java.util.concurrent.ForkJoinPool;
+import java.util.concurrent.Future;
+import java.util.concurrent.TimeUnit;
+import java.util.logging.Level;
+import java.util.logging.Logger;
+
+
+public class MSGFPlus {
+ public static final String VERSION = "Release (v2026.03.25)";
+ public static final String RELEASE_DATE = "25 March 2026";
+
+ public static final String DECOY_DB_EXTENSION = ".revCat.fasta";
+ public static final String DEFAULT_DECOY_PROTEIN_PREFIX = "XXX";
+
+ // Set this to true when debugging
+ private static final boolean DISABLE_THREADING = false;
+
+ /** Default numTasks-per-thread multiplier when {@code -tasks} is not
+ * passed. Users can override at the CLI via {@code -tasks -N}. */
+ private static final int DEFAULT_TASKS_PER_THREAD = 3;
+ private static final String USE_FORK_JOIN_PROPERTY = "msgfplus.useForkJoin";
+
+ // Snapshot of the original CLI argv, captured in main() so that
+ // RunManifestWriter can record it alongside the mzid without
+ // threading argv through runMSGFPlus's many call sites.
+ private static volatile String[] argvSnapshot = new String[0];
+
+ public static void main(String argv[]) {
+ long startTime = System.currentTimeMillis();
+ argvSnapshot = argv == null ? new String[0] : argv.clone();
+
+ MSGFPlusOptions opts = new MSGFPlusOptions();
+ CommandLine cl = MSGFPlusOptions.commandLine(opts);
+
+ if (argv.length == 0) {
+ printToolInfo();
+ cl.usage(System.out);
+ return;
+ }
+
+ StaxMzMLParser.turnOffLogs();
+
+ try {
+ cl.parseArgs(argv);
+ } catch (ParameterException e) {
+ MSGFLogger.error(e.getMessage());
+ System.out.println();
+ cl.usage(System.out);
+ System.exit(-1);
+ }
+
+ if (cl.isUsageHelpRequested()) {
+ cl.usage(System.out);
+ return;
+ }
+ if (cl.isVersionHelpRequested()) {
+ System.out.println(VERSION);
+ return;
+ }
+
+ // Propagate verbose flag to the shared logger before any downstream code logs.
+ MSGFLogger.setVerbose(opts.effectiveVerbose() == 1);
+
+ printToolInfo();
+ printJVMInfo();
+
+ String errorMessage = null;
+ try {
+ errorMessage = runMSGFPlus(opts);
+ } catch (Exception e) {
+ e.printStackTrace();
+ System.exit(-1);
+ }
+
+ if (errorMessage != null) {
+ MSGFLogger.error(errorMessage);
+ System.out.println();
+ System.exit(-1);
+ } else
+ MSGFLogger.info("MS-GF+ complete (total elapsed time: %.2f sec)", (System.currentTimeMillis() - startTime) / (float) 1000);
+ }
+
+ private static void printToolInfo() {
+ System.out.println("MS-GF+ " + VERSION + " (" + RELEASE_DATE + ")");
+ }
+
+ private static void printJVMInfo() {
+ System.out.println("Java " + System.getProperty("java.version") + " (" + System.getProperty("java.vendor") + ")");
+ System.out.println(System.getProperty("os.name") + " (" + System.getProperty("os.arch") + ", version " + System.getProperty("os.version") + ")");
+ }
+
+ public static String runMSGFPlus(MSGFPlusOptions opts) {
+ SearchParams params = new SearchParams();
+ String errorMessage = params.parse(opts);
+
+ if (errorMessage != null) {
+ return errorMessage;
+ }
+
+ List ioList = params.getDBSearchIOList();
+ boolean multiFiles = false;
+ if (ioList.size() >= 2) {
+ MSGFLogger.info("Processing " + ioList.size() + " spectra");
+ for (DBSearchIOFiles ioFiles : ioList) {
+ MSGFLogger.debug("\t" + ioFiles.getSpecFile().getName());
+ }
+ multiFiles = true;
+ }
+
+ int ioIndex = -1;
+ for (DBSearchIOFiles ioFiles : ioList) {
+ ++ioIndex;
+ File specFile = ioFiles.getSpecFile();
+ SpecFileFormat specFormat = ioFiles.getSpecFileFormat();
+ File outputFile = ioFiles.getOutputFile();
+
+ if (multiFiles) {
+ if (!outputFile.exists()) {
+ MSGFLogger.info("\nProcessing " + specFile.getPath());
+ MSGFLogger.debug("Writing results to " + outputFile.getPath());
+ String errMsg = runMSGFPlus(ioIndex, specFormat, outputFile, params);
+ if (errMsg != null) {
+ return errMsg;
+ }
+ RunManifestWriter.write(ioFiles, params, VERSION, argvSnapshot);
+ } else {
+ MSGFLogger.info("\nIgnoring " + specFile.getPath());
+ MSGFLogger.debug("Output file " + outputFile.getPath() + " exists.");
+ }
+ } else {
+ String errMsg = runMSGFPlus(ioIndex, specFormat, outputFile, params);
+ if (errMsg != null) {
+ return errMsg;
+ }
+ RunManifestWriter.write(ioFiles, params, VERSION, argvSnapshot);
+ }
+ }
+
+ return null;
+ }
+
+ private static String runMSGFPlus(int ioIndex, SpecFileFormat specFormat, File outputFile, SearchParams params) {
+ long startTime = System.currentTimeMillis();
+
+ // Verify that the output directory exists and can be written to
+ File outputDirectory = outputFile.getParentFile();
+ if (outputDirectory != null) {
+ if (!outputDirectory.exists()) {
+ System.out.println("Creating directory " + outputDirectory.getPath());
+ boolean success = outputDirectory.mkdirs();
+ if (!success) {
+ return "Unable to create the missing directory: " + outputDirectory.getPath();
+ }
+ } else if (!outputDirectory.isDirectory()) {
+ return "Invalid output file path (file path instead of directory path?): " + outputDirectory.getPath();
+ }
+
+ // An easy way to test for write access is outputDirectory.canWrite()
+ // However, on Windows this is not always accurate
+ // Thus, create a temporary file then delete it
+ try {
+ File testFile = File.createTempFile("MSGFPlus", ".tmp", outputDirectory);
+ testFile.delete();
+ } catch (java.io.IOException e) {
+ return "Cannot create files in the output directory: " + e.getMessage();
+ } catch (SecurityException e) {
+ return "Cannot create files in the output directory; permission denied for: " + outputDirectory.getPath();
+ }
+ }
+
+ // DB file
+ File databaseFile = params.getDatabaseFile();
+
+ if (databaseFile == null) {
+ return "Database file is not defined; use -d at the command line or DatabaseFile in a config file";
+ }
+
+ if (!databaseFile.exists()) {
+ return "Database file not found: " + databaseFile.getPath();
+ }
+
+ // Precursor mass tolerance
+ Tolerance leftPrecursorMassTolerance = params.getLeftPrecursorMassTolerance();
+ Tolerance rightPrecursorMassTolerance = params.getRightPrecursorMassTolerance();
+
+ int minIsotopeError = params.getMinIsotopeError(); // inclusive
+ int maxIsotopeError = params.getMaxIsotopeError(); // inclusive
+
+ Enzyme enzyme = params.getEnzyme();
+
+ ActivationMethod activationMethod = params.getActivationMethod();
+ InstrumentType instType = params.getInstType();
+ Protocol protocol = params.getProtocol();
+
+ AminoAcidSet aaSet = params.getAASet();
+
+ int startSpecIndex = params.getStartSpecIndex();
+ int endSpecIndex = params.getEndSpecIndex();
+
+ boolean useTDA = params.useTDA();
+
+ int minCharge = params.getMinCharge();
+ int maxCharge = params.getMaxCharge();
+
+ int numThreads = params.getNumThreads();
+ boolean doNotUseEdgeScore = params.doNotUseEdgeScore();
+ boolean allowDenseCentroidedPeaks = params.getAllowDenseCentroidedPeaks();
+
+ int minNumPeaksPerSpectrum = params.getMinNumPeaksPerSpectrum();
+ if (minNumPeaksPerSpectrum == -1) // not specified
+ {
+ if (instType == InstrumentType.TOF)
+ minNumPeaksPerSpectrum = Constants.MIN_NUM_PEAKS_PER_SPECTRUM_TOF;
+ else
+ minNumPeaksPerSpectrum = Constants.MIN_NUM_PEAKS_PER_SPECTRUM;
+ }
+
+ String decoyProteinPrefix = params.getDecoyProteinPrefix();
+
+ System.out.println("Loading database files...");
+
+ File dbIndexDir = params.getDBIndexDir();
+ if (dbIndexDir != null) {
+
+ File newDBFile = new File(Paths.get(dbIndexDir.getPath(), databaseFile.getName()).toString());
+ if (!useTDA) {
+ if (!newDBFile.exists()) {
+ System.out.println("Creating " + newDBFile.getPath() + ".");
+ ReverseDB.copyDB(databaseFile.getPath(), newDBFile.getPath());
+ }
+ }
+ databaseFile = newDBFile;
+ }
+
+ if (useTDA) {
+ String dbFileName = databaseFile.getName();
+ String concatDBFileName = dbFileName.substring(0, dbFileName.lastIndexOf('.')) + DECOY_DB_EXTENSION;
+
+ String concatDBFilePath = Paths.get(databaseFile.getAbsoluteFile().getParent(), concatDBFileName).toString();
+ File concatTargetDecoyDBFile = new File(concatDBFilePath);
+
+ if (!concatTargetDecoyDBFile.exists()) {
+ System.out.println("Creating " + concatTargetDecoyDBFile.getPath() + ".");
+ if (ReverseDB.reverseDB(databaseFile.getPath(), concatTargetDecoyDBFile.getPath(), true, decoyProteinPrefix) == false) {
+ return "Cannot create a decoy database file!";
+ }
+ }
+ databaseFile = concatTargetDecoyDBFile;
+ }
+
+ DBScanner.setAminoAcidProbabilities(databaseFile.getPath(), aaSet);
+ aaSet.registerEnzyme(enzyme);
+
+ CompactFastaSequence fastaSequence = new CompactFastaSequence(databaseFile.getPath());
+ fastaSequence.setDecoyProteinPrefix(decoyProteinPrefix);
+
+ if (useTDA) {
+ float ratioUniqueProteins = fastaSequence.getRatioUniqueProteins();
+ if (ratioUniqueProteins < 0.5f) {
+ fastaSequence.printTooManyDuplicateSequencesMessage(databaseFile.getName(), "MS-GF+");
+ System.exit(-1);
+ }
+
+ float fractionDecoyProteins = fastaSequence.getFractionDecoyProteins();
+ if (fractionDecoyProteins < 0.4f || fractionDecoyProteins > 0.6f) {
+ MSGFLogger.error("Error while reading: " + databaseFile.getName() + " (fraction of decoy proteins: " + fractionDecoyProteins + ")");
+ MSGFLogger.error("Delete " + databaseFile.getName() + " and run MS-GF+ again.");
+ MSGFLogger.error("Decoy protein names should start with " + fastaSequence.getDecoyProteinPrefix());
+ System.exit(-1);
+ }
+ }
+
+ CompactSuffixArray sa = new CompactSuffixArray(fastaSequence, params.getMaxPeptideLength());
+ System.out.print("Loading database finished ");
+ System.out.format("(elapsed time: %.2f sec)\n", (float) (System.currentTimeMillis() - startTime) / 1000);
+
+ System.out.println("Reading spectra...");
+
+ File specFile = params.getDBSearchIOList().get(ioIndex).getSpecFile();
+
+ // Show a message of the form "Opening mzML file QC_Mam_19_01_PNNL_10_06Jan21_Arwen_WBEH-20-12-01.mzML"
+ System.out.printf("Opening %s %s\n", specFormat.getPSIName(), specFile.getName());
+
+ SpectraAccessor specAcc = new SpectraAccessor(specFile, specFormat);
+ int minMSLevel = params.getMinMSLevel();
+ int maxMSLevel = params.getMaxMSLevel();
+ specAcc.setMSLevelRange(minMSLevel, maxMSLevel);
+
+ if (specAcc.getSpecMap() == null || specAcc.getSpecItr() == null)
+ return "Error while parsing spectrum file: " + specFile.getPath();
+
+ ArrayList specKeyList = SpecKey.getSpecKeyList(specAcc,
+ startSpecIndex, endSpecIndex, minCharge, maxCharge, activationMethod, minNumPeaksPerSpectrum, allowDenseCentroidedPeaks,
+ minMSLevel, maxMSLevel);
+
+ int specSize = specKeyList.size();
+ if (specSize == 0)
+ return specFile.getPath() + " does not have any valid spectra";
+
+ System.out.print("Reading spectra finished ");
+ System.out.format("(elapsed time: %.2f sec)\n", (float) (System.currentTimeMillis() - startTime) / 1000);
+
+ if (numThreads <= 0)
+ numThreads = 1;
+
+ // Minimum spectra/task(or thread) floor for efficiency; going smaller slows down processing.
+ // Configurable via -minSpectraPerThread for users on many-core hosts with small inputs (see #52).
+ int spectraPerTaskMinimum = params.getMinSpectraPerThread();
+ int maxThreads = Math.max(1, Math.round((float) specSize / spectraPerTaskMinimum));
+ if (maxThreads < numThreads) {
+ if (maxThreads == 1) {
+ System.out.println("Note: under " + spectraPerTaskMinimum + " spectra; using 1 thread instead of " + numThreads);
+ } else {
+ System.out.println("Note: " + spectraPerTaskMinimum + " spectra per thread minimum; using " + maxThreads + " threads instead of " + numThreads);
+ }
+
+ numThreads = maxThreads;
+ }
+
+ System.out.println("Using " + numThreads + (numThreads == 1 ? " thread." : " threads."));
+
+ // Print out parameters
+ System.out.println("Search Parameters:");
+ System.out.println(params.toString());
+
+ SpecDataType specDataType = new SpecDataType(activationMethod, instType, enzyme, protocol);
+
+ // Achievement B — two-pass precursor mass calibration (P2-cal).
+ // Runs a sampled pre-pass over the current file's SpecKeys to learn
+ // a per-file ppm shift, then stores it on DBSearchIOFiles so every
+ // task-local ScoredSpectraMap picks it up. OFF mode is a strict
+ // no-op: we skip the pre-pass entirely and never call the setter,
+ // so DBSearchIOFiles.precursorMassShiftPpm stays at its 0.0 default
+ // and ScoredSpectraMap.applyShift() takes its exact-zero fast path.
+ DBSearchIOFiles currentIoFiles = params.getDBSearchIOList().get(ioIndex);
+ if (params.getPrecursorCalMode() != SearchParams.PrecursorCalMode.OFF) {
+ long calStart = System.currentTimeMillis();
+ MassCalibrator calibrator = new MassCalibrator(
+ specAcc,
+ sa,
+ aaSet,
+ params,
+ specKeyList,
+ leftPrecursorMassTolerance,
+ rightPrecursorMassTolerance,
+ minIsotopeError,
+ maxIsotopeError,
+ specDataType);
+ double shiftPpm = calibrator.learnPrecursorShiftPpm(ioIndex);
+ boolean applyLearnedShift = shiftPpm != 0.0
+ || params.getPrecursorCalMode() == SearchParams.PrecursorCalMode.ON;
+ if (applyLearnedShift) {
+ currentIoFiles.setPrecursorMassShiftPpm(shiftPpm);
+ System.out.printf("Precursor mass shift learned: %.3f ppm (elapsed: %.2f sec)%n",
+ shiftPpm, (System.currentTimeMillis() - calStart) / 1000.0);
+ } else {
+ System.out.printf("Precursor mass calibration skipped (insufficient confident PSMs; elapsed: %.2f sec)%n",
+ (System.currentTimeMillis() - calStart) / 1000.0);
+ }
+ }
+ double precursorMassShiftPpm = currentIoFiles.getPrecursorMassShiftPpm();
+
+ List resultList;
+
+ int toIndexGlobal = specSize;
+ while (toIndexGlobal < specSize) {
+ SpecKey lastSpecKey = specKeyList.get(toIndexGlobal - 1);
+ SpecKey nextSpecKey = specKeyList.get(toIndexGlobal);
+
+ if (lastSpecKey.getSpecIndex() == nextSpecKey.getSpecIndex())
+ toIndexGlobal++;
+ else
+ break;
+ }
+
+ System.out.println("Spectrum 0-" + (toIndexGlobal - 1) + " (total: " + specSize + ")");
+
+ boolean useForkJoin = Boolean.getBoolean(USE_FORK_JOIN_PROPERTY);
+
+ ThreadPoolExecutorWithExceptions executor =
+ useForkJoin ? null : ThreadPoolExecutorWithExceptions.newFixedThreadPool(numThreads);
+ if (executor != null) executor.setTaskName("Search");
+ ForkJoinPool fjp = useForkJoin ? new ForkJoinPool(numThreads) : null;
+ List> fjpFutures = useForkJoin ? new ArrayList<>() : null;
+
+ int numTasks = Math.min(numThreads * DEFAULT_TASKS_PER_THREAD, Math.round((float) specSize / spectraPerTaskMinimum));
+ if (numThreads <= 1) {
+ numTasks = 1;
+ }
+
+ if (params.getNumTasks() != 0) {
+ numTasks = params.getNumTasks();
+ if (numTasks < 0) {
+ numTasks = numThreads * (numTasks * -1);
+ }
+ if (numTasks < numThreads) {
+ System.out.println("Changing specified tasks from " + numTasks + " to " + numThreads + " to provide the minimum of one task per thread.");
+ numTasks = numThreads;
+ }
+ }
+ if (numTasks > 1) {
+ System.out.println("Splitting work into " + numTasks + " tasks.");
+ } else {
+ System.out.println("Searching using a single task.");
+ }
+
+ // Partition specKeyList
+ int size = toIndexGlobal;
+ int residue = size % numTasks;
+
+ int[] startIndex = new int[numTasks];
+ int[] endIndex = new int[numTasks];
+
+ int subListSize = size / numTasks;
+ for (int i = 0; i < numTasks; i++) {
+ startIndex[i] = i > 0 ? endIndex[i - 1] : 0;
+ endIndex[i] = startIndex[i] + subListSize + (i < residue ? 1 : 0);
+
+ subListSize = size / numTasks;
+ while (endIndex[i] < specKeyList.size()) {
+ SpecKey lastSpecKey = specKeyList.get(endIndex[i] - 1);
+ SpecKey nextSpecKey = specKeyList.get(endIndex[i]);
+
+ if (lastSpecKey.getSpecIndex() == nextSpecKey.getSpecIndex()) {
+ ++endIndex[i];
+ --subListSize;
+ } else
+ break;
+ }
+ }
+
+ List submittedTasks = new ArrayList<>(numTasks);
+
+ try {
+ for (int i = 0; i < numTasks; i++) {
+ final int taskStartIndex = startIndex[i];
+ final int taskEndIndex = endIndex[i];
+ final boolean storeRankScorer = params.outputAdditionalFeatures();
+ final int taskNum = i + 1;
+
+ // Defer ScoredSpectraMap construction to the worker so the
+ // per-task spectrum heap isn't queued up front.
+ ConcurrentMSGFPlus.RunMSGFPlus msgfplusExecutor = new ConcurrentMSGFPlus.RunMSGFPlus(
+ () -> {
+ ScoredSpectraMap specScanner = new ScoredSpectraMap(
+ specAcc,
+ specKeyList.subList(taskStartIndex, taskEndIndex),
+ leftPrecursorMassTolerance,
+ rightPrecursorMassTolerance,
+ minIsotopeError,
+ maxIsotopeError,
+ specDataType,
+ storeRankScorer,
+ false,
+ precursorMassShiftPpm
+ );
+ if (doNotUseEdgeScore)
+ specScanner.turnOffEdgeScoring();
+ return specScanner;
+ },
+ sa,
+ params,
+ taskNum
+ );
+
+ submittedTasks.add(msgfplusExecutor);
+
+ if (DISABLE_THREADING) {
+ msgfplusExecutor.run();
+ } else if (useForkJoin) {
+ fjpFutures.add(fjp.submit(msgfplusExecutor));
+ } else {
+ executor.execute(msgfplusExecutor);
+ }
+
+ }
+
+ if (useForkJoin) {
+ fjp.shutdown();
+ try {
+ fjp.awaitTermination(Long.MAX_VALUE, TimeUnit.NANOSECONDS);
+ } catch (InterruptedException e) {
+ Thread.currentThread().interrupt();
+ Logger.getLogger(MSGFPlus.class.getName()).log(Level.SEVERE, e.getMessage(), e);
+ }
+ for (Future> f : fjpFutures) {
+ try { f.get(); }
+ catch (java.util.concurrent.ExecutionException ex) {
+ Throwable cause = ex.getCause();
+ Logger.getLogger(MSGFPlus.class.getName()).log(Level.SEVERE, cause.getMessage(), cause);
+ fjp.shutdownNow();
+ return "Search failed: " + cause.getMessage();
+ }
+ catch (InterruptedException ex) { Thread.currentThread().interrupt(); }
+ }
+ } else {
+ executor.outputProgressReport();
+ executor.shutdown();
+ try {
+ executor.awaitTerminationWithExceptions(Long.MAX_VALUE, TimeUnit.NANOSECONDS);
+ } catch (InterruptedException e) {
+ if (!executor.HasThrownData()) {
+ e.printStackTrace();
+ Logger.getLogger(MSGFPlus.class.getName()).log(Level.SEVERE, e.getMessage(), e);
+ }
+ }
+ executor.outputProgressReport();
+ }
+
+ // awaitTermination above establishes happens-before on every
+ // task's writes (JLS §17.4.5), so the per-task ArrayLists can
+ // be drained single-threaded with no synchronization.
+ int totalResults = 0;
+ for (ConcurrentMSGFPlus.RunMSGFPlus t : submittedTasks) {
+ totalResults += t.getResultCount();
+ }
+ resultList = new ArrayList<>(totalResults);
+ for (ConcurrentMSGFPlus.RunMSGFPlus t : submittedTasks) {
+ t.drainResultsTo(resultList);
+ }
+
+ if (numTasks > 1) {
+ printTaskWallSummary(submittedTasks);
+ }
+ submittedTasks.clear();
+
+ } catch (OutOfMemoryError ex) {
+ ex.printStackTrace();
+ Logger.getLogger(MSGFPlus.class.getName()).log(Level.SEVERE, null, ex);
+ shutdownPoolNow(executor, fjp);
+ int taskMult = numTasks / numThreads;
+ return "Task terminated; results incomplete. Please run again with a greater amount of memory, using \"-Xmx4G\", for example.\n" +
+ "\tYou can also use less memory by increasing the number of tasks used for the search, at the cost of more time.\n" +
+ "\tTry doubling the number used for this search with \"-tasks -" + (taskMult * 2) + "\" or \"-tasks " + (numTasks * 2) + "\".";
+ } catch (Exception ex) {
+ ex.printStackTrace();
+ Logger.getLogger(MSGFPlus.class.getName()).log(Level.SEVERE, null, ex);
+ shutdownPoolNow(executor, fjp);
+ return "Task terminated; results incomplete. Please run again.";
+ } catch (Throwable ex) {
+ ex.printStackTrace();
+ Logger.getLogger(MSGFPlus.class.getName()).log(Level.SEVERE, null, ex);
+ shutdownPoolNow(executor, fjp);
+ return "Task terminated; results incomplete. Please run again.";
+ }
+
+ long qValueStartTime = System.currentTimeMillis();
+
+ if (params.useTDA()) {
+ // Compute Q-values
+ System.out.println("Computing q-values...");
+ ComputeFDR.addQValues(resultList, sa, false, decoyProteinPrefix);
+ System.out.print("Computing q-values finished ");
+ System.out.format("(elapsed time: %.2f sec)\n", (float) (System.currentTimeMillis() - qValueStartTime) / 1000);
+ }
+
+ // Sort by spectral E-values then write to disk
+
+ long saveResultsStartTime = System.currentTimeMillis();
+
+ System.out.println("Writing results...");
+ Collections.sort(resultList);
+
+ if (params.writeTsv()) {
+ DirectTSVWriter tsvWriter = new DirectTSVWriter(params, aaSet, sa, specAcc, ioIndex);
+ try {
+ tsvWriter.writeResults(resultList, outputFile);
+ } catch (IOException e) {
+ return "Error writing TSV output: " + e.getMessage();
+ }
+ System.out.println("TSV file: " + outputFile.getPath());
+ }
+
+ if (!params.writeTsv()) {
+ DirectPinWriter pinWriter = new DirectPinWriter(params, aaSet, sa, specAcc, ioIndex);
+ try {
+ pinWriter.writeResults(resultList, outputFile);
+ } catch (IOException e) {
+ return "Error writing pin output: " + e.getMessage();
+ }
+ System.out.println("PIN file: " + outputFile.getPath());
+ }
+
+ System.out.print("Writing results finished ");
+ System.out.format("(elapsed time: %.2f sec)\n", (float) (System.currentTimeMillis() - saveResultsStartTime) / 1000);
+ return null;
+ }
+
+ private static void shutdownPoolNow(ThreadPoolExecutorWithExceptions executor, ForkJoinPool fjp) {
+ if (executor != null) executor.shutdownNow();
+ else if (fjp != null) fjp.shutdownNow();
+ }
+
+ /**
+ * One-line wall-time summary across completed tasks. tail_gap (max -
+ * median) is the load-balance signal; high values point at uneven
+ * SpecKey distribution and motivate raising the {@code -tasks -N} multiplier.
+ */
+ private static void printTaskWallSummary(List tasks) {
+ List walls = new ArrayList<>(tasks.size());
+ for (ConcurrentMSGFPlus.RunMSGFPlus t : tasks) {
+ ConcurrentMSGFPlus.TaskWallStats s = t.getWallStats();
+ if (s != null) walls.add(s.totalMs());
+ }
+ if (walls.isEmpty()) return;
+ Collections.sort(walls);
+ long min = walls.get(0);
+ long max = walls.get(walls.size() - 1);
+ long median = walls.get(walls.size() / 2);
+ long p95 = walls.get(Math.min(walls.size() - 1, (int) Math.ceil(walls.size() * 0.95) - 1));
+ long sum = 0L;
+ for (long w : walls) sum += w;
+ System.out.format(
+ "Task wall summary (n=%d): min=%.1fs median=%.1fs p95=%.1fs max=%.1fs total=%.1fs tail_gap=%.1fs (%.0f%% of median)%n",
+ walls.size(), min / 1000.0, median / 1000.0, p95 / 1000.0, max / 1000.0,
+ sum / 1000.0, (max - median) / 1000.0,
+ median > 0 ? 100.0 * (max - median) / median : 0.0);
+ }
+}
diff --git a/src/main/java/edu/ucsd/msjava/cli/MSGFPlusOptions.java b/src/main/java/edu/ucsd/msjava/cli/MSGFPlusOptions.java
new file mode 100644
index 00000000..e02fe1d6
--- /dev/null
+++ b/src/main/java/edu/ucsd/msjava/cli/MSGFPlusOptions.java
@@ -0,0 +1,512 @@
+package edu.ucsd.msjava.cli;
+
+import edu.ucsd.msjava.msdbsearch.SearchParams.PrecursorCalMode;
+import edu.ucsd.msjava.msutil.ActivationMethod;
+import edu.ucsd.msjava.msutil.Enzyme;
+import edu.ucsd.msjava.msutil.InstrumentType;
+import edu.ucsd.msjava.msutil.Protocol;
+import picocli.CommandLine;
+import picocli.CommandLine.Command;
+import picocli.CommandLine.Option;
+
+import java.io.BufferedReader;
+import java.io.File;
+import java.io.FileReader;
+import java.io.IOException;
+import java.util.ArrayList;
+import java.util.List;
+
+/**
+ * Typed command-line options for MS-GF+. Picocli reads {@code argv} into
+ * the {@code @Option}-annotated fields below; {@link #applyConfigFile}
+ * fills in any field the CLI did not set from a {@code -conf} file
+ * (CLI takes precedence). {@link #validate} enforces required-input
+ * and numeric/enum range invariants. Each {@code effectiveXxx()} accessor
+ * returns the user-supplied value or the legacy default.
+ *
+ * Flag inventory: see {@code .claude/plans/parameter-modernization-flag-inventory.md}.
+ */
+@Command(
+ name = "MS-GF+",
+ mixinStandardHelpOptions = true,
+ sortOptions = false,
+ description = "MS-GF+: peptide identification by database search of mass spectra.")
+public final class MSGFPlusOptions {
+
+ /** Build a {@link CommandLine} configured for MS-GF+: enums match
+ * case-insensitively (so {@code -outputFormat pin} and {@code -outputFormat PIN}
+ * both work) and the parser uses the standard MS-GF+ usage layout. */
+ public static CommandLine commandLine(MSGFPlusOptions opts) {
+ return new CommandLine(opts).setCaseInsensitiveEnumValuesAllowed(true);
+ }
+
+ // ---------- input (required at runtime, but may be provided via -conf) ----------
+
+ @Option(names = "-s", paramLabel = "SpectrumFile",
+ description = "Input spectrum file (*.mzML, *.mgf) or directory of spectra. "
+ + "Required, unless provided via -conf as SpectrumFile=...")
+ public File spectrumFile;
+
+ @Option(names = "-d", paramLabel = "DatabaseFile",
+ description = "Database file (*.fasta, *.fa, *.faa). "
+ + "Required, unless provided via -conf as DatabaseFile=...")
+ public File databaseFile;
+
+ // ---------- optional config + output ----------
+
+ @Option(names = "-conf", paramLabel = "ConfigFile",
+ description = "Configuration file path; CLI flags override config file values")
+ public File configFile;
+
+ @Option(names = "-o", paramLabel = "OutputFile",
+ description = "Output file (*.pin or *.tsv); Default: .pin")
+ public File outputFile;
+
+ @Option(names = "-decoy", paramLabel = "Prefix",
+ description = "Decoy protein prefix; Default: XXX")
+ public String decoyPrefix;
+
+ // ---------- precursor mass tolerance ----------
+
+ @Option(names = "-t", paramLabel = "Tolerance",
+ converter = PrecursorTolerance.Converter.class,
+ description = "Precursor mass tolerance, e.g. 20ppm or 0.5Da or 0.5Da,2.5Da; Default: 20ppm. " +
+ "Asymmetric form sets left tolerance (ObsMass < TheoMass) and right tolerance (ObsMass > TheoMass).")
+ public PrecursorTolerance precursorTolerance;
+
+ @Option(names = "-u", paramLabel = "Units", hidden = true,
+ description = "Tolerance units (legacy): 0=Da, 1=ppm, 2=as written in -t (Default: 2)")
+ public Integer precursorToleranceUnits;
+
+ @Option(names = "-ti", paramLabel = "Range",
+ converter = IntRange.Converter.class,
+ description = "Isotope-error range, e.g. -1,2 (both inclusive); Default: 0,1")
+ public IntRange isotopeErrorRange;
+
+ // ---------- threading / parallelism ----------
+
+ @Option(names = "-thread", paramLabel = "N",
+ description = "Number of worker threads; Default: number of available cores")
+ public Integer numThreads;
+
+ @Option(names = "-tasks", paramLabel = "N",
+ description = "Number of tasks: 0=auto, >0=fixed, <0=N*threads; Default: 0")
+ public Integer numTasks;
+
+ @Option(names = "-minSpectraPerThread", paramLabel = "N",
+ description = "Minimum spectra per thread/task; Default: 250")
+ public Integer minSpectraPerThread;
+
+ @Option(names = "-verbose", paramLabel = "N",
+ description = "Verbosity: 0=total progress only (Default), 1=per-thread")
+ public Integer verbose;
+
+ // ---------- target/decoy + scoring shape ----------
+
+ @Option(names = "-tda", paramLabel = "N",
+ description = "Target-decoy strategy: 0=off (Default), 1=concatenated decoy search")
+ public Integer tdaStrategy;
+
+ @Option(names = "-m", paramLabel = "ID",
+ description = "Fragmentation method ID: 0=as written/CID (Default), 1=CID, 2=ETD, 3=HCD, 4=UVPD")
+ public Integer fragMethodId;
+
+ @Option(names = "-inst", paramLabel = "ID",
+ description = "Instrument type ID; default depends on registry")
+ public Integer instrumentTypeId;
+
+ @Option(names = "-e", paramLabel = "ID",
+ description = "Enzyme ID; default depends on registry")
+ public Integer enzymeId;
+
+ @Option(names = "-protocol", paramLabel = "ID",
+ description = "Protocol ID; default depends on registry")
+ public Integer protocolId;
+
+ @Option(names = "-ntt", paramLabel = "N",
+ description = "Number of tolerable termini (0..2); Default: 2 (fully tryptic)")
+ public Integer numTolerableTermini;
+
+ // ---------- modifications ----------
+
+ @Option(names = "-mod", paramLabel = "ModFile",
+ description = "Modification file (also accepts StaticMod=, DynamicMod=, CustomAA= entries via -conf)")
+ public File modificationFile;
+
+ // ---------- peptide / charge bounds ----------
+
+ @Option(names = "-minLength", paramLabel = "N",
+ description = "Minimum peptide length; Default: 6")
+ public Integer minPeptideLength;
+
+ @Option(names = "-maxLength", paramLabel = "N",
+ description = "Maximum peptide length; Default: 40")
+ public Integer maxPeptideLength;
+
+ @Option(names = "-minCharge", paramLabel = "N",
+ description = "Minimum precursor charge; Default: 2")
+ public Integer minCharge;
+
+ @Option(names = "-maxCharge", paramLabel = "N",
+ description = "Maximum precursor charge; Default: 3")
+ public Integer maxCharge;
+
+ @Option(names = "-n", paramLabel = "N",
+ description = "Number of matches reported per spectrum; Default: 1")
+ public Integer numMatchesPerSpec;
+
+ // ---------- output / features / calibration ----------
+
+ @Option(names = "-addFeatures", paramLabel = "N",
+ description = "Include extra features for Percolator: 0=basic (Default), 1=+features")
+ public Integer addFeatures;
+
+ @Option(names = "-outputFormat", paramLabel = "Format",
+ description = "Output format: pin (Default) or tsv")
+ public OutputFormat outputFormat;
+
+ @Option(names = "-precursorCal", paramLabel = "Mode",
+ description = "Precursor calibration mode: auto (Default), on, off")
+ public PrecursorCalMode precursorCalMode;
+
+ @Option(names = "-ccm", paramLabel = "Mass",
+ description = "Charge carrier mass; Default: 1.00727649 (proton)")
+ public Double chargeCarrierMass;
+
+ @Option(names = "-maxMissedCleavages", paramLabel = "N",
+ description = "Max missed cleavages per peptide; -1 = unlimited (Default)")
+ public Integer maxMissedCleavages;
+
+ @Option(names = "-numMods", paramLabel = "N",
+ description = "Max dynamic mods per peptide; Default: 3")
+ public Integer maxNumMods;
+
+ @Option(names = "-allowDenseCentroidedPeaks", paramLabel = "N",
+ description = "Allow centroid scans with dense peaks: 0=skip (Default), 1=allow")
+ public Integer allowDenseCentroidedPeaks;
+
+ @Option(names = "-msLevel", paramLabel = "Range",
+ converter = IntRange.Converter.class,
+ description = "MS level or range, e.g. 2 or 2,3; Default: 2,2")
+ public IntRange msLevel;
+
+ // ---------- hidden flags ----------
+
+ @Option(names = "-dd", paramLabel = "Dir", hidden = true,
+ description = "Database index directory")
+ public File dbIndexDir;
+
+ @Option(names = "-index", paramLabel = "Range", hidden = true,
+ converter = IntRange.Converter.class,
+ description = "Spectrum index range, e.g. 1,1000 (both inclusive)")
+ public IntRange specIndexRange;
+
+ @Option(names = "-edgeScore", paramLabel = "N", hidden = true,
+ description = "Edge scoring: 0=use (Default), 1=skip")
+ public Integer edgeScore;
+
+ @Option(names = "-minNumPeaks", paramLabel = "N", hidden = true,
+ description = "Minimum number of peaks per spectrum")
+ public Integer minNumPeaks;
+
+ @Option(names = "-iso", paramLabel = "N", hidden = true,
+ description = "Number of isoforms to consider per peptide")
+ public Integer numIsoforms;
+
+ @Option(names = "-ignoreMetCleavage", paramLabel = "N", hidden = true,
+ description = "Ignore N-terminal Met cleavage: 0=consider (Default), 1=ignore")
+ public Integer ignoreMetCleavage;
+
+ @Option(names = "-minDeNovoScore", paramLabel = "N", hidden = true,
+ description = "Minimum de novo score")
+ public Integer minDeNovoScore;
+
+ // ---------- config-file-only entries (populated by applyConfigFile) ----------
+
+ /** {@code DynamicMod=...} entries from the config file (or {@code -mod} file). */
+ public final List dynamicMods = new ArrayList<>();
+ /** {@code StaticMod=...} entries from the config file (or {@code -mod} file). */
+ public final List staticMods = new ArrayList<>();
+ /** {@code CustomAA=...} entries from the config file (or {@code -mod} file). */
+ public final List customAAs = new ArrayList<>();
+
+ /** Set when {@link #applyConfigFile(File)} encounters {@code MaxNumModsPerPeptide=}
+ * via the legacy alias path; allows the config-file value to feed the
+ * {@link #effectiveMaxNumMods()} default. */
+ private Integer configMaxNumMods;
+
+ // ---------- effective-value resolvers (CLI value, else config-file value, else default) ----------
+
+ public int effectiveMinPeptideLength() { return minPeptideLength != null ? minPeptideLength : 6; }
+ public int effectiveMaxPeptideLength() { return maxPeptideLength != null ? maxPeptideLength : 40; }
+ public int effectiveMinCharge() { return minCharge != null ? minCharge : 2; }
+ public int effectiveMaxCharge() { return maxCharge != null ? maxCharge : 3; }
+ public int effectiveMinSpectraPerThread() { return minSpectraPerThread != null ? minSpectraPerThread : 250; }
+ public int effectiveVerbose() { return verbose != null ? verbose : 0; }
+ public int effectiveTdaStrategy() { return tdaStrategy != null ? tdaStrategy : 0; }
+ public int effectiveMaxNumMods() { return maxNumMods != null ? maxNumMods : (configMaxNumMods != null ? configMaxNumMods : 3); }
+ public OutputFormat effectiveOutputFormat() { return outputFormat != null ? outputFormat : OutputFormat.PIN; }
+
+ /** Resolves {@code -m} index to {@link ActivationMethod}. MSGFPlus exposes
+ * 0=ASWRITTEN, 1=CID, 2=ETD, 3=HCD, 4=UVPD. The registry also defines
+ * FUSION (merge-mode synthetic method) and PQD, but neither is exposed
+ * as a user-selectable index by MSGFPlus -- FUSION was hidden by the
+ * legacy {@code addFragMethodParam(..., doNotAddMergeMode=true)}, which
+ * shifted UVPD from registry slot 5 down to user-facing index 4. */
+ public ActivationMethod effectiveActivationMethod() {
+ int idx = fragMethodId != null ? fragMethodId : 0;
+ switch (idx) {
+ case 0: return ActivationMethod.ASWRITTEN;
+ case 1: return ActivationMethod.CID;
+ case 2: return ActivationMethod.ETD;
+ case 3: return ActivationMethod.HCD;
+ case 4: return ActivationMethod.UVPD;
+ default: throw new IllegalArgumentException("invalid -m index: " + idx);
+ }
+ }
+
+ public InstrumentType effectiveInstrumentType() {
+ InstrumentType[] all = InstrumentType.getAllRegisteredInstrumentTypes();
+ int idx = instrumentTypeId != null ? instrumentTypeId : 0;
+ if (idx < 0 || idx >= all.length) throw new IllegalArgumentException("invalid -inst index: " + idx);
+ return all[idx];
+ }
+
+ public Enzyme effectiveEnzyme() {
+ Enzyme[] all = Enzyme.getAllRegisteredEnzymes();
+ // TRYPSIN is registered at index 1 (UnspecificCleavage at 0). See Enzyme static init.
+ int idx = enzymeId != null ? enzymeId : 1;
+ if (idx < 0 || idx >= all.length) throw new IllegalArgumentException("invalid -e index: " + idx);
+ return all[idx];
+ }
+
+ public Protocol effectiveProtocol() {
+ Protocol[] all = Protocol.getAllRegisteredProtocols();
+ int idx = protocolId != null ? protocolId : 0;
+ if (idx < 0 || idx >= all.length) throw new IllegalArgumentException("invalid -protocol index: " + idx);
+ return all[idx];
+ }
+
+ // ---------- config-file overlay ----------
+
+ /**
+ * Read {@code -conf} config file and populate any fields the CLI did not
+ * already set. Recognizes legacy aliases (IsotopeError → IsotopeErrorRange,
+ * etc.) and collects repeated {@code DynamicMod=}, {@code StaticMod=},
+ * {@code CustomAA=} entries.
+ *
+ * @return null on success, error string otherwise.
+ */
+ public String applyConfigFile(File file) {
+ unrecognizedConfigEntries = 0;
+ try (BufferedReader reader = new BufferedReader(new FileReader(file))) {
+ String line;
+ int lineNum = 0;
+ while ((line = reader.readLine()) != null) {
+ lineNum++;
+ String trimmed = stripComment(line);
+ if (trimmed.isEmpty()) continue;
+ int eq = trimmed.indexOf('=');
+ if (eq <= 0) continue;
+ String rawKey = trimmed.substring(0, eq).trim();
+ String value = trimmed.substring(eq + 1).trim();
+ String key = canonicalConfigKey(rawKey);
+ String err = applyConfigEntry(key, value, file.getName());
+ if (err != null) {
+ return "Error parsing line " + lineNum + " of " + file.getName() + ": " + err;
+ }
+ }
+ } catch (IOException e) {
+ return "Error reading config file " + file.getPath() + ": " + e.getMessage();
+ }
+ if (unrecognizedConfigEntries > 0) {
+ System.out.println("Valid parameters are described in the example parameter file at " +
+ "https://github.com/MSGFPlus/msgfplus/blob/master/docs/examples/MSGFPlus_Params.txt");
+ }
+ return null;
+ }
+
+ /** Counter incremented inside {@link #applyConfigEntry} whenever an unknown
+ * config-file key is seen; surfaced via the end-of-file URL hint and
+ * reset at the start of each {@link #applyConfigFile} call. */
+ private int unrecognizedConfigEntries;
+
+ private String applyConfigEntry(String key, String value, String fileName) {
+ // Config-file matching is case-insensitive. canonicalConfigKey()
+ // already returns lowercase canonical names, so the switch labels
+ // are lowercase too. Repeated mod entries are matched first since
+ // they accumulate rather than overwrite.
+ switch (key) {
+ case "dynamicmod": if (!value.equalsIgnoreCase("none")) dynamicMods.add(value); return null;
+ case "staticmod": if (!value.equalsIgnoreCase("none")) staticMods.add(value); return null;
+ case "customaa": if (!value.equalsIgnoreCase("none")) customAAs.add(value); return null;
+ default: break;
+ }
+ // Single-valued entries: only fill in if CLI did not set the field.
+ try {
+ switch (key) {
+ case "spectrumfile": if (spectrumFile == null) spectrumFile = new File(value); return null;
+ case "databasefile": if (databaseFile == null) databaseFile = new File(value); return null;
+ case "outputfile": if (outputFile == null) outputFile = new File(value); return null;
+ case "modificationfilename":
+ case "modificationfile": if (modificationFile == null) modificationFile = new File(value); return null;
+ case "dbindexdir": if (dbIndexDir == null) dbIndexDir = new File(value); return null;
+ case "decoyprefix": if (decoyPrefix == null) decoyPrefix = value; return null;
+ case "precursormasstolerance": if (precursorTolerance == null) precursorTolerance = PrecursorTolerance.parse(value); return null;
+ case "precursormasstoleranceunits":if (precursorToleranceUnits == null) precursorToleranceUnits = Integer.parseInt(value); return null;
+ case "isotopeerrorrange": if (isotopeErrorRange == null) isotopeErrorRange = IntRange.parse(value); return null;
+ case "fragmentationmethodid": if (fragMethodId == null) fragMethodId = Integer.parseInt(value); return null;
+ case "instrumentid": if (instrumentTypeId == null) instrumentTypeId = Integer.parseInt(value); return null;
+ case "enzymeid": if (enzymeId == null) enzymeId = Integer.parseInt(value); return null;
+ case "protocolid": if (protocolId == null) protocolId = Integer.parseInt(value); return null;
+ case "ntt": if (numTolerableTermini == null) numTolerableTermini = Integer.parseInt(value); return null;
+ case "minpeplength": if (minPeptideLength == null) minPeptideLength = Integer.parseInt(value); return null;
+ case "maxpeplength": if (maxPeptideLength == null) maxPeptideLength = Integer.parseInt(value); return null;
+ case "mincharge": if (minCharge == null) minCharge = Integer.parseInt(value); return null;
+ case "maxcharge": if (maxCharge == null) maxCharge = Integer.parseInt(value); return null;
+ case "nummatchesperspec": if (numMatchesPerSpec == null) numMatchesPerSpec = Integer.parseInt(value); return null;
+ case "numthreads": if (numThreads == null && !value.equalsIgnoreCase("all"))
+ numThreads = Integer.parseInt(value); return null;
+ case "numtasks": if (numTasks == null) numTasks = Integer.parseInt(value); return null;
+ case "minspectraperthread": if (minSpectraPerThread == null) minSpectraPerThread = Integer.parseInt(value); return null;
+ case "verbose": if (verbose == null) verbose = Integer.parseInt(value); return null;
+ case "tda": if (tdaStrategy == null) tdaStrategy = Integer.parseInt(value); return null;
+ case "addfeatures": if (addFeatures == null) addFeatures = Integer.parseInt(value); return null;
+ case "outputformat": if (outputFormat == null) outputFormat = OutputFormat.valueOf(value.trim().toUpperCase(java.util.Locale.ROOT)); return null;
+ case "precursorcal": if (precursorCalMode == null) precursorCalMode = PrecursorCalMode.valueOf(value.trim().toUpperCase(java.util.Locale.ROOT)); return null;
+ case "chargecarriermass": if (chargeCarrierMass == null) chargeCarrierMass = Double.parseDouble(value); return null;
+ case "maxmissedcleavages": if (maxMissedCleavages == null) maxMissedCleavages = Integer.parseInt(value); return null;
+ case "nummods": if (maxNumMods == null) configMaxNumMods = Integer.parseInt(value); return null;
+ case "allowdensecentroidedpeaks": if (allowDenseCentroidedPeaks == null) allowDenseCentroidedPeaks = Integer.parseInt(value); return null;
+ case "mslevel": if (msLevel == null) msLevel = IntRange.parse(value); return null;
+ case "specindex": if (specIndexRange == null) specIndexRange = IntRange.parse(value); return null;
+ case "edgescore": if (edgeScore == null) edgeScore = Integer.parseInt(value); return null;
+ case "minnumpeaksperspectrum": if (minNumPeaks == null) minNumPeaks = Integer.parseInt(value); return null;
+ case "numisoforms": if (numIsoforms == null) numIsoforms = Integer.parseInt(value); return null;
+ case "ignoremetcleavage": if (ignoreMetCleavage == null) ignoreMetCleavage = Integer.parseInt(value); return null;
+ case "mindenovoscore": if (minDeNovoScore == null) minDeNovoScore = Integer.parseInt(value); return null;
+ default:
+ if (!key.startsWith("enzymedef")) {
+ System.out.println("Warning, unrecognized parameter '" + key + "=" + value + "' in config file " + fileName);
+ unrecognizedConfigEntries++;
+ }
+ return null;
+ }
+ } catch (IllegalArgumentException e) {
+ return "invalid value for '" + key + "': " + value + " (" + e.getMessage() + ")";
+ }
+ }
+
+ public static String stripComment(String line) {
+ int hash = line.indexOf('#');
+ return (hash >= 0 ? line.substring(0, hash) : line).trim();
+ }
+
+ /** Normalize legacy / alternate config-file keys to canonical form.
+ * Returns lowercase so {@link #applyConfigEntry} can match
+ * case-insensitively (the legacy {@code ParamManager.parseConfigParamFile}
+ * matched names with {@code equalsIgnoreCase}). Mirrors the alias
+ * rewrites previously in {@code ParamNameEnum.getParamNameFromLine}. */
+ private static String canonicalConfigKey(String key) {
+ String norm = key.toLowerCase(java.util.Locale.ROOT);
+ switch (norm) {
+ case "isotopeerror": return "isotopeerrorrange";
+ case "targetdecoyanalysis": return "tda";
+ case "fragmentationmethod": return "fragmentationmethodid";
+ case "instrument": return "instrumentid";
+ case "enzyme": return "enzymeid";
+ case "protocol": return "protocolid";
+ case "numtolerabletermini": return "ntt";
+ case "minnumpeaks": return "minnumpeaksperspectrum";
+ case "maxnummods": return "nummods";
+ case "maxnummodsperpeptide": return "nummods";
+ case "minlength": return "minpeplength";
+ case "minpeptidelength": return "minpeplength";
+ case "maxlength": return "maxpeplength";
+ case "maxpeptidelength": return "maxpeplength";
+ case "pmtolerance": return "precursormasstolerance";
+ case "parentmasstolerance": return "precursormasstolerance";
+ default: return norm;
+ }
+ }
+
+ /** Validates required-input invariants and the numeric/enum range
+ * constraints the legacy {@code IntParameter.minValue}/{@code maxValue}
+ * and {@code EnumParameter} machinery used to enforce. Returns
+ * {@code null} on success or a user-facing error string otherwise.
+ *
+ * Required: {@code -s} and {@code -d} (either via CLI or {@code -conf}).
+ * Numeric flags must satisfy their original lower bounds; enum-shaped
+ * flags must fall in their defined index range. */
+ public String validate() {
+ if (spectrumFile == null) return "Spectrum file is not defined; use -s at the command line or SpectrumFile in a config file";
+ if (databaseFile == null) return "Database file is not defined; use -d at the command line or DatabaseFile in a config file";
+ if (modificationFile != null && !modificationFile.exists()) {
+ return "Modification file not found: " + modificationFile.getPath();
+ }
+
+ String err;
+ if ((err = checkMin("-thread", numThreads, 1)) != null) return err;
+ if ((err = checkMin("-tasks", numTasks, -10)) != null) return err;
+ if ((err = checkMin("-minSpectraPerThread", minSpectraPerThread, 1)) != null) return err;
+ if ((err = checkMin("-minLength", minPeptideLength, 1)) != null) return err;
+ if ((err = checkMin("-maxLength", maxPeptideLength, 1)) != null) return err;
+ if ((err = checkMin("-minCharge", minCharge, 1)) != null) return err;
+ if ((err = checkMin("-maxCharge", maxCharge, 1)) != null) return err;
+ if ((err = checkMin("-n", numMatchesPerSpec, 1)) != null) return err;
+ if ((err = checkMin("-maxMissedCleavages", maxMissedCleavages, -1)) != null) return err;
+ if ((err = checkMin("-numMods", maxNumMods, 0)) != null) return err;
+ if ((err = checkMin("-minNumPeaks", minNumPeaks, 0)) != null) return err;
+ if ((err = checkMin("-iso", numIsoforms, 0)) != null) return err;
+ if ((err = checkMin("-minDeNovoScore", minDeNovoScore, Integer.MIN_VALUE)) != null) return err;
+
+ if ((err = checkRange("-ntt", numTolerableTermini, 0, 2)) != null) return err;
+ if ((err = checkRange("-tda", tdaStrategy, 0, 1)) != null) return err;
+ if ((err = checkRange("-verbose", verbose, 0, 1)) != null) return err;
+ if ((err = checkRange("-addFeatures", addFeatures, 0, 1)) != null) return err;
+ if ((err = checkRange("-allowDenseCentroidedPeaks", allowDenseCentroidedPeaks, 0, 1)) != null) return err;
+ if ((err = checkRange("-edgeScore", edgeScore, 0, 1)) != null) return err;
+ if ((err = checkRange("-ignoreMetCleavage", ignoreMetCleavage, 0, 1)) != null) return err;
+ if ((err = checkRange("-u", precursorToleranceUnits, 0, 2)) != null) return err;
+
+ if (chargeCarrierMass != null && chargeCarrierMass <= 0.1) {
+ return "Invalid value for parameter -ccm: " + chargeCarrierMass + " (must be > 0.1)";
+ }
+
+ if (fragMethodId != null && (fragMethodId < 0 || fragMethodId > 4)) {
+ return "Invalid value for parameter -m: " + fragMethodId + " (valid: 0..4)";
+ }
+ int instMax = InstrumentType.getAllRegisteredInstrumentTypes().length - 1;
+ if (instrumentTypeId != null && (instrumentTypeId < 0 || instrumentTypeId > instMax)) {
+ return "Invalid value for parameter -inst: " + instrumentTypeId + " (valid: 0.." + instMax + ")";
+ }
+ int enzMax = Enzyme.getAllRegisteredEnzymes().length - 1;
+ if (enzymeId != null && (enzymeId < 0 || enzymeId > enzMax)) {
+ return "Invalid value for parameter -e: " + enzymeId + " (valid: 0.." + enzMax + ")";
+ }
+ int protMax = Protocol.getAllRegisteredProtocols().length - 1;
+ if (protocolId != null && (protocolId < 0 || protocolId > protMax)) {
+ return "Invalid value for parameter -protocol: " + protocolId + " (valid: 0.." + protMax + ")";
+ }
+ return null;
+ }
+
+ private static String checkMin(String flag, Integer value, int min) {
+ if (value == null) return null;
+ if (value < min) return "Invalid value for parameter " + flag + ": " + value + " (must be >= " + min + ")";
+ return null;
+ }
+
+ private static String checkRange(String flag, Integer value, int min, int max) {
+ if (value == null) return null;
+ if (value < min || value > max) return "Invalid value for parameter " + flag + ": " + value + " (valid: " + min + ".." + max + ")";
+ return null;
+ }
+
+ /** Mutator used by {@code AminoAcidSet} when the parsed mod metadata
+ * changes the effective max-num-mods (the AA set is authoritative once
+ * loaded). Mirrors the legacy {@code ParamManager.setMaxNumMods}. */
+ public void setMaxNumModsFromMetadata(int n) {
+ this.maxNumMods = n;
+ }
+}
diff --git a/src/main/java/edu/ucsd/msjava/cli/OutputFormat.java b/src/main/java/edu/ucsd/msjava/cli/OutputFormat.java
new file mode 100644
index 00000000..2e570882
--- /dev/null
+++ b/src/main/java/edu/ucsd/msjava/cli/OutputFormat.java
@@ -0,0 +1,17 @@
+package edu.ucsd.msjava.cli;
+
+/**
+ * Search output format selected by {@code -outputFormat}. Picocli matches
+ * incoming values case-insensitively (see
+ * {@code @Command(caseInsensitiveEnumValuesAllowed = true)}).
+ *
+ *
Numeric forms ({@code 0} / {@code 1}) accepted by older releases are
+ * intentionally not supported. Users on legacy invocations should switch
+ * to the named values.
+ */
+public enum OutputFormat {
+ /** Percolator {@code .pin} (default). */
+ PIN,
+ /** Tab-separated values, direct inspection / downstream tools. */
+ TSV
+}
diff --git a/src/main/java/edu/ucsd/msjava/cli/PrecursorTolerance.java b/src/main/java/edu/ucsd/msjava/cli/PrecursorTolerance.java
new file mode 100644
index 00000000..b214ef01
--- /dev/null
+++ b/src/main/java/edu/ucsd/msjava/cli/PrecursorTolerance.java
@@ -0,0 +1,58 @@
+package edu.ucsd.msjava.cli;
+
+import edu.ucsd.msjava.msgf.Tolerance;
+import picocli.CommandLine.ITypeConverter;
+import picocli.CommandLine.TypeConversionException;
+
+/**
+ * Typed precursor mass tolerance: a left and a right
+ * {@link Tolerance}. Supports symmetric form ({@code "20ppm"}) and
+ * asymmetric form ({@code "0.5Da,2.5Da"}). Both sides must use the
+ * same unit and be non-negative.
+ */
+public record PrecursorTolerance(Tolerance left, Tolerance right) {
+
+ public PrecursorTolerance {
+ if (left == null || right == null) {
+ throw new IllegalArgumentException("left and right tolerances must be non-null");
+ }
+ if (left.isTolerancePPM() != right.isTolerancePPM()) {
+ throw new IllegalArgumentException("left and right tolerance units must be the same");
+ }
+ if (left.getValue() < 0 || right.getValue() < 0) {
+ throw new IllegalArgumentException("parent mass tolerance must not be negative");
+ }
+ }
+
+ public static PrecursorTolerance parse(String value) {
+ String[] tok = value.split(",");
+ Tolerance l, r;
+ if (tok.length == 1) {
+ l = r = Tolerance.parseToleranceStr(tok[0]);
+ } else if (tok.length == 2) {
+ l = Tolerance.parseToleranceStr(tok[0]);
+ r = Tolerance.parseToleranceStr(tok[1]);
+ } else {
+ throw new IllegalArgumentException("invalid tolerance value: " + value);
+ }
+ if (l == null || r == null) {
+ throw new IllegalArgumentException("invalid tolerance value: " + value);
+ }
+ return new PrecursorTolerance(l, r);
+ }
+
+ @Override public String toString() {
+ return left.equals(right) ? left.toString() : left + "," + right;
+ }
+
+ /** picocli {@link ITypeConverter} that wraps {@link #parse(String)}. */
+ public static final class Converter implements ITypeConverter {
+ @Override public PrecursorTolerance convert(String value) {
+ try {
+ return parse(value);
+ } catch (IllegalArgumentException e) {
+ throw new TypeConversionException(e.getMessage());
+ }
+ }
+ }
+}
diff --git a/src/main/java/edu/ucsd/msjava/fdr/ComputeFDR.java b/src/main/java/edu/ucsd/msjava/fdr/ComputeFDR.java
index 15004902..72a5257f 100644
--- a/src/main/java/edu/ucsd/msjava/fdr/ComputeFDR.java
+++ b/src/main/java/edu/ucsd/msjava/fdr/ComputeFDR.java
@@ -165,8 +165,6 @@ else if (!decoyFile.isFile())
if (targetFile == null)
printUsageAndExit("Target is missing!");
-// if(specFileCol < 0)
-// printUsageAndExit("specFileCol is missing or invalid!");
if (scoreCol < 0)
printUsageAndExit("scoreCol is missing or invalid!");
if (pepCol < 0)
diff --git a/src/main/java/edu/ucsd/msjava/fdr/ComputeQValue.java b/src/main/java/edu/ucsd/msjava/fdr/ComputeQValue.java
index 161cf80a..d136b894 100644
--- a/src/main/java/edu/ucsd/msjava/fdr/ComputeQValue.java
+++ b/src/main/java/edu/ucsd/msjava/fdr/ComputeQValue.java
@@ -1,7 +1,7 @@
package edu.ucsd.msjava.fdr;
-import edu.ucsd.msjava.parser.BufferedLineReader;
-import edu.ucsd.msjava.ui.MSGFPlus;
+import edu.ucsd.msjava.mgf.BufferedLineReader;
+import edu.ucsd.msjava.cli.MSGFPlus;
import java.io.File;
import java.util.ArrayList;
diff --git a/src/main/java/edu/ucsd/msjava/fdr/MSGFPlusPSMSet.java b/src/main/java/edu/ucsd/msjava/fdr/MSGFPlusPSMSet.java
index b9b6434e..31b5469d 100644
--- a/src/main/java/edu/ucsd/msjava/fdr/MSGFPlusPSMSet.java
+++ b/src/main/java/edu/ucsd/msjava/fdr/MSGFPlusPSMSet.java
@@ -3,7 +3,7 @@
import edu.ucsd.msjava.msdbsearch.CompactSuffixArray;
import edu.ucsd.msjava.msdbsearch.DatabaseMatch;
import edu.ucsd.msjava.msdbsearch.MSGFPlusMatch;
-import edu.ucsd.msjava.ui.MSGFPlus;
+import edu.ucsd.msjava.cli.MSGFPlus;
import java.util.ArrayList;
import java.util.HashMap;
diff --git a/src/main/java/edu/ucsd/msjava/fdr/PSMSet.java b/src/main/java/edu/ucsd/msjava/fdr/PSMSet.java
index fd8f4397..15a553f2 100644
--- a/src/main/java/edu/ucsd/msjava/fdr/PSMSet.java
+++ b/src/main/java/edu/ucsd/msjava/fdr/PSMSet.java
@@ -58,7 +58,5 @@ public ArrayList getPepScores() {
return pepScores;
}
- //
-// public abstract void writeResults(TargetDecoyAnalysis tda, PrintStream out, float fdrThreshold, float pepFDRThreshold, float scoreThreshold);
public abstract void read();
}
diff --git a/src/main/java/edu/ucsd/msjava/fdr/Pair.java b/src/main/java/edu/ucsd/msjava/fdr/Pair.java
index b1e14bb1..fd179bd7 100644
--- a/src/main/java/edu/ucsd/msjava/fdr/Pair.java
+++ b/src/main/java/edu/ucsd/msjava/fdr/Pair.java
@@ -2,40 +2,18 @@
import java.util.Comparator;
-/**
- * This class represents a pair of two objects.
- *
- * @param the first object
- * @param the second object
- * @author sangtaekim
- */
+/** Generic ordered pair. */
public class Pair {
- /**
- * The first.
- */
private A first;
-
- /**
- * The second.
- */
private B second;
- /**
- * Instantiates a new pair.
- *
- * @param first the first
- * @param second the second
- */
public Pair(A first, B second) {
super();
this.first = first;
this.second = second;
}
- /* (non-Javadoc)
- * @see java.lang.Object#hashCode()
- */
public int hashCode() {
int hashFirst = first != null ? first.hashCode() : 0;
int hashSecond = second != null ? second.hashCode() : 0;
@@ -43,9 +21,6 @@ public int hashCode() {
return (hashFirst + hashSecond) * hashSecond + hashFirst;
}
- /* (non-Javadoc)
- * @see java.lang.Object#equals(java.lang.Object)
- */
public boolean equals(Object other) {
if (other instanceof Pair, ?>) {
Pair, ?> otherPair = (Pair, ?>) other;
@@ -61,45 +36,22 @@ public boolean equals(Object other) {
return false;
}
- /* (non-Javadoc)
- * @see java.lang.Object#toString()
- */
public String toString() {
return "(" + first + ", " + second + ")";
}
- /**
- * Gets the first.
- *
- * @return the first
- */
public A getFirst() {
return first;
}
- /**
- * Sets the first.
- *
- * @param first the new first
- */
public void setFirst(A first) {
this.first = first;
}
- /**
- * Gets the second.
- *
- * @return the second
- */
public B getSecond() {
return second;
}
- /**
- * Sets the second.
- *
- * @param second the new second
- */
public void setSecond(B second) {
this.second = second;
}
@@ -115,13 +67,6 @@ public PairComparator(boolean useSecondForComprison) {
this.useSecondForComprison = useSecondForComprison;
}
- /**
- * Determines the order of Pair objects. If useSecondForComparison is set, use B for comparison, otherwise A is used.
- *
- * @param p1 the first element.
- * @param p2 the second element.
- * @return 1 if p1 > p2, -1 if p2 > p1 and 0 otherwise.
- */
public int compare(Pair p1, Pair p2) {
if (!useSecondForComprison)
return p1.getFirst().compareTo(p2.getFirst());
diff --git a/src/main/java/edu/ucsd/msjava/fdr/ScoredString.java b/src/main/java/edu/ucsd/msjava/fdr/ScoredString.java
index dca5ce14..06bc6636 100644
--- a/src/main/java/edu/ucsd/msjava/fdr/ScoredString.java
+++ b/src/main/java/edu/ucsd/msjava/fdr/ScoredString.java
@@ -9,27 +9,12 @@
***************************************************************************/
package edu.ucsd.msjava.fdr;
-/**
- * The Class ScoredString.
- */
public class ScoredString extends Pair implements Comparable> {
- /**
- * Instantiates a new scored string.
- *
- * @param peptide the peptide
- * @param score the score
- */
public ScoredString(String peptide, Float score) {
super(peptide, score);
}
- /**
- * Instantiates a new scored string, using an integer score.
- *
- * @param score
- * @param peptide
- */
public ScoredString(String peptide, int score) {
super(peptide, (float) score);
}
@@ -42,20 +27,10 @@ public int compareTo(Pair o) {
return getFirst().compareTo(o.getFirst());
}
- /**
- * Gets the str.
- *
- * @return the str
- */
public String getStr() {
return super.getFirst();
}
- /**
- * Gets the score.
- *
- * @return the score
- */
public float getScore() {
return super.getSecond();
}
diff --git a/src/main/java/edu/ucsd/msjava/fdr/TSVPSMSet.java b/src/main/java/edu/ucsd/msjava/fdr/TSVPSMSet.java
index 5a0a4eaf..f0454945 100644
--- a/src/main/java/edu/ucsd/msjava/fdr/TSVPSMSet.java
+++ b/src/main/java/edu/ucsd/msjava/fdr/TSVPSMSet.java
@@ -1,6 +1,6 @@
package edu.ucsd.msjava.fdr;
-import edu.ucsd.msjava.ui.MSGFPlus;
+import edu.ucsd.msjava.cli.MSGFPlus;
import java.io.*;
import java.util.ArrayList;
@@ -224,29 +224,8 @@ public static String getPeptideFromAnnotation(String annotation) {
else
pep = annotation;
- // if there are flanking amino acids (e.g. R.ACDEFK.G), remove them
-// int firstDotIndex = annotation.indexOf('.');
-// int lastDotIndex = annotation.lastIndexOf('.');
-// if(firstDotIndex < lastDotIndex)
-// pep = annotation.substring(firstDotIndex+1, lastDotIndex);
-// else
-// pep = annotation;
pep = pep.toUpperCase();
return pep;
}
- public static void main(String argv[]) throws Exception {
- File file = new File("/home/sangtaekim/Research/ToolDistribution/Test/inspect.out");
- ArrayList>> reqStrList = new ArrayList>>();
- ArrayList charges = new ArrayList();
- charges.add("1");
- charges.add("3");
- ArrayList peps = new ArrayList();
- peps.add("EE");
- reqStrList.add(new Pair>(2, peps));
- reqStrList.add(new Pair>(4, charges));
- TSVPSMSet psmSet = new TSVPSMSet(file, "\t", true, 14, true, 0, 1, 2, reqStrList);
- psmSet.read();
- psmSet.printPeptideScoreTable();
- }
}
diff --git a/src/main/java/edu/ucsd/msjava/fdr/TargetDecoyAnalysis.java b/src/main/java/edu/ucsd/msjava/fdr/TargetDecoyAnalysis.java
index ec8abc11..87142d59 100644
--- a/src/main/java/edu/ucsd/msjava/fdr/TargetDecoyAnalysis.java
+++ b/src/main/java/edu/ucsd/msjava/fdr/TargetDecoyAnalysis.java
@@ -28,47 +28,6 @@ public TargetDecoyAnalysis(PSMSet target, PSMSet decoy, float pit) {
pepLevelFDRMap = getFDRMap(target.getPepScores(), decoy.getPepScores(), isGreaterBetter, pit);
}
-// public TargetDecoyPSMSet(
-// File concatenatedFile,
-// String delimiter,
-// boolean hasHeader,
-// int scoreCol,
-// boolean isGreaterBetter,
-// int specFileCol,
-// int specIndexCol,
-// int pepCol,
-// ArrayList>> reqStrList,
-// int dbCol, String decoyPrefix)
-// {
-// target = new TSVPSMSet(concatenatedFile, delimiter, hasHeader, scoreCol, isGreaterBetter, specFileCol, specIndexCol, pepCol, reqStrList).decoy(dbCol, decoyPrefix, true).read();
-// decoy = new TSVPSMSet(concatenatedFile, delimiter, hasHeader, scoreCol, isGreaterBetter, specFileCol, specIndexCol, pepCol, reqStrList).decoy(dbCol, decoyPrefix, false).read();
-// this.isGreaterBetter = isGreaterBetter;
-// isConcatenated = true;
-// psmLevelFDRMap = getFDRMap(target.getPSMScores(), decoy.getPSMScores(), isGreaterBetter, isConcatenated, 1);
-// pepLevelFDRMap = getFDRMap(target.getPepScores(), decoy.getPepScores(), isGreaterBetter, isConcatenated, 1);
-// }
-//
-// public TargetDecoyPSMSet(
-// File targetFile,
-// File decoyFile,
-// String delimiter,
-// boolean hasHeader,
-// int scoreCol,
-// boolean isGreaterBetter,
-// int specFileCol,
-// int specIndexCol,
-// int pepCol,
-// ArrayList>> reqStrListPSMSet,
-// float pit
-// )
-// {
-// target = new TSVPSMSet(targetFile, delimiter, hasHeader, scoreCol, isGreaterBetter, specFileCol, specIndexCol, pepCol, reqStrListPSMSet).read();
-// decoy = new TSVPSMSet(decoyFile, delimiter, hasHeader, scoreCol, isGreaterBetter, specFileCol, specIndexCol, pepCol, reqStrListPSMSet).read();
-// isConcatenated = false;
-// psmLevelFDRMap = getFDRMap(target.getPSMScores(), decoy.getPSMScores(), isGreaterBetter, isConcatenated, pit);
-// pepLevelFDRMap = getFDRMap(target.getPepScores(), decoy.getPepScores(), isGreaterBetter, isConcatenated, pit);
-// }
-
public PSMSet getTargetPSMSet() {
return target;
}
@@ -147,7 +106,6 @@ public float getThresholdScore(float fdrThreshold, boolean isPeptideLevel) {
threshold = Float.MIN_VALUE;
for (Entry entry : map.entrySet()) {
-// System.out.println(entry.getKey()+"\t"+entry.getValue());
if (entry.getValue() > fdrThreshold)
break;
else
@@ -218,7 +176,6 @@ public static TreeMap getFDRMap(ArrayList target, ArrayList
fdrMap.put(decoyScore, fdr);
if (fdr >= 1)
break;
-// System.out.println("1: " + decoyScore + ":" + fdr);
}
}
@@ -247,16 +204,6 @@ public static TreeMap getFDRMap(ArrayList target, ArrayList
finalFDRMap.put(entry.getKey(), fdr);
}
-// if(isGreaterBetter)
-// {
-// finalFDRMap.put(Float.POSITIVE_INFINITY, 0f);
-// finalFDRMap.put(Float.NEGATIVE_INFINITY, 1f);
-// }
-// else
-// {
-// finalFDRMap.put(Float.POSITIVE_INFINITY, 1f);
-// finalFDRMap.put(Float.NEGATIVE_INFINITY, 0f);
-// }
return finalFDRMap;
}
diff --git a/src/main/java/edu/ucsd/msjava/parser/BufferedLineReader.java b/src/main/java/edu/ucsd/msjava/mgf/BufferedLineReader.java
similarity index 54%
rename from src/main/java/edu/ucsd/msjava/parser/BufferedLineReader.java
rename to src/main/java/edu/ucsd/msjava/mgf/BufferedLineReader.java
index 4d5d3214..e7135ecc 100644
--- a/src/main/java/edu/ucsd/msjava/parser/BufferedLineReader.java
+++ b/src/main/java/edu/ucsd/msjava/mgf/BufferedLineReader.java
@@ -1,16 +1,18 @@
-package edu.ucsd.msjava.parser;
+package edu.ucsd.msjava.mgf;
import java.io.*;
-import net.pempek.unicode.UnicodeBOMInputStream;
/**
- * Buffered line reader class
- * Uses UnicodeBOMInputStream to properly detect files that start with a byte order mark
+ * Buffered line reader. Wraps the file in {@link UnicodeBOMInputStream}
+ * and consumes the BOM via {@code skipBOM()} so the first line returned by
+ * {@link #readLine()} never contains the BOM glyph -- this matters for
+ * config / mod / FASTA files saved by Windows editors that prepend a UTF-8
+ * BOM.
*/
public class BufferedLineReader extends BufferedReader implements LineReader {
public BufferedLineReader(String fileName) throws IOException {
- super(new InputStreamReader(new UnicodeBOMInputStream(new FileInputStream(fileName))));
+ super(new InputStreamReader(new UnicodeBOMInputStream(new FileInputStream(fileName)).skipBOM()));
}
@Override
diff --git a/src/main/java/edu/ucsd/msjava/parser/BufferedRandomAccessLineReader.java b/src/main/java/edu/ucsd/msjava/mgf/BufferedRandomAccessLineReader.java
similarity index 74%
rename from src/main/java/edu/ucsd/msjava/parser/BufferedRandomAccessLineReader.java
rename to src/main/java/edu/ucsd/msjava/mgf/BufferedRandomAccessLineReader.java
index 8b0d3cc2..cb60076f 100644
--- a/src/main/java/edu/ucsd/msjava/parser/BufferedRandomAccessLineReader.java
+++ b/src/main/java/edu/ucsd/msjava/mgf/BufferedRandomAccessLineReader.java
@@ -1,7 +1,5 @@
-package edu.ucsd.msjava.parser;
+package edu.ucsd.msjava.mgf;
-import net.pempek.unicode.UnicodeBOMInputStream;
-import org.apache.commons.lang3.tuple.Pair;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
@@ -79,39 +77,39 @@ private static boolean bytesMatchBOM(byte[] buf, UnicodeBOMInputStream.BOM bomTy
* @return
*/
public static String stripBOM(String str) {
- Pair result = stripBOMAndGetLength(str);
- return result.getKey();
+ return stripBOMAndGetLength(str).text();
}
+ /** Result of a BOM-strip: the updated string plus the BOM byte length. */
+ public record BomStripResult(String text, int bomLength) {}
+
/**
- * Check for a byte order mark at the start of str
- * If found, remove it
- * @param str
- * @return Key/value pair where the key is the updated string and the value is the byte order mark length
+ * Check for a byte order mark at the start of {@code str}; if found,
+ * remove it. Returns the updated string and the BOM byte length.
*/
- public static Pair stripBOMAndGetLength(String str) {
+ public static BomStripResult stripBOMAndGetLength(String str) {
// Check for byte order marks
byte[] buf = str.getBytes();
int copyOffset = 0;
if (buf.length >= 4) {
- if (bytesMatchBOM(buf, net.pempek.unicode.UnicodeBOMInputStream.BOM.UTF_32_LE)) {
+ if (bytesMatchBOM(buf, UnicodeBOMInputStream.BOM.UTF_32_LE)) {
copyOffset = 4;
- } else if (bytesMatchBOM(buf, net.pempek.unicode.UnicodeBOMInputStream.BOM.UTF_32_BE)) {
+ } else if (bytesMatchBOM(buf, UnicodeBOMInputStream.BOM.UTF_32_BE)) {
copyOffset = 4;
}
}
if (copyOffset == 0 && buf.length >= 3) {
- if (bytesMatchBOM(buf, net.pempek.unicode.UnicodeBOMInputStream.BOM.UTF_8)) {
+ if (bytesMatchBOM(buf, UnicodeBOMInputStream.BOM.UTF_8)) {
copyOffset = 3;
}
}
if (copyOffset == 0 && buf.length >= 2) {
- if (bytesMatchBOM(buf, net.pempek.unicode.UnicodeBOMInputStream.BOM.UTF_16_LE)) {
+ if (bytesMatchBOM(buf, UnicodeBOMInputStream.BOM.UTF_16_LE)) {
copyOffset = 2;
- } else if (bytesMatchBOM(buf, net.pempek.unicode.UnicodeBOMInputStream.BOM.UTF_16_BE)) {
+ } else if (bytesMatchBOM(buf, UnicodeBOMInputStream.BOM.UTF_16_BE)) {
copyOffset = 2;
}
}
@@ -120,7 +118,7 @@ public static Pair stripBOMAndGetLength(String str) {
str = new String(java.util.Arrays.copyOfRange(buf, copyOffset, buf.length));
}
- return Pair.of(str, copyOffset);
+ return new BomStripResult(str, copyOffset);
}
private int fillBuffer() {
@@ -154,11 +152,11 @@ public String readLine() {
if (startOfFile) {
// Check for a byte order mark
- Pair result = stripBOMAndGetLength(str);
+ BomStripResult result = stripBOMAndGetLength(str);
- bomLength = result.getValue();
+ bomLength = result.bomLength();
if (bomLength > 0) {
- str = result.getKey();
+ str = result.text();
}
}
@@ -244,23 +242,4 @@ public void close() throws IOException {
in.close();
}
- public static void main(String argv[]) throws Exception {
- long time = System.currentTimeMillis();
- String fileName = "/home/sangtaekim/Research/Data/ABRF/2011/UniProt.Yeast.NFISnr.contamsS48.fasta";
- BufferedRandomAccessLineReader in = new BufferedRandomAccessLineReader(fileName, 1 << 16);
-// BufferedReader in = new BufferedReader(new FileReader(fileName));
-// RandomAccessFile in = new RandomAccessFile(fileName, "r");
- String s;
- int lineNum = 0;
- long pos = 0;
- while ((s = in.readLine()) != null) {
- lineNum++;
- if (lineNum == 48232)
- System.out.println(lineNum + " " + s + " " + (pos = in.getPosition()));
- }
- in.seek(pos);
- System.out.println(in.readLine());
- System.out.println("Time: " + (System.currentTimeMillis() - time));
- }
-
}
diff --git a/src/main/java/edu/ucsd/msjava/parser/LineReader.java b/src/main/java/edu/ucsd/msjava/mgf/LineReader.java
similarity index 63%
rename from src/main/java/edu/ucsd/msjava/parser/LineReader.java
rename to src/main/java/edu/ucsd/msjava/mgf/LineReader.java
index c0f31e74..f0217a4a 100644
--- a/src/main/java/edu/ucsd/msjava/parser/LineReader.java
+++ b/src/main/java/edu/ucsd/msjava/mgf/LineReader.java
@@ -1,4 +1,4 @@
-package edu.ucsd.msjava.parser;
+package edu.ucsd.msjava.mgf;
public interface LineReader {
String readLine();
diff --git a/src/main/java/edu/ucsd/msjava/parser/MgfSpectrumParser.java b/src/main/java/edu/ucsd/msjava/mgf/MgfSpectrumParser.java
similarity index 79%
rename from src/main/java/edu/ucsd/msjava/parser/MgfSpectrumParser.java
rename to src/main/java/edu/ucsd/msjava/mgf/MgfSpectrumParser.java
index e805a781..093e63ee 100644
--- a/src/main/java/edu/ucsd/msjava/parser/MgfSpectrumParser.java
+++ b/src/main/java/edu/ucsd/msjava/mgf/MgfSpectrumParser.java
@@ -1,4 +1,4 @@
-package edu.ucsd.msjava.parser;
+package edu.ucsd.msjava.mgf;
import edu.ucsd.msjava.msutil.*;
@@ -11,11 +11,6 @@
import static edu.ucsd.msjava.misc.TextParsingUtils.isInteger;
-/**
- * This class enables to parse spectrum file with mgf format.
- *
- * @author sangtaekim
- */
public class MgfSpectrumParser implements SpectrumParser {
private static final Pattern TITLE_SCAN_KEY_VALUE_PATTERN =
Pattern.compile("(?i)(?:^|[\\s;])(?:scan|scans|spectrum)=(\\d+)(?:\\b|$)");
@@ -26,27 +21,13 @@ public class MgfSpectrumParser implements SpectrumParser {
private long scanMissingWarningCount;
- /**
- * Number of scans where we could not determine the scan number
- * This method is required by interface SpectrumParser
- * @return
- */
public long getScanMissingWarningCount()
{
return scanMissingWarningCount;
}
- /**
- * Amino acid set to be used to parse "SEQ="
- */
private AminoAcidSet aaSet = AminoAcidSet.getStandardAminoAcidSetWithFixedCarbamidomethylatedCys();
- /**
- * Specify amino acid set to be used to parse "SEQ=" field.
- *
- * @param aaSet amino acid set.
- * @return this object.
- */
public MgfSpectrumParser aaSet(AminoAcidSet aaSet) {
this.aaSet = aaSet;
linesRead = 0;
@@ -55,14 +36,6 @@ public MgfSpectrumParser aaSet(AminoAcidSet aaSet) {
return this;
}
- /**
- * Implementation of readSpectrum method. Implicitly lineReader points to the start of a spectrum.
- * Reads mgf file line by line until the spectrum ends, generate a Spectrum object and returns it.
- * If it cannot read a spectrum, it returns null.
- *
- * @param lineReader a LineReader object points to the start of a spectrum
- * @return a spectrum object. null if no spectrum can be generated.
- */
public Spectrum readSpectrum(LineReader lineReader) {
Spectrum spec = null;
String title = null;
@@ -72,8 +45,6 @@ public Spectrum readSpectrum(LineReader lineReader) {
int precursorCharge = 0;
ActivationMethod activation = null;
float elutionTimeSeconds = 0;
-// Float toleranceVal = null;
-// Tolerance.Unit toleranceUnit = null;
String buf;
boolean parse = false; // parse only after the BEGIN IONS
@@ -112,7 +83,6 @@ public Spectrum readSpectrum(LineReader lineReader) {
} else if (buf.startsWith("TITLE")) {
title = buf.substring(buf.indexOf('=') + 1);
spec.setTitle(title);
-// spec.setID(title);
} else if (buf.startsWith("CHARGE")) {
// Charge state, e.g. CHARGE=2+
// Extract the text after the equals sign
@@ -206,23 +176,6 @@ public Spectrum readSpectrum(LineReader lineReader) {
else
elutionTimeSeconds = Float.valueOf(token[0]);
}
-// else if(buf.startsWith("TOL="))
-// {
-// String tolStr = buf.substring(buf.indexOf("=")+1);
-// float toleranceValue = Float.parseFloat(tolStr);
-// if(toleranceValue > 0)
-// {
-// toleranceVal = toleranceValue;
-// }
-// }
-// else if(buf.startsWith("TOLU="))
-// {
-// String tolUnitStr = buf.substring(buf.indexOf("=")+1);
-// if(tolUnitStr.equalsIgnoreCase("ppm"))
-// toleranceUnit = Tolerance.Unit.PPM;
-// else if(tolUnitStr.equalsIgnoreCase("Da"))
-// toleranceUnit = Tolerance.Unit.Da;
-// }
else if (buf.startsWith("END IONS")) {
assert (spec != null);
if (spec.getScanNum() < 0 && title != null) {
@@ -260,11 +213,6 @@ else if (buf.startsWith("END IONS")) {
spec.setRt(elutionTimeSeconds);
spec.setRtIsSeconds(true);
}
-// if(toleranceVal != null && toleranceUnit != null)
-// {
-// Tolerance precursorTolerance = new Tolerance(toleranceVal, toleranceUnit);
-// spec.setPrecursorTolerance(precursorTolerance);
-// }
if (!sorted)
Collections.sort(spec);
@@ -275,13 +223,6 @@ else if (buf.startsWith("END IONS")) {
return null;
}
- /**
- * Extract start and end scan from the title if it is of the form:
- * DatasetName.ScanStart.ScanEnd.Charge
- *
- * @param spec Spectrum
- * @param title Title line
- */
private void extractScanRangeFromTitle(Spectrum spec, String title) {
// Split on periods
String[] token = title.split("\\.");
@@ -323,13 +264,6 @@ private void extractScanRangeFromTitle(Spectrum spec, String title) {
}
}
- /**
- * Implementation of getSpecIndexMap object. Reads the entire spectrum file and
- * generates a map from a spectrum index to the file position of the spectrum.
- *
- * @param lineReader a LineReader object that points to the start of a file.
- * @return A map from spectrum indexes to the spectrum meta information.
- */
public Map getSpecMetaInfoMap(BufferedRandomAccessLineReader lineReader) {
Hashtable specIndexMap = new Hashtable();
String buf;
@@ -409,40 +343,4 @@ private boolean extractScanNumFromTitleKeyValue(Spectrum spec, String title) {
return true;
}
- // test code
- public static void main(String argv[]) throws Exception {
- long time = System.currentTimeMillis();
- String mgfFile = "/Users/sangtaekim/Research/Data/PNNL/IPYS_TD_Scere010_Orbitrap_001a.mgf";
-// String mgfFile = "/Users/sangtaekim/Research/Data/AgilentQTOF/notAnnotatedAgilentQTOF.mgf";
-
- /*
- // SpectraIterator test
- MgfSpectrumParser parser = new MgfSpectrumParser();
- SpectraIterator itr = new SpectraIterator(mgfFile, parser);
- int size = 0;
- while(itr.hasNext())
- {
- Spectrum spec = itr.next();
- size++;
- System.out.println(spec.getScanNum()+" "+spec.getPrecursorPeak());
- }
- System.out.println("Size: " + size);
- */
- // SpectraMap test
-
- /* SpectraMap test
- SpectraMap map = new SpectraMap(mgfFile, new MgfSpectrumParser());
- Spectrum spec = map.getSpectrumByScanNum(1585);
- System.out.println(spec.getScanNum() + " " + spec.getPrecursorPeak());
- */
-
-// SpectraContainer container = new SpectraContainer(mgfFile, new MgfSpectrumParser());
-// for(Spectrum spec : container)
-// System.out.println(spec.getScanNum() + " " + spec.getPrecursorPeak());
- ArrayList specContainer = new ArrayList();
- SpectraIterator iterator = new SpectraIterator(mgfFile, new MgfSpectrumParser());
- while (iterator.hasNext())
- specContainer.add(iterator.next());
- System.out.println("Time: " + (System.currentTimeMillis() - time));
- }
}
diff --git a/src/main/java/edu/ucsd/msjava/parser/SpectrumParser.java b/src/main/java/edu/ucsd/msjava/mgf/SpectrumParser.java
similarity index 94%
rename from src/main/java/edu/ucsd/msjava/parser/SpectrumParser.java
rename to src/main/java/edu/ucsd/msjava/mgf/SpectrumParser.java
index f659b055..86856b18 100644
--- a/src/main/java/edu/ucsd/msjava/parser/SpectrumParser.java
+++ b/src/main/java/edu/ucsd/msjava/mgf/SpectrumParser.java
@@ -1,4 +1,4 @@
-package edu.ucsd.msjava.parser;
+package edu.ucsd.msjava.mgf;
import edu.ucsd.msjava.msutil.Spectrum;
import edu.ucsd.msjava.msutil.SpectrumMetaInfo;
diff --git a/src/main/java/net/pempak/unicode/UnicodeBOMInputStream.java b/src/main/java/edu/ucsd/msjava/mgf/UnicodeBOMInputStream.java
similarity index 96%
rename from src/main/java/net/pempak/unicode/UnicodeBOMInputStream.java
rename to src/main/java/edu/ucsd/msjava/mgf/UnicodeBOMInputStream.java
index 6103cd5a..67a70b53 100644
--- a/src/main/java/net/pempak/unicode/UnicodeBOMInputStream.java
+++ b/src/main/java/edu/ucsd/msjava/mgf/UnicodeBOMInputStream.java
@@ -1,295 +1,295 @@
-// (‑●‑●)> released under the WTFPL v2 license, by Gregory Pakosz (@gpakosz)
-
-package net.pempek.unicode;
-
-import java.io.IOException;
-import java.io.InputStream;
-import java.io.PushbackInputStream;
-
-/**
- * The UnicodeBOMInputStream class wraps any
- * InputStream and detects the presence of any Unicode BOM
- * (Byte Order Mark) at its beginning, as defined by
- * RFC 3629 - UTF-8, a
- * transformation format of ISO 10646
- *
- * The
- * Unicode FAQ
- * defines 5 types of BOMs:
- * 00 00 FE FF = UTF-32, big-endian
- * FF FE 00 00 = UTF-32, little-endian
- * FE FF = UTF-16, big-endian
- * FF FE = UTF-16, little-endian
- * EF BB BF = UTF-8
- *
- *
- * Use the {@link #getBOM()} method to know whether a BOM has been detected
- * or not.
- *
- * Use the {@link #skipBOM()} method to remove the detected BOM from the
- * wrapped InputStream object.
- *
- * @author Gregory Pakosz
- * @version 1.0
- */
-public class UnicodeBOMInputStream extends InputStream
-{
- /**
- * Type safe enumeration class that describes the different types of Unicode
- * BOMs.
- */
- public static final class BOM
- {
- /**
- * NONE.
- */
- public static final BOM NONE = new BOM(new byte[]{}, "NONE");
-
- /**
- * UTF-8 BOM (EF BB BF).
- */
- public static final BOM UTF_8 = new BOM(new byte[]{(byte)0xEF,
- (byte)0xBB,
- (byte)0xBF},
- "UTF-8");
-
- /**
- * UTF-16, little-endian (FF FE).
- */
- public static final BOM UTF_16_LE = new BOM(new byte[]{ (byte)0xFF,
- (byte)0xFE},
- "UTF-16 little-endian");
-
- /**
- * UTF-16, big-endian (FE FF).
- */
- public static final BOM UTF_16_BE = new BOM(new byte[]{ (byte)0xFE,
- (byte)0xFF},
- "UTF-16 big-endian");
-
- /**
- * UTF-32, little-endian (FF FE 00 00).
- */
- public static final BOM UTF_32_LE = new BOM(new byte[]{ (byte)0xFF,
- (byte)0xFE,
- (byte)0x00,
- (byte)0x00},
- "UTF-32 little-endian");
-
- /**
- * UTF-32, big-endian (00 00 FE FF).
- */
- public static final BOM UTF_32_BE = new BOM(new byte[]{ (byte)0x00,
- (byte)0x00,
- (byte)0xFE,
- (byte)0xFF},
- "UTF-32 big-endian");
-
- /**
- * Returns a String representation of this BOM
- * value.
- */
- public final String toString()
- {
- return description;
- }
-
- /**
- * Returns the bytes corresponding to this BOM value.
- */
- public final byte[] getBytes()
- {
- final int length = bytes.length;
- final byte[] result = new byte[length];
-
- // make a defensive copy
- System.arraycopy(bytes, 0, result, 0, length);
-
- return result;
- }
-
- private BOM(final byte bom[], final String description)
- {
- assert(bom != null) : "invalid BOM: null is not allowed";
- assert(description != null) : "invalid description: null is not allowed";
- assert(description.length() != 0) : "invalid description: empty string is not allowed";
-
- this.bytes = bom;
- this.description = description;
- }
-
- final byte bytes[];
- private final String description;
-
- } // BOM
-
- /**
- * Constructs a new UnicodeBOMInputStream that wraps the
- * specified InputStream.
- *
- * @param inputStream an InputStream.
- *
- * @throws NullPointerException when inputStream is
- * null.
- * @throws IOException on reading from the specified InputStream
- * when trying to detect the Unicode BOM.
- */
- public UnicodeBOMInputStream(final InputStream inputStream) throws NullPointerException,
- IOException
- {
- if (inputStream == null)
- throw new NullPointerException("invalid input stream: null is not allowed");
-
- in = new PushbackInputStream(inputStream, 4);
-
- final byte bom[] = new byte[4];
- final int read = in.read(bom);
-
- switch(read)
- {
- case 4:
- if ((bom[0] == (byte)0xFF) &&
- (bom[1] == (byte)0xFE) &&
- (bom[2] == (byte)0x00) &&
- (bom[3] == (byte)0x00))
- {
- this.bom = BOM.UTF_32_LE;
- break;
- }
- else
- if ((bom[0] == (byte)0x00) &&
- (bom[1] == (byte)0x00) &&
- (bom[2] == (byte)0xFE) &&
- (bom[3] == (byte)0xFF))
- {
- this.bom = BOM.UTF_32_BE;
- break;
- }
-
- case 3:
- if ((bom[0] == (byte)0xEF) &&
- (bom[1] == (byte)0xBB) &&
- (bom[2] == (byte)0xBF))
- {
- this.bom = BOM.UTF_8;
- break;
- }
-
- case 2:
- if ((bom[0] == (byte)0xFF) &&
- (bom[1] == (byte)0xFE))
- {
- this.bom = BOM.UTF_16_LE;
- break;
- }
- else
- if ((bom[0] == (byte)0xFE) &&
- (bom[1] == (byte)0xFF))
- {
- this.bom = BOM.UTF_16_BE;
- break;
- }
-
- default:
- this.bom = BOM.NONE;
- break;
- }
-
- if (read > 0)
- in.unread(bom, 0, read);
- }
-
- /**
- * Returns the BOM that was detected in the wrapped
- * InputStream object.
- *
- * @return a BOM value.
- */
- public final BOM getBOM()
- {
- // BOM type is immutable.
- return bom;
- }
-
- /**
- * Skips the BOM that was found in the wrapped
- * InputStream object.
- *
- * @return this UnicodeBOMInputStream.
- *
- * @throws IOException when trying to skip the BOM from the wrapped
- * InputStream object.
- */
- public final synchronized UnicodeBOMInputStream skipBOM() throws IOException
- {
- if (!skipped)
- {
- in.skip(bom.bytes.length);
- skipped = true;
- }
- return this;
- }
-
- @Override
- public int read() throws IOException
- {
- return in.read();
- }
-
- @Override
- public int read(final byte b[]) throws IOException,
- NullPointerException
- {
- return in.read(b, 0, b.length);
- }
-
- @Override
- public int read(final byte b[],
- final int off,
- final int len) throws IOException,
- NullPointerException
- {
- return in.read(b, off, len);
- }
-
- @Override
- public long skip(final long n) throws IOException
- {
- return in.skip(n);
- }
-
- @Override
- public int available() throws IOException
- {
- return in.available();
- }
-
- @Override
- public void close() throws IOException
- {
- in.close();
- }
-
- @Override
- public synchronized void mark(final int readlimit)
- {
- in.mark(readlimit);
- }
-
- @Override
- public synchronized void reset() throws IOException
- {
- in.reset();
- }
-
- @Override
- public boolean markSupported()
- {
- return in.markSupported();
- }
-
- private final PushbackInputStream in;
- private final BOM bom;
- private boolean skipped = false;
-
-} // UnicodeBOMInputStream
+// (‑●‑●)> released under the WTFPL v2 license, by Gregory Pakosz (@gpakosz)
+
+package edu.ucsd.msjava.mgf;
+
+import java.io.IOException;
+import java.io.InputStream;
+import java.io.PushbackInputStream;
+
+/**
+ * The UnicodeBOMInputStream class wraps any
+ * InputStream and detects the presence of any Unicode BOM
+ * (Byte Order Mark) at its beginning, as defined by
+ * RFC 3629 - UTF-8, a
+ * transformation format of ISO 10646
+ *
+ * The
+ * Unicode FAQ
+ * defines 5 types of BOMs:
+ * 00 00 FE FF = UTF-32, big-endian
+ * FF FE 00 00 = UTF-32, little-endian
+ * FE FF = UTF-16, big-endian
+ * FF FE = UTF-16, little-endian
+ * EF BB BF = UTF-8
+ *
+ *
+ * Use the {@link #getBOM()} method to know whether a BOM has been detected
+ * or not.
+ *
+ * Use the {@link #skipBOM()} method to remove the detected BOM from the
+ * wrapped InputStream object.
+ *
+ * @author Gregory Pakosz
+ * @version 1.0
+ */
+public class UnicodeBOMInputStream extends InputStream
+{
+ /**
+ * Type safe enumeration class that describes the different types of Unicode
+ * BOMs.
+ */
+ public static final class BOM
+ {
+ /**
+ * NONE.
+ */
+ public static final BOM NONE = new BOM(new byte[]{}, "NONE");
+
+ /**
+ * UTF-8 BOM (EF BB BF).
+ */
+ public static final BOM UTF_8 = new BOM(new byte[]{(byte)0xEF,
+ (byte)0xBB,
+ (byte)0xBF},
+ "UTF-8");
+
+ /**
+ * UTF-16, little-endian (FF FE).
+ */
+ public static final BOM UTF_16_LE = new BOM(new byte[]{ (byte)0xFF,
+ (byte)0xFE},
+ "UTF-16 little-endian");
+
+ /**
+ * UTF-16, big-endian (FE FF).
+ */
+ public static final BOM UTF_16_BE = new BOM(new byte[]{ (byte)0xFE,
+ (byte)0xFF},
+ "UTF-16 big-endian");
+
+ /**
+ * UTF-32, little-endian (FF FE 00 00).
+ */
+ public static final BOM UTF_32_LE = new BOM(new byte[]{ (byte)0xFF,
+ (byte)0xFE,
+ (byte)0x00,
+ (byte)0x00},
+ "UTF-32 little-endian");
+
+ /**
+ * UTF-32, big-endian (00 00 FE FF).
+ */
+ public static final BOM UTF_32_BE = new BOM(new byte[]{ (byte)0x00,
+ (byte)0x00,
+ (byte)0xFE,
+ (byte)0xFF},
+ "UTF-32 big-endian");
+
+ /**
+ * Returns a String representation of this BOM
+ * value.
+ */
+ public final String toString()
+ {
+ return description;
+ }
+
+ /**
+ * Returns the bytes corresponding to this BOM value.
+ */
+ public final byte[] getBytes()
+ {
+ final int length = bytes.length;
+ final byte[] result = new byte[length];
+
+ // make a defensive copy
+ System.arraycopy(bytes, 0, result, 0, length);
+
+ return result;
+ }
+
+ private BOM(final byte bom[], final String description)
+ {
+ assert(bom != null) : "invalid BOM: null is not allowed";
+ assert(description != null) : "invalid description: null is not allowed";
+ assert(description.length() != 0) : "invalid description: empty string is not allowed";
+
+ this.bytes = bom;
+ this.description = description;
+ }
+
+ final byte bytes[];
+ private final String description;
+
+ } // BOM
+
+ /**
+ * Constructs a new UnicodeBOMInputStream that wraps the
+ * specified InputStream.
+ *
+ * @param inputStream an InputStream.
+ *
+ * @throws NullPointerException when inputStream is
+ * null.
+ * @throws IOException on reading from the specified InputStream
+ * when trying to detect the Unicode BOM.
+ */
+ public UnicodeBOMInputStream(final InputStream inputStream) throws NullPointerException,
+ IOException
+ {
+ if (inputStream == null)
+ throw new NullPointerException("invalid input stream: null is not allowed");
+
+ in = new PushbackInputStream(inputStream, 4);
+
+ final byte bom[] = new byte[4];
+ final int read = in.read(bom);
+
+ switch(read)
+ {
+ case 4:
+ if ((bom[0] == (byte)0xFF) &&
+ (bom[1] == (byte)0xFE) &&
+ (bom[2] == (byte)0x00) &&
+ (bom[3] == (byte)0x00))
+ {
+ this.bom = BOM.UTF_32_LE;
+ break;
+ }
+ else
+ if ((bom[0] == (byte)0x00) &&
+ (bom[1] == (byte)0x00) &&
+ (bom[2] == (byte)0xFE) &&
+ (bom[3] == (byte)0xFF))
+ {
+ this.bom = BOM.UTF_32_BE;
+ break;
+ }
+
+ case 3:
+ if ((bom[0] == (byte)0xEF) &&
+ (bom[1] == (byte)0xBB) &&
+ (bom[2] == (byte)0xBF))
+ {
+ this.bom = BOM.UTF_8;
+ break;
+ }
+
+ case 2:
+ if ((bom[0] == (byte)0xFF) &&
+ (bom[1] == (byte)0xFE))
+ {
+ this.bom = BOM.UTF_16_LE;
+ break;
+ }
+ else
+ if ((bom[0] == (byte)0xFE) &&
+ (bom[1] == (byte)0xFF))
+ {
+ this.bom = BOM.UTF_16_BE;
+ break;
+ }
+
+ default:
+ this.bom = BOM.NONE;
+ break;
+ }
+
+ if (read > 0)
+ in.unread(bom, 0, read);
+ }
+
+ /**
+ * Returns the BOM that was detected in the wrapped
+ * InputStream object.
+ *
+ * @return a BOM value.
+ */
+ public final BOM getBOM()
+ {
+ // BOM type is immutable.
+ return bom;
+ }
+
+ /**
+ * Skips the BOM that was found in the wrapped
+ * InputStream object.
+ *
+ * @return this UnicodeBOMInputStream.
+ *
+ * @throws IOException when trying to skip the BOM from the wrapped
+ * InputStream object.
+ */
+ public final synchronized UnicodeBOMInputStream skipBOM() throws IOException
+ {
+ if (!skipped)
+ {
+ in.skip(bom.bytes.length);
+ skipped = true;
+ }
+ return this;
+ }
+
+ @Override
+ public int read() throws IOException
+ {
+ return in.read();
+ }
+
+ @Override
+ public int read(final byte b[]) throws IOException,
+ NullPointerException
+ {
+ return in.read(b, 0, b.length);
+ }
+
+ @Override
+ public int read(final byte b[],
+ final int off,
+ final int len) throws IOException,
+ NullPointerException
+ {
+ return in.read(b, off, len);
+ }
+
+ @Override
+ public long skip(final long n) throws IOException
+ {
+ return in.skip(n);
+ }
+
+ @Override
+ public int available() throws IOException
+ {
+ return in.available();
+ }
+
+ @Override
+ public void close() throws IOException
+ {
+ in.close();
+ }
+
+ @Override
+ public synchronized void mark(final int readlimit)
+ {
+ in.mark(readlimit);
+ }
+
+ @Override
+ public synchronized void reset() throws IOException
+ {
+ in.reset();
+ }
+
+ @Override
+ public boolean markSupported()
+ {
+ return in.markSupported();
+ }
+
+ private final PushbackInputStream in;
+ private final BOM bom;
+ private boolean skipped = false;
+
+} // UnicodeBOMInputStream
diff --git a/src/main/java/edu/ucsd/msjava/msdbsearch/BuildSA.java b/src/main/java/edu/ucsd/msjava/msdbsearch/BuildSA.java
index d9dd615f..6e5c5195 100644
--- a/src/main/java/edu/ucsd/msjava/msdbsearch/BuildSA.java
+++ b/src/main/java/edu/ucsd/msjava/msdbsearch/BuildSA.java
@@ -1,7 +1,6 @@
package edu.ucsd.msjava.msdbsearch;
-import edu.ucsd.msjava.ui.MSGFPlus;
-import org.apache.commons.io.FilenameUtils;
+import edu.ucsd.msjava.cli.MSGFPlus;
import java.io.BufferedWriter;
import java.io.File;
@@ -156,7 +155,9 @@ public static void buildSAFiles(File databaseFile, File outputDir, int mode, Str
if (databaseFile.getName().toLowerCase().endsWith(".revCat.fasta".toLowerCase())) {
System.err.println("Delete " + databaseFile.getName() + " and run MS-GF+ (or BuildSA) again.");
} else {
- String baseName = FilenameUtils.removeExtension(databaseFile.getName());
+ String fileName = databaseFile.getName();
+ int dot = fileName.lastIndexOf('.');
+ String baseName = dot >= 0 ? fileName.substring(0, dot) : fileName;
System.err.println("Delete files starting with " + baseName +
" (but keep " + databaseFile.getName() + ") and run MS-GF+ (or BuildSA) again.");
}
diff --git a/src/main/java/edu/ucsd/msjava/msdbsearch/CandidatePeptideGrid.java b/src/main/java/edu/ucsd/msjava/msdbsearch/CandidatePeptideGrid.java
index ddf3af10..8e1c0670 100644
--- a/src/main/java/edu/ucsd/msjava/msdbsearch/CandidatePeptideGrid.java
+++ b/src/main/java/edu/ucsd/msjava/msdbsearch/CandidatePeptideGrid.java
@@ -64,11 +64,6 @@ public class CandidatePeptideGrid {
private int length;
private int[] size;
-// public CandidatePeptideGrid(AminoAcidSet aaSet, int maxPeptideLength)
-// {
-// this(aaSet, maxPeptideLength, Constants.NUM_VARIANTS_PER_PEPTIDE);
-// }
-
public CandidatePeptideGrid(AminoAcidSet aaSet, Enzyme enzyme, int maxPeptideLength, int maxNumVariantsPerPeptide, int maxMissedCleavages) {
this.numMaxMods = aaSet.getMaxNumberOfVariableModificationsPerPeptide();
this.maxPeptideLength = maxPeptideLength;
@@ -160,11 +155,6 @@ public int getNumMods(int index) {
return numMods[index][length];
}
-// public boolean addResidue(char residue)
-// {
-// return addResidue(length+1, residue);
-// }
-
/**
* Add a residue to the candidate peptide grid
* @param length
diff --git a/src/main/java/edu/ucsd/msjava/msdbsearch/CandidatePeptideGridConsideringMetCleavage.java b/src/main/java/edu/ucsd/msjava/msdbsearch/CandidatePeptideGridConsideringMetCleavage.java
index bfd38472..d2e27fcb 100644
--- a/src/main/java/edu/ucsd/msjava/msdbsearch/CandidatePeptideGridConsideringMetCleavage.java
+++ b/src/main/java/edu/ucsd/msjava/msdbsearch/CandidatePeptideGridConsideringMetCleavage.java
@@ -8,11 +8,6 @@ public class CandidatePeptideGridConsideringMetCleavage extends CandidatePeptide
private final CandidatePeptideGrid candidatePepGridMetCleaved; // For peptides with Met cleaved
boolean isProteinNTermWithHeadingMet = false;
-// public CandidatePeptideGridConsideringMetCleavage(AminoAcidSet aaSet, int maxPeptideLength)
-// {
-// this(aaSet, maxPeptideLength, Constants.NUM_VARIANTS_PER_PEPTIDE);
-// }
-
public CandidatePeptideGridConsideringMetCleavage(AminoAcidSet aaSet, Enzyme enzyme, int maxPeptideLength, int maxNumVariantsPerPeptide, int maxNumMissedCleavages) {
super(aaSet, enzyme, maxPeptideLength, maxNumVariantsPerPeptide, maxNumMissedCleavages);
candidatePepGridMetCleaved = new CandidatePeptideGrid(aaSet, enzyme, maxPeptideLength, maxNumVariantsPerPeptide, maxNumMissedCleavages);
diff --git a/src/main/java/edu/ucsd/msjava/msdbsearch/CompactFastaSequence.java b/src/main/java/edu/ucsd/msjava/msdbsearch/CompactFastaSequence.java
index 886644b5..0e2200b3 100644
--- a/src/main/java/edu/ucsd/msjava/msdbsearch/CompactFastaSequence.java
+++ b/src/main/java/edu/ucsd/msjava/msdbsearch/CompactFastaSequence.java
@@ -2,7 +2,7 @@
import edu.ucsd.msjava.sequences.Constants;
import edu.ucsd.msjava.sequences.Sequence;
-import edu.ucsd.msjava.ui.MSGFPlus;
+import edu.ucsd.msjava.cli.MSGFPlus;
import java.io.*;
import java.text.SimpleDateFormat;
@@ -256,8 +256,6 @@ public long getSize() {
public byte getByteAt(long position) {
// forget boundary check for faster access
-// if(position >= this.size) return Constants.TERMINATOR;
-// return this.sequence.get((int)position);
return this.sequence[(int) position];
}
@@ -651,4 +649,4 @@ public void printTooManyDuplicateSequencesMessage(String fileName, String toolNa
"You can consolidate the duplicates using the 'Validate Fasta File' tool in the Protein Digestion Simulator, " +
"available at https://github.com/PNNL-Comp-Mass-Spec/Protein-Digestion-Simulator/releases");
}
-}
\ No newline at end of file
+}
diff --git a/src/main/java/edu/ucsd/msjava/msdbsearch/CompactSuffixArray.java b/src/main/java/edu/ucsd/msjava/msdbsearch/CompactSuffixArray.java
index 033b443b..2f8083ef 100644
--- a/src/main/java/edu/ucsd/msjava/msdbsearch/CompactSuffixArray.java
+++ b/src/main/java/edu/ucsd/msjava/msdbsearch/CompactSuffixArray.java
@@ -190,9 +190,6 @@ private boolean isCompactSuffixArrayValid(long lastModified) {
}
}
- //System.out.println("LastModified times in the existing csarr and cnlcp files " +
- // "match the LastModified time of the sequence file (" + lastModified + ")");
-
return true;
}
@@ -433,14 +430,14 @@ private static void sortAndWriteBuckets(CompactFastaSequence sequence,
int prevRangeLastBucketFirst = -1;
for (RangeMetadata md : rangeMetadatas) {
- if (md.numEntries == 0) continue;
+ if (md.numEntries() == 0) continue;
mergeRangeIntoOutput(sequence, md, prevRangeLastBucketFirst, indexOut, nlcpOut);
- prevRangeLastBucketFirst = md.lastBucketFirstSuffix;
+ prevRangeLastBucketFirst = md.lastBucketFirstSuffix();
}
} finally {
for (RangeMetadata md : rangeMetadatas) {
- deleteQuietly(md.tempIndicesFile);
- deleteQuietly(md.tempLcpsFile);
+ deleteQuietly(md.tempIndicesFile());
+ deleteQuietly(md.tempLcpsFile());
}
// Sweep debris from workers that died before returning a RangeMetadata.
File[] orphans = parentDir.listFiles((dir, name) -> name.startsWith(tempBasename));
@@ -467,8 +464,8 @@ private static void mergeRangeIntoOutput(CompactFastaSequence sequence,
int prevRangeLastBucketFirst,
DataOutputStream indexOut,
DataOutputStream nlcpOut) throws IOException {
- try (DataInputStream idxIn = new DataInputStream(new BufferedInputStream(new FileInputStream(md.tempIndicesFile)));
- DataInputStream lcpIn = new DataInputStream(new BufferedInputStream(new FileInputStream(md.tempLcpsFile)))) {
+ try (DataInputStream idxIn = new DataInputStream(new BufferedInputStream(new FileInputStream(md.tempIndicesFile())));
+ DataInputStream lcpIn = new DataInputStream(new BufferedInputStream(new FileInputStream(md.tempLcpsFile())))) {
int firstIndex = idxIn.readInt();
byte firstLcp = lcpIn.readByte();
if (prevRangeLastBucketFirst >= 0) {
@@ -477,7 +474,7 @@ private static void mergeRangeIntoOutput(CompactFastaSequence sequence,
indexOut.writeInt(firstIndex);
nlcpOut.writeByte(firstLcp);
- for (int i = 1; i < md.numEntries; i++) {
+ for (int i = 1; i < md.numEntries(); i++) {
indexOut.writeInt(idxIn.readInt());
nlcpOut.writeByte(lcpIn.readByte());
}
@@ -648,19 +645,7 @@ private static void writeBucketsDirect(CompactFastaSequence sequence,
/** Per-worker sort+LCP output handle. Indices/LCPs live on disk; this carries
* the small metadata the merge step needs. Empty ranges return {@code null}
* file paths. */
- static final class RangeMetadata {
- final File tempIndicesFile;
- final File tempLcpsFile;
- final int numEntries;
- final int lastBucketFirstSuffix;
-
- RangeMetadata(File tempIndicesFile, File tempLcpsFile, int numEntries, int lastBucketFirstSuffix) {
- this.tempIndicesFile = tempIndicesFile;
- this.tempLcpsFile = tempLcpsFile;
- this.numEntries = numEntries;
- this.lastBucketFirstSuffix = lastBucketFirstSuffix;
- }
- }
+ record RangeMetadata(File tempIndicesFile, File tempLcpsFile, int numEntries, int lastBucketFirstSuffix) {}
/** Growable {@code int[]} bucket of suffix indices. Shared between the
* bucketing phase (sequential {@link #add}) and the per-range worker
@@ -730,17 +715,7 @@ private static byte computeLcpByte(CompactFastaSequence sequence, int idxA, int
@Override
public String toString() {
- String retVal = "Size of the suffix array: " + this.size + "\n";
-// int rank = 0;
-// while(indices.hasRemaining()) {
-// int index = indices.get();
-// int lcp = this.neighboringLcps.get(rank);
-// retVal += rank + "\t" + index + "\t" + lcp + "\t" + sequence.toString(factory.makeSuffix(index).getSequence()) + "\n";
-// rank++;
-// }
-// indices.rewind(); // reset marks after iteration
-// neighboringLcps.rewind();
- return retVal;
+ return "Size of the suffix array: " + this.size + "\n";
}
public void measureNominalMassError(AminoAcidSet aaSet) throws Exception {
diff --git a/src/main/java/edu/ucsd/msjava/msdbsearch/ConcurrentMSGFPlus.java b/src/main/java/edu/ucsd/msjava/msdbsearch/ConcurrentMSGFPlus.java
index 4afcd8a5..1a82f7d1 100644
--- a/src/main/java/edu/ucsd/msjava/msdbsearch/ConcurrentMSGFPlus.java
+++ b/src/main/java/edu/ucsd/msjava/msdbsearch/ConcurrentMSGFPlus.java
@@ -3,22 +3,49 @@
import edu.ucsd.msjava.misc.ProgressData;
import edu.ucsd.msjava.misc.ProgressReporter;
+import java.io.OutputStream;
import java.io.PrintStream;
+import java.util.ArrayList;
import java.util.List;
import java.util.function.Supplier;
-import org.apache.commons.io.output.NullOutputStream;
-
public class ConcurrentMSGFPlus {
+ private static final PrintStream NULL_PRINT_STREAM = new PrintStream(OutputStream.nullOutputStream());
+
+ /** Per-task wall stats in milliseconds. {@code null} if the task didn't
+ * complete (interrupted). */
+ public record TaskWallStats(int taskNum, long preprocessMs, long dbSearchMs,
+ long computeEvalueMs, long totalMs) {}
+
public static class RunMSGFPlus implements Runnable, ProgressReporter {
private final Supplier specScannerSupplier;
private final CompactSuffixArray sa;
SearchParams params;
- List resultList;
+ private final List resultList;
private final int taskNum;
private ProgressData progress;
private ScoredSpectraMap specScanner;
private DBScanner scanner;
+ // Written once at end of run(); read by the main thread only after
+ // executor.awaitTermination, which establishes happens-before.
+ private TaskWallStats wallStats;
+
+ public List getResults() {
+ return resultList;
+ }
+
+ public int getResultCount() {
+ return resultList.size();
+ }
+
+ public void drainResultsTo(List destination) {
+ destination.addAll(resultList);
+ resultList.clear();
+ }
+
+ public TaskWallStats getWallStats() {
+ return wallStats;
+ }
@Override
public void setProgressData(ProgressData data) {
@@ -34,19 +61,20 @@ public RunMSGFPlus(
Supplier specScannerSupplier,
CompactSuffixArray sa,
SearchParams params,
- List resultList,
int taskNum
) {
+ this.resultList = new ArrayList<>();
this.specScannerSupplier = specScannerSupplier;
this.sa = sa;
this.params = params;
- this.resultList = resultList;
this.taskNum = taskNum;
progress = null;
}
@Override
public void run() {
+ long taskStartNs = System.nanoTime();
+ long preprocessMs = 0, dbSearchMs = 0, computeEvalueMs = 0;
if (progress == null) {
progress = new ProgressData();
}
@@ -72,7 +100,7 @@ public void run() {
if (params.getVerbose()) {
output = System.out;
} else {
- output = new PrintStream(new NullOutputStream());
+ output = NULL_PRINT_STREAM;
}
progress.stepRange(5.0);
@@ -98,8 +126,9 @@ public void run() {
if (Thread.currentThread().isInterrupted()) {
return;
}
+ preprocessMs = System.currentTimeMillis() - startTimePreprocess;
output.print(threadName + ": Preprocessing spectra finished ");
- output.format("(elapsed time: %.2f sec)\n", (float) ((System.currentTimeMillis() - startTimePreprocess) / 1000));
+ output.format("(elapsed time: %.2f sec)\n", preprocessMs / 1000.0f);
specScanner.getProgressObj().setParentProgressObj(null);
progress.report(5.0);
@@ -124,8 +153,9 @@ public void run() {
if (Thread.currentThread().isInterrupted()) {
return;
}
+ dbSearchMs = System.currentTimeMillis() - startTimeDbSearch;
output.print(threadName + ": Database search finished ");
- output.format("(elapsed time: %.2f sec)\n", (float) ((System.currentTimeMillis() - startTimeDbSearch) / 1000));
+ output.format("(elapsed time: %.2f sec)\n", dbSearchMs / 1000.0f);
progress.stepRange(95.0);
@@ -138,8 +168,9 @@ public void run() {
if (Thread.currentThread().isInterrupted()) {
return;
}
+ computeEvalueMs = System.currentTimeMillis() - startTimeComputeEvalue;
output.print(threadName + ": Computing spectral E-values finished ");
- output.format("(elapsed time: %.2f sec)\n", (float) ((System.currentTimeMillis() - startTimeComputeEvalue) / 1000));
+ output.format("(elapsed time: %.2f sec)\n", computeEvalueMs / 1000.0f);
scanner.getProgressObj().setParentProgressObj(null);
progress.stepRange(100);
@@ -160,7 +191,10 @@ public void run() {
scanner.addResultsToList(resultList);
progress.report(100.0);
-// gen.addSpectrumIdentificationResults(scanner.getSpecIndexDBMatchMap());
+ long totalMs = (System.nanoTime() - taskStartNs) / 1_000_000L;
+ wallStats = new TaskWallStats(taskNum, preprocessMs, dbSearchMs, computeEvalueMs, totalMs);
+ scanner = null;
+ specScanner = null;
output.println(threadName + ": Task " + taskNum + " completed.");
}
}
diff --git a/src/main/java/edu/ucsd/msjava/msdbsearch/DBScanner.java b/src/main/java/edu/ucsd/msjava/msdbsearch/DBScanner.java
index ace232a8..4f94d64e 100644
--- a/src/main/java/edu/ucsd/msjava/msdbsearch/DBScanner.java
+++ b/src/main/java/edu/ucsd/msjava/msdbsearch/DBScanner.java
@@ -6,7 +6,7 @@
import edu.ucsd.msjava.msscorer.SimpleDBSearchScorer;
import edu.ucsd.msjava.msutil.*;
import edu.ucsd.msjava.msutil.Modification.Location;
-import edu.ucsd.msjava.parser.BufferedLineReader;
+import edu.ucsd.msjava.mgf.BufferedLineReader;
import edu.ucsd.msjava.sequences.Constants;
import java.io.*;
@@ -89,8 +89,12 @@ public DBScanner(
intAAMass[aa.getResidue()] = aa.getNominalMass();
}
- specKeyDBMatchMap = Collections.synchronizedMap(new HashMap>());
- specIndexDBMatchMap = Collections.synchronizedMap(new HashMap>());
+ // DBScanner is owned by exactly one RunMSGFPlus / ConcurrentMSGFDB task.
+ // No internal fork-out (verified: no ExecutorService / Thread creation in
+ // dbSearch). Plain HashMap is enough; the synchronized wrappers were
+ // defensive against a sharing pattern that does not occur in production.
+ specKeyDBMatchMap = new HashMap<>();
+ specIndexDBMatchMap = new HashMap<>();
progress = null;
output = System.out;
@@ -116,7 +120,7 @@ public DBScanner setThreadName(String threadName) {
return this;
}
- public synchronized void addDBMatches(Map> map) {
+ public void addDBMatches(Map> map) {
if (map == null)
return;
Iterator>> itr = map.entrySet().iterator();
@@ -247,10 +251,6 @@ class MatchList extends ArrayList {
if (bufferIndex == 0)
lcp = 0;
- // For debugging
-// System.out.println(index+": " +sequence.getSubsequence(index, sequence.getSize()));
-// if(index == 4)
-// System.out.println("Debug");
// skip redundant peptides
if (Thread.currentThread().isInterrupted()) {
@@ -411,19 +411,12 @@ else if (lcp == 0) // preceding aa is changed
if (peptideLengthIndex < minPeptideLength)
continue;
-// System.out.println(sequence.getSubsequence(index+1, index+i+1));
-// if(sequence.getSubsequence(index+1, index+i+1).equalsIgnoreCase("KYPCRYCEK"))
-// {
-// System.out.println("DebugSequence: " + sequence.getSubsequence(index, index+i+1));
-// }
-
int cTermCleavageScore = 0;
if (enzyme != null) {
char cTermNeighboringResidue = sequence.getCharAt(index + peptideLengthIndex + 1);
isProteinCTerm = (cTermNeighboringResidue == Constants.TERMINATOR_CHAR);
if (enzyme.isCTerm()) {
-// if(isProteinCTerm || enzyme.isCleavable(residue)) // || cTermNeighboringResidue == Constants.INVALID_CHAR)
- if (enzyme.isCleavable(residue)) // || cTermNeighboringResidue == Constants.INVALID_CHAR) // changed by Sangtae to avoid SpecProb=0
+ if (enzyme.isCleavable(residue)) // changed by Sangtae to avoid SpecProb=0
cTermCleavageScore = peptideCleavageCredit;
else {
cTermCleavageScore = peptideCleavagePenalty;
@@ -481,13 +474,6 @@ else if (lcp == 0) // preceding aa is changed
double leftThr = (double) (theoPeptideMass - tolDaLeft);
double rightThr = (double) (theoPeptideMass + tolDaRight);
-// float tolDaLeft = specScanner.getLeftPrecursorMassTolerance().getToleranceAsDa(peptideMass);
-// float tolDaRight = specScanner.getRightPrecursorMassTolerance().getToleranceAsDa(peptideMass);
-// int maxPeptideMassIndex, minPeptideMassIndex;
-//
-// maxPeptideMassIndex = maxNominalPeptideMass + Math.round(tolDaLeft-0.4999f);
-// minPeptideMassIndex = minNominalPeptideMass - Math.round(tolDaRight-0.4999f);
-
if (leftThr < 1 || rightThr < 1) {
// Either or both of the thresholds is less than 1 (and probably negative)
// This can happen when a dynamic mod with a large negative mass is defined and is applied to a small peptide
@@ -501,11 +487,7 @@ else if (lcp == 0) // preceding aa is changed
Collection matchedSpecKeyList = specScanner.getPepMassSpecKeyMap().subMap(leftThr, rightThr).values();
if (matchedSpecKeyList.size() > 0) {
- //////
-// System.out.println("\tMatch: " + sequence.getCharAt(index)+"."+sequence.getSubsequence(index+1, index+i+1)+"."+sequence.getCharAt(index+i+1));
- ///////
boolean isNTermMetCleaved = candidatePepGrid.isNTermMetCleaved(j);
-// int pepLength = i;
int pepLength;
if (!isNTermMetCleaved)
pepLength = peptideLengthIndex;
@@ -668,7 +650,7 @@ public void computeSpecEValue(boolean storeScoreDist, int fromIndex, int toIndex
}
}
- public synchronized void generateSpecIndexDBMatchMap() {
+ public void generateSpecIndexDBMatchMap() {
Iterator>> itr = specKeyDBMatchMap.entrySet().iterator();
int numPeptidesPerSpec = this.numPeptidesPerSpec;
@@ -682,15 +664,6 @@ public synchronized void generateSpecIndexDBMatchMap() {
Map pepSeqMap = new HashMap();
for (DatabaseMatch m : matchQueue) {
String pepSeq = m.getPepSeq();
-// int index = m.getIndex();
-// char pre = sa.getSequence().getCharAt(index);
-// char post;
-// if(m.isNTermMetCleaved())
-// post = sa.getSequence().getCharAt(index+m.getLength());
-// else
-// post = sa.getSequence().getCharAt(index+m.getLength()-1);
-// String key = pre+pepSeq+post;
-
String key = pepSeq + m.getScore();
DatabaseMatch existingMatch = pepSeqMap.get(key);
if (existingMatch == null)
@@ -728,7 +701,7 @@ public synchronized void generateSpecIndexDBMatchMap() {
}
}
- public synchronized void addResultsToList(List resultList) {
+ public void addResultsToList(List resultList) {
Iterator>> itr = specIndexDBMatchMap.entrySet().iterator();
while (itr.hasNext()) {
Entry> entry = itr.next();
@@ -761,7 +734,7 @@ public void addAdditionalFeatures() {
}
// for MS-GFDB
- public synchronized void addDBSearchResults(List gen, String specFileName, boolean replicateMergedResults) {
+ public void addDBSearchResults(List gen, String specFileName, boolean replicateMergedResults) {
Map> specIndexDBMatchMap = new HashMap>();
Iterator>> itr = specKeyDBMatchMap.entrySet().iterator();
@@ -826,7 +799,6 @@ public synchronized void addDBSearchResults(List
}
float expMass = scorer.getPrecursorPeak().getMass();
-// float theoMass = pep.getParentMass();
float peptideMass = match.getPeptideMass();
float pmError = Float.MAX_VALUE;
float theoMass = peptideMass + (float) Composition.H2O;
@@ -837,7 +809,6 @@ public synchronized void addDBSearchResults(List
pmError = error;
}
}
-// if(pmError > )
if (specScanner.getRightPrecursorMassTolerance().isTolerancePPM())
pmError = pmError / theoMass * 1e6f;
diff --git a/src/main/java/edu/ucsd/msjava/msdbsearch/LibraryScanner.java b/src/main/java/edu/ucsd/msjava/msdbsearch/LibraryScanner.java
index 5f6821a7..5f7fb7a8 100644
--- a/src/main/java/edu/ucsd/msjava/msdbsearch/LibraryScanner.java
+++ b/src/main/java/edu/ucsd/msjava/msdbsearch/LibraryScanner.java
@@ -4,7 +4,7 @@
import edu.ucsd.msjava.msscorer.SimpleDBSearchScorer;
import edu.ucsd.msjava.msutil.*;
import edu.ucsd.msjava.msutil.Modification.Location;
-import edu.ucsd.msjava.parser.BufferedLineReader;
+import edu.ucsd.msjava.mgf.BufferedLineReader;
import java.io.FileNotFoundException;
import java.io.IOException;
diff --git a/src/main/java/edu/ucsd/msjava/msdbsearch/MassErrorStat.java b/src/main/java/edu/ucsd/msjava/msdbsearch/MassErrorStat.java
index f6e6b06a..bdeba08e 100644
--- a/src/main/java/edu/ucsd/msjava/msdbsearch/MassErrorStat.java
+++ b/src/main/java/edu/ucsd/msjava/msdbsearch/MassErrorStat.java
@@ -10,27 +10,19 @@ public class MassErrorStat {
private List> errorList; // (error, intensity)
// for all peaks (absolute)
-// private float sum;
private float mean;
- // private float median;
private float sd;
// for top 7 peaks (absolute)
-// private float sum7;
private float mean7;
- // private float median7;
private float sd7;
// for all peaks (absolute)
-// private float rSum;
private float rMean;
- // private float rMedian;
private float rSd;
// for top 7 peaks (absolute)
-// private float rSum7;
private float rMean7;
- // private float rMedian7;
private float rSd7;
public MassErrorStat() {
@@ -80,10 +72,6 @@ public int size() {
return errorList.size();
}
-// public float getSum() {
-// return sum;
-// }
-
public float getMean() {
return mean;
}
@@ -92,10 +80,6 @@ public float getRMean() {
return rMean;
}
-// public float getMedian() {
-// return median;
-// }
-
public float getSd() {
return sd;
}
@@ -104,14 +88,6 @@ public float getRSd() {
return rSd;
}
-// public float getSum7() {
-// return sum7;
-// }
-
-// public float getRSum7() {
-// return rSum7;
-// }
-
public float getMean7() {
return mean7;
}
diff --git a/src/main/java/edu/ucsd/msjava/msdbsearch/PSMFeatureFinder.java b/src/main/java/edu/ucsd/msjava/msdbsearch/PSMFeatureFinder.java
index a21d4d1b..69fa6e4d 100644
--- a/src/main/java/edu/ucsd/msjava/msdbsearch/PSMFeatureFinder.java
+++ b/src/main/java/edu/ucsd/msjava/msdbsearch/PSMFeatureFinder.java
@@ -15,7 +15,6 @@
public class PSMFeatureFinder {
private final Spectrum spec; // MS/MS spectrum
- // private final Spectrum precursorSpec;
private final Peptide peptide;
private final NewScoredSpectrum scoredSpec;
@@ -43,15 +42,11 @@ public class PSMFeatureFinder {
private int longestB = 0;
private int longestY = 0;
- // private Float ms1IonCurrent;
-// private Float isolationWindowEfficiency;
private Tolerance mme;
public PSMFeatureFinder(Spectrum spec, Spectrum precursorSpec, Peptide peptide, NewRankScorer scorer) {
this.spec = spec;
this.peptide = peptide;
-// this.precursorSpec = precursorSpec;
-
scoredSpec = scorer.getScoredSpectrum(spec);
if (scorer.getSpecDataType().getInstrumentType().isHighResolution())
mme = new Tolerance(20f, true); // for high-precision MS/MS, set tolerance as 20ppm
@@ -208,12 +203,10 @@ public Float getMS2IonCurrent() {
}
public Float getMS1IonCurrent() {
-// return ms1IonCurrent;
return null;
}
public Float getIsolationWindowEfficiency() {
-// return isolationWindowEfficiency;
return null;
}
}
diff --git a/src/main/java/edu/ucsd/msjava/msdbsearch/PeptideEnumerator.java b/src/main/java/edu/ucsd/msjava/msdbsearch/PeptideEnumerator.java
index 072bae19..36fa5188 100644
--- a/src/main/java/edu/ucsd/msjava/msdbsearch/PeptideEnumerator.java
+++ b/src/main/java/edu/ucsd/msjava/msdbsearch/PeptideEnumerator.java
@@ -4,7 +4,7 @@
import edu.ucsd.msjava.msutil.Composition;
import edu.ucsd.msjava.msutil.Enzyme;
import edu.ucsd.msjava.sequences.Constants;
-import edu.ucsd.msjava.ui.MSGFPlus;
+import edu.ucsd.msjava.cli.MSGFPlus;
import java.io.*;
diff --git a/src/main/java/edu/ucsd/msjava/msdbsearch/ReverseDB.java b/src/main/java/edu/ucsd/msjava/msdbsearch/ReverseDB.java
index 83ea09ee..d8cdd20b 100644
--- a/src/main/java/edu/ucsd/msjava/msdbsearch/ReverseDB.java
+++ b/src/main/java/edu/ucsd/msjava/msdbsearch/ReverseDB.java
@@ -1,6 +1,6 @@
package edu.ucsd.msjava.msdbsearch;
-import edu.ucsd.msjava.ui.MSGFPlus;
+import edu.ucsd.msjava.cli.MSGFPlus;
import java.io.*;
diff --git a/src/main/java/edu/ucsd/msjava/msdbsearch/ScoredSpectraMap.java b/src/main/java/edu/ucsd/msjava/msdbsearch/ScoredSpectraMap.java
index 66c06d63..8dea0dfa 100644
--- a/src/main/java/edu/ucsd/msjava/msdbsearch/ScoredSpectraMap.java
+++ b/src/main/java/edu/ucsd/msjava/msdbsearch/ScoredSpectraMap.java
@@ -33,8 +33,6 @@ public class ScoredSpectraMap {
private Map specKeyRankScorerMap;
-// private Map specKeyToleranceMap;
-
private boolean turnOffEdgeScoring = false;
private ProgressData progress;
@@ -60,16 +58,16 @@ public ScoredSpectraMap(
this.specDataType = specDataType;
this.precursorMassShiftPpm = precursorMassShiftPpm;
- pepMassSpecKeyMap = Collections.synchronizedSortedMap((new TreeMap()));
- specKeyScorerMap = Collections.synchronizedMap(new HashMap>());
- specIndexChargeToSpecKeyMap = Collections.synchronizedMap(new HashMap, SpecKey>());
-
-// // To support spectrum-specific tolerance
-// if(supportSpectrumSpecificErrorTolerance)
-// specKeyToleranceMap = Collections.synchronizedMap(new HashMap());
+ // Each ScoredSpectraMap is owned by exactly one RunMSGFPlus task (or the
+ // MassCalibrator pre-pass, also single-threaded). The synchronized wrappers
+ // these maps used to carry were defensive against a sharing pattern that
+ // does not occur in production code paths. Plain Map/SortedMap is enough.
+ pepMassSpecKeyMap = new TreeMap<>();
+ specKeyScorerMap = new HashMap<>();
+ specIndexChargeToSpecKeyMap = new HashMap<>();
if (storeRankScorer)
- specKeyRankScorerMap = Collections.synchronizedMap(new HashMap());
+ specKeyRankScorerMap = new HashMap<>();
progress = null;
}
@@ -158,7 +156,6 @@ public Tolerance getRightPrecursorMassTolerance() {
return rightPrecursorMassTolerance;
}
- // public int getNumAllowedC13() { return numAllowedC13; }
public int getMaxIsotopeError() {
return maxIsotopeError;
}
@@ -182,14 +179,6 @@ public NewRankScorer getRankScorer(SpecKey specKey) {
return this.specKeyRankScorerMap.get(specKey);
}
-// public Tolerance getSpectrumSpecificPrecursorTolerance(SpecKey specKey)
-// {
-// if(specKeyToleranceMap == null)
-// return null;
-// else
-// return specKeyToleranceMap.get(specKey);
-// }
-
public ScoredSpectraMap makePepMassSpecKeyMap() {
for (SpecKey specKey : specKeyList) {
int specIndex = specKey.getSpecIndex();
@@ -207,9 +196,6 @@ public ScoredSpectraMap makePepMassSpecKeyMap() {
}
specIndexChargeToSpecKeyMap.put(new Pair(specIndex, specKey.getCharge()), specKey);
-// if(specKeyToleranceMap != null && spec.getPrecursorTolerance() != null)
-// specKeyToleranceMap.put(specKey, spec.getPrecursorTolerance());
-
} else {
// Skip since precursor m/z is zero
}
@@ -269,7 +255,6 @@ private void preProcessIndividualSpectra(int fromIndex, int toIndex) {
int charge = specKey.getCharge();
spec.setCharge(charge);
- // System.out.println("GetScoredSpectrum for " + specKey.toString());
NewScoredSpectrum scoredSpec = scorer.getScoredSpectrum(spec);
float peptideMass = spec.getPrecursorMass() - (float) Composition.H2O;
@@ -366,7 +351,6 @@ private void preProcessFusedSpectra(int fromIndex, int toIndex) {
float tolDaLeft = leftPrecursorMassTolerance.getToleranceAsDa(peptideMass);
int maxNominalPeptideMass = NominalMass.toNominalMass(peptideMass) + Math.round(tolDaLeft - 0.4999f) + 1;
if (supportEdgeScore)
-// specKeyScorerMap.put(specKey, new DBScanScorerSum(scoredSpecList, maxNominalPeptideMass));
specKeyScorerMap.put(specKey, new FastScorer(scoredSpec, maxNominalPeptideMass));
else
specKeyScorerMap.put(specKey, new FastScorer(scoredSpec, maxNominalPeptideMass));
diff --git a/src/main/java/edu/ucsd/msjava/msdbsearch/SearchParams.java b/src/main/java/edu/ucsd/msjava/msdbsearch/SearchParams.java
index 210f968f..58794855 100644
--- a/src/main/java/edu/ucsd/msjava/msdbsearch/SearchParams.java
+++ b/src/main/java/edu/ucsd/msjava/msdbsearch/SearchParams.java
@@ -1,707 +1,518 @@
-package edu.ucsd.msjava.msdbsearch;
-
-import edu.ucsd.msjava.msgf.Tolerance;
-import edu.ucsd.msjava.msutil.*;
-import edu.ucsd.msjava.params.*;
-import edu.ucsd.msjava.parser.BufferedLineReader;
-
-import java.io.File;
-import java.io.IOException;
-import java.util.ArrayList;
-import java.util.Hashtable;
-import java.util.List;
-
-import static edu.ucsd.msjava.msutil.Composition.POTASSIUM_CHARGE_CARRIER_MASS;
-import static edu.ucsd.msjava.msutil.Composition.PROTON;
-import static edu.ucsd.msjava.msutil.Composition.SODIUM_CHARGE_CARRIER_MASS;
-
-public class SearchParams {
-
- /**
- * Two-pass precursor mass calibration (P2-cal) mode.
- *
- *
- * - {@link #AUTO} (default) — run the pre-pass, apply the learned shift
- * only if at least 200 high-confidence PSMs are collected; otherwise
- * fall through with a 0 ppm shift.
- * - {@link #ON} — run the pre-pass and always apply the learned shift,
- * even when fewer than 200 confident PSMs are collected.
- * - {@link #OFF} — skip calibration entirely. The code path MUST be
- * bit-identical to a baseline build without the flag.
- *
- */
- public enum PrecursorCalMode {
- AUTO,
- ON,
- OFF;
-
- /**
- * Case-insensitive string to enum conversion. Unknown values fall
- * back to {@link #AUTO} so that downstream code never crashes if a
- * typo slips past CLI parsing.
- */
- public static PrecursorCalMode fromString(String s) {
- if (s == null) return AUTO;
- String normalized = s.trim().toLowerCase();
- switch (normalized) {
- case "on":
- return ON;
- case "off":
- return OFF;
- case "auto":
- case "":
- return AUTO;
- default:
- return AUTO;
- }
- }
- }
-
- private List dbSearchIOList;
- private File databaseFile;
- private String decoyProteinPrefix;
- private Tolerance leftPrecursorMassTolerance;
- private Tolerance rightPrecursorMassTolerance;
- private int minIsotopeError;
- private int maxIsotopeError;
- private Enzyme enzyme;
- private int numTolerableTermini;
- private ActivationMethod activationMethod;
- private InstrumentType instType;
- private Protocol protocol;
- private AminoAcidSet aaSet;
- private int numMatchesPerSpec;
- private int startSpecIndex;
- private int endSpecIndex;
- private boolean useTDA;
- private boolean ignoreMetCleavage;
- private int minPeptideLength;
- private int maxPeptideLength;
- private int maxNumVariantsPerPeptide;
- private int minCharge;
- private int maxCharge;
- private int numThreads;
- private int numTasks;
- private int minSpectraPerThread;
- private boolean verbose;
- private boolean doNotUseEdgeScore;
- private File dbIndexDir;
- private boolean outputAdditionalFeatures;
- private int minNumPeaksPerSpectrum;
- private int minDeNovoScore;
- private double chargeCarrierMass;
- private int maxMissedCleavages;
- private int maxNumMods;
- private boolean allowDenseCentroidedPeaks;
- private int minMSLevel;
- private int maxMSLevel;
- private int outputFormat; // 0=pin (default), 1=tsv — mzid output removed
- private PrecursorCalMode precursorCalMode = PrecursorCalMode.AUTO;
-
- public SearchParams() {
- }
-
- /**
- * Returns the configured precursor mass calibration mode; defaults
- * to {@link PrecursorCalMode#AUTO}.
- */
- public PrecursorCalMode getPrecursorCalMode() {
- return precursorCalMode;
- }
-
- // Used by MS-GF+
- public List getDBSearchIOList() {
- return dbSearchIOList;
- }
-
- // Used by MS-GF+
- public File getDatabaseFile() {
- return databaseFile;
- }
-
- // Used by MS-GF+
- public String getDecoyProteinPrefix() {
- return decoyProteinPrefix;
- }
-
- // Used by MS-GF+
- public Tolerance getLeftPrecursorMassTolerance() {
- return leftPrecursorMassTolerance;
- }
-
- // Used by MS-GF+
- public Tolerance getRightPrecursorMassTolerance() {
- return rightPrecursorMassTolerance;
- }
-
- // Used by MS-GF+
- public int getMinIsotopeError() {
- return minIsotopeError;
- }
-
- // Used by MS-GF+
- public int getMaxIsotopeError() {
- return maxIsotopeError;
- }
-
- // Used by MS-GF+
- public Enzyme getEnzyme() {
- return enzyme;
- }
-
- public int getNumTolerableTermini() {
- return numTolerableTermini;
- }
-
- // Used by MS-GF+
- public ActivationMethod getActivationMethod() {
- return activationMethod;
- }
-
- // Used by MS-GF+
- public InstrumentType getInstType() {
- return instType;
- }
-
- // Used by MS-GF+
- public Protocol getProtocol() {
- return protocol;
- }
-
- // Used by MS-GF+
- public AminoAcidSet getAASet() {
- return aaSet;
- }
-
- // Used by MS-GF+
- public int getNumMatchesPerSpec() {
- return numMatchesPerSpec;
- }
-
- // Used by MS-GF+
- public int getStartSpecIndex() {
- return startSpecIndex;
- }
-
- // Used by MS-GF+
- public int getEndSpecIndex() {
- return endSpecIndex;
- }
-
- // Used by MS-GF+
- public boolean useTDA() {
- return useTDA;
- }
-
- // Used by MS-GF+
- public boolean ignoreMetCleavage() {
- return ignoreMetCleavage;
- }
-
- // Used by MS-GF+
- public int getMinPeptideLength() {
- return minPeptideLength;
- }
-
- // Used by MS-GF+
- public int getMaxPeptideLength() {
- return maxPeptideLength;
- }
-
- // Used by MS-GF+
- public int getMaxNumVariantsPerPeptide() {
- return maxNumVariantsPerPeptide;
- }
-
- // Used by MS-GF+
- public int getMinCharge() {
- return minCharge;
- }
-
- // Used by MS-GF+
- public int getMaxCharge() {
- return maxCharge;
- }
-
- // Used by MS-GF+
- public int getNumThreads() {
- return numThreads;
- }
-
- public int getNumTasks() {
- return numTasks;
- }
-
- public int getMinSpectraPerThread() {
- return minSpectraPerThread;
- }
-
- public boolean getVerbose() {
- return verbose;
- }
-
- // Used by MS-GF+
- public boolean doNotUseEdgeScore() {
- return doNotUseEdgeScore;
- }
-
- // Used by MS-GF+
- public File getDBIndexDir() {
- return dbIndexDir;
- }
-
- public boolean outputAdditionalFeatures() {
- return outputAdditionalFeatures;
- }
-
- // Used by MS-GF+
- public int getMinNumPeaksPerSpectrum() {
- return minNumPeaksPerSpectrum;
- }
-
- // Used by MS-GF+
- public int getMinDeNovoScore() {
- return minDeNovoScore;
- }
-
- public double getChargeCarrierMass() {
- return chargeCarrierMass;
- }
-
- // Used by MS-GF+
- public int getMaxMissedCleavages() {
- return maxMissedCleavages;
- }
-
- // Used by MS-GF+
- public boolean getAllowDenseCentroidedPeaks() {
- return allowDenseCentroidedPeaks;
- }
-
- // Used by MS-GF+
- public int getMinMSLevel() {
- return minMSLevel;
- }
-
- // Used by MS-GF+
- public int getMaxMSLevel() {
- return maxMSLevel;
- }
-
- /** 0=pin (default), 1=tsv. */
- public int getOutputFormat() {
- return outputFormat;
- }
-
- public boolean writeTsv() {
- return outputFormat == 1;
- }
-
- public boolean writePin() {
- return outputFormat == 0;
- }
-
- /**
- * Look for # in dataLine
- * If present, remove that character and any comment after it
- *
- * @param dataLine
- * @return dataLine without the comment
- */
- public static String getConfigLineWithoutComment(String dataLine) {
- String[] tokenArray = dataLine.split("#");
- if (tokenArray.length == 0)
- return "";
-
- return tokenArray[0].trim();
- }
-
- // Used by MS-GF+
- public String parse(ParamManager paramManager) {
- AminoAcidSet configAASet = null;
- FileParameter configFileParam = paramManager.getConfigFileParam();
-
- if (configFileParam != null && configFileParam.getFile() != null) {
- configAASet = parseConfigParamFile(paramManager);
- }
-
- // Charge carrier mass
- chargeCarrierMass = paramManager.getChargeCarrierMass();
- Composition.setChargeCarrierMass(chargeCarrierMass);
-
- // Spectrum file
- // Read outputFormat up-front so the default-output-file extension
- // logic below (inside both the single-file and directory branches)
- // sees the user-supplied value, not the field's zero initializer.
- outputFormat = paramManager.getOutputFormat();
-
- FileParameter specParam = paramManager.getSpecFileParam();
- File specPath = specParam.getFile();
-
- if (specPath == null)
- {
- return "Spectrum file is not defined; use -s at the command line or SpectrumFile in a config file";
- }
-
- if (!specPath.exists()) {
- return "Spectrum file not found: " + specPath.getPath();
- }
-
- dbSearchIOList = new ArrayList<>();
-
- if (!specPath.isDirectory()) {
- // Spectrum format
- SpecFileFormat specFormat = (SpecFileFormat) specParam.getFileFormat();
- // Output file
- File outputFile = paramManager.getOutputFileParam().getFile();
- if (outputFile == null) {
- String defaultExt = outputFormat == 1 ? ".tsv" : ".pin";
- String outputFilePath = specPath.getPath().substring(0, specPath.getPath().lastIndexOf('.')) + defaultExt;
- outputFile = new File(outputFilePath);
- }
-
- dbSearchIOList = new ArrayList<>();
- dbSearchIOList.add(new DBSearchIOFiles(specPath, specFormat, outputFile));
- } else // spectrum directory
- {
- dbSearchIOList = new ArrayList<>();
- String defaultExt = outputFormat == 1 ? ".tsv" : ".pin";
- for (File f : specPath.listFiles()) {
- SpecFileFormat specFormat = SpecFileFormat.getSpecFileFormat(f.getName());
- if (specParam.isSupported(specFormat)) {
- String outputFileName = f.getName().substring(0, f.getName().lastIndexOf('.')) + defaultExt;
- File outputFile = new File(outputFileName);
-// if (outputFile.exists())
-// return outputFile.getPath() + " already exists!";
- dbSearchIOList.add(new DBSearchIOFiles(f, specFormat, outputFile));
- }
- }
- }
-
- // FASTA file
- databaseFile = paramManager.getDBFileParam().getFile();
-
- decoyProteinPrefix = paramManager.getDecoyProteinPrefix();
-
- // Precursor mass tolerance
- ToleranceParameter tol = paramManager.getPrecursorMassToleranceParam();
- leftPrecursorMassTolerance = tol.getLeftTolerance();
- rightPrecursorMassTolerance = tol.getRightTolerance();
-
- int toleranceUnit = paramManager.getToleranceUnit();
- if (toleranceUnit != 2) {
- boolean isTolerancePPM;
- isTolerancePPM = toleranceUnit != 0;
- leftPrecursorMassTolerance = new Tolerance(leftPrecursorMassTolerance.getValue(), isTolerancePPM);
- rightPrecursorMassTolerance = new Tolerance(rightPrecursorMassTolerance.getValue(), isTolerancePPM);
- }
-
- IntRangeParameter isotopeParam = paramManager.getIsotopeRangeParameter();
- this.minIsotopeError = isotopeParam.getMin();
- this.maxIsotopeError = isotopeParam.getMax();
-
- if (rightPrecursorMassTolerance.getToleranceAsDa(1000, 2) >= 0.5f ||
- leftPrecursorMassTolerance.getToleranceAsDa(1000, 2) >= 0.5f) {
- minIsotopeError = maxIsotopeError = 0;
- }
-
- enzyme = paramManager.getEnzyme();
- numTolerableTermini = paramManager.getNumTolerableTermini();
- activationMethod = paramManager.getActivationMethod();
- instType = paramManager.getInstType();
- if (activationMethod == ActivationMethod.HCD && instType != InstrumentType.HIGH_RESOLUTION_LTQ && instType != InstrumentType.QEXACTIVE)
- instType = InstrumentType.QEXACTIVE; // by default use Q-Exactive model for HCD
-
- protocol = paramManager.getProtocol();
-
- aaSet = null;
- File modFile = paramManager.getModFileParam().getFile();
- if (modFile == null && configAASet == null)
- aaSet = AminoAcidSet.getStandardAminoAcidSetWithFixedCarbamidomethylatedCys();
- else {
- if (modFile != null) {
- String modFileName = modFile.getName();
- String ext = modFileName.substring(modFileName.lastIndexOf('.') + 1);
- if (ext.equalsIgnoreCase("xml"))
- aaSet = AminoAcidSet.getAminoAcidSetFromXMLFile(modFile.getPath());
- else
- aaSet = AminoAcidSet.getAminoAcidSetFromModFile(modFile.getPath(), paramManager);
- } else {
- aaSet = configAASet;
- }
-
- if (protocol == Protocol.AUTOMATIC) {
- if (aaSet.containsITRAQ()) {
- if (aaSet.containsPhosphorylation())
- protocol = Protocol.ITRAQPHOSPHO;
- else
- protocol = Protocol.ITRAQ;
- } else if (aaSet.containsTMT()) {
- protocol = Protocol.TMT;
- } else {
- if (aaSet.containsPhosphorylation())
- protocol = Protocol.PHOSPHORYLATION;
- else
- protocol = Protocol.STANDARD;
- }
- }
- }
-
- numMatchesPerSpec = paramManager.getNumMatchesPerSpectrum();
-
- IntRangeParameter specIndexParam = paramManager.getSpecIndexParameter();
- startSpecIndex = specIndexParam.getMin();
- endSpecIndex = specIndexParam.getMax();
-
- useTDA = paramManager.getTDA() == 1;
- ignoreMetCleavage = paramManager.getIgnoreMetCleavage() == 1;
- outputAdditionalFeatures = paramManager.getOutputAdditionalFeatures() == 1;
-
- minPeptideLength = paramManager.getMinPeptideLength();
- maxPeptideLength = paramManager.getMaxPeptideLength();
-
- // Number of isoforms to consider per peptide, Default: 128
- maxNumVariantsPerPeptide = paramManager.getMaxNumVariantsPerPeptide();
-
- if (minPeptideLength > maxPeptideLength) {
- return "MinPepLength must not be larger than MaxPepLength";
- }
-
- minCharge = paramManager.getMinCharge();
- maxCharge = paramManager.getMaxCharge();
- if (minCharge > maxCharge) {
- return "MinCharge must not be larger than MaxCharge";
- }
-
- numThreads = paramManager.getNumThreads();
- numTasks = paramManager.getNumTasks();
- minSpectraPerThread = paramManager.getMinSpectraPerThread();
- verbose = paramManager.getVerboseFlag() == 1;
- doNotUseEdgeScore = paramManager.getEdgeScoreFlag() == 1;
-
- dbIndexDir = paramManager.getDatabaseIndexDir();
-
- minNumPeaksPerSpectrum = paramManager.getMinNumPeaksPerSpectrum();
-
- minDeNovoScore = paramManager.getMinDeNovoScore();
-
- /* Make sure max missed cleavages is a valid value and that it is not
- * being mixed with an unspecific or no-cleave enzyme
- */
- maxMissedCleavages = paramManager.getMaxMissedCleavages();
- if (maxMissedCleavages > -1 && enzyme.getName().equals("UnspecificCleavage")) {
- return "Cannot specify a MaxMissedCleavages when using unspecific cleavage enzyme";
- } else if (maxMissedCleavages > -1 && enzyme.getName().equals("NoCleavage")) {
- return "Cannot specify a MaxMissedCleavages when using no cleavage enzyme";
- }
-
- allowDenseCentroidedPeaks = paramManager.getAllowDenseCentroidedPeaks() == 1;
- // outputFormat was read earlier in parse() so the default-filename-
- // extension logic in the spec-path branches sees the user's value.
- precursorCalMode = PrecursorCalMode.fromString(paramManager.getPrecursorCalRawValue());
-
- IntRangeParameter msLevelParam = paramManager.getMSLevelParameter();
- minMSLevel = msLevelParam.getMin();
- maxMSLevel = msLevelParam.getMax();
-
- maxNumMods = paramManager.getMaxNumModsPerPeptide();
- int maxNumModsCompare = aaSet.getMaxNumberOfVariableModificationsPerPeptide();
-
- if (maxNumMods != maxNumModsCompare) {
- System.err.println("Error, code bug: " +
- "MaxNumModsPerPeptide tracked by the ParamManager does not match the value tracked by the AminoAcidSet: " +
- maxNumMods + " vs. " + maxNumModsCompare);
- System.exit(-1);
- }
-
- // Make sure all unique modifications have unique identifiers...
- Modification.setModIdentifiers();
-
- return null;
- }
-
- // Used by MS-GF+
- private AminoAcidSet parseConfigParamFile(ParamManager paramManager) {
-
- BufferedLineReader reader = null;
-
- File paramFile = paramManager.getConfigFileParam().getFile();
-
- try {
- reader = new BufferedLineReader(paramFile.getPath());
- } catch (IOException e) {
- System.err.println("Error opening parameter file " + paramFile.getPath());
- e.printStackTrace();
- System.exit(-1);
- }
-
- String dataLine;
- int lineNum = 0;
-
- // Keys in this table are line numbers
- // Values are the text from the config file, after the equals sign, defining a custom amino acid
- Hashtable customAAByLine = new Hashtable<>();
-
- // Keys in this table are line numbers
- // Values are the text from the config file, after the equals sign, defining a static or dynamic mod
- Hashtable modsByLine = new Hashtable<>();
-
- // Parse the settings
-
- int invalidParameterCount = 0;
-
- assert reader != null;
- while ((dataLine = reader.readLine()) != null) {
- lineNum++;
-
- String lineSetting = getConfigLineWithoutComment(dataLine);
- if (lineSetting.length() == 0) {
- continue;
- }
-
- String paramName = ParamManager.ParamNameEnum.getParamNameFromLine(lineSetting);
- if (paramName.isEmpty()) {
- continue;
- }
-
- if (ParamManager.ParamNameEnum.DYNAMIC_MODIFICATION.isThisParam(paramName) ||
- ParamManager.ParamNameEnum.STATIC_MODIFICATION.isThisParam(paramName) ||
- ParamManager.ParamNameEnum.CUSTOM_AA.isThisParam(paramName)) {
-
- String value = lineSetting.split("=")[1].trim();
- if (!value.equalsIgnoreCase("none")) {
- // Store the text after the equals sign
- if (ParamManager.ParamNameEnum.CUSTOM_AA.isThisParam(paramName))
- customAAByLine.put(lineNum, value);
- else
- modsByLine.put(lineNum, value);
- }
- continue;
- }
-
- boolean validParameter = false;
- for (ParamManager.ParamNameEnum param : ParamManager.ParamNameEnum.values()) {
- if (param.isThisParam(paramName)) {
- Parameter commandLineParam = paramManager.getParameter(param.getKey());
- if (commandLineParam != null) {
- validParameter = true;
- if (!commandLineParam.isValueAssigned()) {
- String value = lineSetting.split("=")[1].trim();
- String parseError = commandLineParam.parse(value);
- if (parseError == null || parseError.isEmpty()) {
- commandLineParam.setValueAssigned();
- continue;
- }
-
- if (commandLineParam.getKey().equals(ParamManager.ParamNameEnum.NUM_THREADS.getKey()) &&
- value.equalsIgnoreCase("all")) {
- // Config file has: NumThreads=All
- // This is acceptable
- // Note that numThreads should have already been initialized to the number of cores on this system
- // (see method addNumThreadsParam in ParamManager)
- continue;
- }
-
- System.err.println("Error parsing '" + lineSetting + "' in config file " +
- paramFile.getAbsolutePath() + ": " + parseError);
- System.exit(-1);
- }
- }
- }
- }
-
- if (!validParameter) {
- if (lineSetting.toLowerCase().startsWith("enzymedef")) {
- // DMS uses EnzymeDef to keep track of customize enzyme definitions
- // See, for example, https://github.com/PNNL-Comp-Mass-Spec/DMS-Analysis-Manager/blob/875533dfe95ed2c8252dc72b334cfd8ed651fa1c/Plugins/AM_MSGFDB_PlugIn/clsMSGFPlusUtils.cs#L2456
- // Thus, silently ignore this
- } else {
- System.out.println("Warning, unrecognized parameter '" + lineSetting + "' in config file " + paramFile.getName());
- invalidParameterCount++;
- }
- }
-
- }
-
- if (invalidParameterCount > 0) {
- System.out.println("Valid parameters are described in the example parameter file at " +
- "https://github.com/MSGFPlus/msgfplus/blob/master/docs/examples/MSGFPlus_Params.txt");
- }
-
- return AminoAcidSet.getAminoAcidSetFromList(paramFile.getName(), customAAByLine, modsByLine, paramManager);
- }
-
- @Override
- public String toString() {
- StringBuffer buf = new StringBuffer();
-
-// buf.append("Spectrum File(s):\n");
-// for(DBSearchIOFiles ioFile : this.dbSearchIOList)
-// {
-// buf.append("\t"+ioFile.getSpecFile().getAbsolutePath()+"\n");
-// }
-// buf.append("Database File: " + this.databaseFile.getAbsolutePath() + "\n");
-
- buf.append("\tPrecursorMassTolerance: ");
- if (leftPrecursorMassTolerance.equals(rightPrecursorMassTolerance)) {
- buf.append(leftPrecursorMassTolerance);
- } else {
- buf.append("[" + leftPrecursorMassTolerance + "," + rightPrecursorMassTolerance + "]");
- }
- buf.append("\n");
-
- buf.append("\tIsotopeError: " + this.minIsotopeError + "," + this.maxIsotopeError + "\n");
- buf.append("\tTargetDecoyAnalysis: " + this.useTDA + "\n");
- buf.append("\tFragmentationMethod: " + this.activationMethod + "\n");
- buf.append("\tInstrument: " + (instType == null ? "null" : this.instType.getNameAndDescription()) + "\n");
- buf.append("\tEnzyme: " + (enzyme == null ? "null" : this.enzyme.getName()) + "\n");
-
- String customEnzymeFile = Enzyme.getCustomEnzymeFilePath();
- if (customEnzymeFile != null && !customEnzymeFile.isEmpty()) {
- buf.append("\tEnzyme file: " + customEnzymeFile + "\n");
- }
-
- ArrayList customEnzymeMessages = Enzyme.getCustomEnzymeMessages();
- for (String message : customEnzymeMessages) {
- buf.append("\tEnzyme info: " + message + "\n");
- }
-
- buf.append("\tProtocol: " + (protocol == null ? "null" : this.protocol.getName()) + "\n");
- buf.append("\tNumTolerableTermini: " + this.numTolerableTermini + "\n");
- buf.append("\tIgnoreMetCleavage: " + this.ignoreMetCleavage + "\n");
- buf.append("\tMinPepLength: " + this.minPeptideLength + "\n");
- buf.append("\tMaxPepLength: " + this.maxPeptideLength + "\n");
- buf.append("\tMinCharge: " + this.minCharge + "\n");
- buf.append("\tMaxCharge: " + this.maxCharge + "\n");
- buf.append("\tNumMatchesPerSpec: " + this.numMatchesPerSpec + "\n");
- buf.append("\tMaxMissedCleavages: " + this.maxMissedCleavages + "\n");
- buf.append("\tMaxNumModsPerPeptide: " + this.maxNumMods + "\n");
- buf.append("\tChargeCarrierMass: " + this.chargeCarrierMass);
-
- if (Math.abs(this.chargeCarrierMass - PROTON) < 0.005) {
- buf.append(" (proton)\n");
- } else if (Math.abs(this.chargeCarrierMass - POTASSIUM_CHARGE_CARRIER_MASS) < 0.005) {
- buf.append(" (potassium)\n");
- } else if (Math.abs(this.chargeCarrierMass - SODIUM_CHARGE_CARRIER_MASS) < 0.005) {
- buf.append(" (sodium)\n");
- } else {
- buf.append(" (custom)\n");
- }
-
- buf.append("\tMSLevel: " + this.minMSLevel + "," + this.maxMSLevel + "\n");
- buf.append("\tMinNumPeaksPerSpectrum: " + this.minNumPeaksPerSpectrum + "\n");
- buf.append("\tNumIsoforms: " + this.maxNumVariantsPerPeptide + "\n");
-
- ArrayList modificationsInUse = aaSet.getModificationsInUse();
-
- if (modificationsInUse.size() == 0) {
- buf.append("No static or dynamic post translational modifications are defined.\n");
- } else {
- buf.append("Post translational modifications in use:\n");
- for (String modInfo : modificationsInUse)
- buf.append("\t" + modInfo + "\n");
- }
-
- return buf.toString();
- }
-}
+package edu.ucsd.msjava.msdbsearch;
+
+import edu.ucsd.msjava.cli.IntRange;
+import edu.ucsd.msjava.cli.MSGFPlusOptions;
+import edu.ucsd.msjava.cli.OutputFormat;
+import edu.ucsd.msjava.cli.PrecursorTolerance;
+import edu.ucsd.msjava.msgf.Tolerance;
+import edu.ucsd.msjava.msutil.*;
+
+import java.io.File;
+import java.util.ArrayList;
+import java.util.List;
+
+import static edu.ucsd.msjava.msutil.Composition.POTASSIUM_CHARGE_CARRIER_MASS;
+import static edu.ucsd.msjava.msutil.Composition.PROTON;
+import static edu.ucsd.msjava.msutil.Composition.SODIUM_CHARGE_CARRIER_MASS;
+
+public class SearchParams {
+
+ /**
+ * Two-pass precursor mass calibration (P2-cal) mode.
+ *
+ *
+ * - {@link #AUTO} (default) — run the pre-pass, apply the learned shift
+ * only if at least 200 high-confidence PSMs are collected; otherwise
+ * fall through with a 0 ppm shift.
+ * - {@link #ON} — run the pre-pass and always apply the learned shift,
+ * even when fewer than 200 confident PSMs are collected.
+ * - {@link #OFF} — skip calibration entirely. The code path MUST be
+ * bit-identical to a baseline build without the flag.
+ *
+ */
+ public enum PrecursorCalMode {
+ AUTO,
+ ON,
+ OFF
+ }
+
+ private List dbSearchIOList;
+ private File databaseFile;
+ private String decoyProteinPrefix;
+ private Tolerance leftPrecursorMassTolerance;
+ private Tolerance rightPrecursorMassTolerance;
+ private int minIsotopeError;
+ private int maxIsotopeError;
+ private Enzyme enzyme;
+ private int numTolerableTermini;
+ private ActivationMethod activationMethod;
+ private InstrumentType instType;
+ private Protocol protocol;
+ private AminoAcidSet aaSet;
+ private int numMatchesPerSpec;
+ private int startSpecIndex;
+ private int endSpecIndex;
+ private boolean useTDA;
+ private boolean ignoreMetCleavage;
+ private int minPeptideLength;
+ private int maxPeptideLength;
+ private int maxNumVariantsPerPeptide;
+ private int minCharge;
+ private int maxCharge;
+ private int numThreads;
+ private int numTasks;
+ private int minSpectraPerThread;
+ private boolean verbose;
+ private boolean doNotUseEdgeScore;
+ private File dbIndexDir;
+ private boolean outputAdditionalFeatures;
+ private int minNumPeaksPerSpectrum;
+ private int minDeNovoScore;
+ private double chargeCarrierMass;
+ private int maxMissedCleavages;
+ private int maxNumMods;
+ private boolean allowDenseCentroidedPeaks;
+ private int minMSLevel;
+ private int maxMSLevel;
+ private OutputFormat outputFormat;
+ private PrecursorCalMode precursorCalMode = PrecursorCalMode.AUTO;
+
+ public SearchParams() {
+ }
+
+ /**
+ * Returns the configured precursor mass calibration mode; defaults
+ * to {@link PrecursorCalMode#AUTO}.
+ */
+ public PrecursorCalMode getPrecursorCalMode() {
+ return precursorCalMode;
+ }
+
+ public List getDBSearchIOList() {
+ return dbSearchIOList;
+ }
+
+ public File getDatabaseFile() {
+ return databaseFile;
+ }
+
+ public String getDecoyProteinPrefix() {
+ return decoyProteinPrefix;
+ }
+
+ public Tolerance getLeftPrecursorMassTolerance() {
+ return leftPrecursorMassTolerance;
+ }
+
+ public Tolerance getRightPrecursorMassTolerance() {
+ return rightPrecursorMassTolerance;
+ }
+
+ public int getMinIsotopeError() {
+ return minIsotopeError;
+ }
+
+ public int getMaxIsotopeError() {
+ return maxIsotopeError;
+ }
+
+ public Enzyme getEnzyme() {
+ return enzyme;
+ }
+
+ public int getNumTolerableTermini() {
+ return numTolerableTermini;
+ }
+
+ public ActivationMethod getActivationMethod() {
+ return activationMethod;
+ }
+
+ public InstrumentType getInstType() {
+ return instType;
+ }
+
+ public Protocol getProtocol() {
+ return protocol;
+ }
+
+ public AminoAcidSet getAASet() {
+ return aaSet;
+ }
+
+ public int getNumMatchesPerSpec() {
+ return numMatchesPerSpec;
+ }
+
+ public int getStartSpecIndex() {
+ return startSpecIndex;
+ }
+
+ public int getEndSpecIndex() {
+ return endSpecIndex;
+ }
+
+ public boolean useTDA() {
+ return useTDA;
+ }
+
+ public boolean ignoreMetCleavage() {
+ return ignoreMetCleavage;
+ }
+
+ public int getMinPeptideLength() {
+ return minPeptideLength;
+ }
+
+ public int getMaxPeptideLength() {
+ return maxPeptideLength;
+ }
+
+ public int getMaxNumVariantsPerPeptide() {
+ return maxNumVariantsPerPeptide;
+ }
+
+ public int getMinCharge() {
+ return minCharge;
+ }
+
+ public int getMaxCharge() {
+ return maxCharge;
+ }
+
+ public int getNumThreads() {
+ return numThreads;
+ }
+
+ public int getNumTasks() {
+ return numTasks;
+ }
+
+ public int getMinSpectraPerThread() {
+ return minSpectraPerThread;
+ }
+
+ public boolean getVerbose() {
+ return verbose;
+ }
+
+ public boolean doNotUseEdgeScore() {
+ return doNotUseEdgeScore;
+ }
+
+ public File getDBIndexDir() {
+ return dbIndexDir;
+ }
+
+ public boolean outputAdditionalFeatures() {
+ return outputAdditionalFeatures;
+ }
+
+ public int getMinNumPeaksPerSpectrum() {
+ return minNumPeaksPerSpectrum;
+ }
+
+ public int getMinDeNovoScore() {
+ return minDeNovoScore;
+ }
+
+ public double getChargeCarrierMass() {
+ return chargeCarrierMass;
+ }
+
+ public int getMaxMissedCleavages() {
+ return maxMissedCleavages;
+ }
+
+ public boolean getAllowDenseCentroidedPeaks() {
+ return allowDenseCentroidedPeaks;
+ }
+
+ public int getMinMSLevel() {
+ return minMSLevel;
+ }
+
+ public int getMaxMSLevel() {
+ return maxMSLevel;
+ }
+
+ public boolean writeTsv() {
+ return outputFormat == OutputFormat.TSV;
+ }
+
+ /**
+ * Look for # in dataLine
+ * If present, remove that character and any comment after it
+ *
+ * @param dataLine
+ * @return dataLine without the comment
+ */
+ public static String getConfigLineWithoutComment(String dataLine) {
+ return MSGFPlusOptions.stripComment(dataLine);
+ }
+
+ /**
+ * Build a SearchParams from the typed CLI/config-file model. Reads {@code -conf}
+ * (when set) via {@link MSGFPlusOptions#applyConfigFile(File)} so any unset CLI
+ * fields are filled from the config file before the rest of the build runs.
+ *
+ * @return null on success; user-facing error string otherwise.
+ */
+ public String parse(MSGFPlusOptions opts) {
+ // Apply config-file overlay first: fills in any opts.* fields the CLI did
+ // not set, plus collects DynamicMod/StaticMod/CustomAA into opts.*Mods lists.
+ if (opts.configFile != null) {
+ String err = opts.applyConfigFile(opts.configFile);
+ if (err != null) return err;
+ }
+
+ // Required-input + numeric/enum range check now that CLI +
+ // config-file have both run. Catches things like -m 99 with a
+ // user-facing error instead of the IllegalArgumentException
+ // the resolver would otherwise raise during search setup.
+ String requiredErr = opts.validate();
+ if (requiredErr != null) return requiredErr;
+
+ chargeCarrierMass = opts.chargeCarrierMass != null ? opts.chargeCarrierMass : 1.00727649;
+ Composition.setChargeCarrierMass(chargeCarrierMass);
+
+ // Read outputFormat up-front so the default-output-file extension logic
+ // below sees the user-supplied value, not the field's zero initializer.
+ outputFormat = opts.effectiveOutputFormat();
+
+ File specPath = opts.spectrumFile;
+ if (!specPath.exists()) {
+ return "Spectrum file not found: " + specPath.getPath();
+ }
+
+ dbSearchIOList = new ArrayList<>();
+ String defaultExt = outputFormat == OutputFormat.TSV ? ".tsv" : ".pin";
+
+ if (!specPath.isDirectory()) {
+ SpecFileFormat specFormat = SpecFileFormat.getSpecFileFormat(specPath.getName());
+ if (!isSupportedSpectrumFormat(specFormat)) {
+ return "Spectrum file extension does not match a supported format (*.mzML, *.mgf): " + specPath.getName();
+ }
+ File outputFile = opts.outputFile;
+ if (outputFile == null) {
+ String outputFilePath = specPath.getPath().substring(0, specPath.getPath().lastIndexOf('.')) + defaultExt;
+ outputFile = new File(outputFilePath);
+ }
+ dbSearchIOList.add(new DBSearchIOFiles(specPath, specFormat, outputFile));
+ } else {
+ for (File f : specPath.listFiles()) {
+ SpecFileFormat specFormat = SpecFileFormat.getSpecFileFormat(f.getName());
+ if (isSupportedSpectrumFormat(specFormat)) {
+ String outputFileName = f.getName().substring(0, f.getName().lastIndexOf('.')) + defaultExt;
+ File outputFile = new File(outputFileName);
+ dbSearchIOList.add(new DBSearchIOFiles(f, specFormat, outputFile));
+ }
+ }
+ }
+
+ databaseFile = opts.databaseFile;
+ decoyProteinPrefix = opts.decoyPrefix != null ? opts.decoyPrefix : "XXX";
+
+ PrecursorTolerance tol = opts.precursorTolerance != null ? opts.precursorTolerance : PrecursorTolerance.parse("20ppm");
+ leftPrecursorMassTolerance = tol.left();
+ rightPrecursorMassTolerance = tol.right();
+
+ int toleranceUnit = opts.precursorToleranceUnits != null ? opts.precursorToleranceUnits : 2;
+ if (toleranceUnit != 2) {
+ boolean isTolerancePPM = toleranceUnit != 0;
+ leftPrecursorMassTolerance = new Tolerance(leftPrecursorMassTolerance.getValue(), isTolerancePPM);
+ rightPrecursorMassTolerance = new Tolerance(rightPrecursorMassTolerance.getValue(), isTolerancePPM);
+ }
+
+ IntRange isotope = opts.isotopeErrorRange != null ? opts.isotopeErrorRange : new IntRange(0, 1);
+ this.minIsotopeError = isotope.min();
+ this.maxIsotopeError = isotope.max();
+
+ if (rightPrecursorMassTolerance.getToleranceAsDa(1000, 2) >= 0.5f ||
+ leftPrecursorMassTolerance.getToleranceAsDa(1000, 2) >= 0.5f) {
+ minIsotopeError = maxIsotopeError = 0;
+ }
+
+ enzyme = opts.effectiveEnzyme();
+ numTolerableTermini = opts.numTolerableTermini != null ? opts.numTolerableTermini : 2;
+ activationMethod = opts.effectiveActivationMethod();
+ instType = opts.effectiveInstrumentType();
+ if (activationMethod == ActivationMethod.HCD
+ && instType != InstrumentType.HIGH_RESOLUTION_LTQ
+ && instType != InstrumentType.QEXACTIVE) {
+ instType = InstrumentType.QEXACTIVE; // default to Q-Exactive for HCD
+ }
+ protocol = opts.effectiveProtocol();
+
+ aaSet = null;
+ File modFile = opts.modificationFile;
+ boolean hasConfigMods = !opts.dynamicMods.isEmpty()
+ || !opts.staticMods.isEmpty()
+ || !opts.customAAs.isEmpty();
+
+ if (modFile == null && !hasConfigMods) {
+ aaSet = AminoAcidSet.getStandardAminoAcidSetWithFixedCarbamidomethylatedCys();
+ } else {
+ if (modFile != null) {
+ String modFileName = modFile.getName();
+ String ext = modFileName.substring(modFileName.lastIndexOf('.') + 1);
+ if (ext.equalsIgnoreCase("xml")) {
+ aaSet = AminoAcidSet.getAminoAcidSetFromXMLFile(modFile.getPath());
+ } else {
+ aaSet = AminoAcidSet.getAminoAcidSetFromModFile(modFile.getPath(), opts);
+ }
+ } else {
+ List mods = new ArrayList<>(opts.staticMods.size() + opts.dynamicMods.size());
+ mods.addAll(opts.staticMods);
+ mods.addAll(opts.dynamicMods);
+ aaSet = AminoAcidSet.getAminoAcidSetFromModEntries(
+ opts.configFile != null ? opts.configFile.getName() : "config",
+ opts.customAAs, mods, opts);
+ }
+
+ if (protocol == Protocol.AUTOMATIC) {
+ if (aaSet.containsITRAQ()) {
+ protocol = aaSet.containsPhosphorylation() ? Protocol.ITRAQPHOSPHO : Protocol.ITRAQ;
+ } else if (aaSet.containsTMT()) {
+ protocol = Protocol.TMT;
+ } else {
+ protocol = aaSet.containsPhosphorylation() ? Protocol.PHOSPHORYLATION : Protocol.STANDARD;
+ }
+ }
+ }
+
+ numMatchesPerSpec = opts.numMatchesPerSpec != null ? opts.numMatchesPerSpec : 1;
+
+ IntRange specIdx = opts.specIndexRange != null ? opts.specIndexRange : new IntRange(1, Integer.MAX_VALUE - 1);
+ startSpecIndex = specIdx.min();
+ endSpecIndex = specIdx.max();
+
+ useTDA = opts.effectiveTdaStrategy() == 1;
+ ignoreMetCleavage = (opts.ignoreMetCleavage != null ? opts.ignoreMetCleavage : 0) == 1;
+ outputAdditionalFeatures = (opts.addFeatures != null ? opts.addFeatures : 0) == 1;
+
+ minPeptideLength = opts.effectiveMinPeptideLength();
+ maxPeptideLength = opts.effectiveMaxPeptideLength();
+ maxNumVariantsPerPeptide = opts.numIsoforms != null ? opts.numIsoforms : edu.ucsd.msjava.sequences.Constants.NUM_VARIANTS_PER_PEPTIDE;
+
+ if (minPeptideLength > maxPeptideLength) {
+ return "MinPepLength must not be larger than MaxPepLength";
+ }
+
+ minCharge = opts.effectiveMinCharge();
+ maxCharge = opts.effectiveMaxCharge();
+ if (minCharge > maxCharge) {
+ return "MinCharge must not be larger than MaxCharge";
+ }
+
+ numThreads = opts.numThreads != null ? opts.numThreads : Runtime.getRuntime().availableProcessors();
+ numTasks = opts.numTasks != null ? opts.numTasks : 0;
+ minSpectraPerThread = opts.effectiveMinSpectraPerThread();
+ verbose = opts.effectiveVerbose() == 1;
+ doNotUseEdgeScore = (opts.edgeScore != null ? opts.edgeScore : 0) == 1;
+
+ dbIndexDir = opts.dbIndexDir;
+ minNumPeaksPerSpectrum = opts.minNumPeaks != null ? opts.minNumPeaks : edu.ucsd.msjava.sequences.Constants.MIN_NUM_PEAKS_PER_SPECTRUM;
+ minDeNovoScore = opts.minDeNovoScore != null ? opts.minDeNovoScore : edu.ucsd.msjava.sequences.Constants.MIN_DE_NOVO_SCORE;
+
+ maxMissedCleavages = opts.maxMissedCleavages != null ? opts.maxMissedCleavages : -1;
+ if (maxMissedCleavages > -1 && enzyme.getName().equals("UnspecificCleavage")) {
+ return "Cannot specify a MaxMissedCleavages when using unspecific cleavage enzyme";
+ } else if (maxMissedCleavages > -1 && enzyme.getName().equals("NoCleavage")) {
+ return "Cannot specify a MaxMissedCleavages when using no cleavage enzyme";
+ }
+
+ allowDenseCentroidedPeaks = (opts.allowDenseCentroidedPeaks != null ? opts.allowDenseCentroidedPeaks : 0) == 1;
+ precursorCalMode = opts.precursorCalMode != null ? opts.precursorCalMode : PrecursorCalMode.AUTO;
+
+ IntRange ms = opts.msLevel != null ? opts.msLevel : new IntRange(2, 2);
+ minMSLevel = ms.min();
+ maxMSLevel = ms.max();
+
+ maxNumMods = opts.effectiveMaxNumMods();
+ int maxNumModsCompare = aaSet.getMaxNumberOfVariableModificationsPerPeptide();
+ if (maxNumMods != maxNumModsCompare) {
+ System.err.println("Error, code bug: MaxNumModsPerPeptide tracked by MSGFPlusOptions ("
+ + maxNumMods + ") does not match value tracked by AminoAcidSet ("
+ + maxNumModsCompare + ")");
+ System.exit(-1);
+ }
+
+ Modification.setModIdentifiers();
+ return null;
+ }
+
+ /** Spectrum-format whitelist: only mzML and MGF are supported. */
+ private static boolean isSupportedSpectrumFormat(SpecFileFormat fmt) {
+ return fmt == SpecFileFormat.MZML
+ || fmt == SpecFileFormat.MGF;
+ }
+
+
+ @Override
+ public String toString() {
+ StringBuilder buf = new StringBuilder();
+
+ buf.append("\tPrecursorMassTolerance: ");
+ if (leftPrecursorMassTolerance.equals(rightPrecursorMassTolerance)) {
+ buf.append(leftPrecursorMassTolerance);
+ } else {
+ buf.append("[" + leftPrecursorMassTolerance + "," + rightPrecursorMassTolerance + "]");
+ }
+ buf.append("\n");
+
+ buf.append("\tIsotopeError: " + this.minIsotopeError + "," + this.maxIsotopeError + "\n");
+ buf.append("\tTargetDecoyAnalysis: " + this.useTDA + "\n");
+ buf.append("\tFragmentationMethod: " + this.activationMethod + "\n");
+ buf.append("\tInstrument: " + (instType == null ? "null" : this.instType.getNameAndDescription()) + "\n");
+ buf.append("\tEnzyme: " + (enzyme == null ? "null" : this.enzyme.getName()) + "\n");
+
+ String customEnzymeFile = Enzyme.getCustomEnzymeFilePath();
+ if (customEnzymeFile != null && !customEnzymeFile.isEmpty()) {
+ buf.append("\tEnzyme file: " + customEnzymeFile + "\n");
+ }
+
+ ArrayList customEnzymeMessages = Enzyme.getCustomEnzymeMessages();
+ for (String message : customEnzymeMessages) {
+ buf.append("\tEnzyme info: " + message + "\n");
+ }
+
+ buf.append("\tProtocol: " + (protocol == null ? "null" : this.protocol.getName()) + "\n");
+ buf.append("\tNumTolerableTermini: " + this.numTolerableTermini + "\n");
+ buf.append("\tIgnoreMetCleavage: " + this.ignoreMetCleavage + "\n");
+ buf.append("\tMinPepLength: " + this.minPeptideLength + "\n");
+ buf.append("\tMaxPepLength: " + this.maxPeptideLength + "\n");
+ buf.append("\tMinCharge: " + this.minCharge + "\n");
+ buf.append("\tMaxCharge: " + this.maxCharge + "\n");
+ buf.append("\tNumMatchesPerSpec: " + this.numMatchesPerSpec + "\n");
+ buf.append("\tMaxMissedCleavages: " + this.maxMissedCleavages + "\n");
+ buf.append("\tMaxNumModsPerPeptide: " + this.maxNumMods + "\n");
+ buf.append("\tChargeCarrierMass: " + this.chargeCarrierMass);
+
+ if (Math.abs(this.chargeCarrierMass - PROTON) < 0.005) {
+ buf.append(" (proton)\n");
+ } else if (Math.abs(this.chargeCarrierMass - POTASSIUM_CHARGE_CARRIER_MASS) < 0.005) {
+ buf.append(" (potassium)\n");
+ } else if (Math.abs(this.chargeCarrierMass - SODIUM_CHARGE_CARRIER_MASS) < 0.005) {
+ buf.append(" (sodium)\n");
+ } else {
+ buf.append(" (custom)\n");
+ }
+
+ buf.append("\tMSLevel: " + this.minMSLevel + "," + this.maxMSLevel + "\n");
+ buf.append("\tMinNumPeaksPerSpectrum: " + this.minNumPeaksPerSpectrum + "\n");
+ buf.append("\tNumIsoforms: " + this.maxNumVariantsPerPeptide + "\n");
+
+ ArrayList modificationsInUse = aaSet.getModificationsInUse();
+
+ if (modificationsInUse.size() == 0) {
+ buf.append("No static or dynamic post translational modifications are defined.\n");
+ } else {
+ buf.append("Post translational modifications in use:\n");
+ for (String modInfo : modificationsInUse)
+ buf.append("\t" + modInfo + "\n");
+ }
+
+ return buf.toString();
+ }
+}
diff --git a/src/main/java/edu/ucsd/msjava/msgf/AAFrequencyCounter.java b/src/main/java/edu/ucsd/msjava/msgf/AAFrequencyCounter.java
index bb04dc82..50b2bb29 100644
--- a/src/main/java/edu/ucsd/msjava/msgf/AAFrequencyCounter.java
+++ b/src/main/java/edu/ucsd/msjava/msgf/AAFrequencyCounter.java
@@ -109,48 +109,4 @@ public int getOccurrence(String str) {
return occ;
}
- public static void main(String argv[]) {
- System.out.println(getRandomFrequency("AAA"));
-// generate(3);
- /*
- AAFrequencyCounter counter = new AAFrequencyCounter();
- counter.readFromFreqFile("/home/sangtaekim/Research/Data/AAFrequency/SProt_2mer.txt");
- counter.frequencyTable.printSorted();
- */
- }
-
- public static void generate(int nMer) {
- AAFrequencyCounter counter = new AAFrequencyCounter();
- counter.setNMer(nMer);
-// counter.readFromFasta("/home/sangtaekim/Research/Data/SProt/uniprot_sprot.fasta");
- counter.readFromFasta("/home/sangtaekim/Research/Data/EColiDB/Ecol_protein_formatted.fasta");
-
- System.out.println("n\t" + nMer);
- System.out.println("size\t" + counter.sizeNMer);
- String allAA = "GASPVTCLINDQKEMHFRYW";
-
- if (nMer == 1) {
- for (int i = 0; i < allAA.length(); i++) {
- char c = allAA.charAt(i);
- System.out.println(c + "\t" + counter.getOccurrence(String.valueOf(c)));
- }
-
- } else if (nMer == 2) {
- for (int i = 0; i < allAA.length(); i++) {
- for (int j = 0; j < allAA.length(); j++) {
- String s = "" + allAA.charAt(i) + allAA.charAt(j);
- System.out.println(s + "\t" + counter.getOccurrence(s));
- }
- }
- } else if (nMer == 3) {
- for (int i = 0; i < allAA.length(); i++) {
- for (int j = 0; j < allAA.length(); j++) {
- for (int k = 0; k < allAA.length(); k++) {
- String s = "" + allAA.charAt(i) + allAA.charAt(j) + allAA.charAt(k);
- System.out.println(s + "\t" + counter.getOccurrence(s));
- }
- }
- }
- }
- }
}
diff --git a/src/main/java/edu/ucsd/msjava/msgf/BacktrackTable.java b/src/main/java/edu/ucsd/msjava/msgf/BacktrackTable.java
index 2015ad6c..a8db230a 100644
--- a/src/main/java/edu/ucsd/msjava/msgf/BacktrackTable.java
+++ b/src/main/java/edu/ucsd/msjava/msgf/BacktrackTable.java
@@ -36,18 +36,6 @@ public void getReconstructions(T curNode, int score, String prefix, ArrayList edge : graph.getEdges(curNode)) {
int edgeIndex = edge.getEdgeIndex();
-// String residue;
-// if(edgeIndex >= 0)
-// residue = String.valueOf(graph.getAASet().getAminoAcid(edgeIndex).getResidue());
-// else
-// {
-// if(edgeIndex == -2)
-// residue="K.";
-// else if(edgeIndex == -3)
-// residue = "G.";
-// else
-// residue = "";
-// }
if (pointer.isSet(score, edgeIndex))
getReconstructions(edge.getPrevNode(), score - (edge.getEdgeScore() + pointer.getNodeScore()), prefix + graph.getAASet().getAminoAcid(edgeIndex).getResidueStr(), reconstructions, sa);
}
@@ -64,12 +52,6 @@ public String getOneReconstruction(T curNode, int score, String prefix) {
{
return prefix;
}
-// for(T prevNode : graph.getPreviousNodes(curNode))
-// {
-// int edgeIndex = graph.getEdgeIndex(curNode, prevNode);
-// if(pointer.isSet(score, edgeIndex))
-// getOneReconstruction(prevNode, score-pointer.getCurScore(), prefix+aaSet.getAminoAcid(edgeIndex).getResidue());
-// }
for (DeNovoGraph.Edge edge : graph.getEdges(curNode)) {
int edgeIndex = edge.getEdgeIndex();
if (pointer.isSet(score, edgeIndex))
diff --git a/src/main/java/edu/ucsd/msjava/msgf/DeNovoGraph.java b/src/main/java/edu/ucsd/msjava/msgf/DeNovoGraph.java
index 9028eac0..da66ecb0 100644
--- a/src/main/java/edu/ucsd/msjava/msgf/DeNovoGraph.java
+++ b/src/main/java/edu/ucsd/msjava/msgf/DeNovoGraph.java
@@ -37,7 +37,6 @@ public ArrayList getIntermediateNodeList() {
public abstract int getNodeScore(T node);
- // public abstract int getEdgeScore(T curNode, T prevNode);
public abstract ArrayList> getEdges(T curNode);
public abstract T getComplementNode(T node);
diff --git a/src/main/java/edu/ucsd/msjava/msgf/FlexAminoAcidGraph.java b/src/main/java/edu/ucsd/msjava/msgf/FlexAminoAcidGraph.java
index 18f86291..eb98d5ed 100644
--- a/src/main/java/edu/ucsd/msjava/msgf/FlexAminoAcidGraph.java
+++ b/src/main/java/edu/ucsd/msjava/msgf/FlexAminoAcidGraph.java
@@ -276,11 +276,6 @@ private void setBackwardEdgesFromSink() {
ArrayList aaList = aaSet.getAAList(location);
-// if(enzymaticCleavageOnly && direction != enzyme.isNTerm())
-// aaList = aaSet.getEnzymeAAList();
-// else
-// aaList = aaSet.getAAList(location);
-
int peptideNominalMass = pmNode.getNominalMass();
ArrayList> edges = new ArrayList>();
for (AminoAcid aa : aaList) {
@@ -321,17 +316,9 @@ private void makeForwardEdges(NominalMass curNode, ArrayList aaList,
aa.getProbability(),
aaSet.getIndex(aa),
aa.getMass());
-// if(curNode.getNominalMass() == 57 && nextNode.getNominalMass() == 114)
-// System.out.println("Debug");
int errorScore = scoredSpec.getEdgeScore(nextNode, curNode, aa.getMass());
-// if(aa.isModified())
-// errorScore += MODIFIED_EDGE_PENALTY;
if (errorScore < -100 || errorScore > 100) {
System.err.println("Warning, invalid ErrorScore: " + errorScore);
-
- // Could abort the search
- // System.exit(-1);
-
// Instead, use a score of -4
errorScore = -4;
}
@@ -347,19 +334,4 @@ private void makeForwardEdges(NominalMass curNode, ArrayList aaList,
}
}
-// private void computeEdgeScores()
-// {
-// for(NominalMass curNode : intermediateNodes)
-// {
-// ArrayList> edges = edgeMap.get(curNode);
-// for(DeNovoGraph.Edge edge : edges)
-// {
-// NominalMass prevNode = edge.getPrevNode();
-// int errorScore = scoredSpec.getEdgeScore(curNode, prevNode, edge.getEdgeMass());
-// assert(errorScore == edge.getErrorScore());
-// edge.setErrorScore(errorScore);
-// }
-// edgeMap.put(curNode, edges);
-// }
-// }
}
diff --git a/src/main/java/edu/ucsd/msjava/msgf/GeneratingFunction.java b/src/main/java/edu/ucsd/msjava/msgf/GeneratingFunction.java
index 367fdf04..0d0774d9 100644
--- a/src/main/java/edu/ucsd/msjava/msgf/GeneratingFunction.java
+++ b/src/main/java/edu/ucsd/msjava/msgf/GeneratingFunction.java
@@ -19,7 +19,6 @@ public class GeneratingFunction implements GF {
private boolean calcProb = true;
private Enzyme enzyme = Enzyme.TRYPSIN;
- // private int numScoreBinsPerNode = 1000;
private int gfTableCapacity;
private ScoreDist distribution = null;
@@ -27,9 +26,6 @@ public class GeneratingFunction implements GF {
private class GFTable extends LinkedHashMap {
- /**
- *
- */
private static final long serialVersionUID = 1L;
private final int capacity;
@@ -49,14 +45,11 @@ protected boolean removeEldestEntry(Map.Entry eldest) {
private boolean isGFComputed = false;
-// private HashMap srmScore = null;
-
public GeneratingFunction(DeNovoGraph graph) {
this.graph = graph;
this.gfTableCapacity = 1 + graph.intermediateNodes.size() + graph.sinkNodes.size();
}
- // Builder
public GeneratingFunction doNotBacktrack() {
this.backtrack = false;
return this;
@@ -77,7 +70,6 @@ public GeneratingFunction enzyme(Enzyme enzyme) {
return this;
}
- // public GeneratingFunction numScoreBinsPerNode(int numBins) { this.numScoreBinsPerNode = numBins; return this; }
public GeneratingFunction gfTableCapacity(int gfTableCapacity) {
this.gfTableCapacity = gfTableCapacity;
return this;
@@ -99,7 +91,6 @@ public Enzyme getEnzyme() {
return enzyme;
}
- // public int getNumScoreBinsPerNode() { return numScoreBinsPerNode; }
public boolean isGFComputed() {
return this.isGFComputed;
}
@@ -417,8 +408,6 @@ private void setCurNode(T curNode, ScoreDistFactory scoreDistFactory) {
System.err.println("Warning, MinScore is abnormally low; "
+ "MinScore: " + curMinScore + ", MaxScore: " + curMaxScore + ", "
+ "CurNode: " + curNode.getNominalMass() + ", CurNodeScore: " + curNodeScore);
- // Could abort processing
- // System.exit(-1);
// Instead, skip this node
return;
}
@@ -427,8 +416,6 @@ private void setCurNode(T curNode, ScoreDistFactory scoreDistFactory) {
System.err.println("Warning, MaxScore is abnormally high; "
+ "MinScore: " + curMinScore + ", MaxScore: " + curMaxScore + ", "
+ "CurNode: " + curNode.getNominalMass() + ", CurNodeScore: " + curNodeScore);
- // Could abort processing
- // System.exit(-1);
// Instead, skip this node
return;
}
diff --git a/src/main/java/edu/ucsd/msjava/msgf/Histogram.java b/src/main/java/edu/ucsd/msjava/msgf/Histogram.java
index 3b2c9046..09d65785 100644
--- a/src/main/java/edu/ucsd/msjava/msgf/Histogram.java
+++ b/src/main/java/edu/ucsd/msjava/msgf/Histogram.java
@@ -68,8 +68,6 @@ public void printSortedRatio() {
Collections.sort(keyList);
for (T key : keyList) {
System.out.println(key + "\t" + this.get(key) + "\t" + this.get(key) / (float) totalCount);
-// System.out.print(key+"\t"+this.get(key)+"\t");
-// System.out.format("%.3f\n", this.get(key)/(float)totalCount);
}
}
}
diff --git a/src/main/java/edu/ucsd/msjava/msgf/MSGFDBResultGenerator.java b/src/main/java/edu/ucsd/msjava/msgf/MSGFDBResultGenerator.java
index 18854ec0..992b3ecd 100644
--- a/src/main/java/edu/ucsd/msjava/msgf/MSGFDBResultGenerator.java
+++ b/src/main/java/edu/ucsd/msjava/msgf/MSGFDBResultGenerator.java
@@ -127,17 +127,6 @@ public double getEDD(double specProbThreshold) {
// returns cumulative probability <= specProbThreshold
public double getSpectralProbability(double specProbThreshold) {
-// int index = Arrays.binarySearch(cumScoreDist, specProbThreshold);
-// if(index >= 0)
-// return cumScoreDist[index];
-// else
-// {
-// index = -index-1;
-// if(index > 0)
-// return cumScoreDist[index-1];
-// else
-// return 0;
-// }
while (curIndex < cumScoreDist.length - 1 && cumScoreDist[curIndex + 1] <= specProbThreshold)
++curIndex;
diff --git a/src/main/java/edu/ucsd/msjava/msgf/MSGFResult.java b/src/main/java/edu/ucsd/msjava/msgf/MSGFResult.java
deleted file mode 100644
index a4ac6aa3..00000000
--- a/src/main/java/edu/ucsd/msjava/msgf/MSGFResult.java
+++ /dev/null
@@ -1,33 +0,0 @@
-package edu.ucsd.msjava.msgf;
-
-import edu.ucsd.msjava.msutil.Peptide;
-import edu.ucsd.msjava.msutil.Spectrum;
-
-public class MSGFResult {
- public MSGFResult(Spectrum spec, Peptide annotation, GeneratingFunction> gf) {
- this.spec = spec;
- this.annotation = annotation;
- this.gf = gf;
- }
-
- public Spectrum getSpec() {
- return spec;
- }
-
- public Peptide getAnnotation() {
- return annotation;
- }
-
- public GeneratingFunction> getGf() {
- return gf;
- }
-
- public ProfileGF> getProfGF() {
- return profGF;
- }
-
- private Spectrum spec;
- private Peptide annotation;
- private GeneratingFunction> gf;
- private ProfileGF> profGF;
-}
diff --git a/src/main/java/edu/ucsd/msjava/msgf/MassListComparator.java b/src/main/java/edu/ucsd/msjava/msgf/MassListComparator.java
index 529fb4e9..f5e558bb 100644
--- a/src/main/java/edu/ucsd/msjava/msgf/MassListComparator.java
+++ b/src/main/java/edu/ucsd/msjava/msgf/MassListComparator.java
@@ -50,36 +50,9 @@ else if (i2 == massList2.size() - 1)
}
- public static class MatchedPair {
- T m1, m2;
-
- public MatchedPair(T m1, T m2) {
- this.m1 = m1;
- this.m2 = m2;
- }
-
- public T getMass1() {
- return m1;
- }
-
- public T getMass2() {
- return m2;
- }
+ public record MatchedPair(T m1, T m2) {
+ public T getMass1() { return m1; }
+ public T getMass2() { return m2; }
}
- public static void main(String argv[]) {
- float[] data1 = {40, 40.1f, 40.2f, 50};
- float[] data2 = {39.7f, 40.05f, 40.6f};
- ArrayList list1 = new ArrayList();
- ArrayList list2 = new ArrayList();
-
- for (float f : data1)
- list1.add(new Mass(f));
- for (float f : data2)
- list2.add(new Mass(f));
-
- MassListComparator comparator = new MassListComparator(list1, list2);
- for (MatchedPair pair : comparator.getMatchedList(new Tolerance(0.5f)))
- System.out.println(pair.m1.getMass() + "\t" + pair.m2.getMass());
- }
}
diff --git a/src/main/java/edu/ucsd/msjava/msgf/NominalMassFactory.java b/src/main/java/edu/ucsd/msjava/msgf/NominalMassFactory.java
index 073dd70c..adb395f3 100644
--- a/src/main/java/edu/ucsd/msjava/msgf/NominalMassFactory.java
+++ b/src/main/java/edu/ucsd/msjava/msgf/NominalMassFactory.java
@@ -111,7 +111,6 @@ public boolean contains(NominalMass node) {
return factory[index] != null;
}
- // // static methods
private static NominalMassFactory defaultNominalMassFactory = new NominalMassFactory(50);
public static NominalMass getInstanceFor(float mass) {
diff --git a/src/main/java/edu/ucsd/msjava/msgf/ProfileGF.java b/src/main/java/edu/ucsd/msjava/msgf/ProfileGF.java
index b26bc2e0..5e4c0a23 100644
--- a/src/main/java/edu/ucsd/msjava/msgf/ProfileGF.java
+++ b/src/main/java/edu/ucsd/msjava/msgf/ProfileGF.java
@@ -98,8 +98,6 @@ public ProfileGF computeProfile(float specProb) {
int thresholdScore = gf.getThresholdScore(specProb) + 1;
if (thresholdScore >= gf.getMaxScore())
thresholdScore = gf.getMaxScore() - 1;
-// else if(thresholdScore < gf.getMaxScore()-gf.getNumScoreBinsPerNode())
-// thresholdScore = gf.getMaxScore()-gf.getNumScoreBinsPerNode();
return computeProfile(thresholdScore);
}
@@ -182,18 +180,6 @@ private void setBackwardNodes(T curNode, HashMap bwdTable) {
if (prevBwdDist != null)
prevBwdDist.addNumber(score - curNodeScore, numberRecs);
}
-// for(int aaIndex : pointer.getBacktrackAAIndexList(score))
-// {
-// if((bits & (1 << aaIndex)) == 0){
-// bits |= (1 << aaIndex);
-// T prevNode = gf.getGraph().getPreviousNode(curNode, gf.getGraph().getAASet().getAminoAcid(aaIndex));
-// prevBwdDists[aaIndex] = bwdTable.get(prevNode);
-// }
-// T prevNode = gf.getGraph().getPreviousNode(curNode, gf.getAASet().getAminoAcid(aaIndex));
-// ScoreDist prevBwdDist = prevBwdDists[aaIndex];
-// if(prevBwdDist != null)
-// prevBwdDist.addNumber(score-curNodeScore, numberRecs);
-// }
}
}
}
diff --git a/src/main/java/edu/ucsd/msjava/msgf/ProfilePeak.java b/src/main/java/edu/ucsd/msjava/msgf/ProfilePeak.java
index 0048ef3e..bf4e4a76 100644
--- a/src/main/java/edu/ucsd/msjava/msgf/ProfilePeak.java
+++ b/src/main/java/edu/ucsd/msjava/msgf/ProfilePeak.java
@@ -2,32 +2,13 @@
import edu.ucsd.msjava.msutil.Matter;
-public class ProfilePeak implements Comparable> {
- T node;
- float probability;
+public record ProfilePeak(T node, float probability) implements Comparable> {
- public ProfilePeak(T node, float probability) {
- this.node = node;
- this.probability = probability;
- }
-
- public T getNode() {
- return node;
- }
-
- public void setNode(T node) {
- this.node = node;
- }
-
- public float getProbability() {
- return probability;
- }
-
- public void setProbability(float probability) {
- this.probability = probability;
- }
+ public T getNode() { return node; }
+ public float getProbability() { return probability; }
+ @Override
public int compareTo(ProfilePeak p) {
return node.compareTo(p.node);
}
-}
\ No newline at end of file
+}
diff --git a/src/main/java/edu/ucsd/msjava/msgf/ScoreDist.java b/src/main/java/edu/ucsd/msjava/msgf/ScoreDist.java
index 46fbd97a..4effc750 100644
--- a/src/main/java/edu/ucsd/msjava/msgf/ScoreDist.java
+++ b/src/main/java/edu/ucsd/msjava/msgf/ScoreDist.java
@@ -50,7 +50,6 @@ public double getSpectralProbability(int score) {
double specProb = 0;
int minIndex = (score >= minScore) ? score - minScore : 0;
for (int t = minIndex; t < probDistribution.length; t++) {
-// System.out.println("***********\t"+(t+minScore)+"\t"+probDistribution[t]);
specProb += probDistribution[t];
}
if (specProb > 1.)
@@ -112,21 +111,4 @@ public ScoreBound getPercentileRange(float percentile) {
return null;
}
-// // added by kyowon. Get a new ScoreDist instance. it has the same value as the original one from newMinScore to max score of the original ScoreDist
-// static public ScoreDist getTruncatedScoreDist(ScoreDist original, int newMinScore){
-// ScoreDistFactory factory = new ScoreDistFactory(original.isNumSet(), original.isProbSet());
-// ScoreDist newDist = factory.getInstance(Math.max(newMinScore, original.getMinScore()), original.getMaxScore());
-//
-// for(int score = newDist.getMinScore(); score specIterator;
- protected final NewAdditiveScorer scorer;
-
- // Optional parameters set by builders.
-
- protected float specProb = 1e-9f;
-
- protected boolean trypticOnly = true;
-
- // Tolerance
- protected Tolerance pmTolerance = new Tolerance(30, true);
- protected Tolerance fragTolerance = new Tolerance(30, true);
-
- protected float minParentMass = 400;
- protected float maxParentMass = 2000;
- protected int msgfScoreThreshold = 0;
-
- // Amino acid set, default: standard + Carbamidomethyl C
- protected AminoAcidSet aaSet;
-
- // output
- protected PrintStream out;
-
- /**
- * A constructor specifies spectral file name and database file name. Database must be "fasta" format.
- *
- * @param specIterator spectra iterator.
- * @param scorer a scorer object.
- */
- protected ToolLauncher(Iterator specIterator, NewAdditiveScorer scorer) {
- this.specIterator = specIterator;
- this.scorer = scorer;
- this.out = System.out;
- this.aaSet = AminoAcidSet.getStandardAminoAcidSetWithFixedCarbamidomethylatedCys();
- }
-
- /**
- * A builder method to set spectral probability.
- *
- * @param specProb spectral probability
- * @return this object.
- */
- public ToolLauncher specProb(float specProb) {
- this.specProb = specProb;
- return this;
- }
-
- /**
- * If this method is called, non-tryptic peptides are generated.
- * Otherwise, only peptides ends with 'K' or 'R' are generated.
- *
- * @return this object.
- */
- public ToolLauncher allowNonTryptic() {
- this.trypticOnly = false;
- return this;
- }
-
-
- /**
- * Set parent mass tolerance.
- *
- * @param tolerance tolerance.
- * @return this object.
- */
- public ToolLauncher pmTolerance(Tolerance pmTolerance) {
- this.pmTolerance = pmTolerance;
- return this;
- }
-
- /**
- * Set fragment mass tolerance.
- *
- * @param tolerance tolerance.
- * @return this object.
- */
- public ToolLauncher fragTolerance(Tolerance fragTolerance) {
- this.fragTolerance = fragTolerance;
- return this;
- }
-
- /**
- * Set minimum parent mass.
- *
- * @param minParentMass minimum parent mass.
- * @return this object.
- */
- public ToolLauncher minParentMass(float minParentMass) {
- this.minParentMass = minParentMass;
- return this;
- }
-
- /**
- * Set maximum parent mass.
- *
- * @param maxParentMass maximum parent mass.
- * @return this object.
- */
- public ToolLauncher maxParentMass(float maxParentMass) {
- this.maxParentMass = maxParentMass;
- return this;
- }
-
- /**
- * Set max MSGF score threshold. Ignore all spectra whose best de novo scores are below thresholdScore.
- *
- * @param thresholdScore max MS-GF score threshold.
- * @return this object.
- */
- public ToolLauncher msgfScoreThreshold(int thresholdScore) {
- this.msgfScoreThreshold = thresholdScore;
- return this;
- }
-
- /**
- * Set the amino acid set.
- *
- * @param aaSet amino acid set.
- * @return this object.
- */
- public ToolLauncher aminoAcidSet(AminoAcidSet aaSet) {
- this.aaSet = aaSet;
- return this;
- }
-
- /**
- * Set the output.
- *
- * @param outputFileName output file name.
- * @return this object.
- */
- public ToolLauncher outputFileName(String outputFileName) {
- try {
- out = new PrintStream(new BufferedOutputStream(new FileOutputStream(outputFileName)));
- } catch (FileNotFoundException e) {
- e.printStackTrace();
- }
- return this;
- }
-}
diff --git a/src/main/java/edu/ucsd/msjava/mslibsearch/ProcessedSpectrum.java b/src/main/java/edu/ucsd/msjava/mslibsearch/ProcessedSpectrum.java
deleted file mode 100644
index adf9d1aa..00000000
--- a/src/main/java/edu/ucsd/msjava/mslibsearch/ProcessedSpectrum.java
+++ /dev/null
@@ -1,33 +0,0 @@
-package edu.ucsd.msjava.mslibsearch;
-
-import edu.ucsd.msjava.msutil.Spectrum;
-
-public class ProcessedSpectrum {
- private final Spectrum expSpec;
- private final Spectrum libSpec;
-
- public ProcessedSpectrum(Spectrum expSpec, Spectrum libSpec) {
- this.expSpec = expSpec;
- this.libSpec = libSpec;
- }
-
- public Spectrum getSpectrum() {
-// boolean[] expPeak = new boolean[NominalMass.toNominalMass(expSpec.getPrecursorMass())];
-// for(Peak p : libSpec)
-// {
-// int nominalMass = NominalMass.toNominalMass(p.getMz());
-// if(nominalMass >= 0 && nominalMass < expPeak.length)
-// expPeak[nominalMass] = true;
-// }
-//
-// Spectrum spec = expSpec.getCloneWithoutPeakList();
-// for(Peak p : expSpec)
-// {
-// int nominalMass = NominalMass.toNominalMass(p.getMz());
-// if(nominalMass >= 0 && nominalMass < expPeak.length && expPeak[NominalMass.toNominalMass(p.getMz())])
-// spec.add(p);
-// }
-// return spec;
- return expSpec;
- }
-}
diff --git a/src/main/java/edu/ucsd/msjava/msscorer/DBScanScorer.java b/src/main/java/edu/ucsd/msjava/msscorer/DBScanScorer.java
index 3760f9df..1390e882 100644
--- a/src/main/java/edu/ucsd/msjava/msscorer/DBScanScorer.java
+++ b/src/main/java/edu/ucsd/msjava/msscorer/DBScanScorer.java
@@ -55,9 +55,6 @@ public int getEdgeScore(NominalMass curNode, NominalMass prevNode, float theoMas
}
private int getEdgeScoreInt(int curNominalMass, int prevNominalMass, float theoMass) {
- // Debug
-// if(curNominalMass == 114 && prevNominalMass == 57)
-// System.out.println("Debug");
if (curNominalMass >= nodeMass.length || prevNominalMass >= nodeMass.length || curNominalMass < 0 || prevNominalMass < 0)
return 0;
int ionExistenceIndex = 0;
diff --git a/src/main/java/edu/ucsd/msjava/msscorer/FastScorer.java b/src/main/java/edu/ucsd/msjava/msscorer/FastScorer.java
index da9380f3..6cc969e4 100644
--- a/src/main/java/edu/ucsd/msjava/msscorer/FastScorer.java
+++ b/src/main/java/edu/ucsd/msjava/msscorer/FastScorer.java
@@ -62,10 +62,6 @@ public int getScore(double[] prefixMassArr, int[] nominalPrefixMassArr, int from
for (int i = fromIndex; i < toIndex - 1; i++) {
int prefixMass = nominalPrefixMassArr[i];
int suffixMass = peptideMass - prefixMass;
-// if(prefixMass >= prefixScore.length || suffixMass >= suffixScore.length)
-// {
-// System.out.println("Debug");
-// }
int curScore;
try {
curScore = Math.round(prefixScore[prefixMass] + suffixScore[suffixMass]);
@@ -80,9 +76,6 @@ public int getScore(double[] prefixMassArr, int[] nominalPrefixMassArr, int from
}
public int getNodeScore(NominalMass prefixMass, NominalMass suffixMass) {
-// if(prefixMass.getNominalMass() >= prefixScore.length ||
-// suffixMass.getNominalMass() >= suffixScore.length)
-// System.out.println("Debug");
int preNormMass = prefixMass.getNominalMass();
int sufNormMass = suffixMass.getNominalMass();
if (preNormMass >= prefixScore.length || sufNormMass >= suffixScore.length || preNormMass < 0 || sufNormMass < 0)
diff --git a/src/main/java/edu/ucsd/msjava/msscorer/IonProbability.java b/src/main/java/edu/ucsd/msjava/msscorer/IonProbability.java
index 0ab02834..5f442d34 100644
--- a/src/main/java/edu/ucsd/msjava/msscorer/IonProbability.java
+++ b/src/main/java/edu/ucsd/msjava/msscorer/IonProbability.java
@@ -100,8 +100,6 @@ public float[] getIonProb() {
}
if (spec.getPeakByMass(mz, tol) != null) {
numObservedPeaks[index]++;
-// if(ion.getName().equals("y2-H3PO4"))
-// System.out.println("Debug");
} else
numMissingPeaks[index]++;
}
diff --git a/src/main/java/edu/ucsd/msjava/msscorer/NewRankScorer.java b/src/main/java/edu/ucsd/msjava/msscorer/NewRankScorer.java
index 99dd0378..290c70d6 100644
--- a/src/main/java/edu/ucsd/msjava/msscorer/NewRankScorer.java
+++ b/src/main/java/edu/ucsd/msjava/msscorer/NewRankScorer.java
@@ -197,14 +197,7 @@ protected void readFromFile(File paramFile, boolean verbose) {
private void readFromInputStream(InputStream is, boolean verbose) {
DataInputStream in = new DataInputStream(is);
- // Read the date
try {
-// int year = in.readInt(); // version information
-// int month = in.readInt();
-// int date = in.readInt();
-// if(verbose)
-// System.out.println("CreationDate: " + year + "/" + (month+1) + "/" + date);
-
int version = in.readInt();
if (verbose)
System.out.println("Version: " + version);
@@ -245,11 +238,6 @@ private void readFromInputStream(InputStream is, boolean verbose) {
for (byte i = 0; i < lenProtocol; i++)
bufProtocol.append(in.readChar());
protocol = Protocol.get(bufProtocol.toString());
-// if(protocol == null)
-// {
-// System.out.println(bufProtocol.toString());
-// System.exit(-1);
-// }
} else
protocol = Protocol.AUTOMATIC;
@@ -420,8 +408,6 @@ private void readFromInputStream(InputStream is, boolean verbose) {
for (int i = 0; i < ionExTable.length; i++) {
ionExTable[i] = in.readFloat();
if (ionExTable[i] == 0) {
-// System.out.println("IonExTable: " + partition.getCharge() + " " + partition.getSegNum()
-// + " " + partition.getParentMass() + " " + ionExTable[i]);
ionExTable[i] = 0.001f;
}
assert (ionExTable[i] > 0);
@@ -776,9 +762,6 @@ public void writeParameters(File outputFile) {
// Rank distributions
out.writeInt(maxRank);
for (Partition partition : partitionSet) {
-// if(partition.getParentMass() > 4100 && partition.getCharge() == 5 && partition.getSegNum() == 1)
-// System.out.println("Debug");
-
HashMap rankDistTable = getRankDistTable(partition);
if (rankDistTable == null)
continue;
@@ -798,10 +781,6 @@ public void writeParameters(File outputFile) {
}
// Error distribution
-// protected int errorScalingFactor = 0; // if 0, don't user errors, 10 for low accuracy, 100 for high accuracy
-// protected Hashtable ionErrDistTable = null;
-// protected Hashtable noiseErrDistTable = null;
-// protected Hashtable ionExistenceTable = null;
out.writeInt(errorScalingFactor);
if (errorScalingFactor > 0) {
for (Partition partition : partitionSet) {
@@ -946,11 +925,4 @@ public void writeParametersPlainText(File outputFile) {
out.close();
}
- public static void main(String argv[]) throws Exception {
- readWriteTest();
-// paramTest();
- }
-
- public static void readWriteTest() throws Exception {
- }
}
diff --git a/src/main/java/edu/ucsd/msjava/msscorer/NewScoredSpectrum.java b/src/main/java/edu/ucsd/msjava/msscorer/NewScoredSpectrum.java
index 195122c3..56c1a653 100644
--- a/src/main/java/edu/ucsd/msjava/msscorer/NewScoredSpectrum.java
+++ b/src/main/java/edu/ucsd/msjava/msscorer/NewScoredSpectrum.java
@@ -28,7 +28,6 @@ public NewScoredSpectrum(Spectrum spec, NewRankScorer scorer) {
this.mme = scorer.mme;
this.precursor = spec.getPrecursorPeak().clone();
this.activationMethodArr = new ActivationMethod[1];
-// activationMethodArr[0] = scorer.getActivationMethod();
if (spec.getActivationMethod() != null)
activationMethodArr[0] = spec.getActivationMethod();
else
@@ -93,16 +92,6 @@ public int getEdgeScore(T curNode, T prevNode, float theoMass) {
float edgeScore = scorer.getIonExistenceScore(partition, ionExistenceIndex, probPeak);
if (ionExistenceIndex == 3)
edgeScore += scorer.getErrorScore(partition, curNodeMass - prevNodeMass - theoMass);
-
-// // debug
-// if(edgeScore < -1000 || edgeScore > 1000)
-// {
-// System.out.println("Error! EdgeScore = " + edgeScore);
-// System.out.println("Spectrum ScanNum: " + spec.getScanNum());
-// System.out.println("Partition: " + partition.getCharge() + " " + partition.getSegNum() + " " + partition.getParentMass());
-// System.out.println("IonExistence: " + scorer.getIonExistenceScore(partition, ionExistenceIndex, probPeak));
-// System.out.println("Error: " + scorer.getErrorScore(partition, curNodeMass-prevNodeMass-theoMass));
-// }
return Math.round(edgeScore);
}
@@ -126,12 +115,7 @@ public boolean getMainIonDirection() {
return mainIon.isPrefixIon();
}
- /**
- * returns the corrected mass of the node based on the peak observed in the spectrum
- *
- * @param node
- * @return corrected mass of the node if peak exists, null -1
- */
+ /** Returns the corrected m/z from the observed peak, or -1 if no peak was found. */
public float getNodeMass(T node) {
if (node.getNominalMass() == 0)
return 0;
@@ -215,9 +199,6 @@ public float getExplainedIonCurrent(float residueMass, boolean isPrefix, Toleran
public Pair getMassErrorWithIntensity(float residueMass, boolean isPrefix, Tolerance fragmentTolerance) {
Float error = null;
float maxIntensity = 0;
-// IonType bestIon = null;
-// Peak bestPeak = null;
-// float bestTheoMass = 0;
for (int segIndex = 0; segIndex < scorer.getNumSegments(); segIndex++) {
for (IonType ion : ionTypes[segIndex]) {
@@ -246,18 +227,10 @@ public Pair getMassErrorWithIntensity(float residueMass, boolean i
if (p != null) // peak exists
{
float err = (p.getMz() - theoMass) / theoMass * 1e6f;
-// float err = p.getMz() - theoMass;
-// if(err < 0)
-// err = -err;
float intensity = p.getIntensity();
- // Debug
-// System.out.println(residueMass + " " + ion.getName() + " " + err + " " + intensity);
if (intensity > maxIntensity) {
error = err;
maxIntensity = intensity;
-// bestIon = ion;
-// bestPeak = p;
-// bestTheoMass = theoMass;
}
}
}
@@ -265,9 +238,6 @@ public Pair getMassErrorWithIntensity(float residueMass, boolean i
if (error == null)
return null;
else {
-// // Debug
-// System.out.println("*\t" + residueMass + "\t" + bestIon.getName() + "\t" + error + "\t" + bestPeak.getRank()
-// + "\t" + bestPeak.getMz() + "\t" + bestPeak.getIntensity() + "\t" + bestTheoMass);
return new Pair(error, maxIntensity);
}
}
diff --git a/src/main/java/edu/ucsd/msjava/msscorer/NewScorerFactory.java b/src/main/java/edu/ucsd/msjava/msscorer/NewScorerFactory.java
index 38ce7ba4..094fc60c 100644
--- a/src/main/java/edu/ucsd/msjava/msscorer/NewScorerFactory.java
+++ b/src/main/java/edu/ucsd/msjava/msscorer/NewScorerFactory.java
@@ -171,28 +171,4 @@ else if (!method.isElectronBased() && enzyme.isNTerm())
return scorer;
}
- public static void main(String argv[]) {
- for (ActivationMethod method : ActivationMethod.getAllRegisteredActivationMethods()) {
- if (method == ActivationMethod.FUSION || method == ActivationMethod.ASWRITTEN)
- continue;
- for (InstrumentType inst : InstrumentType.getAllRegisteredInstrumentTypes()) {
- for (Enzyme enzyme : Enzyme.getAllRegisteredEnzymes()) {
- for (Protocol protocol : Protocol.getAllRegisteredProtocols()) {
-// if(method == ActivationMethod.HCD && inst == InstrumentType.QEXACTIVE && enzyme == Enzyme.UnspecificCleavage && protocol == Protocol.NOPROTOCOL)
-// {
-// System.out.println("Debug");
-// }
- NewRankScorer scorer = NewScorerFactory.get(method, inst, enzyme, protocol);
- System.out.print(method.getName() + "_" + inst.getName() + "_" + enzyme.getName() + "_" + protocol.getName() + " -> ");
- if (scorer != null) {
- System.out.println(scorer.getSpecDataType());
- } else {
- System.err.println("Null!");
- System.exit(-1);
- }
- }
- }
- }
- }
- }
}
diff --git a/src/main/java/edu/ucsd/msjava/msscorer/PrecursorOffsetFrequency.java b/src/main/java/edu/ucsd/msjava/msscorer/PrecursorOffsetFrequency.java
index 51633a14..7d55c708 100644
--- a/src/main/java/edu/ucsd/msjava/msscorer/PrecursorOffsetFrequency.java
+++ b/src/main/java/edu/ucsd/msjava/msscorer/PrecursorOffsetFrequency.java
@@ -84,8 +84,6 @@ else if (offList.size() == 0)
float tolDa = granularity / 2 * (offList.size() - clusterStartIndex);
clusteredOFF.add(new PrecursorOffsetFrequency(reducedCharge, offset, clusterFreq).tolerance(new Tolerance(tolDa)));
-// for(PrecursorOffsetFrequency off : clusteredOFF)
-// System.out.println(off.getReducedCharge()+"\t"+off.getOffset()+"\t"+off.getFrequency()+"\t"+off.getTolerance().toString());
return clusteredOFF;
}
}
diff --git a/src/main/java/edu/ucsd/msjava/msscorer/ScoringParameterGenerator.java b/src/main/java/edu/ucsd/msjava/msscorer/ScoringParameterGenerator.java
deleted file mode 100644
index 1c570c37..00000000
--- a/src/main/java/edu/ucsd/msjava/msscorer/ScoringParameterGenerator.java
+++ /dev/null
@@ -1,733 +0,0 @@
-package edu.ucsd.msjava.msscorer;
-
-import edu.ucsd.msjava.msgf.Histogram;
-import edu.ucsd.msjava.msgf.NominalMass;
-import edu.ucsd.msjava.msgf.Tolerance;
-import edu.ucsd.msjava.msscorer.NewScorerFactory.SpecDataType;
-import edu.ucsd.msjava.msutil.*;
-import edu.ucsd.msjava.parser.MgfSpectrumParser;
-
-import java.io.File;
-import java.util.*;
-
-/**
- * This only supports low accuracy fragment ions.
- *
- * @author sangtaekim
- */
-public class ScoringParameterGenerator extends NewRankScorer {
- private static final float MIN_OFFSET_MASS = -120; // for ion types
- private static final float MAX_OFFSET_MASS = 38;
- private static final float MIN_PRECURSOR_OFFSET = -300; // for precursors
- private static final float MAX_PRECURSOR_OFFSET = 30;
- private static final int MIN_NUM_SPECTRA_PER_PARTITION = 400; // 400
- private static final int MIN_NUM_SPECTRA_FOR_PRECURSOR_OFF = 150;
-
- private static final float MIN_PRECURSOR_OFFSET_PROBABILITY = 0.15f; // 0.15
- private static final float MIN_ION_OFFSET_PROBABILITY = 0.15f; // 0.15, for ion types
- private static final int MAX_RANK = 150;
- private static final int NUM_SEGMENTS_PER_SPECTRUM = 2; // 2
-
-
- private static final int[] smoothingRanks = {3, 5, 10, 20, 50, Integer.MAX_VALUE}; //Ranks around which smoothing occurs
- private static final int[] smoothingWindowSize = {0, 1, 2, 3, 4, 5}; //Smoothing windows for each smoothing rank
-
- private static final int NUM_NOISE_IONS = 10;
- protected static final int MAX_CHARGE = 20;
-
- public static void main(String argv[]) {
- File specFile = null;
- File outputFile = null;
- boolean isText = false;
- AminoAcidSet aaSet = AminoAcidSet.getStandardAminoAcidSetWithFixedCarbamidomethylatedCys();
- int numSpecsPerPeptide = 1;
- int errorScalingFactor = 10;
-
- // Fragmentation method
- ActivationMethod activationMethod = null;
- InstrumentType instType = null;
- Enzyme enzyme = null;
-
- for (int i = 0; i < argv.length; i += 2) {
- if (!argv[i].startsWith("-") || i + 1 >= argv.length)
- printUsageAndExit("Invalid parameter!");
- if (argv[i].equalsIgnoreCase("-i")) {
- specFile = new File(argv[i + 1]);
- if (!specFile.exists()) {
- printUsageAndExit(argv[i + 1] + " doesn't exist.");
- }
- int posDot = specFile.getName().lastIndexOf('.');
- if (posDot >= 0) {
- String extension = specFile.getName().substring(posDot);
- if (!extension.equalsIgnoreCase(".mgf"))
- printUsageAndExit("Invalid spectrum format: " + argv[i + 1]);
- } else
- printUsageAndExit("Invalid spectrum format: " + argv[i + 1]);
- } else if (argv[i].equalsIgnoreCase("-o")) {
- outputFile = new File(argv[i + 1]);
- } else if (argv[i].equalsIgnoreCase("-t")) {
- outputFile = new File(argv[i + 1]);
- isText = true;
- } else if (argv[i].equalsIgnoreCase("-fixMod")) {
- // 0: No mod, 1: Carbamidomethyl C, 2: Carboxymethyl C
- if (argv[i + 1].equalsIgnoreCase("0"))
- aaSet = AminoAcidSet.getStandardAminoAcidSet();
- else if (argv[i + 1].equalsIgnoreCase("1"))
- aaSet = AminoAcidSet.getStandardAminoAcidSetWithFixedCarbamidomethylatedCys();
- else if (argv[i + 1].equalsIgnoreCase("2"))
- aaSet = AminoAcidSet.getStandardAminoAcidSetWithFixedCarboxymethylatedCys();
- else
- printUsageAndExit("Invalid -fixMod parameter: " + argv[i + 1]);
- } else if (argv[i].equalsIgnoreCase("-pep")) {
- numSpecsPerPeptide = Integer.parseInt(argv[i + 1]);
- } else if (argv[i].equalsIgnoreCase("-err")) {
- errorScalingFactor = Integer.parseInt(argv[i + 1]);
- } else if (argv[i].equalsIgnoreCase("-m")) // Fragmentation method
- {
- // (0: written in the spectrum, 1: CID , 2: ETD, 3: HCD, 4: UVPD)
- if (argv[i + 1].equalsIgnoreCase("1")) {
- activationMethod = ActivationMethod.CID;
- } else if (argv[i + 1].equalsIgnoreCase("2")) {
- activationMethod = ActivationMethod.ETD;
- } else if (argv[i + 1].equalsIgnoreCase("3")) {
- activationMethod = ActivationMethod.HCD;
- } else if (argv[i + 1].equalsIgnoreCase("4")) {
- activationMethod = ActivationMethod.UVPD;
- } else {
- printUsageAndExit("Invalid activation method: " + argv[i + 1]);
- }
- } else if (argv[i].equalsIgnoreCase("-inst")) // Instrument type
- {
- if (argv[i + 1].equalsIgnoreCase("0")) {
- instType = InstrumentType.LOW_RESOLUTION_LTQ;
- } else if (argv[i + 1].equalsIgnoreCase("1")) {
- instType = InstrumentType.TOF;
- } else if (argv[i + 1].equalsIgnoreCase("2")) {
- instType = InstrumentType.HIGH_RESOLUTION_LTQ;
- } else {
- printUsageAndExit("Invalid instrument type: " + argv[i + 1]);
- }
- } else if (argv[i].equalsIgnoreCase("-e")) // Enzyme
- {
- // 0: No enzyme, 1: Trypsin, 2: Chymotrypsin, 3: LysC, 4: LysN, 5: GluC, 6: ArgC, 7: AspN
- if (argv[i + 1].equalsIgnoreCase("0"))
- enzyme = null;
- else if (argv[i + 1].equalsIgnoreCase("1"))
- enzyme = Enzyme.TRYPSIN;
- else if (argv[i + 1].equalsIgnoreCase("2"))
- enzyme = Enzyme.CHYMOTRYPSIN;
- else if (argv[i + 1].equalsIgnoreCase("3"))
- enzyme = Enzyme.LysC;
- else if (argv[i + 1].equalsIgnoreCase("4"))
- enzyme = Enzyme.LysN;
- else if (argv[i + 1].equalsIgnoreCase("5"))
- enzyme = Enzyme.GluC;
- else if (argv[i + 1].equalsIgnoreCase("6"))
- enzyme = Enzyme.ArgC;
- else if (argv[i + 1].equalsIgnoreCase("7"))
- enzyme = Enzyme.AspN;
- else
- printUsageAndExit("Invalid enzyme: " + argv[i + 1]);
- } else
- printUsageAndExit("Invalid parameters!");
- }
- if (specFile == null)
- printUsageAndExit("missing annotatedMgfFileName!");
- if (outputFile == null)
- printUsageAndExit("missing outputFileName!");
- if (activationMethod == null)
- printUsageAndExit("missing activationMethod!");
- if (instType == null)
- printUsageAndExit("missing instrumentType!");
-
- generateParameters(specFile, activationMethod, instType, enzyme, Protocol.AUTOMATIC, numSpecsPerPeptide, errorScalingFactor, outputFile, aaSet, isText, false);
- }
-
- public static void printUsageAndExit(String message) {
- System.err.println(message);
- System.out.println("usage: java -Xmx2000M -cp MSGF.jar msscorer.ScoringParameterGenerator\n" +
- "\t-i annotatedMgfFileName (*.mgf)\n" +
- "\t-o outputFileName (e.g. CID_Tryp.param)\n" +
- "\t-m FragmentationMethodID (1: CID, 2: ETD, 3: HCD, 4: UVPD)\n" +
- "\t-inst InstrumentID (0: Low-res LCQ/LTQ, 1: TOF , 2: High-res LTQ)\n" +
- "\t-e EnzymeID (0: No enzyme, 1: Trypsin (Default), 2: Chymotrypsin, 3: Lys-C, 4: Lys-N, 5: Glu-C, 6: Arg-C, 7: Asp-N)\n" +
- "\t[-fixMod 0/1/2] (0: NoCysteineProtection, 1: CarbamidomethyC (default), 2: CarboxymethylC)\n" +
- "\t[-pep numPeptidesPerSpec] (default: 1)\n" +
- "\t[-err errorScalingFactor] (default: 10)"
- );
- System.exit(0);
- }
-
- public static void generateParameters(
- File specFile,
- ActivationMethod activationMethod,
- InstrumentType instType,
- Enzyme enzyme,
- Protocol protocol,
- int numSpecsPerPeptide,
- int errorScalingFactor,
- File outputFile,
- AminoAcidSet aaSet,
- boolean isText,
- boolean verbose) {
- SpectraContainer container = new SpectraContainer(specFile.getPath(), new MgfSpectrumParser().aaSet(aaSet));
-
- // multiple spectra with the same peptide -> one spec per peptide
- HashMap> pepSpecMap = new HashMap>();
- SpectraContainer specContOnePerPep = new SpectraContainer();
- for (Spectrum spec : container) {
- String pep = spec.getAnnotationStr() + ":" + spec.getCharge();
- if (pep != null && pep.length() > 0) {
- ArrayList specList = pepSpecMap.get(pep);
- if (specList == null) {
- specList = new ArrayList();
- pepSpecMap.put(pep, specList);
- }
- if (specList.size() < numSpecsPerPeptide)
- specList.add(spec);
- }
- }
- for (ArrayList specList : pepSpecMap.values())
- for (Spectrum spec : specList)
- specContOnePerPep.add(spec);
-
- SpecDataType dataType = new SpecDataType(activationMethod, instType, enzyme, protocol);
- ScoringParameterGenerator gen = new ScoringParameterGenerator(specContOnePerPep, dataType);
-
- // set up the tolerance
- gen.tolerance(new Tolerance(1 / Constants.INTEGER_MASS_SCALER / 2));
-
- // Step 1: partition spectra
- gen.partition(NUM_SEGMENTS_PER_SPECTRUM);
- if (verbose)
- System.out.println("Partition: " + gen.partitionSet.size());
-
- // Step 2: compute offset frequency functions of precursor peaks and their neutral losses
- gen.precursorOFF(MIN_PRECURSOR_OFFSET_PROBABILITY);
- if (verbose)
- System.out.println("PrecursorOFF Done.");
-
- // Step 3: filter out "significant" precursor offsets
- gen.filterPrecursorPeaks();
- if (verbose)
- System.out.println("Filtering Done.");
-
- // Step 4: compute offset frequency fnction of fragment peaks and determine ion types to be considered for scoring
- gen.selectIonTypes(MIN_ION_OFFSET_PROBABILITY);
- if (verbose)
- System.out.println("Ion types selected.");
-
- // Step 5: compute rank distributions
- gen.generateRankDist(MAX_RANK);
- if (verbose)
- System.out.println("Rank distribution computed.");
-
- // Step 6 (optional): generate error distribution, currently not in use
-
- // Step 7: smoothing parameters
- gen.smoothing();
- if (verbose)
- System.out.println("Smoothing complete.");
-
- // output
- if (!isText)
- gen.writeParameters(outputFile);
- else
- gen.writeParametersPlainText(outputFile);
-
- if (verbose)
- System.out.println("Writing Done.");
- }
-
- // Required
- private SpectraContainer specContainer;
-
- public ScoringParameterGenerator(SpectraContainer specContainer, SpecDataType dataType) {
- this.specContainer = specContainer;
- super.dataType = dataType;
- }
-
- public void partition(int numSegments) {
- super.numSegments = numSegments;
- chargeHist = new Histogram();
- partitionSet = new TreeSet();
-
- HashMap> parentMassMap = new HashMap>();
- for (Spectrum spec : specContainer) {
- int charge = spec.getCharge();
- if (charge <= 0)
- continue;
- chargeHist.add(charge);
- if (spec.getAnnotation() != null) {
- ArrayList precursorList = parentMassMap.get(charge);
- if (precursorList == null) {
- precursorList = new ArrayList();
- parentMassMap.put(charge, precursorList);
- }
- precursorList.add(spec.getPrecursorMass());
- }
- }
-
- for (int c = chargeHist.minKey(); c <= chargeHist.maxKey(); c++) {
- ArrayList parentMassList = parentMassMap.get(c);
- if (parentMassList == null)
- continue;
-
- int numSpec = parentMassList.size();
- if (numSpec < Math.round(MIN_NUM_SPECTRA_PER_PARTITION * 0.9f)) // to few spectra
- continue;
-
- Collections.sort(parentMassList);
- int bestSetSize = 0;
- int smallestRemainder = MIN_NUM_SPECTRA_PER_PARTITION;
- for (int i = Math.round(MIN_NUM_SPECTRA_PER_PARTITION * 0.9f); i <= Math.round(MIN_NUM_SPECTRA_PER_PARTITION * 1.1f); i++) {
- int remainder = numSpec % i;
- if (i - remainder < remainder)
- remainder = i - remainder;
- if (remainder < smallestRemainder || (remainder == smallestRemainder && Math.abs(MIN_NUM_SPECTRA_PER_PARTITION - i) < Math.abs(MIN_NUM_SPECTRA_PER_PARTITION - bestSetSize))) {
- bestSetSize = i;
- smallestRemainder = remainder;
- }
- }
- int num = 0;
- for (int i = 0; i == 0 || i < Math.round(numSpec / (float) bestSetSize); i++) {
- if (num != 0) {
- for (int seg = 0; seg < numSegments; seg++)
- partitionSet.add(new Partition(c, parentMassList.get(num), seg));
- } else {
- for (int seg = 0; seg < numSegments; seg++)
- partitionSet.add(new Partition(c, 0f, seg));
- }
- num += bestSetSize;
- }
- }
- }
-
- private void precursorOFF(float minProbThreshold) {
- if (chargeHist == null) {
- assert (false) : "partition() must have been called before";
- return;
- }
- precursorOFFMap = new TreeMap>();
- numPrecurOFF = 0;
-
- for (int charge = chargeHist.minKey(); charge <= chargeHist.maxKey(); charge++) {
- if (chargeHist.get(charge) < MIN_NUM_SPECTRA_FOR_PRECURSOR_OFF)
- continue;
- ArrayList precursorOffsetList = new ArrayList();
- int numSpecs = 0;
- HashMap> histList = new HashMap>();
- for (int c = charge; c >= 2; c--)
- histList.put(c, new Histogram());
-
- for (Spectrum spec : specContainer) {
- if (spec.getAnnotation() == null)
- continue;
- if (spec.getCharge() != charge)
- continue;
- numSpecs++;
- spec = filter.apply(spec);
- float precursorNeutralMass = spec.getPrecursorMass();
- for (int c = charge; c >= 2; c--) {
- float precursorMz = (precursorNeutralMass + c * (float) Composition.ChargeCarrierMass()) / c;
- ArrayList peakList = spec.getPeakListByMassRange(
- precursorMz + MIN_PRECURSOR_OFFSET / (float) c - mme.getToleranceAsDa(precursorMz + MIN_PRECURSOR_OFFSET / (float) c) / 2,
- precursorMz + MAX_PRECURSOR_OFFSET / (float) c + mme.getToleranceAsDa(precursorMz + MAX_PRECURSOR_OFFSET / (float) c) / 2);
-
- int prevMassIndexDiff = Integer.MIN_VALUE;
- for (Peak p : peakList) {
- float peakMass = p.getMz();
- int massIndexDiff = NominalMass.toNominalMass(peakMass - precursorMz);
- if (massIndexDiff > prevMassIndexDiff) {
- histList.get(c).add(massIndexDiff);
- prevMassIndexDiff = massIndexDiff;
- }
- }
- }
- }
-
- for (int c = charge; c >= 2; c--) {
- ArrayList keyList = new ArrayList(histList.get(c).keySet());
- Collections.sort(keyList);
- for (Integer key : keyList) {
- float prob = (histList.get(c).get(key)) / (float) numSpecs;
- if (prob > minProbThreshold) {
- precursorOffsetList.add(new PrecursorOffsetFrequency((charge - c), NominalMass.getMassFromNominalMass(key), prob));
- }
- }
- }
- precursorOFFMap.put(charge, precursorOffsetList);
- numPrecurOFF += precursorOffsetList.size();
- }
- }
-
- private void filterPrecursorPeaks() {
- if (this.precursorOFFMap == null)
- return;
- for (Spectrum spec : specContainer) {
- for (PrecursorOffsetFrequency off : this.getPrecursorOFF(spec.getCharge()))
- spec.filterPrecursorPeaks(mme, off.getReducedCharge(), off.getOffset());
- }
- }
-
- private Pair getPrecursorMassRange(Partition partition) {
- float minParentMass = partition.getParentMass();
- float maxParentMass = Float.MAX_VALUE;
- Partition higherPartition = partitionSet.higher(partition);
- if (higherPartition != null) {
- if (higherPartition.getCharge() == partition.getCharge() && higherPartition.getSegNum() == partition.getSegNum()) {
- maxParentMass = higherPartition.getParentMass();
- }
- }
- return new Pair(minParentMass, maxParentMass);
- }
-
- private void selectIonTypes(float minProbThreshold) {
- if (partitionSet == null) {
- assert (false) : "partition() must have been called before!";
- return;
- }
-
- fragOFFTable = new HashMap>();
- insignificantFragOFFTable = new HashMap>();
-
- for (Partition partition : partitionSet) {
- int charge = partition.getCharge();
- // parent mass range check
- Pair parentMassRange = getPrecursorMassRange(partition);
- int seg = partition.getSegNum();
-
- int numSpec = 0;
- HashMap> prefixIonFreq = new HashMap>();
- HashMap> suffixIonFreq = new HashMap>();
- for (int c = 1; c <= charge; c++) {
- prefixIonFreq.put(c, new Histogram());
- suffixIonFreq.put(c, new Histogram());
- }
-
- int numCleavages = 0;
- for (Spectrum spec : specContainer) {
- if (spec.getAnnotation() == null)
- continue;
- if (spec.getCharge() != charge)
- continue;
-
- float curParentMass = spec.getPrecursorMass();
- if (curParentMass < parentMassRange.getFirst() || curParentMass >= parentMassRange.getSecond())
- continue;
-
- Peptide annotation = spec.getAnnotation();
- numCleavages += annotation.size() - 1;
- numSpec++;
- spec = filter.apply(spec);
-
- for (int c = 1; c <= charge; c++) {
- for (int direction = 0; direction < 2; direction++) {
- double accurateMass = 0;
- HashMap> ionFreq = null;
- for (int i = 0; i < annotation.size() - 1; i++) {
- if (direction == 0) {
- accurateMass += annotation.get(i).getAccurateMass();
- ionFreq = prefixIonFreq;
- } else if (direction == 1) {
- accurateMass += annotation.get(annotation.size() - 1 - i).getAccurateMass();
- ionFreq = suffixIonFreq;
- }
- float mass = (float) (accurateMass / c);
- ArrayList peakList = spec.getPeakListByMassRange(
- mass + MIN_OFFSET_MASS / (float) c - mme.getToleranceAsDa(mass),
- mass + MAX_OFFSET_MASS / (float) c + mme.getToleranceAsDa(mass));
- int prevIntOffset = Integer.MIN_VALUE;
- for (Peak p : peakList) {
- float peakMz = p.getMz();
- int segNum = getSegmentNum(peakMz, curParentMass);
- if (segNum != seg)
- continue;
- float offset = peakMz - mass;
- int intOffset = NominalMass.toNominalMass(offset);
- if (intOffset > prevIntOffset) {
- ionFreq.get(c).add(intOffset);
- prevIntOffset = intOffset;
- }
- }
- }
- }
- }
- }
-
- float maxProb = 0;
- int maxCharge = 0;
- int maxDirection = 0;
- float maxOffset = 0;
-
- ArrayList fragmentOffsetFrequencyList = new ArrayList();
- ArrayList insignificantFragmentOffsetFrequencyList = new ArrayList();
- for (int c = 1; c <= charge; c++) {
- for (int direction = 0; direction < 2; direction++) {
- ArrayList keyList;
- if (direction == 0)
- keyList = new ArrayList(prefixIonFreq.get(c).keySet());
- else
- keyList = new ArrayList(suffixIonFreq.get(c).keySet());
-
- Collections.sort(keyList);
- for (Integer key : keyList) {
- float offset = NominalMass.getMassFromNominalMass(key);
- int freq;
- if (direction == 0)
- freq = prefixIonFreq.get(c).get(key);
- else
- freq = suffixIonFreq.get(c).get(key);
- float prob = freq / (float) numCleavages * numSegments;
- if (prob > maxProb) {
- maxProb = prob;
- maxCharge = c;
- maxDirection = direction;
- maxOffset = offset;
- }
- if (prob > minProbThreshold) {
- if (direction == 0)
- fragmentOffsetFrequencyList.add(new FragmentOffsetFrequency(new IonType.PrefixIon(c, offset), prob));
- else
- fragmentOffsetFrequencyList.add(new FragmentOffsetFrequency(new IonType.SuffixIon(c, offset), prob));
- } else {
- if (direction == 0)
- insignificantFragmentOffsetFrequencyList.add(new FragmentOffsetFrequency(new IonType.PrefixIon(c, offset), prob));
- else
- insignificantFragmentOffsetFrequencyList.add(new FragmentOffsetFrequency(new IonType.SuffixIon(c, offset), prob));
- }
- }
- }
- }
-
- if (fragmentOffsetFrequencyList.size() == 0) {
- if (maxDirection == 0)
- fragmentOffsetFrequencyList.add(new FragmentOffsetFrequency(new IonType.PrefixIon(maxCharge, maxOffset), maxProb));
- else
- fragmentOffsetFrequencyList.add(new FragmentOffsetFrequency(new IonType.SuffixIon(maxCharge, maxOffset), maxProb));
- }
-
- Collections.sort(insignificantFragmentOffsetFrequencyList);
- ArrayList noiseOffsetFrequencyList = new ArrayList(NUM_NOISE_IONS);
-
- int numNoise = 0;
- for (FragmentOffsetFrequency off : insignificantFragmentOffsetFrequencyList) {
- if (off.getIonType().getCharge() == 1)
- noiseOffsetFrequencyList.add(off);
- if (++numNoise >= NUM_NOISE_IONS)
- break;
- }
- Collections.sort(fragmentOffsetFrequencyList, Collections.reverseOrder());
- fragOFFTable.put(partition, fragmentOffsetFrequencyList);
- insignificantFragOFFTable.put(partition, noiseOffsetFrequencyList);
- }
- }
-
- private void generateRankDist(int maxRank) {
- if (partitionSet == null) {
- assert (false) : "partition() must have been called!";
- return;
- }
-
- rankDistTable = new HashMap>();
- this.maxRank = maxRank;
-
- for (Partition partition : partitionSet) {
- int charge = partition.getCharge();
- IonType[] ionTypes = getIonTypes(partition);
- if (ionTypes == null || ionTypes.length == 0)
- continue;
- Pair parentMassRange = getPrecursorMassRange(partition);
- int seg = partition.getSegNum();
-
- int numSpec = 0;
- HashMap> rankDist = new HashMap>();
- HashMap rankDistMaxRank = new HashMap();
- HashMap rankDistUnexplained = new HashMap();
-
- for (IonType ion : ionTypes) {
- rankDist.put(ion, new Histogram());
- rankDistMaxRank.put(ion, 0f);
- rankDistUnexplained.put(ion, 0f);
- }
- rankDist.put(IonType.NOISE, new Histogram());
-
- float[] noiseDist = new float[maxRank + 2];
- int numMaxRankPeaks = 0;
- int totalCleavageSites = 0;
-
- for (Spectrum spec : specContainer) {
- int numExplainedPeaks = 0;
- if (spec.getAnnotation() == null)
- continue;
- if (spec.getCharge() != charge)
- continue;
- float curParentMass = spec.getPrecursorMass();
- if (curParentMass < parentMassRange.getFirst() || curParentMass >= parentMassRange.getSecond())
- continue;
-
- Peptide annotation = spec.getAnnotation();
- spec.setRanksOfPeaks();
- numSpec++;
- numMaxRankPeaks += spec.size() - maxRank + 1;
- totalCleavageSites += annotation.size() - 1;
- int prmMassIndex = 0;
- int srmMassIndex = 0;
-
- HashSet explainedPeakSet = new HashSet();
- HashMap numExplainedMaxRankPeaks = new HashMap();
- for (IonType ion : ionTypes) {
- numExplainedMaxRankPeaks.put(ion, 0);
- }
-
- int numSignalBinsAtThisSegment = 0;
- for (int i = 0; i < annotation.size() - 1; i++) {
- prmMassIndex += NominalMass.toNominalMass(annotation.get(i).getMass());
- srmMassIndex += NominalMass.toNominalMass(annotation.get(annotation.size() - 1 - i).getMass());
-
- float prm = NominalMass.getMassFromNominalMass(prmMassIndex);
- float srm = NominalMass.getMassFromNominalMass(srmMassIndex);
- for (IonType ion : ionTypes) {
- float theoMass;
- if (ion instanceof IonType.PrefixIon)
- theoMass = ion.getMz(prm);
- else
- theoMass = ion.getMz(srm);
-
- int segNum = super.getSegmentNum(theoMass, curParentMass);
- if (segNum == seg) {
- numSignalBinsAtThisSegment++;
- Peak p = spec.getPeakByMass(theoMass, mme);
- if (p != null) {
- numExplainedPeaks++;
- int rank = p.getRank();
- if (rank >= maxRank) {
- rank = maxRank;
- numExplainedMaxRankPeaks.put(ion, numExplainedMaxRankPeaks.get(ion) + 1);
- }
- explainedPeakSet.add(p);
- rankDist.get(ion).add(rank);
- } else {
- rankDist.get(ion).add(maxRank + 1); // maxRank+1: missing ion
- }
- }
- }
- }
-
- ArrayList unexplainedPeaksAtThisSegment = new ArrayList();
- int numPeaksAtThisSegment = 0;
- int numMaxRankPeaksAtThisSegment = 0;
- for (Peak p : spec) {
- if (super.getSegmentNum(p.getMz(), curParentMass) == seg) {
- numPeaksAtThisSegment++;
- if (p.getRank() >= maxRank)
- numMaxRankPeaksAtThisSegment++;
- if (!explainedPeakSet.contains(p))
- unexplainedPeaksAtThisSegment.add(p);
- }
- }
-
- float midMassThisSegment = (1f / numSegments * seg + 1f / numSegments / 2) * annotation.getParentMass();
- float numBinsAtThisSegment = annotation.getParentMass() / numSegments / mme.getToleranceAsDa(midMassThisSegment) / 2;
-
- for (Peak p : unexplainedPeaksAtThisSegment) {
- int rank = p.getRank();
-// float noiseFreq = (float)(annotation.size()-1)/(annotation.getParentMass()/(mme.getToleranceAsDa(midMassThisSegment)*2));
- float noiseFreq = (annotation.size() - 1) / numSegments / numBinsAtThisSegment;
- if (rank >= maxRank)
- noiseDist[maxRank] += noiseFreq / numMaxRankPeaksAtThisSegment;
- else
- noiseDist[rank] += noiseFreq;
- }
-
- for (IonType ion : ionTypes) {
- if (numMaxRankPeaksAtThisSegment > 0) {
- Float prevSumFreq = rankDistMaxRank.get(ion);
- float curFreq = numExplainedMaxRankPeaks.get(ion) / (float) numMaxRankPeaksAtThisSegment;
- rankDistMaxRank.put(ion, prevSumFreq + curFreq);
- }
- }
-
- noiseDist[maxRank + 1] += (numBinsAtThisSegment - numPeaksAtThisSegment) * (annotation.size() - 1) / numSegments / numBinsAtThisSegment;
- }
-
- HashMap freqDist = new HashMap();
- for (IonType ion : ionTypes) {
- Float[] dist = new Float[maxRank + 1];
- Histogram hist = rankDist.get(ion);
- for (int i = 1; i <= maxRank - 1; i++) {
- Integer num = hist.get(i);
- dist[i - 1] = (num / (float) numSpec);
- }
- dist[maxRank - 1] = rankDistMaxRank.get(ion) / numSpec;
- dist[maxRank] = hist.get(maxRank + 1) / (float) numSpec;
- freqDist.put(ion, dist);
- }
-
- // noise
- Float[] dist = new Float[maxRank + 1];
- for (int i = 1; i <= maxRank + 1; i++)
- dist[i - 1] = noiseDist[i] / numSpec;
- freqDist.put(IonType.NOISE, dist);
-
- rankDistTable.put(partition, freqDist);
- }
- }
-
- protected void smoothing() {
- smoothingRankDistTable();
- }
-
- protected void smoothingRankDistTable() {
- if (rankDistTable == null)
- return;
- assert (smoothingRanks.length == smoothingWindowSize.length);
- for (Partition partition : rankDistTable.keySet()) {
- HashMap table = this.rankDistTable.get(partition);
- for (IonType ion : table.keySet()) {
- Float[] freq = table.get(ion);
- Float[] smoothedFreq = new Float[freq.length];
- int smoothingIndex = 0;
- for (int i = 0; i < freq.length - 2; i++) // last 2 columns: maxRank, unexplained
- {
- if (smoothingIndex < smoothingRanks.length - 1 &&
- i == smoothingRanks[smoothingIndex])
- smoothingIndex++;
- int windowSize = smoothingWindowSize[smoothingIndex];
- float sumFrequencies = 0;
- int numIndicesSummed = 0;
- for (int d = -windowSize; d <= windowSize; d++) {
- int index = i + d;
- if (index < 0 || index > freq.length - 3)
- continue;
- sumFrequencies += freq[index];
- numIndicesSummed++;
- }
- while (sumFrequencies == 0 && windowSize < freq.length - 4) {
- windowSize++;
- int index = i - windowSize;
- if (index >= 0) {
- sumFrequencies += freq[index];
- numIndicesSummed++;
- }
- index = i + windowSize;
- if (index <= freq.length - 3) {
- sumFrequencies += freq[index];
- numIndicesSummed++;
- }
- }
- if (sumFrequencies != 0)
- smoothedFreq[i] = sumFrequencies / numIndicesSummed;
- else
- assert (false);
- }
- for (int i = 0; i < freq.length - 2; i++)
- freq[i] = smoothedFreq[i];
- if (freq[freq.length - 1] == 0)
- freq[freq.length - 1] = Float.MIN_VALUE;
- if (freq[freq.length - 2] == 0)
- freq[freq.length - 2] = freq[freq.length - 3];
- }
- }
- }
-}
\ No newline at end of file
diff --git a/src/main/java/edu/ucsd/msjava/msscorer/ScoringParameterGeneratorWithErrors.java b/src/main/java/edu/ucsd/msjava/msscorer/ScoringParameterGeneratorWithErrors.java
deleted file mode 100644
index d2891aed..00000000
--- a/src/main/java/edu/ucsd/msjava/msscorer/ScoringParameterGeneratorWithErrors.java
+++ /dev/null
@@ -1,880 +0,0 @@
-package edu.ucsd.msjava.msscorer;
-
-import edu.ucsd.msjava.msgf.Histogram;
-import edu.ucsd.msjava.msgf.IntHistogram;
-import edu.ucsd.msjava.msgf.NominalMass;
-import edu.ucsd.msjava.msgf.Tolerance;
-import edu.ucsd.msjava.msscorer.NewScorerFactory.SpecDataType;
-import edu.ucsd.msjava.msutil.*;
-import edu.ucsd.msjava.msutil.IonType.PrefixIon;
-import edu.ucsd.msjava.parser.MgfSpectrumParser;
-
-import java.io.File;
-import java.util.*;
-
-/**
- * This only supports low accuracy fragment ions.
- *
- * @author sangtaekim
- */
-public class ScoringParameterGeneratorWithErrors extends NewRankScorer {
- private static final float MIN_PRECURSOR_OFFSET = -300; // for precursors
- private static final float MAX_PRECURSOR_OFFSET = 30;
- private static final int MIN_NUM_SPECTRA_PER_PARTITION = 400; // 400
- private static final int MIN_NUM_SPECTRA_FOR_PRECURSOR_OFF = 150;
- private static final int MAX_NUM_PARTITIONS_PER_CHARGE = 30; // 30
-
- private static final float MIN_PRECURSOR_OFFSET_PROBABILITY = 0.15f; // 0.15
- private static final float MIN_ION_OFFSET_PROBABILITY = 0.15f; // 0.15, for ion types
- private static final float MIN_MAIN_ION_OFFSET_PROBABILITY = 0.01f; // ions with probabilities below this number will be ignored
-
- private static final int MAX_RANK = 150;
- private static final int NUM_SEGMENTS_PER_SPECTRUM = 2; // 2
-
- private static final int[] smoothingRanks = {3, 5, 10, 20, 50, Integer.MAX_VALUE}; //Ranks around which smoothing occurs
- private static final int[] smoothingWindowSize = {0, 1, 2, 3, 4, 5}; //Smoothing windows for each smoothing rank
-
- private static final float DECONVOLUTION_MASS_TOLERANCE = 0.02f;
- protected static final int MAX_CHARGE = 20;
-
- public static void generateParameters(
- File specFile,
- SpecDataType dataType,
- AminoAcidSet aaSet,
- File outputDir,
- boolean isText,
- boolean verbose,
- boolean singlePartition
- ) {
- SpectraContainer container = new SpectraContainer(specFile.getPath(), new MgfSpectrumParser().aaSet(aaSet));
- generateParameters(container, dataType, aaSet, outputDir, isText, verbose, singlePartition);
- }
-
- public static void generateParameters(
- SpectraContainer container,
- SpecDataType dataType,
- AminoAcidSet aaSet,
- File outputDir,
- boolean isText,
- boolean verbose) {
- generateParameters(container, dataType, aaSet, outputDir, isText, verbose, false);
- }
-
- public static void generateParameters(
- SpectraContainer container,
- SpecDataType dataType,
- AminoAcidSet aaSet,
- File outputDir,
- boolean isText,
- boolean verbose,
- boolean singlePartition) {
- if (verbose)
- System.out.println("Number of annotated PSMs: " + container.size());
-
- String paramFileName = dataType.toString() + ".param";
-
- File outputFile;
- if (outputDir != null)
- outputFile = new File(outputDir, paramFileName);
- else
- outputFile = new File(paramFileName);
-
- if (verbose)
- System.out.println("Output file name: " + outputFile.getAbsolutePath());
- int errorScalingFactor = 0;
- boolean applyDeconvolution = false;
-
- if (dataType.getInstrumentType() == InstrumentType.HIGH_RESOLUTION_LTQ
- || dataType.getInstrumentType() == InstrumentType.TOF
- || dataType.getInstrumentType().isHighResolution()) {
- errorScalingFactor = 100;
- applyDeconvolution = true;
- if (verbose)
- System.out.println("High-precision MS/MS data: " +
- "errorScalingFactor(" + errorScalingFactor + ") " +
- "chargeDeconvolution(" + applyDeconvolution + ")");
- }
-
- boolean considerPhosLoss = false;
- if (dataType.getProtocol().getName().equals("Phosphorylation")) {
- considerPhosLoss = true;
- if (verbose)
- System.out.println("Consider H3PO4 loss.");
- }
-
- boolean consideriTRAQLoss = false;
- if (dataType.getProtocol().getName().equals("iTRAQ")) {
- consideriTRAQLoss = true;
- if (verbose)
- System.out.println("Consider iTRAQ loss.");
- }
-
- boolean considerTMTLoss = false;
- if (dataType.getProtocol().getName().equals("TMT")) {
- considerTMTLoss = true;
- if (verbose)
- System.out.println("Consider TMT loss.");
- }
-
- if (dataType.getProtocol().getName().equals("iTRAQPhospho")) {
- considerPhosLoss = true;
- consideriTRAQLoss = true;
- if (verbose)
- System.out.println("Consider iTRAQ and H3PO4 loss.");
- }
-
- HashSet pepSet = new HashSet();
- for (Spectrum spec : container)
- pepSet.add(spec.getAnnotationStr());
-
- if (verbose)
- System.out.println("Number of unique peptides: " + pepSet.size());
- int numSpecsPerPeptide;
- if (pepSet.size() < 2000) {
- numSpecsPerPeptide = 3;
- } else {
- numSpecsPerPeptide = 1;
- }
- if (verbose)
- System.out.println("Consider " + numSpecsPerPeptide + " per spectrum.");
-
- // multiple spectra with the same peptide -> one spec per peptide
- HashMap> pepSpecMap = new HashMap>();
- for (Spectrum spec : container) {
- if (spec.getAnnotationStr() == null)
- continue;
- String pep = spec.getAnnotationStr() + ":" + spec.getCharge();
- if (pep != null && pep.length() > 0) {
- ArrayList specList = pepSpecMap.get(pep);
- if (specList == null) {
- specList = new ArrayList();
- pepSpecMap.put(pep, specList);
- }
- if (specList.size() < numSpecsPerPeptide)
- specList.add(spec);
- }
- }
-
- SpectraContainer specContOnePerPep = new SpectraContainer();
- for (ArrayList specList : pepSpecMap.values()) {
- for (Spectrum spec : specList) {
- specContOnePerPep.add(spec);
- }
- }
-
- ScoringParameterGeneratorWithErrors gen = new ScoringParameterGeneratorWithErrors(
- specContOnePerPep,
- dataType,
- considerPhosLoss,
- consideriTRAQLoss,
- considerTMTLoss,
- applyDeconvolution);
-
- // set up the tolerance
- gen.tolerance(new Tolerance(0.5f));
-
- // Step 1: partition spectra
- if (singlePartition)
- gen.partition(2, true);
- else
- gen.partition(NUM_SEGMENTS_PER_SPECTRUM, false);
- if (verbose)
- System.out.println("Partition: " + gen.partitionSet.size());
-
- // Step 2: compute offset frequency functions of precursor peaks and their neutral losses
- gen.precursorOFF(MIN_PRECURSOR_OFFSET_PROBABILITY);
- if (verbose)
- System.out.println("PrecursorOFF Done.");
-
- // Step 3: filter out "significant" precursor offsets
- gen.filterPrecursorPeaks();
- if (verbose)
- System.out.println("Filtering Done.");
-
- if (applyDeconvolution) {
- gen.deconvoluteSpectra();
- if (verbose)
- System.out.println("Deconvolution Done.");
- }
-
- // Step 4: compute offset frequency function of fragment peaks and determine ion types to be considered for scoring
- gen.selectIonTypes();
- if (verbose)
- System.out.println("Ion types selected.");
-
- // Step 5: compute rank distributions
- gen.generateRankDist(MAX_RANK);
- if (verbose)
- System.out.println("Rank distribution computed.");
-
- // Step 6 (optional): generate error distribution
- gen.generateErrorDist(errorScalingFactor);
- if (verbose)
- System.out.println("Error disbribution computed");
-
- // Step 7: smoothing parameters
- gen.smoothing();
- if (verbose)
- System.out.println("Smoothing complete.");
-
- // output
-
- gen.writeParameters(outputFile);
- gen.writeParametersPlainText(new File(outputFile.getPath()+".txt"));
- //if (!isText)
- // gen.writeParameters(outputFile);
- //else
- // gen.writeParametersPlainText(outputFile);
-
- if (verbose)
- System.out.println("Writing Done.");
- }
-
- // Required
- private SpectraContainer specContainer;
- private final boolean considerPhosLoss;
- private final boolean consideriTRAQLoss;
- private final boolean considerTMTLoss;
-
- public ScoringParameterGeneratorWithErrors(SpectraContainer specContainer, SpecDataType dataType, boolean considerPhosLoss, boolean consideriTRAQLoss, boolean considerTMTLoss, boolean applyDeconvolution) {
- this.specContainer = specContainer;
- this.considerPhosLoss = considerPhosLoss;
- this.consideriTRAQLoss = consideriTRAQLoss;
- this.considerTMTLoss = considerTMTLoss;
- super.dataType = dataType;
- super.applyDeconvolution = applyDeconvolution;
- super.deconvolutionErrorTolerance = DECONVOLUTION_MASS_TOLERANCE;
- }
-
- public void partition(int numSegments, boolean singlePartition) {
- super.numSegments = numSegments;
- chargeHist = new Histogram();
- partitionSet = new TreeSet();
-
-
- HashMap> parentMassMap = new HashMap>();
- for (Spectrum spec : specContainer) {
- int charge = spec.getCharge();
- if (charge <= 0)
- continue;
- chargeHist.add(charge);
- if (spec.getAnnotation() != null) {
- ArrayList precursorList = parentMassMap.get(charge);
- if (precursorList == null) {
- precursorList = new ArrayList();
- parentMassMap.put(charge, precursorList);
- }
- precursorList.add(spec.getPrecursorMass());
- }
- }
-
- for (int c = chargeHist.minKey(); c <= chargeHist.maxKey(); c++) {
-
- ArrayList parentMassList = parentMassMap.get(c);
- if (parentMassList == null)
- continue;
-
- int numSpec = parentMassList.size();
- if (numSpec < Math.round(MIN_NUM_SPECTRA_PER_PARTITION * 0.9f)) // to few spectra
- continue;
-
- int partitionSize = Math.max(numSpec / MAX_NUM_PARTITIONS_PER_CHARGE, MIN_NUM_SPECTRA_PER_PARTITION);
-
- Collections.sort(parentMassList);
- int bestSetSize = 0;
-
- if (singlePartition)
- bestSetSize = numSpec;
- else {
- int smallestRemainder = partitionSize;
- for (int i = Math.round(partitionSize * 0.9f); i <= Math.round(partitionSize * 1.1f); i++) {
- int remainder = numSpec % i;
- if (i - remainder < remainder)
- remainder = i - remainder;
- if (remainder < smallestRemainder || (remainder == smallestRemainder && Math.abs(partitionSize - i) < Math.abs(partitionSize - bestSetSize))) {
- bestSetSize = i;
- smallestRemainder = remainder;
- }
- }
- }
- int num = 0;
- for (int i = 0; i == 0 || i < Math.round(numSpec / (float) bestSetSize); i++) {
- if (num != 0) {
- for (int seg = 0; seg < numSegments; seg++)
- partitionSet.add(new Partition(c, parentMassList.get(num), seg));
- } else {
- for (int seg = 0; seg < numSegments; seg++)
- partitionSet.add(new Partition(c, 0f, seg));
- }
- num += bestSetSize;
- }
- }
- }
-
- private void precursorOFF(float minProbThreshold) {
- if (chargeHist == null) {
- assert (false) : "partition() must have been called before";
- return;
- }
- precursorOFFMap = new TreeMap>();
- numPrecurOFF = 0;
-
- for (int charge = chargeHist.minKey(); charge <= chargeHist.maxKey(); charge++) {
- if (chargeHist.get(charge) < MIN_NUM_SPECTRA_FOR_PRECURSOR_OFF)
- continue;
- ArrayList precursorOffsetList = new ArrayList();
- int numSpecs = 0;
- HashMap> histList = new HashMap>();
- for (int c = charge; c >= 2; c--)
- histList.put(c, new Histogram());
-
- for (Spectrum spec : specContainer) {
- if (spec.getAnnotation() == null)
- continue;
- if (spec.getCharge() != charge)
- continue;
- numSpecs++;
- spec = filter.apply(spec);
- float precursorNeutralMass = spec.getPrecursorMass();
- for (int c = charge; c >= 2; c--) {
- float precursorMz = (precursorNeutralMass + c * (float) Composition.ChargeCarrierMass()) / c;
- ArrayList peakList = spec.getPeakListByMassRange(
- precursorMz + MIN_PRECURSOR_OFFSET / (float) c - mme.getToleranceAsDa(precursorMz + MIN_PRECURSOR_OFFSET / (float) c) / 2,
- precursorMz + MAX_PRECURSOR_OFFSET / (float) c + mme.getToleranceAsDa(precursorMz + MAX_PRECURSOR_OFFSET / (float) c) / 2);
-
- int prevMassIndexDiff = Integer.MIN_VALUE;
- for (Peak p : peakList) {
- float peakMass = p.getMz();
- int massIndexDiff = NominalMass.toNominalMass(peakMass - precursorMz);
- if (massIndexDiff > prevMassIndexDiff) {
- histList.get(c).add(massIndexDiff);
- prevMassIndexDiff = massIndexDiff;
- }
- }
- }
- }
-
- for (int c = charge; c >= 2; c--) {
- ArrayList keyList = new ArrayList(histList.get(c).keySet());
- Collections.sort(keyList);
- for (Integer key : keyList) {
- float prob = (histList.get(c).get(key)) / (float) numSpecs;
- if (prob > minProbThreshold) {
- precursorOffsetList.add(new PrecursorOffsetFrequency((charge - c), NominalMass.getMassFromNominalMass(key), prob));
- }
- }
- }
- precursorOFFMap.put(charge, precursorOffsetList);
- numPrecurOFF += precursorOffsetList.size();
- }
- }
-
- private void filterPrecursorPeaks() {
- if (this.precursorOFFMap == null)
- return;
- for (Spectrum spec : specContainer) {
- for (PrecursorOffsetFrequency off : this.getPrecursorOFF(spec.getCharge()))
- spec.filterPrecursorPeaks(mme, off.getReducedCharge(), off.getOffset());
- }
- }
-
- private void deconvoluteSpectra() {
- SpectraContainer newSpecContainer = new SpectraContainer();
- for (Spectrum spec : specContainer) {
- newSpecContainer.add(spec.getDeconvolutedSpectrum(DECONVOLUTION_MASS_TOLERANCE));
- }
- specContainer = newSpecContainer;
- }
-
- private Pair getPrecursorMassRange(Partition partition) {
- float minParentMass = partition.getParentMass();
- float maxParentMass = Float.MAX_VALUE;
- Partition higherPartition = partitionSet.higher(partition);
- if (higherPartition != null) {
- if (higherPartition.getCharge() == partition.getCharge() && higherPartition.getSegNum() == partition.getSegNum()) {
- maxParentMass = higherPartition.getParentMass();
- }
- }
- return new Pair(minParentMass, maxParentMass);
- }
-
- private void selectIonTypes() {
- if (partitionSet == null) {
- assert (false) : "partition() must have been called before!";
- return;
- }
-
- fragOFFTable = new HashMap>();
-
- for (Partition partition : partitionSet) {
- int charge = partition.getCharge();
- // parent mass range check
- Pair parentMassRange = getPrecursorMassRange(partition);
-
- SpectraContainer curPartContainer = new SpectraContainer();
- for (Spectrum spec : specContainer) {
- if (spec.getAnnotation() == null)
- continue;
- if (spec.getCharge() != charge)
- continue;
-
- float curParentMass = spec.getPrecursorMass();
- if (curParentMass < parentMassRange.getFirst() || curParentMass >= parentMassRange.getSecond())
- continue;
-
- curPartContainer.add(spec);
- }
-
- ArrayList signalFragmentOffsetFrequencyList = new ArrayList();
-
- int seg = partition.getSegNum();
- IonType[] allIonTypes = IonType.getAllKnownIonTypes(Math.min(charge, 3), true, considerPhosLoss, consideriTRAQLoss, considerTMTLoss).toArray(new IonType[0]);
-
- IonProbability probGen = new IonProbability(
- curPartContainer.iterator(),
- allIonTypes,
- mme)
- .filter(filter)
- .segment(seg, numSegments);
-
-// if(partition.getCharge() == 2 && partition.getSegNum() == 1 && partition.getParentMass() >= 1008 && partition.getParentMass() < 1009)
-// {
-// System.out.println("Debug");
-// }
- float[] ionProb = probGen.getIonProb();
-
- float signalThreshold = MIN_ION_OFFSET_PROBABILITY;
- for (int i = 0; i < allIonTypes.length; i++) {
- if (ionProb[i] >= signalThreshold)
- signalFragmentOffsetFrequencyList.add(new FragmentOffsetFrequency(allIonTypes[i], ionProb[i]));
- }
-
- if (signalFragmentOffsetFrequencyList.size() == 0) {
- int maxIndex = -1;
- float maxIonProb = Float.MIN_VALUE;
- for (int i = 0; i < allIonTypes.length; i++) {
- if (ionProb[i] > MIN_MAIN_ION_OFFSET_PROBABILITY && ionProb[i] > maxIonProb) {
- maxIndex = i;
- maxIonProb = ionProb[i];
- }
- }
- if (maxIndex >= 0)
- signalFragmentOffsetFrequencyList.add(new FragmentOffsetFrequency(allIonTypes[maxIndex], maxIonProb));
- }
-
- Collections.sort(signalFragmentOffsetFrequencyList, Collections.reverseOrder());
- fragOFFTable.put(partition, signalFragmentOffsetFrequencyList);
- }
- super.determineIonTypes();
- }
-
- private void generateRankDist(int maxRank) {
- if (partitionSet == null) {
- assert (false) : "partition() must have been called!";
- return;
- }
-
- rankDistTable = new HashMap>();
- this.maxRank = maxRank;
-
- for (Partition partition : partitionSet) {
- int charge = partition.getCharge();
- IonType[] ionTypes = getIonTypes(partition);
- if (ionTypes == null || ionTypes.length == 0)
- continue;
-
- Pair parentMassRange = getPrecursorMassRange(partition);
- int seg = partition.getSegNum();
-
- int numSpec = 0;
- HashMap> rankDist = new HashMap>();
- HashMap rankDistMaxRank = new HashMap();
- HashMap rankDistUnexplained = new HashMap();
-
- for (IonType ion : ionTypes) {
- rankDist.put(ion, new Histogram());
- rankDistMaxRank.put(ion, 0f);
- rankDistUnexplained.put(ion, 0f);
- }
- rankDist.put(IonType.NOISE, new Histogram());
-
- float[] noiseDist = new float[maxRank + 2];
- int numMaxRankPeaks = 0;
- int totalCleavageSites = 0;
-
- for (Spectrum spec : specContainer) {
- int numExplainedPeaks = 0;
- if (spec.getAnnotation() == null)
- continue;
- if (spec.getCharge() != charge)
- continue;
- float curParentMass = spec.getPrecursorMass();
- if (curParentMass < parentMassRange.getFirst() || curParentMass >= parentMassRange.getSecond())
- continue;
-
- Peptide annotation = spec.getAnnotation();
- spec.setRanksOfPeaks();
- numSpec++;
- numMaxRankPeaks += spec.size() - maxRank + 1;
- totalCleavageSites += annotation.size() - 1;
- int prmMassIndex = 0;
- int srmMassIndex = 0;
-
- HashSet explainedPeakSet = new HashSet();
- HashMap numExplainedMaxRankPeaks = new HashMap();
- for (IonType ion : ionTypes) {
- numExplainedMaxRankPeaks.put(ion, 0);
- }
-
- int numSignalBinsAtThisSegment = 0;
- for (int i = 0; i < annotation.size() - 1; i++) {
- prmMassIndex += NominalMass.toNominalMass(annotation.get(i).getMass());
- srmMassIndex += NominalMass.toNominalMass(annotation.get(annotation.size() - 1 - i).getMass());
-
- float prm = NominalMass.getMassFromNominalMass(prmMassIndex);
- float srm = NominalMass.getMassFromNominalMass(srmMassIndex);
- for (IonType ion : ionTypes) {
- float theoMass;
- if (ion instanceof IonType.PrefixIon)
- theoMass = ion.getMz(prm);
- else
- theoMass = ion.getMz(srm);
-
-// if(ion.getName().equals("z-H-TMT"))
-// {
-// System.out.println("Debug");
-// }
-
- int segNum = super.getSegmentNum(theoMass, curParentMass);
- if (segNum == seg) {
- numSignalBinsAtThisSegment++;
- Peak p = spec.getPeakByMass(theoMass, mme);
- if (p != null) {
- numExplainedPeaks++;
- int rank = p.getRank();
- if (rank >= maxRank) {
- rank = maxRank;
- numExplainedMaxRankPeaks.put(ion, numExplainedMaxRankPeaks.get(ion) + 1);
- }
- explainedPeakSet.add(p);
- rankDist.get(ion).add(rank);
- } else {
- rankDist.get(ion).add(maxRank + 1); // maxRank+1: missing ion
- }
- }
- }
- }
-
- ArrayList unexplainedPeaksAtThisSegment = new ArrayList();
- int numPeaksAtThisSegment = 0;
- int numMaxRankPeaksAtThisSegment = 0;
- for (Peak p : spec) {
- if (super.getSegmentNum(p.getMz(), curParentMass) == seg) {
- numPeaksAtThisSegment++;
- if (p.getRank() >= maxRank)
- numMaxRankPeaksAtThisSegment++;
- if (!explainedPeakSet.contains(p))
- unexplainedPeaksAtThisSegment.add(p);
- }
- }
-
- float midMassThisSegment = (1f / numSegments * seg + 1f / numSegments / 2) * annotation.getParentMass();
- float numBinsAtThisSegment = annotation.getParentMass() / numSegments / mme.getToleranceAsDa(midMassThisSegment) / 2;
-
- for (Peak p : unexplainedPeaksAtThisSegment) {
- int rank = p.getRank();
- // float noiseFreq = (float)(annotation.size()-1)/(annotation.getParentMass()/(mme.getToleranceAsDa(midMassThisSegment)*2));
- float noiseFreq = (annotation.size() - 1) / numSegments / numBinsAtThisSegment;
- if (rank >= maxRank)
- noiseDist[maxRank] += noiseFreq / numMaxRankPeaksAtThisSegment;
- else
- noiseDist[rank] += noiseFreq;
- }
-
- for (IonType ion : ionTypes) {
- if (numMaxRankPeaksAtThisSegment > 0) {
- Float prevSumFreq = rankDistMaxRank.get(ion);
- float curFreq = numExplainedMaxRankPeaks.get(ion) / (float) numMaxRankPeaksAtThisSegment;
- rankDistMaxRank.put(ion, prevSumFreq + curFreq);
- }
- }
-
- noiseDist[maxRank + 1] += (numBinsAtThisSegment - numPeaksAtThisSegment) * (annotation.size() - 1) / numSegments / numBinsAtThisSegment;
- }
-
- HashMap freqDist = new HashMap();
- for (IonType ion : ionTypes) {
- Float[] dist = new Float[maxRank + 1];
- Histogram hist = rankDist.get(ion);
- for (int i = 1; i <= maxRank - 1; i++) {
- Integer num = hist.get(i);
- dist[i - 1] = (num / (float) numSpec);
- }
- dist[maxRank - 1] = rankDistMaxRank.get(ion) / numSpec;
- dist[maxRank] = hist.get(maxRank + 1) / (float) numSpec;
- freqDist.put(ion, dist);
- }
-
- // noise
- Float[] dist = new Float[maxRank + 1];
- for (int i = 1; i <= maxRank + 1; i++)
- dist[i - 1] = noiseDist[i] / numSpec;
- freqDist.put(IonType.NOISE, dist);
-
- rankDistTable.put(partition, freqDist);
- }
- }
-
- private void generateErrorDist(int errorScalingFactor) {
- this.errorScalingFactor = errorScalingFactor;
- if (errorScalingFactor > 0) {
- generateIonErrorDist();
- generateNoiseErrorDist();
- }
- }
-
- private void generateIonErrorDist() {
- ionErrDistTable = new HashMap();
- ionExistenceTable = new HashMap();
- for (Partition partition : partitionSet) {
- int charge = partition.getCharge();
- Pair parentMassRange = getPrecursorMassRange(partition);
- int seg = partition.getSegNum();
- if (seg != super.getNumSegments() - 1)
- continue;
- IonType mainIon = this.getMainIonType(partition);
- IntHistogram errHist = new IntHistogram();
- int[] edgeCount = new int[4];
- int numSpecs = 0;
- for (Spectrum spec : specContainer) {
- if (spec.getAnnotation() == null)
- continue;
- if (spec.getCharge() != charge)
- continue;
-
- float curParentMass = spec.getPrecursorMass();
- if (curParentMass < parentMassRange.getFirst() || curParentMass >= parentMassRange.getSecond())
- continue;
-
- numSpecs++;
- Peptide peptide;
-
- peptide = spec.getAnnotation();
-
- int intResidueMass = 0;
- float[] obsMass = new float[peptide.size() + 1];
-
- obsMass[0] = mainIon.getOffset();
- for (int i = 0; i < peptide.size() - 1; i++) {
- if (mainIon instanceof PrefixIon)
- intResidueMass += peptide.get(i).getNominalMass();
- else
- intResidueMass += peptide.get(peptide.size() - 1 - i).getNominalMass();
-
- float theoMass = mainIon.getMz(NominalMass.getMassFromNominalMass(intResidueMass));
- Peak p = spec.getPeakByMass(theoMass, mme);
- if (p != null)
- obsMass[i + 1] = p.getMz();
- else
- obsMass[i + 1] = -1;
- }
-
- obsMass[peptide.size()] = mainIon.getMz(peptide.getMass());
- for (int i = 1; i <= peptide.size(); i++) {
- if (obsMass[i] >= 0) {
- if (obsMass[i - 1] >= 0) // yy
- {
- AminoAcid aa;
- if (mainIon instanceof PrefixIon)
- aa = peptide.get(i - 1);
- else
- aa = peptide.get(peptide.size() - i);
-
- float expMass = obsMass[i] - obsMass[i - 1];
- float theoMass = aa.getMass() / mainIon.getCharge();
- float diff = expMass - theoMass;
- int diffIndex = Math.round(diff * errorScalingFactor);
- if (diffIndex > errorScalingFactor)
- diffIndex = errorScalingFactor;
- else if (diffIndex < -errorScalingFactor)
- diffIndex = -errorScalingFactor;
- errHist.add(diffIndex);
- edgeCount[3]++;
- } else // ny
- edgeCount[1]++;
- } else {
- if (obsMass[i - 1] >= 0) // yn
- edgeCount[2]++;
- else // nn
- edgeCount[0]++;
- }
- }
- }
-
- Float[] ionErrHist = new Float[2 * errorScalingFactor + 1];
- // smoothing
- float[] smoothedHist = errHist.getSmoothedHist(errorScalingFactor);
- for (int i = -errorScalingFactor; i <= errorScalingFactor; i++)
- ionErrHist[i + errorScalingFactor] = smoothedHist[i + errorScalingFactor] / (float) errHist.totalCount();
-
- Float[] ionExistence = new Float[edgeCount.length];
- int sumEdgeCount = 0;
- for (int i = 0; i < edgeCount.length; i++)
- sumEdgeCount += edgeCount[i];
- for (int i = 0; i < edgeCount.length; i++)
- ionExistence[i] = edgeCount[i] / (float) sumEdgeCount;
-
- for (int i = 0; i < this.numSegments; i++) {
- Partition part = new Partition(partition.getCharge(), partition.getParentMass(), i);
- if (partitionSet.contains(part)) {
- ionErrDistTable.put(part, ionErrHist);
- ionExistenceTable.put(part, ionExistence);
- }
- }
- // if(partition.getCharge() == 2 && partition.getParentMass() > 1000 && partition.getParentMass() < 1110)
- // {
- // System.out.println("Partition\t"+partition.getCharge()+"\t"+partition.getParentMass());
- // System.out.println("ErrorHist:");
- // for(int i=0; i();
- AminoAcidSet aaSet = AminoAcidSet.getStandardAminoAcidSetWithFixedCarbamidomethylatedCys();
- AminoAcid aaK = aaSet.getAminoAcid('K');
- AminoAcid aaQ = aaSet.getAminoAcid('Q');
- int heaviestAANominalMass = aaSet.getMaxNominalMass();
- float[] nominalMass = new float[heaviestAANominalMass + 1];
- for (AminoAcid aa : aaSet)
- nominalMass[aa.getNominalMass()] = aa.getMass();
-
- for (Partition partition : partitionSet) {
- int charge = partition.getCharge();
- Pair parentMassRange = getPrecursorMassRange(partition);
- int seg = partition.getSegNum();
- if (seg != super.getNumSegments() - 1)
- continue;
-
- IntHistogram errHist = new IntHistogram();
- int numSpecs = 0;
- for (Spectrum spec : specContainer) {
- if (spec.getAnnotation() == null)
- continue;
- if (spec.getCharge() != charge)
- continue;
-
- float curParentMass = spec.getPrecursorMass();
- if (curParentMass < parentMassRange.getFirst() || curParentMass >= parentMassRange.getSecond())
- continue;
-
- Spectrum noiseSpec = (Spectrum) spec.clone();
-
- numSpecs++;
-
- for (int i = 0; i < noiseSpec.size() - 1; i++) {
- Peak p1 = noiseSpec.get(i);
- float p1Mass = p1.getMz();
- int nominalP1 = NominalMass.toNominalMass(p1.getMz());
- for (int j = i + 1; j < noiseSpec.size(); j++) {
- Peak p2 = noiseSpec.get(j);
- float p2Mass = p2.getMz();
- int nominalP2 = NominalMass.toNominalMass(p2.getMz());
- int nominalDiff = nominalP2 - nominalP1;
- if (nominalDiff > heaviestAANominalMass)
- break;
- if (nominalMass[nominalDiff] == 0)
- continue;
-
- float diff = p2Mass - p1Mass;
- float aaMass = nominalMass[nominalDiff];
- if (nominalDiff == 128) // K or Q
- {
- if (Math.abs(diff - aaQ.getMass()) > Math.abs(diff - aaK.getMass()))
- aaMass = aaK.getMass();
- else
- aaMass = aaQ.getMass();
- }
- float err = diff - aaMass;
- errHist.add(Math.round(err * errorScalingFactor));
- }
- }
- }
- Float[] noiseErrHist = new Float[2 * errorScalingFactor + 1];
- // smoothing
- float[] smoothedHist = errHist.getSmoothedHist(errorScalingFactor);
- for (int i = -errorScalingFactor; i <= errorScalingFactor; i++)
- noiseErrHist[i + errorScalingFactor] = smoothedHist[i + errorScalingFactor] / (float) errHist.totalCount();
-
- for (int i = 0; i < this.numSegments; i++) {
- Partition part = new Partition(partition.getCharge(), partition.getParentMass(), i);
- if (partitionSet.contains(part)) {
- noiseErrDistTable.put(part, noiseErrHist);
- }
- }
- }
- }
-
- protected void smoothing() {
- smoothingRankDistTable();
- }
-
- protected void smoothingRankDistTable() {
- if (rankDistTable == null)
- return;
- assert (smoothingRanks.length == smoothingWindowSize.length);
- for (Partition partition : rankDistTable.keySet()) {
- HashMap table = this.rankDistTable.get(partition);
- for (IonType ion : table.keySet()) {
- Float[] freq = table.get(ion);
- Float[] smoothedFreq = new Float[freq.length];
- int smoothingIndex = 0;
- for (int i = 0; i < freq.length - 2; i++) // last 2 columns: maxRank, unexplained
- {
- if (smoothingIndex < smoothingRanks.length - 1 &&
- i == smoothingRanks[smoothingIndex])
- smoothingIndex++;
- int windowSize = smoothingWindowSize[smoothingIndex];
- float sumFrequencies = 0;
- int numIndicesSummed = 0;
- for (int d = -windowSize; d <= windowSize; d++) {
- int index = i + d;
- if (index < 0 || index > freq.length - 3)
- continue;
- sumFrequencies += freq[index];
- numIndicesSummed++;
- }
- while (sumFrequencies == 0 && windowSize < freq.length - 4) {
- windowSize++;
- int index = i - windowSize;
- if (index >= 0) {
- sumFrequencies += freq[index];
- numIndicesSummed++;
- }
- index = i + windowSize;
- if (index <= freq.length - 3) {
- sumFrequencies += freq[index];
- numIndicesSummed++;
- }
- }
- if (sumFrequencies != 0)
- smoothedFreq[i] = sumFrequencies / numIndicesSummed;
- else
- assert (false);
- }
- for (int i = 0; i < freq.length - 2; i++)
- freq[i] = smoothedFreq[i];
- if (freq[freq.length - 1] == 0)
- freq[freq.length - 1] = Float.MIN_VALUE;
- if (freq[freq.length - 2] == 0)
- freq[freq.length - 2] = freq[freq.length - 3];
- }
- }
- }
-}
-
diff --git a/src/main/java/edu/ucsd/msjava/msutil/ActivationMethod.java b/src/main/java/edu/ucsd/msjava/msutil/ActivationMethod.java
index a691dfe9..4639b667 100644
--- a/src/main/java/edu/ucsd/msjava/msutil/ActivationMethod.java
+++ b/src/main/java/edu/ucsd/msjava/msutil/ActivationMethod.java
@@ -1,7 +1,5 @@
package edu.ucsd.msjava.msutil;
-import edu.ucsd.msjava.params.ParamObject;
-import edu.ucsd.msjava.params.UserParam;
import java.io.File;
import java.nio.file.Paths;
@@ -115,12 +113,10 @@ private static void add(ActivationMethod actMethod) {
registeredActMethods.add(actMethod);
}
- // add to the HashMap only
private static void addAlias(String name, ActivationMethod actMethod) {
table.put(name, actMethod);
}
- // add to the list only
private static void addToList(ActivationMethod actMethod) {
registeredActMethods.add(actMethod);
}
@@ -150,7 +146,6 @@ private static void addToList(ActivationMethod actMethod) {
// Parse activation methods defined by a user
File actMethodFile = Paths.get("params", "activationMethods.txt").toFile();
if (actMethodFile.exists()) {
-// System.out.println("Loading " + actMethodFile.getAbsolutePath());
ArrayList paramLines = UserParam.parseFromFile(actMethodFile.getPath(), 2);
for (String paramLine : paramLines) {
String[] token = paramLine.split(",");
diff --git a/src/main/java/edu/ucsd/msjava/msutil/AminoAcid.java b/src/main/java/edu/ucsd/msjava/msutil/AminoAcid.java
index 61081dbc..688ce7b4 100644
--- a/src/main/java/edu/ucsd/msjava/msutil/AminoAcid.java
+++ b/src/main/java/edu/ucsd/msjava/msutil/AminoAcid.java
@@ -20,13 +20,6 @@ public class AminoAcid extends Matter {
private float probability = 0.05f;
private Composition composition;
- /**
- * Constructor.
- *
- * @param residue single letter identifier.
- * @param name full name of the amino acid.
- * @param composition CHNOS composition object.
- */
protected AminoAcid(char residue, String name, Composition composition) {
this.mass = composition.getAccurateMass();
this.nominalMass = composition.getNominalMass();
@@ -35,13 +28,6 @@ protected AminoAcid(char residue, String name, Composition composition) {
this.composition = composition;
}
- /**
- * Constructor. Generates a custom amino acid.
- *
- * @param residue Single letter identifier.
- * @param name Full name of the amino acid.
- * @param mass Mass
- */
protected AminoAcid(char residue, String name, double mass) {
this.mass = mass;
this.nominalMass = Math.round(Constants.INTEGER_MASS_SCALER * (float) mass);
@@ -49,109 +35,54 @@ protected AminoAcid(char residue, String name, double mass) {
this.name = name;
}
- /**
- * Builder. Set probability and returns this object.
- *
- * @return this object.
- */
public AminoAcid setProbability(float probability) {
this.probability = probability;
return this;
}
- /**
- * Standard string representation of this object. Output the single letter
- * representation.
- *
- * @return the single letter code for this amino acid.
- */
public String toString() {
return String.valueOf(residue) + ": " + String.format("%.2f", mass);
}
- /**
- * Quick way to tell whether this object is modified.
- *
- * @return false if this is not modified.
- */
+ /** Returns false; overridden by {@code ModifiedAminoAcid}. */
public boolean isModified() {
return false;
}
- /**
- * Quick way to tell the number of variable modifications applied to this amino acid.
- *
- * @return the number of variable modifications applied to this amino acid.
- */
+ /** Returns 0; overridden by {@code ModifiedAminoAcid}. */
public int getNumVariableMods() {
return 0;
}
- /**
- * Tell whether this object is associated with a terminal-specific modification
- *
- * @return false if this is not associated with terminal-specific modification
- */
+ /** Returns false; overridden by {@code ModifiedAminoAcid}. */
public boolean hasTerminalVariableMod() {
return false;
}
- /**
- * Tell whether this object is associated with a residue-specific modification
- *
- * @return false if this is not associated with residue-specific modification
- */
+ /** Returns false; overridden by {@code ModifiedAminoAcid}. */
public boolean hasResidueSpecificVariableMod() {
return false;
}
- // accessor methods
-
- /**
- * Gets the mass of this amino acid. This is the mono isotopic mass.
- *
- * @return the mass of this amino acid
- */
@Override
public float getMass() {
return (float) mass;
}
- /**
- * Gets the mass of this amino acid as double precision. This is the mono isotopic mass.
- *
- * @return the mass of this amino acid (double precision)
- */
@Override
public double getAccurateMass() {
return mass;
}
- /**
- * Gets the nominal mass of this object.
- *
- * @return nominal mass of this object.
- */
@Override
public int getNominalMass() {
return nominalMass;
}
- /**
- * Gets the probability of this amino acid. Currently set as 1/20, uniformly.
- *
- * @return the probability of this amino acid.
- */
public float getProbability() {
return probability;
}
- // // prohibited
- // @Override
- // public void add(AminoAcid other) {
- // assert(false);
- // }
-
@Override
public boolean equals(Object obj) {
if (!(obj instanceof AminoAcid))
@@ -160,52 +91,27 @@ public boolean equals(Object obj) {
return this == aa;
}
- /**
- * Gets the representation of the residue as string.
- *
- * @return the string representing this amino acid.
- */
public String getResidueStr() {
return String.valueOf(residue);
}
- /**
- * Gets the single letter amino acid representation.
- *
- * @return the single letter amino acid character.
- */
public char getResidue() {
return residue;
}
- /**
- * Gets the single letter amino acid representation of the unmodified version of this amino acid.
- *
- * @return the single letter amino acid character.
- */
+ /** Returns the unmodified residue letter; overridden by ModifiedAminoAcid. */
public char getUnmodResidue() {
return residue;
}
- /**
- * Gets the full string.
- *
- * @return the full name/description of the amino acid.
- */
public String getName() {
return name;
}
- /**
- * Gets the composition object for this amino acid.
- *
- * @return the composition object for this amino acid.
- */
public Composition getComposition() {
return composition;
}
- // static members
public static AminoAcid getStandardAminoAcid(char residue) {
return residueMap.get(residue);
}
@@ -214,12 +120,6 @@ public static AminoAcid[] getStandardAminoAcids() {
return standardAATable;
}
- /**
- * Returns a modified version of this amino acid (fixed modification).
- *
- * @param mod a modification.
- * @return a modified amino acid object.
- */
public AminoAcid getAAWithFixedModification(Modification mod) {
String name = mod.getName() + " " + this.getName();
AminoAcid modAA;
@@ -230,13 +130,6 @@ public AminoAcid getAAWithFixedModification(Modification mod) {
return modAA;
}
- /**
- * Get an amino acid with a customized mass
- * @param residue
- * @param name
- * @param mass
- * @return
- */
public static AminoAcid getCustomAminoAcid(char residue, String name, double mass) {
AminoAcid standardAA = AminoAcid.getStandardAminoAcid(residue);
if (standardAA != null && Math.abs(mass - standardAA.getMass()) < 0.001f)
@@ -262,20 +155,8 @@ public int hashCode() {
return (int) residue;
}
-// @Override
-// public boolean equals(Object obj)
-// {
-// if(!(obj instanceof AminoAcid))
-// return false;
-// else
-// {
-// AminoAcid otherAA = (AminoAcid)obj;
-// return this.getResidue() == otherAA.getResidue();
-// }
-// }
-
private static Hashtable residueMap;
- // Static table containing Predefined Amino Acids, sorted by increasing mass
+ // Standard amino acids sorted by increasing nominal mass
private static final AminoAcid[] standardAATable =
{
// C H N O S
@@ -286,7 +167,6 @@ public int hashCode() {
new AminoAcid('V', "Valine", new Composition(5, 9, 1, 1, 0)), // 99.0684
new AminoAcid('T', "Threonine", new Composition(4, 7, 1, 2, 0)), // 101.0477
new AminoAcid('C', "Cystine", new Composition(3, 5, 1, 1, 1)), // 103.0092
- // new AminoAcid('O', "Hydroxyproline", new Composition(5, 7, 1, 2, 0)), // 113.0477; note that O could be Hydroxyproline, Ornithine, or Pyrrolysine
new AminoAcid('L', "Leucine", new Composition(6, 11, 1, 1, 0)), // 113.0841
new AminoAcid('I', "Isoleucine", new Composition(6, 11, 1, 1, 0)), // 113.0841
new AminoAcid('N', "Asparagine", new Composition(4, 6, 2, 2, 0)), // 114.0429
@@ -303,65 +183,17 @@ public int hashCode() {
new AminoAcid('W', "Tryptophan", new Composition(11, 10, 2, 1, 0)), // 186.0793
};
-// public static final AminoAcid N_TERN = new AminoAcid('[', "N-terminus", new Composition(0,0,0,0,0));
-// public static final AminoAcid C_TERM = new AminoAcid(']', "C-terminus", new Composition(0,0,0,0,0));
-// public static final AminoAcid PROTEIN_N_TERN = new AminoAcid('{', "Protein N-terminus", new Composition(0,0,0,0,0));
-// public static final AminoAcid PROTEIN_C_TERM = new AminoAcid('}', "Protein C-terminus", new Composition(0,0,0,0,0));
-// public static final AminoAcid ANY = new AminoAcid('*', "C-terminus", new Composition(0,0,0,0,0));
-
static {
residueMap = new Hashtable();
for (AminoAcid aa : standardAATable)
residueMap.put(aa.getResidue(), aa);
}
- /*
- public static Color getColor(AminoAcid aa)
- {
- int index = aa.getIndex();
- switch(index)
- {
- case 0: return new Color(200,200,200);
- case 1: return new Color(140,255,140);
- case 2: return new Color(255,112,66);
- case 3: return new Color(82,82,82);
- case 4: return new Color(255,140,255);
- case 5: return new Color(184,76,0);
- case 6: return new Color(69,94,69);
- case 7: return new Color(0,76,0);
- case 8: return new Color(255,124,112);
- case 9: return new Color(160,0,66);
- case 10: return new Color(102,0,0);
- case 11: return new Color(71,71,184);
- case 12: return new Color(102,0,100);
- case 13: return new Color(184,160,66);
- case 14: return new Color(112,112,255);
- case 15: return new Color(83,76,82);
- case 16: return new Color(100,100,224);
- case 17: return Color.orange;
- case 18: return new Color(140,112,76);
- case 19: return new Color(79,70,0);
- default: return null;
- }
- }
- */
-
- /**
- * Get the amino acid of the given integer mass.
- *
- * @param mass the integer mass
- * @return the list of amino acids with the mass.
- */
+
public static ArrayList getAminoAcids(int mass) {
if (mass2aa.containsKey(mass)) return mass2aa.get(mass);
return new ArrayList();
}
- /**
- * Checks whether the character is an standard amino acid
- *
- * @param c the character input
- * @return true if it is part of the standard amino acid set, false otherwise
- */
public static boolean isStdAminoAcid(char c) {
return residueMap.containsKey(c);
}
diff --git a/src/main/java/edu/ucsd/msjava/msutil/AminoAcidSet.java b/src/main/java/edu/ucsd/msjava/msutil/AminoAcidSet.java
index 08f2dbf0..cb443c0c 100644
--- a/src/main/java/edu/ucsd/msjava/msutil/AminoAcidSet.java
+++ b/src/main/java/edu/ucsd/msjava/msutil/AminoAcidSet.java
@@ -1,15 +1,12 @@
package edu.ucsd.msjava.msutil;
+import edu.ucsd.msjava.cli.MSGFPlusOptions;
import edu.ucsd.msjava.msdbsearch.SearchParams;
import edu.ucsd.msjava.msutil.Modification.Location;
-import edu.ucsd.msjava.params.ParamManager;
-import edu.ucsd.msjava.parser.BufferedLineReader;
-import edu.ucsd.msjava.ui.MSGFPlus;
+import edu.ucsd.msjava.mgf.BufferedLineReader;
import java.io.File;
import java.io.IOException;
-import java.nio.file.Path;
-import java.nio.file.Paths;
import java.text.DecimalFormat;
import java.util.*;
@@ -19,23 +16,13 @@
* @author sangtaekim
*/
public class AminoAcidSet implements Iterable {
- /**
- *
- */
private static final AminoAcid[] EMPTY_AA_ARRAY = new AminoAcid[0];
private HashMap> aaListMap;
- /**
- * Mapping from Location Enum name to places where the location applies
- */
private static HashMap locMap;
- /**
- * This tracks any default mods that the user has defined
- * Keys are mod names and values are the mod mass that the user defined for this modification
- * This list is used to warn users of non-standard mod masses for default mods
- */
+ // maps mod name -> user-supplied mass; used to warn on non-standard masses for built-in mods
private static Hashtable defaultModUsage = new Hashtable<>();
static {
@@ -47,7 +34,6 @@ public class AminoAcidSet implements Iterable {
locMap.put(Location.Protein_C_Term, new Location[]{Location.Protein_C_Term});
}
- // for fast indexing
private HashMap residueMap; // residue -> aa (residue must be unique)
private HashMap aa2index; // aa -> index
private HashMap> standardResidueAAArrayMap; // std residue -> array of amino acids
@@ -66,8 +52,6 @@ public class AminoAcidSet implements Iterable {
private HashSet modResidueSet = new HashSet<>(); // set of symbols used for residues
private char nextResidue;
- // for enzyme
-// private ArrayList enzymeAAList;
private int neighboringAACleavageCredit = 0;
private int neighboringAACleavagePenalty = 0;
private int peptideCleavageCredit = 0;
@@ -76,9 +60,6 @@ public class AminoAcidSet implements Iterable {
AminoAcid lightestAA, heaviestAA;
- /**
- * This tracks user-friendly descriptions of the modifications in use
- */
private ArrayList modificationsInUse = new ArrayList<>();
private AminoAcidSet() // prevents instantiation
@@ -91,11 +72,6 @@ private AminoAcidSet() // prevents instantiation
nextResidue = 128;
}
- /**
- * Returns the list of amino acids specific to the position.
- *
- * @return list of intermediate amino acids.
- */
public ArrayList getAAList(Location location) {
return aaListMap.get(location);
}
@@ -120,39 +96,18 @@ public ArrayList getModificationsInUse() {
return modificationsInUse;
}
- /**
- * Returns the iterator of anywhere amino acids
- */
public Iterator iterator() {
return aaListMap.get(Location.Anywhere).iterator();
}
- /**
- * Returns the size of amino acid depending on the location.
- *
- * @param location amino acid location
- * @return
- */
public int size(Location location) {
return aaListMap.get(location).size();
}
- /**
- * Returns the size of anywhere amino acids
- *
- * @return the size of anywhere amino acids
- */
public int size() {
return aaListMap.get(Location.Anywhere).size();
}
- /**
- * Retrieve an array of amino acids given the specific standard residue.
- *
- * @param location amino acid location
- * @param standardAAResidue the standard residue to look up
- * @return the array of amino acids or an empty array otherwise
- */
public AminoAcid[] getAminoAcids(Location location, char standardAAResidue) {
AminoAcid[] matches = standardResidueAAArrayMap.get(location).get(standardAAResidue);
if (matches != null)
@@ -161,44 +116,20 @@ public AminoAcid[] getAminoAcids(Location location, char standardAAResidue) {
return EMPTY_AA_ARRAY;
}
- /**
- * Retrieve an array of amino acids given the specific nominal mass.
- *
- * @param location amino acid location
- * @param nominalMass nominal mass to look up
- * @return the array of amino acids or an empty list otherwise
- */
public AminoAcid[] getAminoAcids(Location location, int nominalMass) {
AminoAcid[] matches = nominalMass2aa.get(location).get(nominalMass);
if (matches != null) return matches;
return EMPTY_AA_ARRAY;
}
- /**
- * Retrieve an array of amino acids given the specific nominal mass.
- *
- * @param nominalMass the mass to look up
- * @return the array of amino acids or an empty list otherwise
- */
public AminoAcid[] getAminoAcids(int nominalMass) {
return getAminoAcids(Location.Anywhere, nominalMass);
}
- /**
- * Checks whether a residue belongs to this amino acid set
- *
- * @param residue a residue
- * @return true if residue belongs to the amino acid set
- */
public boolean contains(char residue) {
return residueMap.containsKey(residue);
}
- /**
- * Returns a list of all residues without mods
- *
- * @return
- */
public ArrayList getResidueListWithoutMods() {
ArrayList residues = new ArrayList<>();
for (Map.Entry aa : residueMap.entrySet()) {
@@ -210,22 +141,10 @@ public ArrayList getResidueListWithoutMods() {
return residues;
}
- /**
- * Returns a list of all residues, including modified residues
- *
- * @return
- */
public ArrayList getResidueList() {
return new ArrayList<>(residueMap.keySet());
}
- /**
- * Get the amino acid mass of the residue.
- *
- * @param residue the amino acid mass. Use uppercase for standard aa (convention).
- * this method is case sensitive.
- * @return the amino acid object. null if no aa corresponding to the residue
- */
public AminoAcid getAminoAcid(Location location, char residue) {
AminoAcid[] aaArr = getAminoAcids(location, residue);
for (AminoAcid aa : aaArr)
@@ -234,60 +153,26 @@ public AminoAcid getAminoAcid(Location location, char residue) {
return null;
}
- /**
- * Get the amino acid mass of the residue.
- *
- * @param residue the amino acid mass. Use uppercase for standard aa (convention).
- * this method is case sensitive.
- * @return the amino acid object. null if no aa corresponding to the residue
- */
public AminoAcid getAminoAcid(char residue) {
return residueMap.get(residue);
}
- /**
- * Set the number of allowable variable modifications per peptide
- *
- * @param maxNumberOfVariableModificationsPerPeptide the number of allowable variable modifications per peptide
- */
public void setMaxNumberOfVariableModificationsPerPeptide(int maxNumberOfVariableModificationsPerPeptide) {
this.maxNumberOfVariableModificationsPerPeptide = maxNumberOfVariableModificationsPerPeptide;
}
- /**
- * Get the number of allowable variable modifications per peptide
- *
- * @return the number of allowable variable modifications per peptide
- */
public int getMaxNumberOfVariableModificationsPerPeptide() {
return this.maxNumberOfVariableModificationsPerPeptide;
}
- /**
- * Get all amino acids for all locations.
- *
- * @return an array of all amino acids.
- */
public AminoAcid[] getAllAminoAcidArr() {
return this.allAminoAcidArr;
}
- /**
- * Get the amino acid corresponding to the index
- *
- * @param index amino acid index
- * @return amino acid object
- */
public AminoAcid getAminoAcid(int index) {
return allAminoAcidArr[index];
}
- /**
- * Get the index of the aa
- *
- * @param aa amino acid
- * @return the index of aa. null if aa does not belong to this amino acid set
- */
public int getIndex(AminoAcid aa) {
Integer index = aa2index.get(aa);
if (index == null)
@@ -295,12 +180,6 @@ public int getIndex(AminoAcid aa) {
return index;
}
- /**
- * Get the peptide corresponding to the string sequence.
- *
- * @param sequence sequence of the peptide.
- * @return peptide object of the sequence
- */
public Peptide getPeptide(String sequence) {
boolean isModified = false;
ArrayList aaArray = new ArrayList<>();
@@ -766,13 +645,10 @@ private AminoAcidSet finalizeSet() {
private static AminoAcidSet standardAASetWithCarbamidomethylatedCysWithTerm = null;
/**
- * Load modification definitions from a text file and associate with amino acids
- *
- * @param modFilePath Path to the mods.txt file
- * @param paramManager Parameter manager
- * @return
+ * Load modification definitions from a text file and associate with amino acids.
+ * Updates {@code opts.maxNumMods} if the mod metadata declares a different value.
*/
- public static AminoAcidSet getAminoAcidSetFromModFile(String modFilePath, ParamManager paramManager) {
+ public static AminoAcidSet getAminoAcidSetFromModFile(String modFilePath, MSGFPlusOptions opts) {
BufferedLineReader reader = null;
File modFile = new File(modFilePath);
@@ -789,8 +665,7 @@ public static AminoAcidSet getAminoAcidSetFromModFile(String modFilePath, ParamM
String dataLine;
String sourceFileName = modFile.getName();
int lineNum = 0;
- int maxNumMods = paramManager.getMaxNumModsPerPeptide();
- ModificationMetadata modMetadata = new ModificationMetadata(maxNumMods);
+ ModificationMetadata modMetadata = new ModificationMetadata(opts.effectiveMaxNumMods());
while ((dataLine = reader.readLine()) != null) {
lineNum++;
@@ -800,7 +675,7 @@ public static AminoAcidSet getAminoAcidSetFromModFile(String modFilePath, ParamM
}
}
- AminoAcidSet aaSet = getAminoAcidSetAndUpdateParams(mods, customAA, modMetadata, paramManager);
+ AminoAcidSet aaSet = buildAndSyncMaxNumMods(mods, customAA, modMetadata, opts);
try {
reader.close();
@@ -811,68 +686,53 @@ public static AminoAcidSet getAminoAcidSetFromModFile(String modFilePath, ParamM
}
/**
- * Associate modification definitions read from a MSGF+ parameter file with amino acids
- *
- * @param modConfigFilePath
- * @param customAAByLine Hashtable where keys are the line number in the MSGF+ parameter file and values are the text from the given line
- * @param modsByLine Hashtable where keys are the line number in the MSGF+ parameter file and values are the text from the given line
- * @param paramManager Parameter manager
- * @return AminoAcidSet
+ * Build an {@link AminoAcidSet} from {@code CustomAA=}, {@code StaticMod=},
+ * and {@code DynamicMod=} entries collected from a config file. Replaces
+ * the legacy {@code getAminoAcidSetFromList(Hashtable, Hashtable, ParamManager)}
+ * that took line-number-keyed hashtables; the {@link MSGFPlusOptions}-based
+ * config-file overlay collects entries as ordered Lists.
*/
- public static AminoAcidSet getAminoAcidSetFromList(
- String modConfigFilePath,
- Hashtable customAAByLine,
- Hashtable modsByLine,
- ParamManager paramManager) {
+ public static AminoAcidSet getAminoAcidSetFromModEntries(
+ String configName,
+ List customAAEntries,
+ List modEntries,
+ MSGFPlusOptions opts) {
ArrayList mods = new ArrayList<>();
ArrayList customAA = new ArrayList<>();
- int maxNumMods = paramManager.getMaxNumModsPerPeptide();
- ModificationMetadata modMetadata = new ModificationMetadata(maxNumMods);
+ ModificationMetadata modMetadata = new ModificationMetadata(opts.effectiveMaxNumMods());
- // First parse any custom amino acid definitions
- customAAByLine.forEach((lineNum, dataLine) -> {
- boolean success = parseConfigEntry(modConfigFilePath, lineNum, dataLine, mods, customAA, modMetadata);
- if (!success) {
+ for (int i = 0; i < customAAEntries.size(); i++) {
+ // parseConfigEntry expects bare comma-separated mod definitions, not
+ // a "Key=value" line. MSGFPlusOptions.applyConfigEntry already strips
+ // the "CustomAA=" prefix when populating opts.customAAs.
+ if (!parseConfigEntry(configName, i + 1, customAAEntries.get(i), mods, customAA, modMetadata)) {
System.exit(-1);
}
- });
-
- // Now parse the static and dynamic mods
- modsByLine.forEach((lineNum, dataLine) -> {
- boolean success = parseConfigEntry(modConfigFilePath, lineNum, dataLine, mods, customAA, modMetadata);
- if (!success) {
+ }
+ for (int i = 0; i < modEntries.size(); i++) {
+ if (!parseConfigEntry(configName, i + 1, modEntries.get(i), mods, customAA, modMetadata)) {
System.exit(-1);
}
- });
-
- AminoAcidSet aaSet = getAminoAcidSetAndUpdateParams(mods, customAA, modMetadata, paramManager);
+ }
- return aaSet;
+ return buildAndSyncMaxNumMods(mods, customAA, modMetadata, opts);
}
- /**
- * @param mods Modification definitions
- * @param customAA Custom amino acids
- * @param modMetadata Modification metadata, which may have an updated maxNumModsPerPeptide value read from a mods.txt file
- * @param paramManager Parameter manager
- * @return AminoAcidSet
- */
- private static AminoAcidSet getAminoAcidSetAndUpdateParams(
+ /** Builds the {@link AminoAcidSet} and propagates the metadata's
+ * {@code maxNumModsPerPeptide} to {@code opts.maxNumMods}. */
+ private static AminoAcidSet buildAndSyncMaxNumMods(
ArrayList mods,
ArrayList customAA,
ModificationMetadata modMetadata,
- ParamManager paramManager) {
+ MSGFPlusOptions opts) {
AminoAcidSet aaSet = AminoAcidSet.getAminoAcidSet(mods, customAA);
-
int maxNumMods = modMetadata.getMaxNumModsPerPeptide();
- if (maxNumMods != paramManager.getMaxNumModsPerPeptide()) {
- paramManager.setMaxNumMods(maxNumMods);
+ if (maxNumMods != opts.effectiveMaxNumMods()) {
+ opts.setMaxNumModsFromMetadata(maxNumMods);
}
-
aaSet.setMaxNumberOfVariableModificationsPerPeptide(maxNumMods);
-
return aaSet;
}
@@ -884,7 +744,7 @@ private static boolean parseConfigEntry(
ArrayList customAA,
ModificationMetadata modMetadata) {
- String modSetting = SearchParams.getConfigLineWithoutComment(dataLine);
+ String modSetting = MSGFPlusOptions.stripComment(dataLine);
if (modSetting.length() == 0) {
return true;
}
@@ -1731,13 +1591,6 @@ private void updateAAListMapWithFixedModAA(
aaListMap.put(loc, new ArrayList<>(newAAList));
}
- public static void main(String argv[]) {
- ParamManager paramManager = new ParamManager("MS-GF+ AminoAcidSet", MSGFPlus.VERSION, MSGFPlus.RELEASE_DATE, "n/a");
- Path modFilePath = Paths.get(System.getProperty("user.home") + "Research", "Data", "Debug", "mods.txt");
- AminoAcidSet aaSet = AminoAcidSet.getAminoAcidSetFromModFile(modFilePath.toString(), paramManager);
- aaSet.printAASet();
- }
-
private static class ModificationMetadata {
public ModificationMetadata(int maxNumModsPerPeptide) {
this.maxNumModsPerPeptide = maxNumModsPerPeptide;
diff --git a/src/main/java/edu/ucsd/msjava/msutil/Annotation.java b/src/main/java/edu/ucsd/msjava/msutil/Annotation.java
index 2a5874cb..2e60e689 100644
--- a/src/main/java/edu/ucsd/msjava/msutil/Annotation.java
+++ b/src/main/java/edu/ucsd/msjava/msutil/Annotation.java
@@ -1,68 +1,28 @@
package edu.ucsd.msjava.msutil;
-public class Annotation {
- private AminoAcid prevAA;
- private Peptide peptide;
- private AminoAcid nextAA;
-
- public Annotation(AminoAcid prevAA, Peptide peptide, AminoAcid nextAA) {
- this.prevAA = prevAA;
- this.peptide = peptide;
- this.nextAA = nextAA;
- }
+public record Annotation(AminoAcid prevAA, Peptide peptide, AminoAcid nextAA) {
public Annotation(String annotationStr, AminoAcidSet aaSet) {
- String pepStr = annotationStr.substring(annotationStr.indexOf('.') + 1, annotationStr.lastIndexOf('.'));
- char prevAAResidue = annotationStr.charAt(0);
- char nextAAResidue = annotationStr.charAt(annotationStr.length() - 1);
-
- prevAA = aaSet.getAminoAcid(prevAAResidue);
- peptide = aaSet.getPeptide(pepStr);
- nextAA = aaSet.getAminoAcid(nextAAResidue);
- }
-
- public boolean isProteinNTerm() {
- return prevAA == null;
- }
-
- public boolean isProteinCTerm() {
- return nextAA == null;
- }
-
- public AminoAcid getPrevAA() {
- return prevAA;
- }
-
- public void setPrevAA(AminoAcid prevAA) {
- this.prevAA = prevAA;
- }
-
- public Peptide getPeptide() {
- return peptide;
- }
-
- public void setPeptide(Peptide peptide) {
- this.peptide = peptide;
- }
-
- public AminoAcid getNextAA() {
- return nextAA;
- }
-
- public void setNextAA(AminoAcid nextAA) {
- this.nextAA = nextAA;
- }
-
- @Override
- public String toString() {
- if (peptide == null)
- return null;
- StringBuffer output = new StringBuffer();
- if (prevAA != null)
- output.append(prevAA.getResidueStr());
- output.append("." + peptide.toString() + ".");
- if (nextAA != null)
- output.append(nextAA.getResidueStr());
+ this(
+ aaSet.getAminoAcid(annotationStr.charAt(0)),
+ aaSet.getPeptide(annotationStr.substring(annotationStr.indexOf('.') + 1, annotationStr.lastIndexOf('.'))),
+ aaSet.getAminoAcid(annotationStr.charAt(annotationStr.length() - 1))
+ );
+ }
+
+ public boolean isProteinNTerm() { return prevAA == null; }
+ public boolean isProteinCTerm() { return nextAA == null; }
+
+ public AminoAcid getPrevAA() { return prevAA; }
+ public Peptide getPeptide() { return peptide; }
+ public AminoAcid getNextAA() { return nextAA; }
+
+ @Override public String toString() {
+ if (peptide == null) return null;
+ StringBuilder output = new StringBuilder();
+ if (prevAA != null) output.append(prevAA.getResidueStr());
+ output.append('.').append(peptide).append('.');
+ if (nextAA != null) output.append(nextAA.getResidueStr());
return output.toString();
}
}
diff --git a/src/main/java/edu/ucsd/msjava/msutil/Atom.java b/src/main/java/edu/ucsd/msjava/msutil/Atom.java
index c5213ca0..c5b149c2 100644
--- a/src/main/java/edu/ucsd/msjava/msutil/Atom.java
+++ b/src/main/java/edu/ucsd/msjava/msutil/Atom.java
@@ -2,46 +2,16 @@
import java.util.HashMap;
-public class Atom {
- public Atom(String code, double mass, int nominalMass, String name) {
- this.code = code;
- this.mass = mass;
- this.nominalMass = nominalMass;
- this.name = name;
- }
-
- public String getCode() {
- return code;
- }
-
- public String getName() {
- return name;
- }
-
- public double getMass() {
- return mass;
- }
+public record Atom(String code, double mass, int nominalMass, String name) {
- public int getNominalMass() {
- return nominalMass;
- }
-
- public static Atom[] getAtomarr() {
- return atomArr;
- }
-
- public static HashMap getAtomMap() {
- return atomMap;
- }
-
- public static Atom get(String code) {
- return atomMap.get(code);
- }
+ public String getCode() { return code; }
+ public String getName() { return name; }
+ public double getMass() { return mass; }
+ public int getNominalMass() { return nominalMass; }
- private final String code;
- private final String name;
- private final double mass;
- private final int nominalMass;
+ public static Atom[] getAtomarr() { return atomArr; }
+ public static HashMap