Skip to content

Fix annotationt issue#16

Merged
tonywu1999 merged 7 commits intodevelfrom
Fix-Annotationt-Issue
May 7, 2026
Merged

Fix annotationt issue#16
tonywu1999 merged 7 commits intodevelfrom
Fix-Annotationt-Issue

Conversation

@Rudhik1904
Copy link
Copy Markdown
Contributor

@Rudhik1904 Rudhik1904 commented May 4, 2026

Motivation and Context

  1. Annotation never got attached
    bigDIANNtoMSstatsFormat accepted annotation but never called the merge step → output had no Condition/BioReplicate, breaking downstream MSstats.

Fix: Wire MSstatsAddAnnotationBig in after preprocessing. Harden the helper to drop overlapping non-key columns before joining (uses dplyr::tbl_vars() so it works on Arrow/lazy/data frames). R/converters.R

  1. Arrow backend crashed on real data
    na.rm <- TRUE (assignment) silently ran mean() with NAs → corrupted feature ranking.
    arrow::write_csv_arrow rejects lazy arrow_dplyr_query → write step failed on real reports.
    All-NA columns get inferred as null type → downstream ops crash.
    Fix: Typo → na.rm = TRUE. Switch writes to arrow::write_dataset(..., format = "csv") (accepts lazy). Detect null-typed columns and cast/drop them lazily. R/backends.R, R/converters.R

  2. DIANN 2.0 parquet unsupported
    DIANN 2.0 emits parquet by default. The cleaner used read_delim_chunked (text only) and sniffed delimiter from the filename → garbage on parquet.

Fix: New parquet branch in reduceBigDIANN streaming via arrow::open_dataset() + Scanner$create(batch_size = 1e6) + RecordBatchReader. Same per-chunk callback as TSV. Delimiter sniffer now reads the first line. R/clean_DIANN.R

Shiny caveat: pass quantificationColumn = "auto" on parquet — DIANN 2.0 has Fr.N.Quantity columns instead of FragmentQuantCorrected.

Smaller fixes
Positional-arg bug in bigSpectronauttoMSstatsFormat (aggregate_psms was landing in filter_unique_peptides's slot) — fixed by switching to named args.
Sparklyr branch passed input where input_file was expected.
Test cleanup: file.remove → unlink(recursive = TRUE) (arrow writes directories).
Removed three committed topN_*.csv artifacts.

Changes

Three connected fixes to the DIANN out-of-memory pipeline:

Annotation actually gets merged (the eponymous fix).
Arrow backend doesn't crash on all-NA columns or eager writes.
DIANN 2.0 parquet input works.

Test coverage

Deleted tests/testthat/test-clean_DIANN.R. It mocked MSstatsMakeAnnotation inside cleanDIANNChunk — but annotation now happens after preprocessing, not during chunking, so the test was asserting behavior that no longer exists.
Added bigDIANNtoMSstatsFormat works with annotation in tests/testthat/test-diann_converter.R — exercises the new merge path end-to-end, asserts Condition/BioReplicate land in the output and map correctly per Run.
Two real-fixture tests (DIANN 1.x TSV and DIANN 2.0 parquet, both pulled from MSstatsConvert/inst/tinytest) are present in test-converters.R but commented out — they reference paths under /Users/rudhikshah/... so they'd fail on CI or any other contributor's machine.

Checklist Before Requesting a Review

  • [x ] I have read the MSstats contributing guidelines
  • [ x] My changes generate no new warnings
  • [x ] Any dependent changes have been merged and published in downstream modules

Context and Solution

This PR fixes missing annotation merging for bigDIANN workflows and multiple Arrow backend issues that broke preprocessing and downstream MSstats steps. It ensures run-level annotation (Condition, BioReplicate) is actually merged, makes Arrow-backed processing robust to lazy queries and null-typed (all-NA) columns, and adds explicit parquet support for DIANN 2.0 by streaming record batches. The annotation merge was moved to after preprocessing and MSstatsAddAnnotationBig was hardened to drop overlapping non-key columns before joining.

Detailed Changes

  • R/backends.R

    • Post-process Arrow CSV schema to detect columns inferred as null and cast or drop them lazily before computation.
    • Explicitly cast PrecursorMz to numeric when present.
    • Fix typo: use na.rm = TRUE when computing mean to ignore NAs.
    • Replace arrow::write_csv_arrow() with arrow::write_dataset(..., format = "csv") for both topN and final outputs to accept lazy arrow_dplyr_query.
  • R/clean_DIANN.R

    • Add DIANN 2.0 parquet support: use arrow::open_dataset(), Scanner$create(batch_size = 1e6) and RecordBatchReader to stream batches and reuse the existing per-chunk cleaner callback.
    • Change delimiter detection for text inputs to sniff the first line for \t, , or ;.
    • Define a single diann_chunk callback used for both parquet and delimited-text paths; dispatch reader based on file extension.
  • R/converters.R

    • Fix Sparklyr backend invocation: use correct named input_file argument.
    • Call MSstatsPreprocessBig with named arguments in bigFragPipetoMSstatsFormat and bigSpectronauttoMSstatsFormat (positional-arg bug fixed).
    • bigDIANNtoMSstatsFormat:
      • Open reduced Arrow CSV dataset and detect/drop/cast null-typed (all-NA) columns via lazy dplyr operations before preprocessing.
      • Write a cleaned Arrow dataset directory for preprocessing.
      • Move annotation merging to after preprocessing: call MSstatsAddAnnotationBig post-preprocessing.
    • MSstatsAddAnnotationBig:
      • Compute overlapping columns between input and annotation (excluding Run), drop overlapping non-key columns from input, then inner_join on Run to avoid duplicated columns.
  • R/utils.R

    • Add internal helper .prefixedPath(prefix, path) to construct intermediate output paths by prefixing only the basename (preserve directory path).
  • Tests & artifacts

    • Remove committed topN_*.csv artifacts.

Unit Tests Added or Modified

  • Deleted test:

    • tests/testthat/test-clean_DIANN.R — removed test asserting annotation was passed in cleanDIANNChunk (annotation is now applied post-preprocessing).
  • Added test:

    • tests/testthat/test-diann_converter.R — end-to-end test for bigDIANNtoMSstatsFormat that supplies run-level annotation, asserts Condition and BioReplicate columns are present and correctly mapped by Run, and removes created cleaned_ output directory.
  • Modified tests:

    • tests/testthat/test-converters.R — updated cleanup to use unlink(..., recursive = TRUE, force = TRUE) to remove temporary CSVs and Arrow output directories; handle Arrow-created directories like reduce_output_<output_file>.
  • Note: Two real-fixture DIANN tests (DIANN 1.x TSV and DIANN 2.0 parquet) remain present but commented out due to hardcoded local paths.

Coding Guidelines Violations

  • None detected. Changes use named arguments for clarity, handle lazy dplyr/Arrow workflows appropriately, and add a focused helper for path prefixing.

…ot handled in backend properly. Also, there were other errors in the backend like

na.rm <- TRUE should be na.rm = TRUE
And
arrow::write_csv_arrow doesn't accept lazy Arrow queries

Fixing those things and also updating MSstatsAddAnnotationBig to use it in bigDIANNtoMSstatsFormat
@Rudhik1904 Rudhik1904 requested a review from tonywu1999 May 4, 2026 06:38
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 4, 2026

Warning

Rate limit exceeded

@Rudhik1904 has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 56 minutes and 48 seconds before requesting another review.

To keep reviews running without waiting, you can enable usage-based add-on for your organization. This allows additional reviews beyond the hourly cap. Account admins can enable it under billing.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 57f5a22b-5505-4265-94e7-8e3c9b1e8c08

📥 Commits

Reviewing files that changed from the base of the PR and between 7e83290 and 98caf27.

📒 Files selected for processing (1)
  • tests/testthat/test-diann_converter.R
📝 Walkthrough
📝 Walkthrough
🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 inconclusive)

Check name Status Explanation Resolution
Title check ❓ Inconclusive The title contains a typo ('annotationt' instead of 'annotation') and is vague; it does not clearly summarize the multi-faceted changes (annotation merging, Arrow backend fixes, DIANN 2.0 parquet support). Correct the typo and use a more specific title that captures the main fix, e.g., 'Fix annotation merging in bigDIANNtoMSstatsFormat and Arrow backend issues' or similar.
✅ Passed checks (4 passed)
Check name Status Explanation
Description check ✅ Passed The description includes comprehensive motivation, context, and detailed bullet points covering all three main fixes and testing changes, meeting the template requirements.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch Fix-Annotationt-Issue

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

  • Generate code and open pull requests
  • Plan features and break down work
  • Investigate incidents and troubleshoot customer tickets together
  • Automate recurring tasks and respond to alerts with triggers
  • Summarize progress and report instantly

Built for teams:

  • Shared memory across your entire org—no repeating context
  • Per-thread sandboxes to safely plan and execute work
  • Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
tests/testthat/test-diann_converter.R (1)

170-174: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Use unlink() for Arrow outputs in both cleanup blocks.

output_file is now a CSV dataset directory, so file.remove() leaves it behind. The first test also skips cleanup for topN_..., which this code path now writes. Residual directories can leak across reruns and make the suite flaky.

Suggested fix
   # Cleanup
   file.remove(input_file)
-  if (file.exists(output_file)) file.remove(output_file)
-  if (file.exists(paste0("reduce_output_", output_file))) file.remove(paste0("reduce_output_", output_file))
+  unlink(output_file, recursive = TRUE, force = TRUE)
+  unlink(paste0("reduce_output_", output_file), recursive = TRUE, force = TRUE)
+  unlink(paste0("topN_", output_file), recursive = TRUE, force = TRUE)
@@
   # Cleanup
   file.remove(input_file)
-  if (file.exists(output_file)) file.remove(output_file)
-  if (file.exists(paste0("reduce_output_", output_file))) file.remove(paste0("reduce_output_", output_file))
-  if (dir.exists(paste0("cleaned_", output_file))) unlink(paste0("cleaned_", output_file), recursive = TRUE)
+  unlink(output_file, recursive = TRUE, force = TRUE)
+  unlink(paste0("reduce_output_", output_file), recursive = TRUE, force = TRUE)
+  unlink(paste0("cleaned_", output_file), recursive = TRUE, force = TRUE)
+  unlink(paste0("topN_", output_file), recursive = TRUE, force = TRUE)

Also applies to: 208-212

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/testthat/test-diann_converter.R` around lines 170 - 174, The cleanup
uses file.remove() on output_file (now a CSV dataset directory) and its reduced
variant so directories are left behind; replace those file.remove(...) calls
with unlink(..., recursive = TRUE) to remove directories created by the tests
(keep file.remove for input_file if it is a single file), and add the same
unlink(recursive=TRUE) handling for the topN_* outputs that the first test
writes; apply the same changes in the second cleanup block referenced (lines
~208-212) so all dataset directories are removed between runs.
R/converters.R (1)

149-164: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Construct intermediate converter paths with dirname()/basename().

These helpers still prefix the full output_file_name string, so nested or absolute output paths get corrupted. For example, subdir/out.csv turns into reduce_output_subdir/out.csv, and /tmp/out.csv turns into reduce_output_/tmp/out.csv. The same problem exists for cleaned_....

Suggested fix
-  reduceBigSpectronaut(input_file, paste0("reduce_output_", output_file_name),
+  reduced_file <- file.path(
+    dirname(output_file_name),
+    paste0("reduce_output_", basename(output_file_name))
+  )
+  reduceBigSpectronaut(input_file, reduced_file,
                        intensity, filter_by_excluded, filter_by_identified,
                        filter_by_qvalue, qvalue_cutoff,
                        calculateAnomalyScores, anomalyModelFeatures)
   msstats_data <- MSstatsPreprocessBig(
-    input_file = paste0("reduce_output_", output_file_name),
+    input_file = reduced_file,
     output_file_name = output_file_name,
     backend = backend,
@@
-  reduced_file <- paste0("reduce_output_", output_file_name)
+  reduced_file <- file.path(
+    dirname(output_file_name),
+    paste0("reduce_output_", basename(output_file_name))
+  )
@@
-    cleaned_file <- paste0("cleaned_", output_file_name)
+    cleaned_file <- file.path(
+      dirname(output_file_name),
+      paste0("cleaned_", basename(output_file_name))
+    )

Based on learnings: construct intermediate paths with file.path(dirname(output_file_name), paste0("prefix_", basename(output_file_name))) so directory-valued outputs resolve correctly.

Also applies to: 208-225

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@R/converters.R` around lines 149 - 164, The intermediate output filenames are
built by naively prefixing output_file_name (causing wrong paths for nested or
absolute names); update the calls that create intermediate paths (the
reduceBigSpectronaut input and MSstatsPreprocessBig input_file args as well as
any "cleaned_..." constructions) to use file.path(dirname(output_file_name),
paste0("reduce_output_", basename(output_file_name))) (and similarly for
"cleaned_") so directory components are preserved; locate usages around
reduceBigSpectronaut and MSstatsPreprocessBig and replace the plain
paste0(prefix, output_file_name) with the dirname/basename-based file.path
construction.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@R/backends.R`:
- Around line 127-131: The path passed to arrow::write_dataset is built
incorrectly using paste0("topN_", output_file_name), which mangles paths; change
it to construct the new filename by combining dirname(output_file_name) and
paste0("topN_", basename(output_file_name)) (or handle root/empty dirname
appropriately) before calling arrow::write_dataset so the "topN_" prefix is
applied only to the file name and the directory is preserved; update the code
around the arrow::write_dataset call that currently uses output_file_name and
ensure the computed path variable is used instead.

In `@R/converters.R`:
- Around line 242-245: The code assumes annotation is already a table when
calling MSstatsAddAnnotationBig(msstats_data, annotation); update
bigDIANNtoMSstatsFormat() to preserve the documented "annotation file" behavior
by detecting if annotation is a file path or a connection (e.g.,
is.character(annotation) && length(annotation)==1) and, if so, read it into a
tabular object (using a safe tabular reader like read.delim/read.table or
data.table::fread) and normalize column names before passing to
MSstatsAddAnnotationBig; also apply the same file/path-to-table normalization
for the other annotation handling block (the second occurrence referenced around
the later lines) so both places accept either a file path or a data.frame.
- Around line 229-245: MSstatsPreprocessBig currently writes the processed table
to output_file_name before the annotation merge, so calling
MSstatsAddAnnotationBig only updates the in-memory msstats_data; to fix, after
MSstatsAddAnnotationBig(msstats_data, annotation) persist the updated
msstats_data back to the same output_file_name using the same write/save pathway
used by MSstatsPreprocessBig (respecting backend/connection/max_feature_count
behavior), ensuring callers that reopen output_file_name see the annotated
Condition/BioReplicate fields.

---

Outside diff comments:
In `@R/converters.R`:
- Around line 149-164: The intermediate output filenames are built by naively
prefixing output_file_name (causing wrong paths for nested or absolute names);
update the calls that create intermediate paths (the reduceBigSpectronaut input
and MSstatsPreprocessBig input_file args as well as any "cleaned_..."
constructions) to use file.path(dirname(output_file_name),
paste0("reduce_output_", basename(output_file_name))) (and similarly for
"cleaned_") so directory components are preserved; locate usages around
reduceBigSpectronaut and MSstatsPreprocessBig and replace the plain
paste0(prefix, output_file_name) with the dirname/basename-based file.path
construction.

In `@tests/testthat/test-diann_converter.R`:
- Around line 170-174: The cleanup uses file.remove() on output_file (now a CSV
dataset directory) and its reduced variant so directories are left behind;
replace those file.remove(...) calls with unlink(..., recursive = TRUE) to
remove directories created by the tests (keep file.remove for input_file if it
is a single file), and add the same unlink(recursive=TRUE) handling for the
topN_* outputs that the first test writes; apply the same changes in the second
cleanup block referenced (lines ~208-212) so all dataset directories are removed
between runs.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 9f8d0507-056a-4b70-b621-bd327ee02dfc

📥 Commits

Reviewing files that changed from the base of the PR and between ce69a87 and 89a1171.

⛔ Files ignored due to path filters (3)
  • tests/testthat/topN_preprocess_output.csv is excluded by !**/*.csv
  • tests/testthat/topN_spectro_output.csv is excluded by !**/*.csv
  • tests/testthat/topN_test_diann_output.csv is excluded by !**/*.csv
📒 Files selected for processing (6)
  • R/backends.R
  • R/clean_DIANN.R
  • R/converters.R
  • tests/testthat/test-clean_DIANN.R
  • tests/testthat/test-converters.R
  • tests/testthat/test-diann_converter.R
💤 Files with no reviewable changes (1)
  • tests/testthat/test-clean_DIANN.R

Comment thread R/backends.R Outdated
Comment thread R/converters.R
Comment thread R/converters.R Outdated
Comment on lines +242 to +245
# Merge annotation with the preprocessed data
if (!is.null(annotation)) {
msstats_data <- MSstatsAddAnnotationBig(msstats_data, annotation)
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Preserve the documented “annotation file” input path.

bigDIANNtoMSstatsFormat() still documents annotation as a file or data frame, but MSstatsAddAnnotationBig() now assumes a tabular object and immediately calls colnames()/inner_join() on it. Passing a file path will now fail instead of being normalized first.

Suggested fix
-  if (!is.null(annotation)) {
+  if (is.character(annotation) && length(annotation) == 1) {
+    annotation <- utils::read.csv(annotation, stringsAsFactors = FALSE)
+  }
+  if (!is.null(annotation)) {
     msstats_data <- MSstatsAddAnnotationBig(msstats_data, annotation)
-    }
+  }

Also applies to: 277-292

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@R/converters.R` around lines 242 - 245, The code assumes annotation is
already a table when calling MSstatsAddAnnotationBig(msstats_data, annotation);
update bigDIANNtoMSstatsFormat() to preserve the documented "annotation file"
behavior by detecting if annotation is a file path or a connection (e.g.,
is.character(annotation) && length(annotation)==1) and, if so, read it into a
tabular object (using a safe tabular reader like read.delim/read.table or
data.table::fread) and normalize column names before passing to
MSstatsAddAnnotationBig; also apply the same file/path-to-table normalization
for the other annotation handling block (the second occurrence referenced around
the later lines) so both places accept either a file path or a data.frame.

@mstaniak
Copy link
Copy Markdown
Contributor

mstaniak commented May 4, 2026

In a sense, this was good, because annotation is redundant for most processing steps and only becomes relevant at the testing step

@tonywu1999 tonywu1999 merged commit a43b90b into devel May 7, 2026
2 checks passed
@tonywu1999 tonywu1999 deleted the Fix-Annotationt-Issue branch May 7, 2026 16:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants