Fix annotationt issue#16
Conversation
…ot handled in backend properly. Also, there were other errors in the backend like na.rm <- TRUE should be na.rm = TRUE And arrow::write_csv_arrow doesn't accept lazy Arrow queries Fixing those things and also updating MSstatsAddAnnotationBig to use it in bigDIANNtoMSstatsFormat
|
Warning Rate limit exceeded
To keep reviews running without waiting, you can enable usage-based add-on for your organization. This allows additional reviews beyond the hourly cap. Account admins can enable it under billing. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
📝 Walkthrough📝 Walkthrough🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 inconclusive)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Tip 💬 Introducing Slack Agent: The best way for teams to turn conversations into code.Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.
Built for teams:
One agent for your entire SDLC. Right inside Slack. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
tests/testthat/test-diann_converter.R (1)
170-174:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winUse
unlink()for Arrow outputs in both cleanup blocks.
output_fileis now a CSV dataset directory, sofile.remove()leaves it behind. The first test also skips cleanup fortopN_..., which this code path now writes. Residual directories can leak across reruns and make the suite flaky.Suggested fix
# Cleanup file.remove(input_file) - if (file.exists(output_file)) file.remove(output_file) - if (file.exists(paste0("reduce_output_", output_file))) file.remove(paste0("reduce_output_", output_file)) + unlink(output_file, recursive = TRUE, force = TRUE) + unlink(paste0("reduce_output_", output_file), recursive = TRUE, force = TRUE) + unlink(paste0("topN_", output_file), recursive = TRUE, force = TRUE) @@ # Cleanup file.remove(input_file) - if (file.exists(output_file)) file.remove(output_file) - if (file.exists(paste0("reduce_output_", output_file))) file.remove(paste0("reduce_output_", output_file)) - if (dir.exists(paste0("cleaned_", output_file))) unlink(paste0("cleaned_", output_file), recursive = TRUE) + unlink(output_file, recursive = TRUE, force = TRUE) + unlink(paste0("reduce_output_", output_file), recursive = TRUE, force = TRUE) + unlink(paste0("cleaned_", output_file), recursive = TRUE, force = TRUE) + unlink(paste0("topN_", output_file), recursive = TRUE, force = TRUE)Also applies to: 208-212
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/testthat/test-diann_converter.R` around lines 170 - 174, The cleanup uses file.remove() on output_file (now a CSV dataset directory) and its reduced variant so directories are left behind; replace those file.remove(...) calls with unlink(..., recursive = TRUE) to remove directories created by the tests (keep file.remove for input_file if it is a single file), and add the same unlink(recursive=TRUE) handling for the topN_* outputs that the first test writes; apply the same changes in the second cleanup block referenced (lines ~208-212) so all dataset directories are removed between runs.R/converters.R (1)
149-164:⚠️ Potential issue | 🟠 Major | ⚡ Quick winConstruct intermediate converter paths with
dirname()/basename().These helpers still prefix the full
output_file_namestring, so nested or absolute output paths get corrupted. For example,subdir/out.csvturns intoreduce_output_subdir/out.csv, and/tmp/out.csvturns intoreduce_output_/tmp/out.csv. The same problem exists forcleaned_....Suggested fix
- reduceBigSpectronaut(input_file, paste0("reduce_output_", output_file_name), + reduced_file <- file.path( + dirname(output_file_name), + paste0("reduce_output_", basename(output_file_name)) + ) + reduceBigSpectronaut(input_file, reduced_file, intensity, filter_by_excluded, filter_by_identified, filter_by_qvalue, qvalue_cutoff, calculateAnomalyScores, anomalyModelFeatures) msstats_data <- MSstatsPreprocessBig( - input_file = paste0("reduce_output_", output_file_name), + input_file = reduced_file, output_file_name = output_file_name, backend = backend, @@ - reduced_file <- paste0("reduce_output_", output_file_name) + reduced_file <- file.path( + dirname(output_file_name), + paste0("reduce_output_", basename(output_file_name)) + ) @@ - cleaned_file <- paste0("cleaned_", output_file_name) + cleaned_file <- file.path( + dirname(output_file_name), + paste0("cleaned_", basename(output_file_name)) + )Based on learnings: construct intermediate paths with
file.path(dirname(output_file_name), paste0("prefix_", basename(output_file_name)))so directory-valued outputs resolve correctly.Also applies to: 208-225
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@R/converters.R` around lines 149 - 164, The intermediate output filenames are built by naively prefixing output_file_name (causing wrong paths for nested or absolute names); update the calls that create intermediate paths (the reduceBigSpectronaut input and MSstatsPreprocessBig input_file args as well as any "cleaned_..." constructions) to use file.path(dirname(output_file_name), paste0("reduce_output_", basename(output_file_name))) (and similarly for "cleaned_") so directory components are preserved; locate usages around reduceBigSpectronaut and MSstatsPreprocessBig and replace the plain paste0(prefix, output_file_name) with the dirname/basename-based file.path construction.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@R/backends.R`:
- Around line 127-131: The path passed to arrow::write_dataset is built
incorrectly using paste0("topN_", output_file_name), which mangles paths; change
it to construct the new filename by combining dirname(output_file_name) and
paste0("topN_", basename(output_file_name)) (or handle root/empty dirname
appropriately) before calling arrow::write_dataset so the "topN_" prefix is
applied only to the file name and the directory is preserved; update the code
around the arrow::write_dataset call that currently uses output_file_name and
ensure the computed path variable is used instead.
In `@R/converters.R`:
- Around line 242-245: The code assumes annotation is already a table when
calling MSstatsAddAnnotationBig(msstats_data, annotation); update
bigDIANNtoMSstatsFormat() to preserve the documented "annotation file" behavior
by detecting if annotation is a file path or a connection (e.g.,
is.character(annotation) && length(annotation)==1) and, if so, read it into a
tabular object (using a safe tabular reader like read.delim/read.table or
data.table::fread) and normalize column names before passing to
MSstatsAddAnnotationBig; also apply the same file/path-to-table normalization
for the other annotation handling block (the second occurrence referenced around
the later lines) so both places accept either a file path or a data.frame.
- Around line 229-245: MSstatsPreprocessBig currently writes the processed table
to output_file_name before the annotation merge, so calling
MSstatsAddAnnotationBig only updates the in-memory msstats_data; to fix, after
MSstatsAddAnnotationBig(msstats_data, annotation) persist the updated
msstats_data back to the same output_file_name using the same write/save pathway
used by MSstatsPreprocessBig (respecting backend/connection/max_feature_count
behavior), ensuring callers that reopen output_file_name see the annotated
Condition/BioReplicate fields.
---
Outside diff comments:
In `@R/converters.R`:
- Around line 149-164: The intermediate output filenames are built by naively
prefixing output_file_name (causing wrong paths for nested or absolute names);
update the calls that create intermediate paths (the reduceBigSpectronaut input
and MSstatsPreprocessBig input_file args as well as any "cleaned_..."
constructions) to use file.path(dirname(output_file_name),
paste0("reduce_output_", basename(output_file_name))) (and similarly for
"cleaned_") so directory components are preserved; locate usages around
reduceBigSpectronaut and MSstatsPreprocessBig and replace the plain
paste0(prefix, output_file_name) with the dirname/basename-based file.path
construction.
In `@tests/testthat/test-diann_converter.R`:
- Around line 170-174: The cleanup uses file.remove() on output_file (now a CSV
dataset directory) and its reduced variant so directories are left behind;
replace those file.remove(...) calls with unlink(..., recursive = TRUE) to
remove directories created by the tests (keep file.remove for input_file if it
is a single file), and add the same unlink(recursive=TRUE) handling for the
topN_* outputs that the first test writes; apply the same changes in the second
cleanup block referenced (lines ~208-212) so all dataset directories are removed
between runs.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 9f8d0507-056a-4b70-b621-bd327ee02dfc
⛔ Files ignored due to path filters (3)
tests/testthat/topN_preprocess_output.csvis excluded by!**/*.csvtests/testthat/topN_spectro_output.csvis excluded by!**/*.csvtests/testthat/topN_test_diann_output.csvis excluded by!**/*.csv
📒 Files selected for processing (6)
R/backends.RR/clean_DIANN.RR/converters.Rtests/testthat/test-clean_DIANN.Rtests/testthat/test-converters.Rtests/testthat/test-diann_converter.R
💤 Files with no reviewable changes (1)
- tests/testthat/test-clean_DIANN.R
| # Merge annotation with the preprocessed data | ||
| if (!is.null(annotation)) { | ||
| msstats_data <- MSstatsAddAnnotationBig(msstats_data, annotation) | ||
| } |
There was a problem hiding this comment.
Preserve the documented “annotation file” input path.
bigDIANNtoMSstatsFormat() still documents annotation as a file or data frame, but MSstatsAddAnnotationBig() now assumes a tabular object and immediately calls colnames()/inner_join() on it. Passing a file path will now fail instead of being normalized first.
Suggested fix
- if (!is.null(annotation)) {
+ if (is.character(annotation) && length(annotation) == 1) {
+ annotation <- utils::read.csv(annotation, stringsAsFactors = FALSE)
+ }
+ if (!is.null(annotation)) {
msstats_data <- MSstatsAddAnnotationBig(msstats_data, annotation)
- }
+ }Also applies to: 277-292
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@R/converters.R` around lines 242 - 245, The code assumes annotation is
already a table when calling MSstatsAddAnnotationBig(msstats_data, annotation);
update bigDIANNtoMSstatsFormat() to preserve the documented "annotation file"
behavior by detecting if annotation is a file path or a connection (e.g.,
is.character(annotation) && length(annotation)==1) and, if so, read it into a
tabular object (using a safe tabular reader like read.delim/read.table or
data.table::fread) and normalize column names before passing to
MSstatsAddAnnotationBig; also apply the same file/path-to-table normalization
for the other annotation handling block (the second occurrence referenced around
the later lines) so both places accept either a file path or a data.frame.
|
In a sense, this was good, because annotation is redundant for most processing steps and only becomes relevant at the testing step |
Motivation and Context
bigDIANNtoMSstatsFormat accepted annotation but never called the merge step → output had no Condition/BioReplicate, breaking downstream MSstats.
Fix: Wire MSstatsAddAnnotationBig in after preprocessing. Harden the helper to drop overlapping non-key columns before joining (uses dplyr::tbl_vars() so it works on Arrow/lazy/data frames). R/converters.R
Arrow backend crashed on real data
na.rm <- TRUE (assignment) silently ran mean() with NAs → corrupted feature ranking.
arrow::write_csv_arrow rejects lazy arrow_dplyr_query → write step failed on real reports.
All-NA columns get inferred as null type → downstream ops crash.
Fix: Typo → na.rm = TRUE. Switch writes to arrow::write_dataset(..., format = "csv") (accepts lazy). Detect null-typed columns and cast/drop them lazily. R/backends.R, R/converters.R
DIANN 2.0 parquet unsupported
DIANN 2.0 emits parquet by default. The cleaner used read_delim_chunked (text only) and sniffed delimiter from the filename → garbage on parquet.
Fix: New parquet branch in reduceBigDIANN streaming via arrow::open_dataset() + Scanner$create(batch_size = 1e6) + RecordBatchReader. Same per-chunk callback as TSV. Delimiter sniffer now reads the first line. R/clean_DIANN.R
Shiny caveat: pass quantificationColumn = "auto" on parquet — DIANN 2.0 has Fr.N.Quantity columns instead of FragmentQuantCorrected.
Smaller fixes
Positional-arg bug in bigSpectronauttoMSstatsFormat (aggregate_psms was landing in filter_unique_peptides's slot) — fixed by switching to named args.
Sparklyr branch passed input where input_file was expected.
Test cleanup: file.remove → unlink(recursive = TRUE) (arrow writes directories).
Removed three committed topN_*.csv artifacts.
Changes
Three connected fixes to the DIANN out-of-memory pipeline:
Annotation actually gets merged (the eponymous fix).
Arrow backend doesn't crash on all-NA columns or eager writes.
DIANN 2.0 parquet input works.
Test coverage
Deleted tests/testthat/test-clean_DIANN.R. It mocked MSstatsMakeAnnotation inside cleanDIANNChunk — but annotation now happens after preprocessing, not during chunking, so the test was asserting behavior that no longer exists.
Added bigDIANNtoMSstatsFormat works with annotation in tests/testthat/test-diann_converter.R — exercises the new merge path end-to-end, asserts Condition/BioReplicate land in the output and map correctly per Run.
Two real-fixture tests (DIANN 1.x TSV and DIANN 2.0 parquet, both pulled from MSstatsConvert/inst/tinytest) are present in test-converters.R but commented out — they reference paths under /Users/rudhikshah/... so they'd fail on CI or any other contributor's machine.
Checklist Before Requesting a Review
Context and Solution
This PR fixes missing annotation merging for bigDIANN workflows and multiple Arrow backend issues that broke preprocessing and downstream MSstats steps. It ensures run-level annotation (Condition, BioReplicate) is actually merged, makes Arrow-backed processing robust to lazy queries and null-typed (all-NA) columns, and adds explicit parquet support for DIANN 2.0 by streaming record batches. The annotation merge was moved to after preprocessing and MSstatsAddAnnotationBig was hardened to drop overlapping non-key columns before joining.
Detailed Changes
R/backends.R
R/clean_DIANN.R
R/converters.R
R/utils.R
Tests & artifacts
Unit Tests Added or Modified
Deleted test:
Added test:
Modified tests:
Note: Two real-fixture DIANN tests (DIANN 1.x TSV and DIANN 2.0 parquet) remain present but commented out due to hardcoded local paths.
Coding Guidelines Violations