In certain cases, standard aligners may incorrectly detect variants due to limitations in their scoring schemes.
This can lead to soft-clipping of reads or the identification of false-positive variants,
where reads are aligned to shorter or suboptimal variant representations instead of the true underlying sequence.
VerRea aims to address these issues by re-evaluating alignments and improving the accuracy of variant detection.
- Project Goal
- Requirements and Dependencies
- Installation and Run
- Tests and limitations
- License Information
- Acknowledgments
- Contact Information
The goal of this project is to improve variant detection accuracy by correcting misaligned reads produced by standard aligners.
VerRea performs a targeted realignment step after the initial alignment. It analyzes selected genomic regions and
identifies reads that were likely misaligned (soft-clipping or suboptimal scoring).
These reads are then realigned against alternative reference sequences to better represent the underlying variants.
This approach aims to:
reduce false-positive variant calls
recover missed or incorrectly represented variants
improve alignment quality in challenging regions
- OS MacOs or Linux
- C++17 or later
- HTSlib
- spdlog (automatically fetched before compile)
make dev
./build/debug/app \
"--ref" "path_to_hg38" \
"--in" "tests/inputs/bams/100_reads_only_carl_Seraseq-STD-10-ng_so_rmdp.bam" \
"--kmer" "41" \
"--sga" "8,-4,-15,-1" \
"--ca" "5,2" \
"--log" "debug" \
"--mmr" "0.05" \
"--out" "tests/temp/out_seraseq.rea.bam" \
"--targets" "tests/inputs/beds/carl.bed"make all
./build/release/app \
"--ref" "path_to_hg38" \
"--in" "tests/inputs/bams/100_reads_only_carl_Seraseq-STD-10-ng_so_rmdp.bam" \
"--kmer" "41" \
"--sga" "8,-4,-15,-1" \
"--ca" "5,2" \
"--log" "info" \
"--mmr" "0.05" \
"--out" "tests/temp/out_seraseq.rea.bam" \
"--targets" "tests/inputs/beds/carl.bed"After compiling you can run tests
./build/release/basis_tests "--ref" "path_to_hg38"All testing input files(.bams) comes from Plasmids, Standards or artificially simulated.
The input .bams contain only short reads 150bp.
All tests were performed using short-read sequencing data (150 bp).
The tool currently operates on targeted regions specified via a BED file,
as the primary use case focuses on selected genomic loci. Whole-genome performance has not yet been evaluated,
and the current implementation is single-threaded.
MIT
This project was developed in response to the need for improved detection of problematic variants during
the development of an amplicon kit for MPN (Myeloproliferative Neoplasms),
specifically targeting the CALR (calreticulin) locus.
The laboratory development of the kit was done by Veronika Chladova (BioVendor R&D).
The interpretation and evaluation of variant allele frequency (VAF) within the MPN kit were also consulted with her expertise.
For questions, bugs:
Matej Forgac — forgac.matej@gmail.com
For sequencing-related questions(lab part):
Veronika Chladova — chladova@biovendor.com
