Skip to content

fix(lib): derive GT phase from allele strings vs ref column#115

Merged
Jaureguy760 merged 1 commit into
mcvickerlab:devfrom
Jaureguy760:fix-gt-allele-parsing
Jun 2, 2026
Merged

fix(lib): derive GT phase from allele strings vs ref column#115
Jaureguy760 merged 1 commit into
mcvickerlab:devfrom
Jaureguy760:fix-gt-allele-parsing

Conversation

@Jaureguy760

Copy link
Copy Markdown
Collaborator

Summary

The GT parser previously only recognized VCF-standard "0|1" and "1|0" formats. Input data with allele-string GT values (e.g., "A|C", "G|T") was silently treated as unphased, causing per_donor phased analysis to produce identical results to unphased.

Changes

Modified rust/src/lib.rs to support two GT formats:

  1. Numeric (VCF standard): "0|1" or "1|0"
  2. Allele strings: "X|Y" - derives phase by comparing to the ref column
    • If first allele == ref → gt = 0 (ref on hap1)
    • If second allele == ref → gt = 1 (ref on hap2)

Verification

Version per_donor phased λ_GC per_donor unphased λ_GC Diff
OLD 1.501 1.501 0.000 (BUG)
FIXED 1.106 1.501 0.395 (FIX WORKS)
  • OLD: phased == unphased (λ_GC diff = 0.000) ❌
  • FIXED: phased ≠ unphased (λ_GC diff = 0.395) ✅
  • Phased λ_GC (1.106) closer to 1.0 (well-calibrated) than unphased (1.501)

Test plan

  • Quick test on 1000 rows: 35 regions with different p-values (vs 0 for OLD)
  • Full parity grid on 1.26M rows, 75k regions
  • λ_GC patterns statistically sensible (phased better calibrated)
  • Existing unit tests pass (cargo test)

🤖 Generated with Claude Code

The GT parser previously only recognized VCF-standard "0|1" and "1|0"
formats. Input data with allele-string GT values (e.g., "A|C", "G|T")
was silently treated as unphased, causing per_donor phased analysis
to produce identical results to unphased.

Now supports two GT formats:
1. Numeric: "0|1" or "1|0" (VCF standard)
2. Allele: "X|Y" where X,Y are allele strings - derives phase by
   comparing to the ref column

Fix verified:
- OLD: per_donor phased == unphased (λ_GC diff = 0.000)
- FIXED: per_donor phased ≠ unphased (λ_GC diff = 0.395)
- Phased λ_GC (1.106) closer to 1.0 than unphased (1.501)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@Jaureguy760 Jaureguy760 merged commit 4a20f80 into mcvickerlab:dev Jun 2, 2026
8 checks passed
@Jaureguy760 Jaureguy760 deleted the fix-gt-allele-parsing branch June 2, 2026 09:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant