Skip to content

RacimoLab/simGL

Repository files navigation

simGL

PyPI DOI

simGL simulates genotype likelihoods (GLs) from haplotypic genotype matrices, given per-sample coverage and sequencing error rates. It is designed to work seamlessly with msprime and tskit pipelines, but accepts any NumPy haplotype matrix.

Installation

pip install simGL

Or via conda:

conda install -c conda-forge simGL

Or from source:

git clone https://github.com/RacimoLab/simGL.git
cd simGL
pip install -e .

Quick example

import msprime
import numpy as np
import simGL

# 1. Simulate a tree sequence and extract the biallelic genotype matrix
ts = msprime.sim_ancestry(
    samples=10, ploidy=2, sequence_length=100_000,
    recombination_rate=1e-8, population_size=10_000, random_seed=1,
)
ts = msprime.sim_mutations(ts, rate=1e-4, random_seed=1)

gm_full  = ts.genotype_matrix()
biallelic = gm_full.max(axis=1) == 1
gm       = gm_full[biallelic]           # shape (n_sites, n_haplotypes)

# 2. Get reference and alternative alleles
variants = list(ts.variants())
ref = np.array([v.alleles[0] for v in variants])[biallelic]
alt = np.array([v.alleles[1] for v in variants])[biallelic]

# 3. Simulate allele read counts
arc = simGL.sim_allelereadcounts(
    gm, mean_depth=10., std_depth=2., e=0.01,
    ploidy=2, seed=42, ref=ref, alt=alt,
)
# arc shape: (n_sites, n_individuals, 4)  —  A, C, G, T read counts

# 4. Compute genotype likelihoods
GL = simGL.allelereadcounts_to_GL(arc, e=0.01, ploidy=2)
# GL shape: (n_sites, n_individuals, 10)  —  all diploid ACGT genotypes

# 5. Subset to biallelic genotypes and write a VCF
Ra     = simGL.ref_alt_to_index(ref, alt)
GL_sub = simGL.subset_GL(GL, Ra, ploidy=2)

pos   = np.array([int(v.site.position) for v in variants])[biallelic] + 1
names = [f"ind{i}" for i in range(ts.num_individuals)]
simGL.GL_to_vcf(GL_sub, arc, ref, alt, pos, names, "output.vcf")

Documentation

Full documentation — installation, user guide, API reference, and theory — is available at https://simgl.readthedocs.io.

Citation

If you use simGL in your work, please cite:

Coll Macià, M. & Gower, G. (2026). simGL: Simulate genotype likelihoods from haplotypic genotype matrices (v0.2.0). Zenodo. https://doi.org/10.5281/zenodo.19637642

Please also cite the methodological papers listed in the Citation page of the documentation.

License

MIT

About

Simulate genotype likelihoods from tree sequence data

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors