Skip to content

with_seqs("variants") (RaggedVariants) does not clip variants to the region window #202

Description

@d-laub

When reading variants via Dataset.with_seqs("variants").with_len("ragged"), the returned RaggedVariants for a [region, sample] cell is NOT clipped to the queried region window.

In python/genvarloader/_dataset/_haps.py, the ragged __call__ path calls:

# _haps.py:471
ragv = self._get_variants(idx=idx, regions=None, shifts=None)

with regions=None (and there is a # TODO: maybe filter variants for region, shifts? at _haps.py:642). By contrast, the haplotype/annotated path (get_haps_and_shifts, _haps.py:522) correctly passes regions=req.regions, keep=req.keep, so haplotype sequence output is properly windowed (this is why haplotype reconstruction is correct).

Consequences:

  • RaggedVariants output can include variants outside the region window (e.g. boundary-overlapping indels, or — for the PGEN backend, which stores a coarser per-cell variant set — variants from elsewhere on the contig).
  • Any consumer counting/inspecting per-region variants via with_seqs("variants") gets an unclipped set.

Found via property-based testing (Phase 2). Track 1b was reframed to AF validation because a per-region variant-count oracle is not meaningful against the current unclipped output.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions