diff --git a/CLAUDE.md b/CLAUDE.md index 42ca5a1b..879d0796 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -238,6 +238,24 @@ In scope: When a change ships, update the relevant section of the skill and re-check the "Common gotchas" and "Where to look next" pointer table. The skill is published to https://www.skills.sh/ as `mcvickerlab/GenVarLoader` (installable via `npx skills add mcvickerlab/GenVarLoader`); keep it accurate against `main`. +## Docs audit before feature/breaking-change PRs + +Before opening any PR that adds a user-facing feature or makes a breaking change, audit and update the user-facing docs so they stay consistent with the code: + +- `README.md` (features, installation, requirements) +- `docs/source/*.md` — especially `api.md`, `faq.md`, `write.md`, `dataset.md`, `format.md`, `index.md` +- `skills/genvarloader/SKILL.md` (see "Maintaining the `genvarloader` skill" above) + +Check for: now-false claims (deleted backends, removed deps, changed defaults, renamed/removed symbols), new user-facing config or environment variables that need documenting, and changed installation/preprocessing (bcftools/plink2) requirements. + +**`api.md` must stay in sync with `__all__`.** Every symbol exported in `python/genvarloader/__init__.py`'s `__all__` needs an autodoc entry in `docs/source/api.md`; adding a public symbol without one silently drops it from the rendered API reference. Quick check: + +```bash +python -c "import re,genvarloader as g; api=open('docs/source/api.md').read(); print('MISSING:', [n for n in g.__all__ if n not in api] or 'none')" +``` + +The auto-generated `docs/source/changelog.md` (built from commit messages via `changelog.md.j2`) does **not** count as documentation — never treat a changelog entry as a substitute for prose docs. This gate complements the skill-maintenance rule above: public-API changes must update the skill, and any user-facing change must also keep the prose docs true. + ## Rust migration roadmap Any task that mentions "rust" (adding or porting Rust code, touching `src/`, or migrating numba/Python hot paths) **must** read `docs/roadmaps/rust-migration.md` before starting and update it as part of the work — tick completed tasks, record measurement results under the relevant checkpoint, and set the phase status marker (⬜/🚧/✅) + PR link. The roadmap is the source of truth for migration sequencing and the byte-identical parity contract. diff --git a/README.md b/README.md index 1843c067..7c4513e7 100644 --- a/README.md +++ b/README.md @@ -25,7 +25,7 @@ Documentation is available [here](https://genvarloader.readthedocs.io/). See our pip install genvarloader ``` -A PyTorch dependency is **not** included since it may require [special instructions](https://pytorch.org/get-started/locally/). `tbb` and/or `pyomp` are optional dependencies but highly recommended as they can improve throughput for parallelized numba code. +A PyTorch dependency is **not** included since it may require [special instructions](https://pytorch.org/get-started/locally/). GenVarLoader parallelizes its data-loading hot paths in Rust (rayon) out of the box, with no extra dependencies required; you can tune the worker count with the `GVL_NUM_THREADS` environment variable (see the [FAQ](https://genvarloader.readthedocs.io/en/latest/faq.html)). ## Contributing diff --git a/docs/source/api.md b/docs/source/api.md index 382f0f88..ae3fb0e9 100644 --- a/docs/source/api.md +++ b/docs/source/api.md @@ -7,6 +7,8 @@ .. autofunction:: write +.. autofunction:: update + .. autofunction:: get_splice_bed .. autofunction:: read_bedlike @@ -22,6 +24,44 @@ :exclude-members: __new__ ``` +## Insertion fill + +Strategies controlling how re-aligned track values are filled across inserted bases (indels). Pass an instance to [`gvl.Dataset.with_insertion_fill()`](#genvarloader.Dataset.with_insertion_fill). `InsertionFill` is the abstract base; instantiate one of the concrete strategies. + +```{eval-rst} +.. currentmodule:: genvarloader + +.. autoclass:: InsertionFill + :members: + +.. autoclass:: Constant + :members: + +.. autoclass:: FlankSample + :members: + +.. autoclass:: Interpolate + :members: + +.. autoclass:: Repeat5p + :members: + +.. autoclass:: Repeat5pNormalized + :members: +``` + +## Dataset maintenance + +Utilities for upgrading on-disk datasets written by older GVL versions. + +```{eval-rst} +.. currentmodule:: genvarloader + +.. autofunction:: migrate + +.. autofunction:: migrate_svar_link +``` + ## Reading ### Personalized data @@ -35,6 +75,12 @@ .. autofunction:: get_dummy_dataset +.. autoclass:: DummyVariant + :members: + +.. autoclass:: VarWindowOpt + :members: + .. autoclass:: RaggedDataset :exclude-members: __new__, __init__ @@ -102,4 +148,44 @@ Classes that GVL Datasets may return. .. autoclass:: RaggedIntervals :members: :exclude-members: __init__ +``` + +### Flat containers + +Returned in place of the ragged containers when a Dataset uses [`with_output_format("flat")`](#genvarloader.Dataset.with_output_format). Each carries flat `data`/`offsets` buffers and a `to_ragged()` escape hatch back to the ragged form. + +```{eval-rst} +.. currentmodule:: genvarloader + +.. autoclass:: FlatRagged + :members: + :exclude-members: __init__ + +.. autoclass:: FlatAnnotatedHaps + :members: + :exclude-members: __init__ + +.. autoclass:: FlatIntervals + :members: + :exclude-members: __init__ + +.. autoclass:: FlatVariants + :members: + :exclude-members: __init__ + +.. autoclass:: FlatAlleles + :members: + :exclude-members: __init__ + +.. autoclass:: FlatVariantWindows + :members: + :exclude-members: __init__ +``` + +### PyTorch interop + +```{eval-rst} +.. currentmodule:: genvarloader + +.. autofunction:: to_nested_tensor ``` \ No newline at end of file diff --git a/docs/source/faq.md b/docs/source/faq.md index 05409bf9..bb3b8a32 100644 --- a/docs/source/faq.md +++ b/docs/source/faq.md @@ -24,7 +24,7 @@ ragged = gvl.Ragged.from_offsets(data, shape, offsets) # ] ``` -Ragged arrays are subclasses of [Awkward Arrays](https://github.com/scikit-hep/awkward), so anything you can do with Awkward Arrays you can do with Ragged arrays. Within GVL, we use numba JIT'd functions to compute on the ragged objects' buffers directly since it's relatively straightforward (i.e. iterating over the rows of `data` via the `offsets` array). +Ragged arrays are backed by [`seqpro`](https://github.com/ML4GLand/SeqPro)'s `Ragged` type (a Rust-backed `_core.Ragged`). GVL computes on the `data` and `offsets` buffers directly in Rust, which is relatively straightforward (i.e. iterating over the rows of `data` via the `offsets` array). (Earlier releases subclassed [Awkward Arrays](https://github.com/scikit-hep/awkward); GVL no longer depends on `awkward`.) .. note:: @@ -63,6 +63,14 @@ bcftools view -Hp $vcf | wc -l plink2 --pgen-info $prefix ``` +## How do I control how many threads GVL uses? + +GVL's read path (haplotype reconstruction and track re-alignment) is parallelized in Rust with [rayon](https://github.com/rayon-rs/rayon). By default it uses one worker per available CPU, detected from the Linux cgroup cpuset (`sched_getaffinity`) so it respects container limits, and falling back to `os.cpu_count()` elsewhere. Three environment variables tune this: + +- **`GVL_NUM_THREADS`** — set the worker count explicitly (e.g. `GVL_NUM_THREADS=4`). Overrides cgroup detection. Resolved once, on first use, so set it before your first GVL call. +- **`GVL_FORCE_PARALLEL`** — set to a truthy value (`1`, `true`, `yes`, `on`) to force the multithreaded paths even on small inputs. By default GVL runs small inputs serially because thread overhead would dominate; this bypasses that size gate. Mainly useful for benchmarking. +- **`RAYON_NUM_THREADS`** — GVL **overwrites** this with its own resolved count so an inherited value (e.g. baked into a base image) can't defeat the cgroup-aware cap. To size the pool yourself, use `GVL_NUM_THREADS` instead. + ## How can I get personalized protein/spliced RNA sequences? This is not yet supported but on GVL's roadmap for the near future. Keep an eye out in future releases! diff --git a/docs/superpowers/specs/2026-06-30-docs-consistency-audit-design.md b/docs/superpowers/specs/2026-06-30-docs-consistency-audit-design.md new file mode 100644 index 00000000..0597e0fb --- /dev/null +++ b/docs/superpowers/specs/2026-06-30-docs-consistency-audit-design.md @@ -0,0 +1,50 @@ +# Docs consistency pass + CLAUDE.md docs-audit gate + +**Date:** 2026-06-30 +**Branch:** `docs/consistency-audit` + +## Problem + +Recent gvl work (Phase 5: numba read-path backend deleted → Rust-only; awkward → +`_core.Ragged` migration; new rayon threading knobs) left user-facing docs stale, +and there is no process gate ensuring docs stay consistent with future +feature/breaking-change PRs. + +Key facts established during the audit: +- gvl's **own** code is numba-free (`pixi.toml` comment; `tests/parity/test_import_no_numba.py`). + Numba survives only as a conda pin because **seqpro** transitively imports it. The + residual `_numba`-suffixed names in gvl route only to Rust or numpy. +- The read path is parallelized in Rust with rayon, tuned via env vars in + `python/genvarloader/_threads.py`: `GVL_NUM_THREADS`, `GVL_FORCE_PARALLEL`, and a + `RAYON_NUM_THREADS` override (issue #263). None were documented user-side. + +## Scope (focused fix — not a full line-by-line sweep) + +### Part A — docs fixes +1. `docs/source/faq.md` — rewrite the "Ragged objects" answer's stale + "subclass of Awkward Arrays / numba JIT'd functions" paragraph to reflect the + `seqpro.rag.Ragged` (`_core.Ragged`, Rust) backend; note awkward is no longer a dep. +2. `docs/source/faq.md` — new entry "How do I control how many threads GVL uses?" + documenting the three env vars, sourced from `_threads.py`. +3. `README.md` — replace the `tbb`/`pyomp`-for-numba install note with a note that + parallelism is built-in (Rust/rayon), tunable via `GVL_NUM_THREADS`. +4. `skills/genvarloader/SKILL.md` — `_core.Ragged` "Rust+numba backend" → "Rust backend" + (seqpro-core's rag layer is numba-free). +5. Targeted leftover sweep of README + `docs/source/*.md` + SKILL.md for other + `numba`/`awkward`/`GVL_BACKEND`/`tbb`/`pyomp` references — none remaining (the + surviving `awkward` mentions in SKILL.md describe "zero-awkward" as a feature). + +The auto-generated `docs/source/changelog.md` is left untouched (built from commit +messages via `changelog.md.j2`). + +### Part B — CLAUDE.md gate +Add a "Docs audit before feature/breaking-change PRs" section that requires auditing +README + `docs/source/*.md` + SKILL.md before such PRs, lists what to check +(now-false claims, new config/env vars, changed preprocessing), and states the +auto-generated changelog does not count as documentation. Complements the existing +skill-maintenance rule. + +## Verification +- Markdown edits are prose-only in existing files with no new MyST directives. +- Full `pixi run -e docs doc` build not run in-worktree (docs env not provisioned there); + low build-break risk given no directive changes. diff --git a/pixi.toml b/pixi.toml index 3e54e402..e05e09ed 100644 --- a/pixi.toml +++ b/pixi.toml @@ -163,7 +163,6 @@ cargo-test = { cmd = "cargo test --release" } memray-write = { cmd = "memray run -fo tests/benchmarks/profiling/write.memray.bin tests/benchmarks/profiling/profile_write.py --op write" } [feature.docs.tasks] -install-e = "uv pip install -e /cellar/users/dlaub/projects/ML4GLand/SeqPro -e /cellar/users/dlaub/projects/genoray -e ." i-kernel = "ipython kernel install --user --name 'gvl-docs' --display-name 'GVL Docs'" i-kernel-gpu = "ipython kernel install --user --name 'gvl-docs-gpu' --display-name 'GVL Docs GPU'" doc = "cd docs && make clean && make html" diff --git a/skills/genvarloader/SKILL.md b/skills/genvarloader/SKILL.md index b04835a8..ea0da1d5 100644 --- a/skills/genvarloader/SKILL.md +++ b/skills/genvarloader/SKILL.md @@ -341,7 +341,7 @@ Footprint is computed exactly via `Dataset._output_bytes_per_instance(...)` (use - `gvl.Reference.from_path(fasta, contigs=None)` — wrap a FASTA (path to a `.fa`/`.fa.bgz`, or a `.gvlfa` cache dir). Builds/reuses a sibling `.gvlfa` cache directory (self-describing, fingerprint-validated; legacy `.fa.gvl` caches auto-migrate). The cache is built atomically (temp + `os.replace`) under a best-effort lock, so concurrent builders sharing one reference are safe; the cache **auto-rebuilds** from its source when stale or missing. - `gvl.read_bedlike(path)` / `gvl.with_length(bed, L)` — BED helpers (re-exported from `seqpro`). -- `gvl.Ragged`, `gvl.RaggedAnnotatedHaps`, `gvl.RaggedVariants`, `gvl.RaggedIntervals` — ragged return containers. All are backed by `seqpro.rag.Ragged` (`_core.Ragged` Rust+numba backend); **not** `awkward`. `RaggedVariants` is a **subclass** of `seqpro.rag.Ragged` (`class RaggedVariants(seqpro.rag.Ragged)`), so `isinstance(rv, Ragged) is True`. Structural methods — indexing, `reshape`, `squeeze`, `to_packed` — are inherited from the base and **preserve the `RaggedVariants` type** (positional/structural operations return `RaggedVariants`). A **string key** (`rv["start"]`) returns a bare `Ragged` field, not a `RaggedVariants`. `reshape` takes the new shape either as unpacked ints — e.g. `rv.reshape(1, 2, None)` — or as a single tuple `rv.reshape((1, 2, None))`; the base `Ragged` signature accepts both. `squeeze(axis=None)` is a real axis-squeeze (base semantics) — it squeezes any size-1 axis, **not** a fixed "drop axis 0". An int index collapses the leading axis (numpy-consistent); slice/array indexing preserves it. Named properties (`.alt`, `.ref`, `.start`, `.ilen`, `.end`) are the primary access point; extra fields (e.g. `AF`, custom FORMAT fields) are also accessible via `rv["field"]` or `rv.field` (via `__getattr__`). `RaggedVariants` itself does not define `__eq__` (wrapper-level `==` is Python object-identity, not element-wise); to compare contents, compare individual fields — e.g. `rv["alt"] == other_alt` or `rv.start == other_start` — which use `seqpro.rag.Ragged`'s ufunc-based (element-wise) comparison. Domain methods retained on `RaggedVariants`: `.rc_()`, `.pad()`, `.to_nested_tensor_batch()`; derived read-only properties: `.ilen`, `.end`; fields: `.alt`, `.ref`, `.start`, `.dosage`. +- `gvl.Ragged`, `gvl.RaggedAnnotatedHaps`, `gvl.RaggedVariants`, `gvl.RaggedIntervals` — ragged return containers. All are backed by `seqpro.rag.Ragged` (`_core.Ragged` Rust backend); **not** `awkward`. `RaggedVariants` is a **subclass** of `seqpro.rag.Ragged` (`class RaggedVariants(seqpro.rag.Ragged)`), so `isinstance(rv, Ragged) is True`. Structural methods — indexing, `reshape`, `squeeze`, `to_packed` — are inherited from the base and **preserve the `RaggedVariants` type** (positional/structural operations return `RaggedVariants`). A **string key** (`rv["start"]`) returns a bare `Ragged` field, not a `RaggedVariants`. `reshape` takes the new shape either as unpacked ints — e.g. `rv.reshape(1, 2, None)` — or as a single tuple `rv.reshape((1, 2, None))`; the base `Ragged` signature accepts both. `squeeze(axis=None)` is a real axis-squeeze (base semantics) — it squeezes any size-1 axis, **not** a fixed "drop axis 0". An int index collapses the leading axis (numpy-consistent); slice/array indexing preserves it. Named properties (`.alt`, `.ref`, `.start`, `.ilen`, `.end`) are the primary access point; extra fields (e.g. `AF`, custom FORMAT fields) are also accessible via `rv["field"]` or `rv.field` (via `__getattr__`). `RaggedVariants` itself does not define `__eq__` (wrapper-level `==` is Python object-identity, not element-wise); to compare contents, compare individual fields — e.g. `rv["alt"] == other_alt` or `rv.start == other_start` — which use `seqpro.rag.Ragged`'s ufunc-based (element-wise) comparison. Domain methods retained on `RaggedVariants`: `.rc_()`, `.pad()`, `.to_nested_tensor_batch()`; derived read-only properties: `.ilen`, `.end`; fields: `.alt`, `.ref`, `.start`, `.dosage`. - `gvl.FlatRagged` — flat analog of `Ragged`: `.data` (flat numpy array), `.offsets` (int64), `.shape`. Methods: `.to_ragged()`, `.to_fixed(length)`, `.to_padded(pad_value)`, `.reshape(shape)`, `.squeeze(axis)`. Source: `python/genvarloader/_flat.py`. - `gvl.FlatIntervals` — flat-buffer interval container returned by `with_tracks(kind="intervals")` + `with_output_format("flat")`. Fields `.starts`/`.ends`/`.values` are `FlatRagged`; `.to_ragged()` → `RaggedIntervals`; `.reshape(...)`, `.squeeze(...)`, `.shape`. Source: `python/genvarloader/_ragged.py`. - `gvl.FlatAnnotatedHaps` — flat analog of `RaggedAnnotatedHaps`: fields `.haps`, `.var_idxs`, `.ref_coords` (each a `FlatRagged`). Methods: `.to_ragged()`, `.to_fixed(length)`, `.to_padded()`, `.reshape(shape)`, `.squeeze(axis)`. Source: `python/genvarloader/_flat.py`.