Skip to content

build(deps): bump chia from 0.42.0 to 0.42.1 in /keygen-rs#6

Open
dependabot[bot] wants to merge 208 commits intomainfrom
dependabot/cargo/keygen-rs/chia-0.42.1
Open

build(deps): bump chia from 0.42.0 to 0.42.1 in /keygen-rs#6
dependabot[bot] wants to merge 208 commits intomainfrom
dependabot/cargo/keygen-rs/chia-0.42.1

Conversation

@dependabot
Copy link
Copy Markdown
Contributor

@dependabot dependabot Bot commented on behalf of github May 4, 2026

Bumps chia from 0.42.0 to 0.42.1.

Release notes

Sourced from chia's releases.

0.42.1

What's Changed

Full Changelog: Chia-Network/chia_rs@0.42.0...0.42.1

Commits
  • 746c688 Merge pull request #1426 from Chia-Network/bump-0.42.1
  • 03e02c0 bump version to 0.42.1
  • df18c3e Merge pull request #1425 from Chia-Network/bump-clvmrs
  • a23dd08 harmonize versions of thiserror, p256, k256, rand, rand_chacha with clvm_rs. ...
  • 695643c bump clvmr to 0.17.7
  • a256e9d Merge pull request #1422 from Chia-Network/check-time-lock
  • 1c51179 python test cases
  • 787796b extend tests for check_time_lock()
  • 1a00644 simplify check_time_lock() tests
  • 0c1c6a6 [CHIA-3854] Alternative: pure storage cost model (12000/0) for generator iden...
  • Additional commits viewable in compare view

Dependabot compatibility score

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


Dependabot commands and options

You can trigger Dependabot actions by commenting on this PR:

  • @dependabot rebase will rebase this PR
  • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
  • @dependabot show <dependency name> ignore conditions will show all of the ignore conditions of the specified dependency
  • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
  • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
  • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)

Abraham Sewill and others added 30 commits April 19, 2026 19:59
Introduces run_gpu_pipeline_streaming() — a per-phase alloc/free variant
of the plot pipeline that lets xchplot2 run on GPUs too small to hold
the ~15 GB GpuBufferPool (8 GB cards like GTX 1070). Verified bit-exact
against the pool path at k=18 and k=28.

Phase 2-3: orchestration + tile+merge.

  * New GpuPipeline.cu body: allocate d_xs, run launch_construct_xs,
    free scratch, alloc d_t1_meta/d_t1_mi, run T1 match, free d_xs,
    etc. Each phase's buffers are sized exactly for that phase and
    released before the next alloc.
  * T1 and T2 sort phases tile the input and merge the sorted runs via
    a stable 2-way merge-path kernel (merge_pairs_stable_2way). Ties
    go to the left half, matching the global stable ordering (tile 0
    indices are strictly less than tile 1's).
  * XCHPLOT2_STREAMING=1 forces the streaming path through the
    one-shot run_gpu_pipeline(cfg) overload — useful for testing and
    for users who want the smaller peak even when the pool fits.

Phase 4: VRAM tracking + cap enforcement.

  * StreamingStats struct + s_malloc/s_free route every cudaMalloc
    in the streaming path through a tracker. POS2GPU_MAX_VRAM_MB
    enforces a soft cap (throws before the allocation exceeds it);
    POS2GPU_STREAMING_STATS=1 prints a per-alloc trace and a final
    peak-VRAM summary. Pinned host allocations are excluded from the
    cap since they don't consume device VRAM.

Phase 5: automatic dispatch with typed exception.

  * New InsufficientVramError in GpuBufferPool.hpp, thrown by the pool
    ctor specifically from its cudaMemGetInfo pre-check (other CUDA
    failures still throw plain std::runtime_error).
  * run_gpu_pipeline(cfg) and BatchPlotter::run_batch catch
    InsufficientVramError and route to the streaming pipeline. No
    user-facing flag. Prior approach string-matched .what() — brittle;
    typed exception is compile-time-safe.

Phase 6: memory reductions to land under 8 GB at k=28.

  * launch_t1_match / launch_t2_match now emit SoA streams — meta
    (uint64), mi (uint32), xbits (uint32 for T2) — instead of packed
    T1PairingGpu / T2PairingGpu arrays. Same total bytes, but the mi
    column can be fed directly to CUB as the sort key input and freed
    as soon as CUB consumes it (skips a copy-only extract kernel and
    reclaims ~1 GB at k=28). Pool path carves the three SoA arrays
    out of its existing d_pair_a slot; streaming allocates them as
    three separate cudaMallocs.
  * Streaming T2 sort splits the previously-fused merge_permute_t2
    into three passes: merge_pairs_stable_2way → gather_u64 meta →
    gather_u32 xbits. Frees source column between passes so each
    gather's peak only holds one source + one output. Drops post-CUB
    T2 peak from 9,360 MB to 7,280 MB.
  * Streaming T2 sort uses N=4 tiling + tree-of-2-way-merges
    (tile 0+1 → AB, tile 2+3 → CD, AB+CD → final). Halves per-tile
    CUB scratch (~1,044 MB → ~522 MB); AB/CD intermediates fit in
    the headroom gained. Without this, the binding CUB-scratch peak
    was 8,324 MB — 130 MB over the 8 GB target.
  * Alloc reorder throughout: sort outputs (d_t{1,2}_meta_sorted,
    d_t2_xbits_sorted) are allocated only after CUB has freed its
    scratch + vals_in buffers, keeping ~3 GB from going live all at
    once.

Batch-mode streaming.

  * BatchPlotter's streaming-fallback branch maintains two cap-sized
    pinned D2H buffers (double-buffered like the pool path: plot N
    writes slot N%2 while consumer reads slot (N-1)%2) and threads
    them into a new overload:
      run_gpu_pipeline_streaming(cfg, pinned_dst, pinned_capacity)
    Returns a borrowing result (external_fragments_ptr into pinned_dst)
    so the consumer reads directly from pinned — no intermediate
    owning-vector copy.
  * streaming_alloc_pinned_uint64 / streaming_free_pinned_uint64 shims
    live in GpuPipeline.cu so BatchPlotter.cpp (plain .cpp without
    cuda_runtime.h on its include path) can own pinned buffers.
  * XCHPLOT2_STREAMING=1 also bypasses pool construction in
    BatchPlotter; matches the one-shot dispatch and makes the batch
    streaming path testable on high-VRAM hardware.
  * Amortises the ~600 ms cudaMallocHost(2 GB) cost away: k=28 batch
    streaming is 3.65 s/plot vs 3.05 s/plot pool; the remaining 0.60 s
    delta is per-phase device alloc/free (streaming's whole point).

Parity.

  * t1_parity, t2_parity rebuild the AoS form locally after the SoA
    match kernels emit, preserving the existing CPU-vs-GPU set-equality
    check. Both still ALL OK across all seeds.
  * Pool vs streaming bit-exact at k=18 (6 plot_id × strength cases)
    and k=28 (plot_id=0xab*32).

Measured k=28 streaming peak trajectory on a 4090:

  | Stage                                 | Peak VRAM |
  |---------------------------------------|----------:|
  | Before Phase 6                        | 12,484 MB |
  | Fuse + reorder                        | 10,400 MB |
  | T2 match SoA                          |  9,360 MB |
  | T2 sort 3-pass                        |  8,324 MB |
  | T1 match SoA                          |  8,324 MB |
  | N=4 T2 tile + tree merge (final)      |  7,802 MB |

k=28 pool batch steady-state (5 plots on 4090, full free VRAM):
~2.09 s GPU per plot, 2.28 s wall/plot. Consistent with the pre-Phase-6
baseline — the SoA rewiring was structural, not perf-regressing.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds a streaming-path row to the perf table (~3.7 s/plot vs pool's
~2.06 s at k=28 on a 4090 — the delta is per-phase alloc/free that
the streaming path pays in exchange for a ~7.8 GB peak that fits on
an 8 GB card), expands the VRAM section to describe the two code
paths and the auto-dispatch at pool construction, and notes the
XCHPLOT2_STREAMING=1 override for forcing streaming on a high-VRAM
card. Architecture block cross-references the new streaming variant
in GpuPipeline.

No user-visible API change — callers use the same `xchplot2 plot` /
`test` / `batch` commands and get the right path based on available
VRAM, with `GpuBufferPool::InsufficientVramError` as the dispatch
signal.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two small, stacking perf wins in the three match-kernel wrappers
(T1, T2, T3).

1. compute_bucket_offsets is no longer <<<1, 1>>>.

   The old kernel ran on a single thread that walked num_buckets
   binary searches serially. Latency is fine at strength=2 (16
   buckets at k=28) but scales linearly with (1 << strength) —
   painful at higher strengths. The new kernel dispatches one thread
   per bucket; thread num_buckets writes the sentinel
   offsets[num_buckets] = total. Launched with
   blocks = (num_buckets + 1 + 255) / 256.

   Correctness preserved: each thread does the same lower_bound
   lookup on its assigned bucket id as the old loop, just without
   the monotone "start at previous pos" hint (the starting 'pos' in
   the old version was purely a speedup; results are identical).

2. l_count_max is no longer computed on the host.

   The old path D2H'd the bucket-offsets array, cudaStreamSynchronize'd,
   and computed max over num_sections on CPU to size blocks_x for
   match_all_buckets. Three host fences per plot.

   Replaced with max_pairs_per_section(k, section_bits) from the new
   shared header src/host/PoolSizing.hpp. This is the same formula
   GpuBufferPool uses to size the persistent pool — a safe upper
   bound on per-section L-count. Excess threads launched past the
   real L-count early-exit on the existing `l >= l_end` guard at the
   top of match_all_buckets, so the over-launch is free on the GPU.

   The shared-header move also replaces the duplicated
   max_pairs_per_section formula in GpuBufferPool.cu's anon namespace
   and GpuPipeline.cu's max_pairs_per_section_streaming helper.

Measured on RTX 4090 (21 GB free), k=28 batch of 5 plots:

  Before: producer 2.09 s/plot,  batch wall 2.28 s/plot.
  After:  producer 1.96 s/plot,  batch wall 2.15 s/plot.

That's ~6 % wall reduction per plot, bigger than the ~150 µs × 3 that
the raw host-fence count would suggest. cudaStreamSynchronize drains
CUB's internal async state as well as the one kernel, so removing it
unblocks more than just the offsets kernel.

Parity verified:

  * t1_parity, t2_parity: ALL OK against the CPU reference (set
    equality).
  * Pool vs streaming bit-exact at k=18 (2 plot-ids × 2 strengths) and
    k=28 (plot_id=0xab*32).

Prerequisite for subsequent PRs (per-phase streams + async D2H via
cudaEvent) that depend on the absence of the host fence to let phases
and plots actually overlap.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous wording ("build.rs auto-detects... falling back to sm_89.
Override with \$CUDA_ARCHITECTURES") didn't say what the override
actually does or when you'd reach for it. Now it spells out:

 * autodetect is via `nvidia-smi --query-gpu=compute_cap` — builds for
   only that architecture so the binary is small and the build is fast;
 * fallback to sm_89 fires when nvidia-smi isn't in PATH or doesn't see
   a GPU (containers, headless CI builders without the driver);
 * override with CUDA_ARCHITECTURES when building for a different GPU
   than the one compiling, or when you want a fat binary covering
   multiple architectures (e.g. "89;120" for Ada + Blackwell).

Added a short table of common compute_cap values (61..120) so users
don't have to look them up separately.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
A reader landing on the README first wants to know whether their
hardware will run it at all — not where it sits on a 4090 perf curve.
Swap the two so the top-of-README info is "will this work for me?"
and benchmarks live at the bottom as a forward-looking reference.

Hardware compatibility now lists, up front:

 * GPU compute cap floor (sm_61; Pascal / GTX 10-series and up).
 * VRAM floor (8 GB, auto-streaming) and steady-state preference
   (16 GB+, pool path) with a cross-reference to the existing VRAM
   section.
 * PCIe width impact (Gen4 x4 → +240 ms/plot), with the live-check
   incantation that used to live in the Performance preamble.
 * Host RAM (~16 GB; batch pins ~4 GB).
 * Toolkit / runtime notes (CUDA 12+ to build, 12.8+ needed at
   runtime for Blackwell sm_120).
 * OS (Linux tested; Windows/macOS not).

Performance section kept intact and moved just above License. Also
refreshed the pool-path batch-wall row to 2.15 s/plot — the value
from the most recent 5-plot benchmark after the compute_bucket_offsets
+ l_count_max cleanup.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Previously the pool carried 2 rotating pinned D2H slots and the batch
producer/consumer channel held depth 1. That matched the measured
case of producer wall > consumer wall (GPU ~2 s/plot, consumer
FSE+fwrite ~1 s/plot on NVMe) — consumer always caught up before
producer overwrote its slot.

For deployments where the consumer is the long pole, depth 1 leaves
the GPU idle while the consumer catches up. Concretely: a batch on
SATA SSD (~500 MB/s) pushes FSE+write to ~4.4 s/plot, flipping the
ratio.

Parameterise on GpuBufferPool::kNumPinnedBuffers (static constexpr
= 3). Pool ctor/dtor loop-allocate/free. Pool-overload of
run_gpu_pipeline's pinned_index check widened to the new upper
bound. BatchPlotter's streaming-fallback pinned array likewise grows
to 3 via the existing streaming_alloc_pinned_uint64 shim.

Channel becomes a bounded queue instead of std::optional:
 * capacity = kNumPinnedBuffers - 1 (= 2 currently).
 * push waits on cv_not_full; pop on cv_not_empty.
 * Invariant: the producer's slot-(i%N) reuse is safe because the
   channel holds at most (N-1) items, so the consumer must have
   popped plot (i - N) before the producer enqueues plot i.

Host pinned cost at k=28: 4 GB → 6 GB. Device VRAM unchanged. On the
4090+NVMe reference the measured batch wall stays at 2.15 s/plot
(producer-bound, depth doesn't help), confirming the change is
latent capacity rather than a perf regression.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replace the CUDA-only kernels with portable SYCL implementations
compiled by AdaptiveCpp. Each kernel TU now lives as a .cpp consumed by
acpp; CUDA TUs (.cu) only ship when XCHPLOT2_BUILD_CUDA=ON.

Shared infrastructure:
  - PortableAttrs.hpp — POS2_DEVICE_INLINE / POS2_HOST_DEVICE macros
    that compile correctly under nvcc and acpp.
  - AesTables.inl — AES T-tables shared between the CUDA and SYCL paths.
  - SyclBackend.hpp — per-process sycl::queue (gpu_selector) plus a
    device-side AES table buffer initialised on first use.

Per-kernel SYCL ports (.cpp consumed by acpp):
  - T1OffsetsSycl, T2OffsetsSycl, T3OffsetsSycl
  - PipelineKernelsSycl, XsKernelsSycl
  - Renamed pipeline TUs (T1/T2/T3Kernel.cu, XsKernel.cu, GpuPipeline.cu,
    GpuBufferPool.cu) to .cpp; outer wrappers now take sycl::queue&.

Sort wrapper:
  - Sort.cuh declares launch_sort_pairs_u32_u32 / launch_sort_keys_u64
    over sycl::queue&.
  - SortCuda.cu (XCHPLOT2_BUILD_CUDA=ON) wires CUB radix sort, bridging
    the queue↔CUDA-stream boundary by draining q with q.wait(), running
    CUB on the default stream, then cudaStreamSynchronize.
  - SortSycl.cpp ships as a stub that throws on call; the hand-rolled
    SYCL radix sort lands in the next commit.
  - AesStub.cpp provides a no-op initialize_aes_tables for non-CUDA
    builds.

CMake:
  - XCHPLOT2_BUILD_CUDA option (default ON) selects between SortCuda.cu /
    SortSycl.cpp and AesGpu.cu+AesGpuBitsliced.cu / AesStub.cpp.
  - enable_language(CUDA) and find_package(CUDAToolkit) are gated on the
    option; CUDA include paths are probed and exposed to acpp TUs that
    transitively pull cuda_fp16.h via AdaptiveCpp's half.hpp.
  - add_sycl_to_target wraps the SYCL TU set; pos2_gpu links the union.

Updated parity tests (.cu) take sycl::queue&. New SYCL-side parity
tools (sycl_bucket_offsets_parity, sycl_g_x_parity) validate the ported
kernels against the CUDA reference.

Build matrices verified end-to-end:
  XCHPLOT2_BUILD_CUDA=ON  → NVIDIA fast path with CUB
  XCHPLOT2_BUILD_CUDA=OFF → SYCL-everywhere via AdaptiveCpp (sort still
                            stubbed)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replace the SortSycl.cpp stub with a hand-rolled stable LSD radix sort
that runs on every AdaptiveCpp backend (CUDA, HIP, Level Zero, OpenCL).

Pipeline (per 4-bit pass; RADIX=16; TILE_SIZE=1024):
  Phase 1 — per-tile parallel count. Each WG (256 threads × 4 items)
    reduces its tile into a 16-bucket WG-local histogram via local
    atomics, then writes those 16 counts (no atomics) into bucket-major
    tile_hist[d * num_tiles + t].
  Phase 2 — single multi-WG exclusive scan over the entire bucket-major
    tile_hist via AdaptiveCpp's scanning::scan (decoupled-lookback).
    Because the layout is bucket-major, one 1-D scan yields tile_offsets
    directly — each entry is the global start of tile t's bucket-d range
    in the output. Stable by construction: tile t < t' always lands
    earlier within bucket d.
  Phase 3 — cooperative per-tile scatter. Items load contiguously per
    thread into local memory; for each digit d the WG runs one
    exclusive_scan_over_group on per-thread match counts to assign ranks
    in input order (stable), then every thread scatters its matching
    items to local_bases[d] + rank. All 256 threads stay active, no
    sequential bottleneck.

Sort.cuh no longer pulls cuda_fp16 / cuda_runtime — those moved into
SortCuda.cu (the only nvcc TU that needs them), keeping the public
header backend-portable.

Adds tools/parity/sycl_sort_parity that exercises both wrappers
against a std::sort reference at counts {16, 16K, 256K, 1M} × seeds
{1, 7, 31}; built unconditionally so it validates whichever Sort
backend is wired in (CUB on the NVIDIA build, hand-rolled radix on
non-CUDA). All 24 cases pass on both backends.

Throughput on RTX 4090 (warm, N=1M):
  pairs:  CUB 1.27 ms,  SYCL radix 0.92 ms
  keys:   CUB 1.70 ms,  SYCL radix 1.28 ms
The SYCL radix beats CUB-via-bridge at this scale because there's no
per-call SYCL→CUDA→SYCL fence; CUB's tuning is expected to take the
lead at N >> 1M.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The buffer pool aliases d_pair_b as the Xs construction scratch (the
"alias d_pair_b for that, so no separate allocation" trick), so
pair_bytes must be sized to fit either the largest pairing struct or
the full Xs scratch. The previous calculation only accounted for the
pairing structs (max 16 B/elem × cap = ~18 × total_xs at k=22), but
the Xs scratch is 4 × total_xs uint32s plus the sort temp — and the
sort temp alone is ~8 × total_xs (CUB's input/output API mode, and
similarly ~8 × total_xs for the SYCL radix's ping-pong buffers).
That puts the actual Xs need at ~24 × total_xs, exceeding pair_bytes
on every k I tried (20, 22, 24, 26, 28).

The constructor's runtime assertion was firing immediately on every
plot attempt at HEAD, on both the CUB and SYCL backends — the alias
was unsafe and we threw before allocating anything. End-to-end
plotting was therefore broken at HEAD prior to this fix.

Compute xs_temp_bytes first, then fold it into the pair_bytes max.
The runtime assertion is dropped because the size now provably fits
by construction.

VRAM impact: at k=28, pair_bytes grows from ~4.83 GB (18 × total_xs)
to ~6.4 GB (24 × total_xs), so two pair buffers cost an extra ~3.2
GB. Still comfortable on a 24 GB card.

Verified end-to-end on RTX 4090, k=28 (warm timings, mean of 3):
  CUB:  7.25 s/plot   (XCHPLOT2_BUILD_CUDA=ON)
  SYCL: 10.24 s/plot  (XCHPLOT2_BUILD_CUDA=OFF)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Probe the build host once at configure time and pick a sensible
AdaptiveCpp target list:

  - NVIDIA detected (nvidia-smi works) → ACPP_TARGETS=generic.
    Counter-intuitively, AdaptiveCpp's LLVM SSCP "generic" path is a
    few percent faster than cuda:sm_<arch> on our kernels at k=28
    (warm wall: 7.25 s vs 7.78 s on RTX 4090 with the CUB build);
    SSCP's runtime specialization beats CUDA-AOT for this workload.
  - AMD detected (rocminfo Name: gfxXXXX) → ACPP_TARGETS=hip:gfxXXXX.
    SSCP's HIP path is less mature, so AOT-compiling for the actual
    gfx target is the safer pick on AMD.
  - Otherwise → ACPP_TARGETS=generic (works everywhere; JITs on
    first use).

User-overridable via -DACPP_TARGETS=... (CMake) or $ACPP_TARGETS
(cargo install). The CMake-side detection runs in execute_process
with ERROR_QUIET so missing tools just fall through cleanly. The
build.rs side reuses the existing detect_cuda_arch() result and
adds detect_amd_gfx() for the rocminfo path.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The SYCL/AdaptiveCpp port is ~1.5× slower on NVIDIA at k=28 than the
original CUDA-only implementation. Users who only ever target NVIDIA
should know they have the option of the legacy CUDA-only branch
without giving up performance for portability.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add steady-state batch numbers for the three current paths on RTX 4090
at k=28: cuda-only (2.15 s/plot), main+CUB (2.41 s/plot), main+SYCL
(3.79 s/plot). Note that main+CUB is +12% over cuda-only and main+SYCL
is +57% over CUB — the gap is host-side AdaptiveCpp scheduling
overhead, not kernel perf (per-kernel nsys is within ~7% across the
two paths).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The Build section was a one-liner about CUDA + C++20 + CMake + Rust;
it didn't mention AdaptiveCpp at all even though slice 9 made
AdaptiveCpp a hard build dependency. Restructure into:

  - Required toolchain (AdaptiveCpp, CUDA Toolkit headers + optional
    nvcc, C++20 compiler, CMake, Rust). Note that CUDA Toolkit headers
    are required on every build path because AdaptiveCpp's half.hpp
    pulls cuda_fp16.h.
  - Auto-fetched at configure time (pos2-chip via FetchContent, FSE
    vendored under pos2-chip).
  - Optional GPU runtimes for non-NVIDIA targets (ROCm probed by the
    ACPP_TARGETS autodetect; oneAPI requires manual override).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…iveCpp

Three layered install paths so users can pick the friction they want:

  1. Containerfile (podman-first, also docker). Build args select the
     base image: nvidia/cuda for CUB+SYCL, rocm/dev-ubuntu for AMD,
     intel/oneapi for Intel (experimental). All variants build
     AdaptiveCpp 25.10 from source inside the image and ship a slim
     runtime stage. ~15-30 min first build, layer-cached after.

  2. scripts/install-deps.sh — distro-aware native bootstrap covering
     Arch, Ubuntu/Debian, and Fedora families. Detects GPU vendor via
     nvidia-smi/rocminfo and installs the right toolchain (full CUDA
     for NVIDIA, CUDA *headers* + ROCm for AMD), then builds
     AdaptiveCpp into /opt/adaptivecpp. --no-acpp opts out and lets
     CMake fetch it.

  3. CMake FetchContent fallback. find_package(AdaptiveCpp QUIET)
     followed by FetchContent_Declare at v25.10.0 with
     FetchContent_MakeAvailable when the local lookup fails. Opt-in
     option XCHPLOT2_FETCH_ADAPTIVECPP=ON (default ON). The
     add_sycl_to_target macro is verified after the fetch — if
     AdaptiveCpp doesn't expose it as a subproject we error with a
     pointer to the manual install.

build.rs also now reads $XCHPLOT2_BUILD_CUDA so the AMD/Intel container
builds can flip XCHPLOT2_BUILD_CUDA=OFF without touching CMake invocation.

README's Build section restructured into three clearly-labeled paths
with the full dependency table moved into path #3.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Validated the Containerfile by running a full podman build and a k=22
plot inside the container with GPU passthrough via CDI. Output is MD5-
identical to the host build (42dedec6...). Five fixes uncovered along
the way:

  1. Add lld-18 to the apt install list — AdaptiveCpp's CMake hard-
     errors when ld.lld is missing from PATH. Also pass
     -DACPP_LLD_PATH=/usr/lib/llvm-18/bin/ld.lld explicitly.

  2. Move ACPP_TARGETS autodetect *before* find_package(AdaptiveCpp)
     in CMakeLists. AdaptiveCpp's package config reads the value at
     find time, and an empty -DACPP_TARGETS= (default Containerfile
     build-arg) makes acpp error out with "Unknown backend: ".

  3. build.rs treats `Ok("")` from env::var("ACPP_TARGETS") the same
     as Err — Containerfile build-args propagate as empty env vars
     when the user doesn't override.

  4. Link against AdaptiveCpp's runtime libs (acpp-rt + acpp-common)
     in build.rs. The static archives produced by CMake reference
     hipsycl::rt::* symbols that live there. ACPP_PREFIX env var
     (default /opt/adaptivecpp) controls the search path; an rpath
     entry is also added so the binary finds them at runtime.

  5. Use the CUDA *devel* image as BASE_RUNTIME (not the slim
     runtime) and install the full llvm-18 package in the runtime
     stage — AdaptiveCpp's SSCP path shells out to `opt-18` and
     `ptxas` at runtime, both of which are missing from the slim
     CUDA runtime + libllvm18 combination ("LLVMToPtx: opt
     invocation failed with exit code -1").

Plus a .dockerignore that drops build-*/, target/, third_party/, and
.git/ from the build context (was 946 MB, now ~50 MB).

Containerfile header comments still document the AMD ROCm and Intel
oneAPI build-arg combinations, but those remain untested.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The buffer-pool sizing fix (commit d70eefb) raised pool_bytes to
include the aliased Xs scratch, which pushed pool_total at k=28 from
~12 GB to ~15.2 GB device + the 0.5 GB margin. The previous "16 GB+
cards use the pool" framing is now stale — RTX 4080 (16 GB) sits below
the threshold after driver overhead and transparently falls back to
streaming. Update the hardware-compat blurb and the VRAM section to
reflect the new threshold and example cards (4090 / 5090 / A6000 /
H100). Auto-fallback still hides the change from users.

Steady-state per-plot reference also corrected from ~2.1s to ~2.4s
(matches the post-port batch numbers in the Performance table).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Make explicit at the top of the README that plots are accurate
(per-phase parity vs pos2-chip + bit-identical between backends +
deterministic), but the project is still under active development on
performance, cross-vendor support, and tooling. Point first-time users
who just want a stable path at the cuda-only branch.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
User report: cargo install on a different machine fails to link with:

    rust-lld: error: unable to find library -lacpp-rt
    rust-lld: error: unable to find library -lacpp-common

build.rs's hardcoded prefix list was incomplete (missed Ubuntu's
/usr/lib/x86_64-linux-gnu, Arch's /usr/lib, and the FetchContent
build tree under OUT_DIR/cmake-build/_deps/adaptivecpp-build/).

CMakeLists now writes the actual AdaptiveCpp lib directory to
$cmake_build/acpp-prefix.txt at configure time:

  - For installed AdaptiveCpp, derive from AdaptiveCpp_DIR
    (<prefix>/lib/cmake/AdaptiveCpp → <prefix>/lib).
  - For FetchContent builds, evaluate $<TARGET_FILE_DIR:acpp-rt>
    at file(GENERATE) time so the path resolves to the in-tree
    build artifact location.

build.rs reads acpp-prefix.txt first, falls back to ACPP_PREFIX /
AdaptiveCpp_ROOT env vars, then probes a wider list of standard
locations (/opt/adaptivecpp/lib, /usr/local/lib,
/usr/lib/x86_64-linux-gnu, /usr/lib).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
User reported a low-VRAM machine (8 GB free at k=28) ballooning to
~130 GB host RAM during a failed batch run. The streaming pipeline
errored with "sycl::malloc_device(d_xs_temp): null" but kept
accumulating allocations across the failure path.

Two leak-resistance fixes:

  1. GpuBufferPool ctor wraps its allocation sequence in try/catch
     and frees any partial allocations before rethrowing. Without
     this, a mid-sequence OOM (e.g. d_pair_b after d_pair_a/d_storage
     succeeded) leaks ~10 GB device + ~7 GB pinned host per failed
     ctor — pathological under any retry loop.

  2. GpuPipeline streaming's StreamingStats now has a destructor
     that frees every allocation still tracked in its sizes map.
     If the streaming function throws partway (Xs phase OOM after
     d_xs already succeeded, T1 match OOM after T1 buffers
     allocated, etc.), the dtor runs on unwind and releases what's
     live. Removes the GPU leak that previously cascaded into the
     batch loop's pinned-host accounting.

Plus a clearer s_malloc error message when sycl::malloc_device
returns null — includes phase, requested size, live total, and a
hint to try a smaller k or larger card. Replaces the cryptic
"sycl::malloc_device(d_xs_temp): null" with actionable info.

These don't yet make 8 GB cards fit at k=28 on the SYCL build —
that needs Xs tiling and/or SortSycl scratch reduction (next
slice). They just stop leaking when the size mismatch hits.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CUB-style DoubleBuffer pattern: launch_sort_pairs_u32_u32 and
launch_sort_keys_u64 now treat keys_in/keys_out (and vals_in/vals_out)
as a ping-pong pair across radix passes instead of allocating their own
keys_alt/vals_alt scratch (which was 8 × N bytes — 2 GB at k=28!).
The result always lands in keys_out; if the pass count is odd, the
wrapper does one final memcpy from keys_in.

API change: keys_in/vals_in are now non-const (caller treats them as
scratch on input). The CUB backend ignores the non-constness; the SYCL
backend uses both buffers as the ping-pong directly. Updated all call
sites (GpuBufferPool, GpuPipeline T1/T2/T3 sort sizing queries).

Memory wins at k=28 on the SYCL build:
  pair_bytes:    6.0 GB → 4.36 GB
  xs_temp:       6.18 GB → 4.33 GB
  sort_scratch:  2.4 GB  → 0.03 GB
  pool total:    19 GB   → 13 GB
  streaming Xs:  8.2 GB  → 6.3 GB   ← fits 8 GB cards now!

Verified:
  - All 24 sycl_sort_parity tests pass on the new sort.
  - k=22 plot output is byte-identical between CUB and SYCL builds
    (same MD5 42dedec6...).

The slot-of-extra memcpy on even-pass counts (versus old code's
initial memcpy on entry) is a wash; total bytes copied per sort is
unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…rofile

CUB's input/output SortPairs API allocates ~2 GB of internal temp keys/
vals at N=2^28 — that's what kept the streaming Xs scratch at ~6 GB on
the CUB build, OOM-ing 8 GB cards just like the (now-fixed) SYCL build
did. Switch to cub::DoubleBuffer mode: caller's keys_in/keys_out and
vals_in/vals_out act as the radix ping-pong, CUB's own scratch shrinks
to ~MB of histograms.

Side effect of DoubleBuffer mode: CUB picks which buffer the result
lands in (db.Current()), which may be either keys_in or keys_out
depending on the radix pass count. Mirror SortSycl's behaviour with a
final cudaMemcpyAsync from db.Current() to keys_out when needed,
preserving the public API contract (result always in keys_out).

Memory wins at k=28 on the CUB build:
  pair_bytes:    6.0 GB → 4.36 GB
  xs_temp:       6.0 GB → 4.33 GB
  pool total:    19 GB  → 13 GB
  streaming Xs:  8.0 GB → 6.3 GB   ← fits 8 GB cards now too

Verified: k=28 plot is byte-identical between CUB and SYCL builds
(MD5 814b4f2e...).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Container UX before required users to manually pass --build-arg for
BASE_DEVEL, BASE_RUNTIME, ACPP_TARGETS, XCHPLOT2_BUILD_CUDA,
INSTALL_CUDA_HEADERS — one chain per GPU vendor. compose.yaml wires
those up as three named services (cuda / rocm / intel) sharing the
same Containerfile, so users just pick:

    podman compose build cuda                       # NVIDIA, default
    ACPP_GFX=gfx1031 podman compose build rocm     # AMD, gfx target via env
    podman compose build intel                      # Intel, untested

Each service also handles GPU device passthrough (nvidia.com/gpu=all
on CUDA, /dev/kfd + /dev/dri + group_add: video on ROCm) and bind-
mounts ./plots → /out so output lands on the host.

Containerfile additions: build the parity tests (sycl_sort_parity,
sycl_bucket_offsets_parity, sycl_g_x_parity, plot_file_parity) via a
plain CMake step after the cargo install, and copy them to
/usr/local/bin in the runtime stage. Lets users run a quick first-
port validation on a new GPU before attempting a full plot:

    podman compose run --rm --entrypoint /usr/local/bin/sycl_sort_parity rocm

Image size grew from 2.54 GB → 7.78 GB because the runtime stage now
uses the CUDA *devel* image (needed by SSCP for runtime PTX assembly,
already required for SortCuda's nvcc TUs in the CUDA build) and
ships LLVM 18 binaries. Worth it for self-containment.

README's "Container" section rewritten to lead with compose.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Container builds run without GPU access, so compose.yaml has to
hardcode defaults (sm_89 for cuda, gfx1100 for rocm). The new wrapper
runs on the host (where nvidia-smi/rocminfo work), detects vendor +
arch, and exports CUDA_ARCH or ACPP_GFX before invoking compose.

  ./scripts/build-container.sh             # auto-detect
  ./scripts/build-container.sh --gpu amd   # force AMD path
  ./scripts/build-container.sh --engine docker

Drops the AMD UX from "set ACPP_GFX=gfx1031 then podman compose build
rocm" to a single command. README updated to lead with the script and
keep the manual compose invocation as an override path.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Significant new functionality since 0.1.0 / the cuda-only era:

  - SYCL/AdaptiveCpp port (slices 1-18); cross-vendor architecture
    (AMD via HIP, Intel via Level Zero) with CUB preserved as opt-in
    fast path on NVIDIA.
  - Hand-rolled stable parallel SYCL radix sort.
  - GpuBufferPool sizing fix + free-on-throw RAII.
  - Both sort backends switched to DoubleBuffer ping-pong, dropping
    Xs scratch from ~6 GB to ~4.3 GB at k=28 — 8 GB cards now plot
    successfully via the streaming pipeline.
  - Containerfile + compose.yaml + scripts/install-deps.sh +
    scripts/build-container.sh: three layered install paths.
  - Auto-detect ACPP_TARGETS, CUDA arch, and (in the container
    wrapper) GPU vendor.
  - README, performance numbers, branch / WIP docs.

CLI surface unchanged; user-visible API stable. No breaking changes
for anyone who only consumed `xchplot2 plot/test/batch`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
User reported the AMD container build failing with:
    fatal error: cannot open file '/opt/rocm/amdgcn/bitcode/ocml.bc':
    Unknown attribute kind (102) (Producer: 'LLVM22.0.0git'
    Reader: 'LLVM 18.1.3')

ROCm ships its own LLVM (currently dev-tip / LLVM 22). The HIP device
bitcode (ocml.bc, ockl.bc, …) is produced with that LLVM. AdaptiveCpp
was being built against Ubuntu's llvm-18, so when its HIP backend
linked our SYCL kernels against ROCm's bitcode, LLVM 18's reader
choked on LLVM 22's attribute encoding.

Fix: parametrize the LLVM toolchain via two new build args:

  - LLVM_ROOT      = base prefix containing bin/clang etc.
  - LLVM_CMAKE_DIR = directory of LLVMConfig.cmake
                     (Ubuntu and ROCm lay these out differently —
                     Ubuntu: $LLVM_ROOT/cmake,
                     ROCm:   $LLVM_ROOT/lib/cmake/llvm)

Defaults preserve Ubuntu's llvm-18 layout (NVIDIA/Intel paths
unchanged); compose.yaml's rocm service overrides both to point at
/opt/rocm/llvm so AdaptiveCpp + HIP backend match the bitcode
producer.

Also corrected a typo in the prior version: $LLVM_ROOT/bin/ contains
unsuffixed binaries (clang, clang++, ld.lld) — the -18 suffix only
exists on the Ubuntu /usr/bin/ symlinks, not in the versioned llvm-18
dir itself.

Verified: NVIDIA container still builds clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
User reported build-container.sh on a fresh AMD machine printing
"No GPU detected" because rocminfo wasn't installed — and
install-deps.sh's AMD package list (rocm-hip-sdk + rocm-libs) doesn't
pull rocminfo transitively.

  - install-deps.sh: add rocminfo to all three distro AMD package
    lists (Arch, Ubuntu/Debian, Fedora). It's the discovery tool
    build-container.sh probes; tiny package, harmless to always
    install on the AMD path.
  - build-container.sh: when neither nvidia-smi nor rocminfo is
    found, print a multi-line hint pointing the user at either
    installing the right discovery tool, running install-deps.sh,
    or forcing the vendor explicitly with --gpu.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
User reported the script printing "No GPU detected" even though
rocminfo was installed and `command -v rocminfo && rocminfo | grep -q gfx`
returned MATCH when run inline.

The bug: the script enables `set -o pipefail`, which makes a pipeline
return the rightmost non-zero exit code. rocminfo (and some nvidia-smi
configurations) exit non-zero even when their output contains usable
GPU info. So `rocminfo 2>/dev/null | grep -q gfx` returned 0 from grep
but the pipeline returned 1 from rocminfo, causing the elif branch to
evaluate to false.

Restructure: capture each tool's stdout into a variable first (with
`|| true` to swallow the non-zero exit), then test the captured string
with [[ pattern ]]. No pipeline, no pipefail interaction.

Verified: script now correctly detects NVIDIA on this host
(vendor=nvidia service=cuda CUDA_ARCH=89). Should now work for AMD
hosts where rocminfo is installed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ipt)

Second pipefail trap, same shape as the first one. The old gfx-detection
line:

    gfx=$(rocminfo 2>/dev/null | awk '/.../ {print; exit}')

awk's `exit` after the first match closes its stdin, which delivers
SIGPIPE to rocminfo (still writing). With pipefail the pipeline returns
141 (128 + 13); set -e then exits the script silently.

That's why the user reported "no output" — the script was dying on
SIGPIPE right after writing the rocm_out variable, before reaching any
echo. The bash -x trace confirmed: execution reached `gfx=gfx1031`,
exit 141, no further output.

Fix: reuse the rocm_out string captured during vendor detection (or
capture it now if --gpu amd was forced) and parse with bash's built-in
[[ =~ ]] regex — no pipes, no SIGPIPE risk.

Verified locally: NVIDIA detection still works (vendor=nvidia
service=cuda CUDA_ARCH=89).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The three files that were committed under docs/
(gpu-portability-sketch.md, perf-opportunities.md,
streaming-pipeline-design.md) are working notes from the SYCL port
slices, not shipped documentation. One of them even self-identifies as
"not shipped with the repo" in its first paragraph.

Add docs/ to .gitignore and remove the existing files from the index.
User-facing documentation belongs in README.md.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Abraham Sewill and others added 25 commits April 29, 2026 00:04
typos' default dictionary tokenises mis-foo as `mis` + `foo` and
flags `mis` as a misspelling of `miss`/`mist`. Both occurrences in
validate_t1_count's broadened diagnostic from the previous commit
trip this. Reword to `miscompile`/`miscompiling` — same compiler
meaning, single token, dictionary-clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Streaming-pipeline plain and sliced T1 paths now print, when the env
var is set, the first 16 d_xs (match_info, x) entries before the
matcher launches and the resulting t1_count after. This discriminates
"upstream Xs phase silently produced wrong data" from "matcher kernel
fails at scale" on the W5700 / gfx1010 generic-JIT case where plot -k 28
returns 0 T1 entries while small-N parity passes.

Gated on env var; default-off so production paths see no change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
When XCHPLOT2_BUILD_CUDA=OFF, autodetect ROCm via hip/hip_runtime.h. If
present, define XCHPLOT2_SKIP_CUDA_RUNTIME and XCHPLOT2_SKIP_CUDA_FP16
so CudaHalfShim.hpp falls back to its opaque stubs instead of pulling
in CUDA's <cuda_runtime.h>. Without the skip, dual-toolchain hosts
(CUDA Toolkit + ROCm both installed, e.g. the W5700 reporter's W5700 box) hit
typedef redefinition errors on char1 / int2 / etc. between CUDA's
<vector_types.h> and ROCm's <amd_hip_vector_types.h>.

Single-toolchain hosts (CUDA-only or AMD-only without CUDA Toolkit) are
unaffected: the find_path is only triggered on CUDA-off builds, and the
defines only land when ROCm is present.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The W5700 / k=28 plot showed [0..16] of every Xs intermediate uniformly
0xBE (HIP poison fill), suggesting either (a) launch_xs_gen no-op'd
entirely on amdgcn at this scale, or (b) the kernel only failed to
write the first few pages while bulk-writing further offsets. Sampling
at head (idx=0), middle (idx=total/2), and tail (idx=total-16)
discriminates the two — uniform 0xBE across all three positions
confirms no-op; varied data at mid/tail confirms partial-write.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
W5700 / k=28: even after dropping Xs gen/pack tile to 1 k workgroups
(matching the parity-validated k=18 dispatch), post-gen H/M/T sees
0xCDCDCDCD (our pre-launch sentinel) — the kernel completes but
writes nothing. q.memset works (sentinel is visible), so queue
runtime primitives are fine; only kernel writes go missing. Smells
like AdaptiveCpp's HIP JIT producing empty stubs for our cooperative-
LDS + AesHashKeys kernels.

Two new env-gated checks before launch_xs_gen:
  - Trivial parallel_for (256 work-items, no LDS, no captured struct,
    no AES) writing 0xDEADBEEF to keys_a[0..16]. PASS / FAIL is a
    yes/no on whether the SYCL submission path can dispatch *any*
    kernel that actually writes on this device.
  - Read d_aes_tables[0..16] from host — should match the standard
    AES T0[0] = 0xC66363A5. If we see 0xBE or 0xCD instead, the
    T-table USM buffer was never populated and the kernels are
    reading garbage.

After this round we know whether the problem is below our level
(trivial kernel also fails) or above (trivial passes, our complex
kernels fail).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Previous SKIP_CUDA_FP16 path left __half / __half2 undefined entirely.
On most hosts that's harmless (AdaptiveCpp's libkernel never names
them on the HIP/SSCP path the build picks), but on the W5700 reporter's W5700 /
gfx1010 / gfx1013-spoof + ROCm + AdaptiveCpp combo, the missing types
caused the JIT to silently emit no-op kernel stubs — every kernel
dispatch completed cleanly with zero device-side writes (sentinel
fills survived intact through trivial parallel_for and the AES
kernels alike).

Three-tier resolution in CudaHalfShim.hpp:
  1. CUDA Toolkit available + not skipped → <cuda_fp16.h>
  2. ROCm available → <hip/hip_fp16.h> (provides __half via HIP)
  3. Neither → minimal struct stubs (generic SSCP / Intel / containers)

Tier 2 is the one that activates when XCHPLOT2_BUILD_CUDA=OFF + ROCm
present (the configuration the prior CMake change targets), so AMD
builds now have __half from HIP rather than relying on AdaptiveCpp's
internal fallback.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Cargo's build.rs sets -Wl,-rpath for AdaptiveCpp's lib dir and
${rocm_root}/lib via rustc-link-arg, so the production xchplot2
binary loads HIP fine. CMakeLists.txt had no rpath setup, so
binaries built via plain `cmake -B build && cmake --build build
--target sycl_t1_parity` had an empty RUNPATH and threw
"hipsycl::sycl::exception: No matching device" at queue
construction because librt-backend-hip.so could not dlopen
libamdhip64.so.

Append _xchplot2_acpp_lib_dir and the ROCm install root's lib
subdir to CMAKE_BUILD_RPATH / CMAKE_INSTALL_RPATH globally,
right after both paths have been computed. The FetchContent
case (where _xchplot2_acpp_lib_dir is a generator expression)
is filtered out — CMake's BUILD_WITH_INSTALL_RPATH=OFF default
already covers in-tree targets there.

Verified locally:
  readelf -d sycl_t1_parity → RUNPATH includes /opt/adaptivecpp/lib
                              and /opt/rocm/lib
  unset LD_LIBRARY_PATH; ./sycl_t1_parity --k 22 → ALL OK

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds inline validate_kernel_dispatch(q) that runs once on first
sycl_backend::queue() call per worker thread:
  - sycl::malloc_device 16 u32, throw clearly if it returns null
  - q.memset to 0xCD sentinel
  - q.parallel_for(16) writing kPattern + idx
  - q.memcpy back, verify the writes landed
  - throw std::runtime_error with a structured diagnostic message if not

The throw fires at the first GPU work request — well before any
plot-specific allocation, kernel compile, or pipeline state is set up,
turning a multi-round "T1 match produced 0 entries" investigation into
a single one-line failure that points at AdaptiveCpp's HIP/CUDA backend
producing a no-op kernel stub.

Common causes the diagnostic message points to:
  - ACPP_DEBUG_LEVEL=2 to see the JIT compile log
  - rocminfo / nvidia-smi vs the AOT target (build.rs cargo:warning)
  - ACPP_TARGETS=generic to fall back from the spoof to SSCP JIT

Bypass with POS2GPU_SKIP_SELFTEST=1 once the device is known good
(useful for short-lived processes that re-validate every invocation).

Verified locally on RTX 4090 (gfx-spoof N/A, PTX backend):
  - sycl_t1_parity --k 22 → ALL OK (self-test passes silently)
  - POS2GPU_SKIP_SELFTEST=1 sycl_t1_parity --k 22 → ALL OK (bypass works)

Reported by the W5700 reporter — Radeon Pro W5700 / RDNA1 / gfx1010 / gfx1013-spoof
+ AdaptiveCpp. Production kernel writes silently no-op'd, surfacing only
as 'T1 match produced 0 entries' deep in the streaming pipeline. With
this self-test, the same configuration would have thrown immediately
with a pointer to the diagnosis path.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The detect_amd_gfx() spoof rewrites gfx1010/1011/1012 → gfx1013 as a
community workaround for AdaptiveCpp not advertising those ISAs as
direct HIP AOT targets. Empirically the spoof has worked on some W5700
setups but silently produces no-op kernels on others (kernel writes
return cleanly with the output buffer untouched, surfacing as "T1
match produced 0 entries" deep in the streaming pipeline).

Add an opt-out env var so users on broken-spoof setups can try
AOT-targeting the actual ISA instead, without writing a full
ACPP_TARGETS string. Improve the cargo:warning to document both opt-out
paths (XCHPLOT2_NO_GFX_SPOOF=1 for native, ACPP_TARGETS=generic for SSCP
JIT) so users hitting the spoof can self-help without re-deriving the
escape hatches from the source.

No promise that the native target compiles — if AdaptiveCpp doesn't
accept gfx1010 as a HIP target on the user's toolchain version, the
build fails loudly. That's still strictly better than silently
producing broken kernels at runtime.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Single-file, no pos2_gpu / pos2_gpu_host link — just sycl/sycl.hpp +
16-element parallel_for that writes a known pattern, copies back,
prints pass/fail per slot, exits 0 on all-OK.

Use it as the first diagnostic step when sycl_t1_parity or production
CLI silently produces no output. If hellosycl FAILs, the SYCL runtime
itself can't dispatch kernels on the detected device — no
xchplot2-level fix can recover, and the message points at the usual
suspects (rpath, JIT no-op stubs, ACPP_TARGETS picking an unsupported
ISA). If hellosycl PASSes, the runtime is healthy and the bug is
specific to our kernel patterns / pipeline.

Built via:
  cmake --build build --target hellosycl
  ./build/tools/sanity/hellosycl

Or standalone:
  ACPP_TARGETS=hip:gfx1013 acpp -O2 hellosycl.cpp -o hellosycl
  LD_LIBRARY_PATH=/opt/rocm/lib ./hellosycl

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The earlier rpath fix put /opt/rocm/lib in the binary's RUNPATH but
that only governs the binary's own dependency resolution. AdaptiveCpp
dlopens librt-backend-hip.so at runtime, and *that* lib then dlopens
libamdhip64 — glibc does not consult the calling binary's RUNPATH for
those transitive backend deps. Result: ROCm silently fails to load,
AdaptiveCpp falls through to its OpenMP host device, and tools like
hellosycl / sycl_t1_parity report "ALL OK" while having executed
entirely on CPU.

Mirror build.rs:631 (cargo:rustc-link-lib=amdhip64) — make
libamdhip64 a direct dependency of every CMake-built executable when
ROCm is detected. The library is then loaded at process startup via
RUNPATH, so the subsequent dlopen from librt-backend-hip.so succeeds
trivially against the already-loaded handle. Verified locally:

  ldd build/tools/sanity/hellosycl
  → libamdhip64.so.7 => /opt/rocm/lib/libamdhip64.so.7
  → libhsa-runtime64.so.1 => /opt/rocm/lib/libhsa-runtime64.so.1

NVIDIA-only hosts (no /opt/rocm/lib/libamdhip64.so) skip the link
entirely via the EXISTS guard, so we don't regress builds without
ROCm installed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The gfx1013 AOT spoof for gfx1010/1011/1012 was a community workaround
that "should" run on close-ISA RDNA1 silicon. Empirically it has been
observed to silently produce no-op kernels on at least one W5700 /
ROCm 6 / AdaptiveCpp 25.10 setup — the kernel completes without
writing anything, the failure surfaces only as "T1 match produced 0
entries" deep in the streaming pipeline.

Same host with ACPP_TARGETS=generic (SSCP JIT) reproducibly:
  - hellosycl: ALL OK on AMD Radeon Pro W5700
  - sycl_t1_parity --k 22: ALL OK (4194833 / 4194833)
  - sycl_t1_parity --k 24: ALL OK (16779604 / 16779604)

Default for RDNA1 (gfx1010/1011/1012) → ACPP_TARGETS=generic. Two
opt-in escape hatches preserved:
  - XCHPLOT2_FORCE_GFX_SPOOF=1 → restore the legacy gfx1013 AOT path
    for users who've validated their stack on it.
  - XCHPLOT2_NO_GFX_SPOOF=1    → AOT-target the actual ISA natively
    (build will fail if AdaptiveCpp doesn't advertise it).

Non-RDNA1 AMD targets (RDNA2+) are unchanged — rocminfo's gfx string
is passed through unmodified.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…rgets

Previous gating of libamdhip64 link on `acpp_targets.starts_with("hip:")`
broke the new RDNA1 default. After d939ee8 flipped RDNA1 to
ACPP_TARGETS=generic, AMD hosts no longer hit the hip:* branch — so
libamdhip64 stopped being linked into the binary. AdaptiveCpp's
runtime dlopen of librt-backend-hip.so then failed to find
libamdhip64.so.6 (RUNPATH isn't consulted for transitive backend deps
on glibc), HIP backend didn't initialise, and the binary threw
"No matching device" at first queue construction.

Drop the hip:* gate. Link libamdhip64 whenever ROCm is reachable
(/opt/rocm/lib/libamdhip64.so exists or ROCM_PATH points at one).
NVIDIA-only hosts skip the link via the EXISTS guard. Mirrors the
CMakeLists.txt fix from commit 60b7528 (`link_libraries(libamdhip64.so)`)
for the cargo build path.

Reported by the W5700 reporter — W5700 binary built after the RDNA1 default
flip threw "No matching device" before any plot work.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The RDNA1 default flip in d939ee8 made detect_amd_gfx() return None
for gfx1010/1011/1012 (so the caller picks ACPP_TARGETS=generic). But
the same function was being used in the XCHPLOT2_BUILD_CUDA selector
to decide "is there an AMD GPU?". With detect_amd_gfx() now
returning None for RDNA1:

  if usable_nvidia_arch().is_some() { ON }       // false on the W5700 reporter's box
  else if detect_amd_gfx().is_some() { OFF }     // false! (RDNA1 → None)
  else if detect_intel_gpu() { OFF }              // false
  else if detect_nvcc() { ON, "CI fallback" }    // → ON

→ XCHPLOT2_BUILD_CUDA flipped to ON on his W5700 + CUDA-Toolkit-headers
host. SortCuda.cu compiled, linked, and ran its CUB calls against AMD
silicon, throwing "CUB memcpy keys_out: invalid argument" mid-pipeline
(after launch_xs_gen had correctly populated keys_a/vals_a — visible
in the POS2GPU_T1_DEBUG=1 output).

Add amd_gpu_present() that just probes rocminfo for any gfx GPU,
independent of which ACPP_TARGETS string we'd pick for it. Use it in
the BUILD_CUDA selector so the AMD branch fires for RDNA1 too.

ACPP_TARGETS detection unchanged — still uses detect_amd_gfx() for
"which gfx target", and that function's None for RDNA1 still steers
the caller into the generic-SSCP fallback.

Reported by the W5700 reporter — W5700, ROCm 6, AdaptiveCpp 25.10, CUDA Toolkit
headers present (for CudaHalfShim) but no real CUDA capability.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…rinsic claim

The previous floor of sm_61 was set on a misreading of AdaptiveCpp's
half.hpp: it does call __hadd / __hsub / __hmul / __hdiv / __hlt /
__hgt without __CUDA_ARCH__ guards, but cuda_fp16.hpp implements those
intrinsics with NV_IF_ELSE_TARGET(NV_PROVIDES_SM_53, native_PTX,
fp32_emulation_fallback). So pre-sm_53 cards get a software fp32
fallback baked into the headers themselves — code compiles and runs,
just slower. The floor was over-conservative.

Real constraints:

  - sm_50: minimum that CUDA 12.x can codegen for. CUDA 11.x was
    last to support Kepler (sm_30-37); not in scope for this floor.
  - CUDA 13.x dropped sm_50-72 entirely; the existing CMakeLists
    preflight catches that pairing with FATAL_ERROR + fix block.

Add a second arm in usable_nvidia_arch() that detects the toolkit
mismatch (sm < 75 + nvcc >= 13) and routes the user to CUDA 12.9 or
the container path that auto-pins it. The arm fires BEFORE we'd
attempt to build, sparing the user a cryptic mid-build error.

Net: any Maxwell+ NVIDIA card works as primary GPU as long as the
user pairs it with the right CUDA toolkit. Maintainable without
patching upstream AdaptiveCpp.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…fault

Hardware compatibility table updated for two changes that landed
recently in build.rs:

  - NVIDIA floor lowered sm_61 → sm_50 (commit a6985cf): pre-sm_53
    cards now compile + run via cuda_fp16.h's fp32 emulation, no
    AdaptiveCpp patch needed. Note added that build.rs also routes
    around the CUDA 13 + sm < 75 toolkit mismatch.
  - RDNA1 default flipped from gfx1013 AOT spoof to generic SSCP JIT
    (commit d939ee8). The spoof was observed to silently produce
    no-op kernels on at least one W5700; generic SSCP is now the
    default, with XCHPLOT2_FORCE_GFX_SPOOF / XCHPLOT2_NO_GFX_SPOOF
    as opt-in escape hatches.

Plus a CUDA-Toolkit-vs-arch matrix making the sm_50-72 / 12.9
constraint, the sm_75-90 / either-toolkit happy path, and the
sm_120 / 12.8+ constraint explicit instead of folded into a single
"12+ required" line.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
T0[a] is packed-LE (2S[a], S[a], S[a], 3S[a]). For S[0]=0x63 that's
bytes [C6 63 63 A5]; read as a little-endian u32 = 0xa56363c6 — which
is what the dump prints. The parenthetical inverted the byte order;
0xC66363A5 is the big-endian read of the same bytes (the form most
AES references show, hence the slip).

New text shows the algebraic construction plus the actual expected LE
value, so the operator can verify both "is the table populated" and
"is it the right table" at a glance under POS2GPU_T1_DEBUG=1.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…failure modes

Three distinct symptoms all trace back to the same root cause (AMD
host that also has CUDA Toolkit headers → build.rs picked
XCHPLOT2_BUILD_CUDA=ON before amd_gpu_present() landed in fe726fe):

  - "0 usable GPU device(s)" with --devices N
  - "CUB memcpy keys_out: invalid argument" mid-pipeline
  - "T1 match produced 0 entries" on RDNA1 (separate root cause —
    gfx1013 spoof producing no-op stubs — but same family of
    invisible-failure symptom that benefits from being on a search-
    indexable troubleshooting page)

Section is verbatim-symptom-first so users can grep their stderr
and land on the fix without having to read the prose around it. Also
mentions ACPP_VISIBILITY_MASK=hip;omp for the cosmetic CUDA-backend
loader warning that AdaptiveCpp emits when built with CUDA support
on a host without libcudart.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…catch

Builds past 4394c66 surface the BUILD_CUDA-vs-non-CUDA-device mismatch
at queue construction with a clear "selftest landed on a non-CUDA
device" exception, not the bare CUB error 30 seconds in. Worth saying
explicitly so users grepping the README know which symptom to expect
on a recent build.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Mirrors detect_intel_gpu()'s sysfs PCI vendor-ID approach (0x1002 for
AMD) so amd_gpu_present() works even when rocminfo isn't on $PATH at
build time. Reproduces against a Radeon Pro W5700 host where the
reporter has rocminfo installed (works at runtime via AdaptiveCpp's
HIP backend) but the cargo install shell didn't have /opt/rocm/bin on
PATH — autodetect missed AMD, fell through to the "nvcc present → CI
fallback" arm, BUILD_CUDA flipped ON, the streaming pipeline tried to
dispatch CUB sort against the W5700 and the new selftest at 4394c66
caught it loudly.

The sysfs path needs no user-space tools — only readable
/sys/class/drm/card*/device/vendor, which is true on every Linux host
with the amdgpu / radeon kernel module loaded. Robust against:

  - rocminfo not on PATH (this case)
  - rocminfo on PATH but failing because /dev/kfd isn't accessible to
    the build user (cargo install via systemd / chroot / different uid)
  - ROCm not installed yet but the kernel module is loaded (e.g. on a
    fresh distro install where the user is mid-setup)

Doesn't replace rocminfo — that's still the primary signal because it
tells us the gfx target string we'd compile for. Sysfs only answers
"is there an AMD GPU at all", which is exactly what amd_gpu_present()
needs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… AMD/Intel concurrently

Previously SortCuda.cu/SortSyclCub.cpp and SortSycl.cpp were mutually
exclusive at build time: BUILD_CUDA=ON gave CUB-only, BUILD_CUDA=OFF
gave SYCL-only. A hybrid host (NVIDIA + AMD on the same box) had to
pick one, hiding the other from --devices N (and hiding it from
--devices all entirely).

Reorganize the sort entry points into:

  - launch_sort_*_cub       (SortSyclCub.cpp, BUILD_CUDA=ON only)
  - launch_sort_*_sycl      (SortSycl.cpp, always built)
  - launch_sort_*           (SortDispatch.cpp, always built; picks
                              by q.get_device().get_backend() at
                              runtime — sycl::backend::cuda → _cub,
                              else → _sycl)

CMake now always compiles SortSycl.cpp + SortDispatch.cpp; SortSyclCub.cpp
is added on top when BUILD_CUDA=ON. The CUB branch in the dispatcher
is gated by XCHPLOT2_HAVE_CUB so AMD-only / Intel-only / CPU builds
compile it out — the dispatcher reduces to a single tail call into
SortSycl on those builds.

End-to-end on the dev box (NVIDIA RTX 4090 + AdaptiveCpp 25.10 SSCP
generic JIT, BUILD_CUDA=ON): sycl_sort_parity all-PASS at every count
(16 / 16k / 262k / 1M) for both pairs and keys, perf within noise of
the pre-refactor CUB-only path. AdaptiveCpp's SSCP backend reports
sycl::backend::cuda for NVIDIA devices, so the dispatcher routes to
CUB as expected.

Sets up the next two cleanups: usable_gpu_devices() can stop filtering
non-CUDA backends (the binary handles them now) and the BUILD_CUDA-vs-
device-mismatch selftest catch becomes redundant. Done in follow-up
commits.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Prints id, name, backend, VRAM, compute-unit count, and which sort
path the runtime dispatcher will route a worker on each device to
(CUB on cuda-backend queues when this build links CUB, SortSycl
otherwise). The printed `[N]` index is the same value `--devices N`
in `plot` / `batch` accepts.

Example output on a single-NVIDIA dev box:

  Visible GPU devices (1):
    [0] NVIDIA GeForce RTX 4090   backend=cuda  vram=24076 MB  CUs=128  sort:CUB

  Use `--devices N` (id) in `plot` / `batch` to pick a specific
  device, or `--devices all` for one worker per device.

Implementation split across two TUs to keep the SYCL include out of
cli.cpp:

  - SyclDeviceList.hpp: plain-types declaration (struct GpuDeviceInfo,
    list_gpu_devices()). Includable from any TU.
  - SyclDeviceList.cpp: queries via SyclBackend.hpp; compiled by acpp
    via add_sycl_to_target.

Direct inclusion of SyclBackend.hpp into cli.cpp triggered a
-Werror=narrowing in AdaptiveCpp's libkernel/host/builtins.hpp under
g++; the split keeps cli.cpp SYCL-free.

The opencl backend case in the switch was dropped — AdaptiveCpp's
hipsycl::rt::backend_id enum doesn't expose it. cuda / hip /
level_zero cover real-world deployments; everything else falls into
the "?" default.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…Us-only

Token semantics:

  all       → every visible GPU + the CPU worker (was: GPUs only)
  gpu       → every visible GPU                  (new — was implicit in `all`)
  cpu       → CPU worker only                    (unchanged)
  0,2,3     → explicit GPU ids                   (unchanged)

Reads more naturally — "all" should mean everything; "gpu" gives the
old all-GPUs-no-CPU behavior. Existing scripts using `--devices all`
gain a CPU worker (1-2 orders slower than GPU, so it usually finishes
last but doesn't block the GPU workers).

print_usage, devices subcommand hint, and README examples all
updated to reflect the new naming. Tested on dev box (NVIDIA + CPU).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…f idling

Multi-device used to pre-partition entries round-robin: with 10 plots
and a GPU + CPU host, the GPU got plots [0,2,4,6,8] and the CPU got
[1,3,5,7,9]. The GPU finished its share in ~50s and then sat idle for
~25 minutes while the CPU plodded through its half. End-to-end batch
wall was bounded by the CPU.

Convert run_batch_slice's inner loop to optionally pull plot indices
from a shared atomic counter instead of iterating its own vector.
Multi-device passes a single shared `next_idx` to every worker; whichever
worker finishes its current plot first grabs the next one. So the GPU
keeps pulling work for as long as plots remain, and the CPU only
handles whatever it can finish in the same wall.

Per-worker pinned-buffer slot rotation is decoupled from the global
plot index — peer workers each own their own GpuBufferPool, so the
slot must come from a per-worker `local_count`, not the (now-shared)
plot index.

Single-device path unchanged (shared_idx defaults to nullptr → original
sequential iteration). Verbose messages drop the misleading "%zu/%zu"
denominator — with dynamic dispatch the worker doesn't know the
batch's total or its own share.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Bumps [chia](https://github.com/Chia-Network/chia_rs) from 0.42.0 to 0.42.1.
- [Release notes](https://github.com/Chia-Network/chia_rs/releases)
- [Commits](Chia-Network/chia_rs@0.42.0...0.42.1)

---
updated-dependencies:
- dependency-name: chia
  dependency-version: 0.42.1
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
@dependabot dependabot Bot added dependencies Pull requests that update a dependency file rust Pull requests that update rust code labels May 4, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

dependencies Pull requests that update a dependency file rust Pull requests that update rust code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant