build(deps): bump chia from 0.42.0 to 0.42.1 in /keygen-rs by dependabot[bot] · Pull Request #6 · Jsewill/xchplot2

dependabot · 2026-05-04T08:53:53Z

Bumps chia from 0.42.0 to 0.42.1.

Release notes

0.42.1

What's Changed

fixup benchmark-conditions-cost by @arvidn in Chia-Network/chia_rs#1407

fix stubtest for mypy 1.20: clean up unused allowlist entries by @richardkiss in Chia-Network/chia_rs#1415

Fix NFT metadata parsing for invalid keys by @Rigidity in Chia-Network/chia_rs#1310

[CHIA-3854] Alternative: pure storage cost model (12000/0) for generator identity HF by @richardkiss in Chia-Network/chia_rs#1377

Check time lock by @arvidn in Chia-Network/chia_rs#1422

bump clvmr to 0.17.7 by @arvidn in Chia-Network/chia_rs#1425

bump version to 0.42.1 by @arvidn in Chia-Network/chia_rs#1426

Full Changelog: Chia-Network/chia_rs@0.42.0...0.42.1

Commits

746c688 Merge pull request #1426 from Chia-Network/bump-0.42.1
03e02c0 bump version to 0.42.1
df18c3e Merge pull request #1425 from Chia-Network/bump-clvmrs
a23dd08 harmonize versions of thiserror, p256, k256, rand, rand_chacha with clvm_rs. ...
695643c bump clvmr to 0.17.7
a256e9d Merge pull request #1422 from Chia-Network/check-time-lock
1c51179 python test cases
787796b extend tests for check_time_lock()
1a00644 simplify check_time_lock() tests
0c1c6a6 [CHIA-3854] Alternative: pure storage cost model (12000/0) for generator iden...
Additional commits viewable in compare view

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.

Dependabot commands and options

You can trigger Dependabot actions by commenting on this PR:

@dependabot rebase will rebase this PR
@dependabot recreate will recreate this PR, overwriting any edits that have been made to it
@dependabot show <dependency name> ignore conditions will show all of the ignore conditions of the specified dependency
@dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
@dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
@dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)

Introduces run_gpu_pipeline_streaming() — a per-phase alloc/free variant of the plot pipeline that lets xchplot2 run on GPUs too small to hold the ~15 GB GpuBufferPool (8 GB cards like GTX 1070). Verified bit-exact against the pool path at k=18 and k=28. Phase 2-3: orchestration + tile+merge. * New GpuPipeline.cu body: allocate d_xs, run launch_construct_xs, free scratch, alloc d_t1_meta/d_t1_mi, run T1 match, free d_xs, etc. Each phase's buffers are sized exactly for that phase and released before the next alloc. * T1 and T2 sort phases tile the input and merge the sorted runs via a stable 2-way merge-path kernel (merge_pairs_stable_2way). Ties go to the left half, matching the global stable ordering (tile 0 indices are strictly less than tile 1's). * XCHPLOT2_STREAMING=1 forces the streaming path through the one-shot run_gpu_pipeline(cfg) overload — useful for testing and for users who want the smaller peak even when the pool fits. Phase 4: VRAM tracking + cap enforcement. * StreamingStats struct + s_malloc/s_free route every cudaMalloc in the streaming path through a tracker. POS2GPU_MAX_VRAM_MB enforces a soft cap (throws before the allocation exceeds it); POS2GPU_STREAMING_STATS=1 prints a per-alloc trace and a final peak-VRAM summary. Pinned host allocations are excluded from the cap since they don't consume device VRAM. Phase 5: automatic dispatch with typed exception. * New InsufficientVramError in GpuBufferPool.hpp, thrown by the pool ctor specifically from its cudaMemGetInfo pre-check (other CUDA failures still throw plain std::runtime_error). * run_gpu_pipeline(cfg) and BatchPlotter::run_batch catch InsufficientVramError and route to the streaming pipeline. No user-facing flag. Prior approach string-matched .what() — brittle; typed exception is compile-time-safe. Phase 6: memory reductions to land under 8 GB at k=28. * launch_t1_match / launch_t2_match now emit SoA streams — meta (uint64), mi (uint32), xbits (uint32 for T2) — instead of packed T1PairingGpu / T2PairingGpu arrays. Same total bytes, but the mi column can be fed directly to CUB as the sort key input and freed as soon as CUB consumes it (skips a copy-only extract kernel and reclaims ~1 GB at k=28). Pool path carves the three SoA arrays out of its existing d_pair_a slot; streaming allocates them as three separate cudaMallocs. * Streaming T2 sort splits the previously-fused merge_permute_t2 into three passes: merge_pairs_stable_2way → gather_u64 meta → gather_u32 xbits. Frees source column between passes so each gather's peak only holds one source + one output. Drops post-CUB T2 peak from 9,360 MB to 7,280 MB. * Streaming T2 sort uses N=4 tiling + tree-of-2-way-merges (tile 0+1 → AB, tile 2+3 → CD, AB+CD → final). Halves per-tile CUB scratch (~1,044 MB → ~522 MB); AB/CD intermediates fit in the headroom gained. Without this, the binding CUB-scratch peak was 8,324 MB — 130 MB over the 8 GB target. * Alloc reorder throughout: sort outputs (d_t{1,2}_meta_sorted, d_t2_xbits_sorted) are allocated only after CUB has freed its scratch + vals_in buffers, keeping ~3 GB from going live all at once. Batch-mode streaming. * BatchPlotter's streaming-fallback branch maintains two cap-sized pinned D2H buffers (double-buffered like the pool path: plot N writes slot N%2 while consumer reads slot (N-1)%2) and threads them into a new overload: run_gpu_pipeline_streaming(cfg, pinned_dst, pinned_capacity) Returns a borrowing result (external_fragments_ptr into pinned_dst) so the consumer reads directly from pinned — no intermediate owning-vector copy. * streaming_alloc_pinned_uint64 / streaming_free_pinned_uint64 shims live in GpuPipeline.cu so BatchPlotter.cpp (plain .cpp without cuda_runtime.h on its include path) can own pinned buffers. * XCHPLOT2_STREAMING=1 also bypasses pool construction in BatchPlotter; matches the one-shot dispatch and makes the batch streaming path testable on high-VRAM hardware. * Amortises the ~600 ms cudaMallocHost(2 GB) cost away: k=28 batch streaming is 3.65 s/plot vs 3.05 s/plot pool; the remaining 0.60 s delta is per-phase device alloc/free (streaming's whole point). Parity. * t1_parity, t2_parity rebuild the AoS form locally after the SoA match kernels emit, preserving the existing CPU-vs-GPU set-equality check. Both still ALL OK across all seeds. * Pool vs streaming bit-exact at k=18 (6 plot_id × strength cases) and k=28 (plot_id=0xab*32). Measured k=28 streaming peak trajectory on a 4090: | Stage | Peak VRAM | |---------------------------------------|----------:| | Before Phase 6 | 12,484 MB | | Fuse + reorder | 10,400 MB | | T2 match SoA | 9,360 MB | | T2 sort 3-pass | 8,324 MB | | T1 match SoA | 8,324 MB | | N=4 T2 tile + tree merge (final) | 7,802 MB | k=28 pool batch steady-state (5 plots on 4090, full free VRAM): ~2.09 s GPU per plot, 2.28 s wall/plot. Consistent with the pre-Phase-6 baseline — the SoA rewiring was structural, not perf-regressing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adds a streaming-path row to the perf table (~3.7 s/plot vs pool's ~2.06 s at k=28 on a 4090 — the delta is per-phase alloc/free that the streaming path pays in exchange for a ~7.8 GB peak that fits on an 8 GB card), expands the VRAM section to describe the two code paths and the auto-dispatch at pool construction, and notes the XCHPLOT2_STREAMING=1 override for forcing streaming on a high-VRAM card. Architecture block cross-references the new streaming variant in GpuPipeline. No user-visible API change — callers use the same `xchplot2 plot` / `test` / `batch` commands and get the right path based on available VRAM, with `GpuBufferPool::InsufficientVramError` as the dispatch signal. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Two small, stacking perf wins in the three match-kernel wrappers (T1, T2, T3). 1. compute_bucket_offsets is no longer <<<1, 1>>>. The old kernel ran on a single thread that walked num_buckets binary searches serially. Latency is fine at strength=2 (16 buckets at k=28) but scales linearly with (1 << strength) — painful at higher strengths. The new kernel dispatches one thread per bucket; thread num_buckets writes the sentinel offsets[num_buckets] = total. Launched with blocks = (num_buckets + 1 + 255) / 256. Correctness preserved: each thread does the same lower_bound lookup on its assigned bucket id as the old loop, just without the monotone "start at previous pos" hint (the starting 'pos' in the old version was purely a speedup; results are identical). 2. l_count_max is no longer computed on the host. The old path D2H'd the bucket-offsets array, cudaStreamSynchronize'd, and computed max over num_sections on CPU to size blocks_x for match_all_buckets. Three host fences per plot. Replaced with max_pairs_per_section(k, section_bits) from the new shared header src/host/PoolSizing.hpp. This is the same formula GpuBufferPool uses to size the persistent pool — a safe upper bound on per-section L-count. Excess threads launched past the real L-count early-exit on the existing `l >= l_end` guard at the top of match_all_buckets, so the over-launch is free on the GPU. The shared-header move also replaces the duplicated max_pairs_per_section formula in GpuBufferPool.cu's anon namespace and GpuPipeline.cu's max_pairs_per_section_streaming helper. Measured on RTX 4090 (21 GB free), k=28 batch of 5 plots: Before: producer 2.09 s/plot, batch wall 2.28 s/plot. After: producer 1.96 s/plot, batch wall 2.15 s/plot. That's ~6 % wall reduction per plot, bigger than the ~150 µs × 3 that the raw host-fence count would suggest. cudaStreamSynchronize drains CUB's internal async state as well as the one kernel, so removing it unblocks more than just the offsets kernel. Parity verified: * t1_parity, t2_parity: ALL OK against the CPU reference (set equality). * Pool vs streaming bit-exact at k=18 (2 plot-ids × 2 strengths) and k=28 (plot_id=0xab*32). Prerequisite for subsequent PRs (per-phase streams + async D2H via cudaEvent) that depend on the absence of the host fence to let phases and plots actually overlap. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The previous wording ("build.rs auto-detects... falling back to sm_89. Override with \$CUDA_ARCHITECTURES") didn't say what the override actually does or when you'd reach for it. Now it spells out: * autodetect is via `nvidia-smi --query-gpu=compute_cap` — builds for only that architecture so the binary is small and the build is fast; * fallback to sm_89 fires when nvidia-smi isn't in PATH or doesn't see a GPU (containers, headless CI builders without the driver); * override with CUDA_ARCHITECTURES when building for a different GPU than the one compiling, or when you want a fat binary covering multiple architectures (e.g. "89;120" for Ada + Blackwell). Added a short table of common compute_cap values (61..120) so users don't have to look them up separately. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

A reader landing on the README first wants to know whether their hardware will run it at all — not where it sits on a 4090 perf curve. Swap the two so the top-of-README info is "will this work for me?" and benchmarks live at the bottom as a forward-looking reference. Hardware compatibility now lists, up front: * GPU compute cap floor (sm_61; Pascal / GTX 10-series and up). * VRAM floor (8 GB, auto-streaming) and steady-state preference (16 GB+, pool path) with a cross-reference to the existing VRAM section. * PCIe width impact (Gen4 x4 → +240 ms/plot), with the live-check incantation that used to live in the Performance preamble. * Host RAM (~16 GB; batch pins ~4 GB). * Toolkit / runtime notes (CUDA 12+ to build, 12.8+ needed at runtime for Blackwell sm_120). * OS (Linux tested; Windows/macOS not). Performance section kept intact and moved just above License. Also refreshed the pool-path batch-wall row to 2.15 s/plot — the value from the most recent 5-plot benchmark after the compute_bucket_offsets + l_count_max cleanup. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Previously the pool carried 2 rotating pinned D2H slots and the batch producer/consumer channel held depth 1. That matched the measured case of producer wall > consumer wall (GPU ~2 s/plot, consumer FSE+fwrite ~1 s/plot on NVMe) — consumer always caught up before producer overwrote its slot. For deployments where the consumer is the long pole, depth 1 leaves the GPU idle while the consumer catches up. Concretely: a batch on SATA SSD (~500 MB/s) pushes FSE+write to ~4.4 s/plot, flipping the ratio. Parameterise on GpuBufferPool::kNumPinnedBuffers (static constexpr = 3). Pool ctor/dtor loop-allocate/free. Pool-overload of run_gpu_pipeline's pinned_index check widened to the new upper bound. BatchPlotter's streaming-fallback pinned array likewise grows to 3 via the existing streaming_alloc_pinned_uint64 shim. Channel becomes a bounded queue instead of std::optional: * capacity = kNumPinnedBuffers - 1 (= 2 currently). * push waits on cv_not_full; pop on cv_not_empty. * Invariant: the producer's slot-(i%N) reuse is safe because the channel holds at most (N-1) items, so the consumer must have popped plot (i - N) before the producer enqueues plot i. Host pinned cost at k=28: 4 GB → 6 GB. Device VRAM unchanged. On the 4090+NVMe reference the measured batch wall stays at 2.15 s/plot (producer-bound, depth doesn't help), confirming the change is latent capacity rather than a perf regression. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Replace the CUDA-only kernels with portable SYCL implementations compiled by AdaptiveCpp. Each kernel TU now lives as a .cpp consumed by acpp; CUDA TUs (.cu) only ship when XCHPLOT2_BUILD_CUDA=ON. Shared infrastructure: - PortableAttrs.hpp — POS2_DEVICE_INLINE / POS2_HOST_DEVICE macros that compile correctly under nvcc and acpp. - AesTables.inl — AES T-tables shared between the CUDA and SYCL paths. - SyclBackend.hpp — per-process sycl::queue (gpu_selector) plus a device-side AES table buffer initialised on first use. Per-kernel SYCL ports (.cpp consumed by acpp): - T1OffsetsSycl, T2OffsetsSycl, T3OffsetsSycl - PipelineKernelsSycl, XsKernelsSycl - Renamed pipeline TUs (T1/T2/T3Kernel.cu, XsKernel.cu, GpuPipeline.cu, GpuBufferPool.cu) to .cpp; outer wrappers now take sycl::queue&. Sort wrapper: - Sort.cuh declares launch_sort_pairs_u32_u32 / launch_sort_keys_u64 over sycl::queue&. - SortCuda.cu (XCHPLOT2_BUILD_CUDA=ON) wires CUB radix sort, bridging the queue↔CUDA-stream boundary by draining q with q.wait(), running CUB on the default stream, then cudaStreamSynchronize. - SortSycl.cpp ships as a stub that throws on call; the hand-rolled SYCL radix sort lands in the next commit. - AesStub.cpp provides a no-op initialize_aes_tables for non-CUDA builds. CMake: - XCHPLOT2_BUILD_CUDA option (default ON) selects between SortCuda.cu / SortSycl.cpp and AesGpu.cu+AesGpuBitsliced.cu / AesStub.cpp. - enable_language(CUDA) and find_package(CUDAToolkit) are gated on the option; CUDA include paths are probed and exposed to acpp TUs that transitively pull cuda_fp16.h via AdaptiveCpp's half.hpp. - add_sycl_to_target wraps the SYCL TU set; pos2_gpu links the union. Updated parity tests (.cu) take sycl::queue&. New SYCL-side parity tools (sycl_bucket_offsets_parity, sycl_g_x_parity) validate the ported kernels against the CUDA reference. Build matrices verified end-to-end: XCHPLOT2_BUILD_CUDA=ON → NVIDIA fast path with CUB XCHPLOT2_BUILD_CUDA=OFF → SYCL-everywhere via AdaptiveCpp (sort still stubbed) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Replace the SortSycl.cpp stub with a hand-rolled stable LSD radix sort that runs on every AdaptiveCpp backend (CUDA, HIP, Level Zero, OpenCL). Pipeline (per 4-bit pass; RADIX=16; TILE_SIZE=1024): Phase 1 — per-tile parallel count. Each WG (256 threads × 4 items) reduces its tile into a 16-bucket WG-local histogram via local atomics, then writes those 16 counts (no atomics) into bucket-major tile_hist[d * num_tiles + t]. Phase 2 — single multi-WG exclusive scan over the entire bucket-major tile_hist via AdaptiveCpp's scanning::scan (decoupled-lookback). Because the layout is bucket-major, one 1-D scan yields tile_offsets directly — each entry is the global start of tile t's bucket-d range in the output. Stable by construction: tile t < t' always lands earlier within bucket d. Phase 3 — cooperative per-tile scatter. Items load contiguously per thread into local memory; for each digit d the WG runs one exclusive_scan_over_group on per-thread match counts to assign ranks in input order (stable), then every thread scatters its matching items to local_bases[d] + rank. All 256 threads stay active, no sequential bottleneck. Sort.cuh no longer pulls cuda_fp16 / cuda_runtime — those moved into SortCuda.cu (the only nvcc TU that needs them), keeping the public header backend-portable. Adds tools/parity/sycl_sort_parity that exercises both wrappers against a std::sort reference at counts {16, 16K, 256K, 1M} × seeds {1, 7, 31}; built unconditionally so it validates whichever Sort backend is wired in (CUB on the NVIDIA build, hand-rolled radix on non-CUDA). All 24 cases pass on both backends. Throughput on RTX 4090 (warm, N=1M): pairs: CUB 1.27 ms, SYCL radix 0.92 ms keys: CUB 1.70 ms, SYCL radix 1.28 ms The SYCL radix beats CUB-via-bridge at this scale because there's no per-call SYCL→CUDA→SYCL fence; CUB's tuning is expected to take the lead at N >> 1M. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The buffer pool aliases d_pair_b as the Xs construction scratch (the "alias d_pair_b for that, so no separate allocation" trick), so pair_bytes must be sized to fit either the largest pairing struct or the full Xs scratch. The previous calculation only accounted for the pairing structs (max 16 B/elem × cap = ~18 × total_xs at k=22), but the Xs scratch is 4 × total_xs uint32s plus the sort temp — and the sort temp alone is ~8 × total_xs (CUB's input/output API mode, and similarly ~8 × total_xs for the SYCL radix's ping-pong buffers). That puts the actual Xs need at ~24 × total_xs, exceeding pair_bytes on every k I tried (20, 22, 24, 26, 28). The constructor's runtime assertion was firing immediately on every plot attempt at HEAD, on both the CUB and SYCL backends — the alias was unsafe and we threw before allocating anything. End-to-end plotting was therefore broken at HEAD prior to this fix. Compute xs_temp_bytes first, then fold it into the pair_bytes max. The runtime assertion is dropped because the size now provably fits by construction. VRAM impact: at k=28, pair_bytes grows from ~4.83 GB (18 × total_xs) to ~6.4 GB (24 × total_xs), so two pair buffers cost an extra ~3.2 GB. Still comfortable on a 24 GB card. Verified end-to-end on RTX 4090, k=28 (warm timings, mean of 3): CUB: 7.25 s/plot (XCHPLOT2_BUILD_CUDA=ON) SYCL: 10.24 s/plot (XCHPLOT2_BUILD_CUDA=OFF) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Probe the build host once at configure time and pick a sensible AdaptiveCpp target list: - NVIDIA detected (nvidia-smi works) → ACPP_TARGETS=generic. Counter-intuitively, AdaptiveCpp's LLVM SSCP "generic" path is a few percent faster than cuda:sm_<arch> on our kernels at k=28 (warm wall: 7.25 s vs 7.78 s on RTX 4090 with the CUB build); SSCP's runtime specialization beats CUDA-AOT for this workload. - AMD detected (rocminfo Name: gfxXXXX) → ACPP_TARGETS=hip:gfxXXXX. SSCP's HIP path is less mature, so AOT-compiling for the actual gfx target is the safer pick on AMD. - Otherwise → ACPP_TARGETS=generic (works everywhere; JITs on first use). User-overridable via -DACPP_TARGETS=... (CMake) or $ACPP_TARGETS (cargo install). The CMake-side detection runs in execute_process with ERROR_QUIET so missing tools just fall through cleanly. The build.rs side reuses the existing detect_cuda_arch() result and adds detect_amd_gfx() for the rocminfo path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The SYCL/AdaptiveCpp port is ~1.5× slower on NVIDIA at k=28 than the original CUDA-only implementation. Users who only ever target NVIDIA should know they have the option of the legacy CUDA-only branch without giving up performance for portability. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Add steady-state batch numbers for the three current paths on RTX 4090 at k=28: cuda-only (2.15 s/plot), main+CUB (2.41 s/plot), main+SYCL (3.79 s/plot). Note that main+CUB is +12% over cuda-only and main+SYCL is +57% over CUB — the gap is host-side AdaptiveCpp scheduling overhead, not kernel perf (per-kernel nsys is within ~7% across the two paths). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The Build section was a one-liner about CUDA + C++20 + CMake + Rust; it didn't mention AdaptiveCpp at all even though slice 9 made AdaptiveCpp a hard build dependency. Restructure into: - Required toolchain (AdaptiveCpp, CUDA Toolkit headers + optional nvcc, C++20 compiler, CMake, Rust). Note that CUDA Toolkit headers are required on every build path because AdaptiveCpp's half.hpp pulls cuda_fp16.h. - Auto-fetched at configure time (pos2-chip via FetchContent, FSE vendored under pos2-chip). - Optional GPU runtimes for non-NVIDIA targets (ROCm probed by the ACPP_TARGETS autodetect; oneAPI requires manual override). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…iveCpp Three layered install paths so users can pick the friction they want: 1. Containerfile (podman-first, also docker). Build args select the base image: nvidia/cuda for CUB+SYCL, rocm/dev-ubuntu for AMD, intel/oneapi for Intel (experimental). All variants build AdaptiveCpp 25.10 from source inside the image and ship a slim runtime stage. ~15-30 min first build, layer-cached after. 2. scripts/install-deps.sh — distro-aware native bootstrap covering Arch, Ubuntu/Debian, and Fedora families. Detects GPU vendor via nvidia-smi/rocminfo and installs the right toolchain (full CUDA for NVIDIA, CUDA *headers* + ROCm for AMD), then builds AdaptiveCpp into /opt/adaptivecpp. --no-acpp opts out and lets CMake fetch it. 3. CMake FetchContent fallback. find_package(AdaptiveCpp QUIET) followed by FetchContent_Declare at v25.10.0 with FetchContent_MakeAvailable when the local lookup fails. Opt-in option XCHPLOT2_FETCH_ADAPTIVECPP=ON (default ON). The add_sycl_to_target macro is verified after the fetch — if AdaptiveCpp doesn't expose it as a subproject we error with a pointer to the manual install. build.rs also now reads $XCHPLOT2_BUILD_CUDA so the AMD/Intel container builds can flip XCHPLOT2_BUILD_CUDA=OFF without touching CMake invocation. README's Build section restructured into three clearly-labeled paths with the full dependency table moved into path #3. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Validated the Containerfile by running a full podman build and a k=22 plot inside the container with GPU passthrough via CDI. Output is MD5- identical to the host build (42dedec6...). Five fixes uncovered along the way: 1. Add lld-18 to the apt install list — AdaptiveCpp's CMake hard- errors when ld.lld is missing from PATH. Also pass -DACPP_LLD_PATH=/usr/lib/llvm-18/bin/ld.lld explicitly. 2. Move ACPP_TARGETS autodetect *before* find_package(AdaptiveCpp) in CMakeLists. AdaptiveCpp's package config reads the value at find time, and an empty -DACPP_TARGETS= (default Containerfile build-arg) makes acpp error out with "Unknown backend: ". 3. build.rs treats `Ok("")` from env::var("ACPP_TARGETS") the same as Err — Containerfile build-args propagate as empty env vars when the user doesn't override. 4. Link against AdaptiveCpp's runtime libs (acpp-rt + acpp-common) in build.rs. The static archives produced by CMake reference hipsycl::rt::* symbols that live there. ACPP_PREFIX env var (default /opt/adaptivecpp) controls the search path; an rpath entry is also added so the binary finds them at runtime. 5. Use the CUDA *devel* image as BASE_RUNTIME (not the slim runtime) and install the full llvm-18 package in the runtime stage — AdaptiveCpp's SSCP path shells out to `opt-18` and `ptxas` at runtime, both of which are missing from the slim CUDA runtime + libllvm18 combination ("LLVMToPtx: opt invocation failed with exit code -1"). Plus a .dockerignore that drops build-*/, target/, third_party/, and .git/ from the build context (was 946 MB, now ~50 MB). Containerfile header comments still document the AMD ROCm and Intel oneAPI build-arg combinations, but those remain untested. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The buffer-pool sizing fix (commit d70eefb) raised pool_bytes to include the aliased Xs scratch, which pushed pool_total at k=28 from ~12 GB to ~15.2 GB device + the 0.5 GB margin. The previous "16 GB+ cards use the pool" framing is now stale — RTX 4080 (16 GB) sits below the threshold after driver overhead and transparently falls back to streaming. Update the hardware-compat blurb and the VRAM section to reflect the new threshold and example cards (4090 / 5090 / A6000 / H100). Auto-fallback still hides the change from users. Steady-state per-plot reference also corrected from ~2.1s to ~2.4s (matches the post-port batch numbers in the Performance table). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Make explicit at the top of the README that plots are accurate (per-phase parity vs pos2-chip + bit-identical between backends + deterministic), but the project is still under active development on performance, cross-vendor support, and tooling. Point first-time users who just want a stable path at the cuda-only branch. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

User report: cargo install on a different machine fails to link with: rust-lld: error: unable to find library -lacpp-rt rust-lld: error: unable to find library -lacpp-common build.rs's hardcoded prefix list was incomplete (missed Ubuntu's /usr/lib/x86_64-linux-gnu, Arch's /usr/lib, and the FetchContent build tree under OUT_DIR/cmake-build/_deps/adaptivecpp-build/). CMakeLists now writes the actual AdaptiveCpp lib directory to $cmake_build/acpp-prefix.txt at configure time: - For installed AdaptiveCpp, derive from AdaptiveCpp_DIR (<prefix>/lib/cmake/AdaptiveCpp → <prefix>/lib). - For FetchContent builds, evaluate $<TARGET_FILE_DIR:acpp-rt> at file(GENERATE) time so the path resolves to the in-tree build artifact location. build.rs reads acpp-prefix.txt first, falls back to ACPP_PREFIX / AdaptiveCpp_ROOT env vars, then probes a wider list of standard locations (/opt/adaptivecpp/lib, /usr/local/lib, /usr/lib/x86_64-linux-gnu, /usr/lib). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

User reported a low-VRAM machine (8 GB free at k=28) ballooning to ~130 GB host RAM during a failed batch run. The streaming pipeline errored with "sycl::malloc_device(d_xs_temp): null" but kept accumulating allocations across the failure path. Two leak-resistance fixes: 1. GpuBufferPool ctor wraps its allocation sequence in try/catch and frees any partial allocations before rethrowing. Without this, a mid-sequence OOM (e.g. d_pair_b after d_pair_a/d_storage succeeded) leaks ~10 GB device + ~7 GB pinned host per failed ctor — pathological under any retry loop. 2. GpuPipeline streaming's StreamingStats now has a destructor that frees every allocation still tracked in its sizes map. If the streaming function throws partway (Xs phase OOM after d_xs already succeeded, T1 match OOM after T1 buffers allocated, etc.), the dtor runs on unwind and releases what's live. Removes the GPU leak that previously cascaded into the batch loop's pinned-host accounting. Plus a clearer s_malloc error message when sycl::malloc_device returns null — includes phase, requested size, live total, and a hint to try a smaller k or larger card. Replaces the cryptic "sycl::malloc_device(d_xs_temp): null" with actionable info. These don't yet make 8 GB cards fit at k=28 on the SYCL build — that needs Xs tiling and/or SortSycl scratch reduction (next slice). They just stop leaking when the size mismatch hits. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

CUB-style DoubleBuffer pattern: launch_sort_pairs_u32_u32 and launch_sort_keys_u64 now treat keys_in/keys_out (and vals_in/vals_out) as a ping-pong pair across radix passes instead of allocating their own keys_alt/vals_alt scratch (which was 8 × N bytes — 2 GB at k=28!). The result always lands in keys_out; if the pass count is odd, the wrapper does one final memcpy from keys_in. API change: keys_in/vals_in are now non-const (caller treats them as scratch on input). The CUB backend ignores the non-constness; the SYCL backend uses both buffers as the ping-pong directly. Updated all call sites (GpuBufferPool, GpuPipeline T1/T2/T3 sort sizing queries). Memory wins at k=28 on the SYCL build: pair_bytes: 6.0 GB → 4.36 GB xs_temp: 6.18 GB → 4.33 GB sort_scratch: 2.4 GB → 0.03 GB pool total: 19 GB → 13 GB streaming Xs: 8.2 GB → 6.3 GB ← fits 8 GB cards now! Verified: - All 24 sycl_sort_parity tests pass on the new sort. - k=22 plot output is byte-identical between CUB and SYCL builds (same MD5 42dedec6...). The slot-of-extra memcpy on even-pass counts (versus old code's initial memcpy on entry) is a wash; total bytes copied per sort is unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…rofile CUB's input/output SortPairs API allocates ~2 GB of internal temp keys/ vals at N=2^28 — that's what kept the streaming Xs scratch at ~6 GB on the CUB build, OOM-ing 8 GB cards just like the (now-fixed) SYCL build did. Switch to cub::DoubleBuffer mode: caller's keys_in/keys_out and vals_in/vals_out act as the radix ping-pong, CUB's own scratch shrinks to ~MB of histograms. Side effect of DoubleBuffer mode: CUB picks which buffer the result lands in (db.Current()), which may be either keys_in or keys_out depending on the radix pass count. Mirror SortSycl's behaviour with a final cudaMemcpyAsync from db.Current() to keys_out when needed, preserving the public API contract (result always in keys_out). Memory wins at k=28 on the CUB build: pair_bytes: 6.0 GB → 4.36 GB xs_temp: 6.0 GB → 4.33 GB pool total: 19 GB → 13 GB streaming Xs: 8.0 GB → 6.3 GB ← fits 8 GB cards now too Verified: k=28 plot is byte-identical between CUB and SYCL builds (MD5 814b4f2e...). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Container UX before required users to manually pass --build-arg for BASE_DEVEL, BASE_RUNTIME, ACPP_TARGETS, XCHPLOT2_BUILD_CUDA, INSTALL_CUDA_HEADERS — one chain per GPU vendor. compose.yaml wires those up as three named services (cuda / rocm / intel) sharing the same Containerfile, so users just pick: podman compose build cuda # NVIDIA, default ACPP_GFX=gfx1031 podman compose build rocm # AMD, gfx target via env podman compose build intel # Intel, untested Each service also handles GPU device passthrough (nvidia.com/gpu=all on CUDA, /dev/kfd + /dev/dri + group_add: video on ROCm) and bind- mounts ./plots → /out so output lands on the host. Containerfile additions: build the parity tests (sycl_sort_parity, sycl_bucket_offsets_parity, sycl_g_x_parity, plot_file_parity) via a plain CMake step after the cargo install, and copy them to /usr/local/bin in the runtime stage. Lets users run a quick first- port validation on a new GPU before attempting a full plot: podman compose run --rm --entrypoint /usr/local/bin/sycl_sort_parity rocm Image size grew from 2.54 GB → 7.78 GB because the runtime stage now uses the CUDA *devel* image (needed by SSCP for runtime PTX assembly, already required for SortCuda's nvcc TUs in the CUDA build) and ships LLVM 18 binaries. Worth it for self-containment. README's "Container" section rewritten to lead with compose. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Container builds run without GPU access, so compose.yaml has to hardcode defaults (sm_89 for cuda, gfx1100 for rocm). The new wrapper runs on the host (where nvidia-smi/rocminfo work), detects vendor + arch, and exports CUDA_ARCH or ACPP_GFX before invoking compose. ./scripts/build-container.sh # auto-detect ./scripts/build-container.sh --gpu amd # force AMD path ./scripts/build-container.sh --engine docker Drops the AMD UX from "set ACPP_GFX=gfx1031 then podman compose build rocm" to a single command. README updated to lead with the script and keep the manual compose invocation as an override path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Significant new functionality since 0.1.0 / the cuda-only era: - SYCL/AdaptiveCpp port (slices 1-18); cross-vendor architecture (AMD via HIP, Intel via Level Zero) with CUB preserved as opt-in fast path on NVIDIA. - Hand-rolled stable parallel SYCL radix sort. - GpuBufferPool sizing fix + free-on-throw RAII. - Both sort backends switched to DoubleBuffer ping-pong, dropping Xs scratch from ~6 GB to ~4.3 GB at k=28 — 8 GB cards now plot successfully via the streaming pipeline. - Containerfile + compose.yaml + scripts/install-deps.sh + scripts/build-container.sh: three layered install paths. - Auto-detect ACPP_TARGETS, CUDA arch, and (in the container wrapper) GPU vendor. - README, performance numbers, branch / WIP docs. CLI surface unchanged; user-visible API stable. No breaking changes for anyone who only consumed `xchplot2 plot/test/batch`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

User reported the AMD container build failing with: fatal error: cannot open file '/opt/rocm/amdgcn/bitcode/ocml.bc': Unknown attribute kind (102) (Producer: 'LLVM22.0.0git' Reader: 'LLVM 18.1.3') ROCm ships its own LLVM (currently dev-tip / LLVM 22). The HIP device bitcode (ocml.bc, ockl.bc, …) is produced with that LLVM. AdaptiveCpp was being built against Ubuntu's llvm-18, so when its HIP backend linked our SYCL kernels against ROCm's bitcode, LLVM 18's reader choked on LLVM 22's attribute encoding. Fix: parametrize the LLVM toolchain via two new build args: - LLVM_ROOT = base prefix containing bin/clang etc. - LLVM_CMAKE_DIR = directory of LLVMConfig.cmake (Ubuntu and ROCm lay these out differently — Ubuntu: $LLVM_ROOT/cmake, ROCm: $LLVM_ROOT/lib/cmake/llvm) Defaults preserve Ubuntu's llvm-18 layout (NVIDIA/Intel paths unchanged); compose.yaml's rocm service overrides both to point at /opt/rocm/llvm so AdaptiveCpp + HIP backend match the bitcode producer. Also corrected a typo in the prior version: $LLVM_ROOT/bin/ contains unsuffixed binaries (clang, clang++, ld.lld) — the -18 suffix only exists on the Ubuntu /usr/bin/ symlinks, not in the versioned llvm-18 dir itself. Verified: NVIDIA container still builds clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

User reported build-container.sh on a fresh AMD machine printing "No GPU detected" because rocminfo wasn't installed — and install-deps.sh's AMD package list (rocm-hip-sdk + rocm-libs) doesn't pull rocminfo transitively. - install-deps.sh: add rocminfo to all three distro AMD package lists (Arch, Ubuntu/Debian, Fedora). It's the discovery tool build-container.sh probes; tiny package, harmless to always install on the AMD path. - build-container.sh: when neither nvidia-smi nor rocminfo is found, print a multi-line hint pointing the user at either installing the right discovery tool, running install-deps.sh, or forcing the vendor explicitly with --gpu. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

User reported the script printing "No GPU detected" even though rocminfo was installed and `command -v rocminfo && rocminfo | grep -q gfx` returned MATCH when run inline. The bug: the script enables `set -o pipefail`, which makes a pipeline return the rightmost non-zero exit code. rocminfo (and some nvidia-smi configurations) exit non-zero even when their output contains usable GPU info. So `rocminfo 2>/dev/null | grep -q gfx` returned 0 from grep but the pipeline returned 1 from rocminfo, causing the elif branch to evaluate to false. Restructure: capture each tool's stdout into a variable first (with `|| true` to swallow the non-zero exit), then test the captured string with [[ pattern ]]. No pipeline, no pipefail interaction. Verified: script now correctly detects NVIDIA on this host (vendor=nvidia service=cuda CUDA_ARCH=89). Should now work for AMD hosts where rocminfo is installed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ipt) Second pipefail trap, same shape as the first one. The old gfx-detection line: gfx=$(rocminfo 2>/dev/null | awk '/.../ {print; exit}') awk's `exit` after the first match closes its stdin, which delivers SIGPIPE to rocminfo (still writing). With pipefail the pipeline returns 141 (128 + 13); set -e then exits the script silently. That's why the user reported "no output" — the script was dying on SIGPIPE right after writing the rocm_out variable, before reaching any echo. The bash -x trace confirmed: execution reached `gfx=gfx1031`, exit 141, no further output. Fix: reuse the rocm_out string captured during vendor detection (or capture it now if --gpu amd was forced) and parse with bash's built-in [[ =~ ]] regex — no pipes, no SIGPIPE risk. Verified locally: NVIDIA detection still works (vendor=nvidia service=cuda CUDA_ARCH=89). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The three files that were committed under docs/ (gpu-portability-sketch.md, perf-opportunities.md, streaming-pipeline-design.md) are working notes from the SYCL port slices, not shipped documentation. One of them even self-identifies as "not shipped with the repo" in its first paragraph. Add docs/ to .gitignore and remove the existing files from the index. User-facing documentation belongs in README.md. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

typos' default dictionary tokenises mis-foo as `mis` + `foo` and flags `mis` as a misspelling of `miss`/`mist`. Both occurrences in validate_t1_count's broadened diagnostic from the previous commit trip this. Reword to `miscompile`/`miscompiling` — same compiler meaning, single token, dictionary-clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Streaming-pipeline plain and sliced T1 paths now print, when the env var is set, the first 16 d_xs (match_info, x) entries before the matcher launches and the resulting t1_count after. This discriminates "upstream Xs phase silently produced wrong data" from "matcher kernel fails at scale" on the W5700 / gfx1010 generic-JIT case where plot -k 28 returns 0 T1 entries while small-N parity passes. Gated on env var; default-off so production paths see no change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

When XCHPLOT2_BUILD_CUDA=OFF, autodetect ROCm via hip/hip_runtime.h. If present, define XCHPLOT2_SKIP_CUDA_RUNTIME and XCHPLOT2_SKIP_CUDA_FP16 so CudaHalfShim.hpp falls back to its opaque stubs instead of pulling in CUDA's <cuda_runtime.h>. Without the skip, dual-toolchain hosts (CUDA Toolkit + ROCm both installed, e.g. the W5700 reporter's W5700 box) hit typedef redefinition errors on char1 / int2 / etc. between CUDA's <vector_types.h> and ROCm's <amd_hip_vector_types.h>. Single-toolchain hosts (CUDA-only or AMD-only without CUDA Toolkit) are unaffected: the find_path is only triggered on CUDA-off builds, and the defines only land when ROCm is present. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The W5700 / k=28 plot showed [0..16] of every Xs intermediate uniformly 0xBE (HIP poison fill), suggesting either (a) launch_xs_gen no-op'd entirely on amdgcn at this scale, or (b) the kernel only failed to write the first few pages while bulk-writing further offsets. Sampling at head (idx=0), middle (idx=total/2), and tail (idx=total-16) discriminates the two — uniform 0xBE across all three positions confirms no-op; varied data at mid/tail confirms partial-write. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

W5700 / k=28: even after dropping Xs gen/pack tile to 1 k workgroups (matching the parity-validated k=18 dispatch), post-gen H/M/T sees 0xCDCDCDCD (our pre-launch sentinel) — the kernel completes but writes nothing. q.memset works (sentinel is visible), so queue runtime primitives are fine; only kernel writes go missing. Smells like AdaptiveCpp's HIP JIT producing empty stubs for our cooperative- LDS + AesHashKeys kernels. Two new env-gated checks before launch_xs_gen: - Trivial parallel_for (256 work-items, no LDS, no captured struct, no AES) writing 0xDEADBEEF to keys_a[0..16]. PASS / FAIL is a yes/no on whether the SYCL submission path can dispatch *any* kernel that actually writes on this device. - Read d_aes_tables[0..16] from host — should match the standard AES T0[0] = 0xC66363A5. If we see 0xBE or 0xCD instead, the T-table USM buffer was never populated and the kernels are reading garbage. After this round we know whether the problem is below our level (trivial kernel also fails) or above (trivial passes, our complex kernels fail). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Previous SKIP_CUDA_FP16 path left __half / __half2 undefined entirely. On most hosts that's harmless (AdaptiveCpp's libkernel never names them on the HIP/SSCP path the build picks), but on the W5700 reporter's W5700 / gfx1010 / gfx1013-spoof + ROCm + AdaptiveCpp combo, the missing types caused the JIT to silently emit no-op kernel stubs — every kernel dispatch completed cleanly with zero device-side writes (sentinel fills survived intact through trivial parallel_for and the AES kernels alike). Three-tier resolution in CudaHalfShim.hpp: 1. CUDA Toolkit available + not skipped → <cuda_fp16.h> 2. ROCm available → <hip/hip_fp16.h> (provides __half via HIP) 3. Neither → minimal struct stubs (generic SSCP / Intel / containers) Tier 2 is the one that activates when XCHPLOT2_BUILD_CUDA=OFF + ROCm present (the configuration the prior CMake change targets), so AMD builds now have __half from HIP rather than relying on AdaptiveCpp's internal fallback. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Cargo's build.rs sets -Wl,-rpath for AdaptiveCpp's lib dir and ${rocm_root}/lib via rustc-link-arg, so the production xchplot2 binary loads HIP fine. CMakeLists.txt had no rpath setup, so binaries built via plain `cmake -B build && cmake --build build --target sycl_t1_parity` had an empty RUNPATH and threw "hipsycl::sycl::exception: No matching device" at queue construction because librt-backend-hip.so could not dlopen libamdhip64.so. Append _xchplot2_acpp_lib_dir and the ROCm install root's lib subdir to CMAKE_BUILD_RPATH / CMAKE_INSTALL_RPATH globally, right after both paths have been computed. The FetchContent case (where _xchplot2_acpp_lib_dir is a generator expression) is filtered out — CMake's BUILD_WITH_INSTALL_RPATH=OFF default already covers in-tree targets there. Verified locally: readelf -d sycl_t1_parity → RUNPATH includes /opt/adaptivecpp/lib and /opt/rocm/lib unset LD_LIBRARY_PATH; ./sycl_t1_parity --k 22 → ALL OK Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adds inline validate_kernel_dispatch(q) that runs once on first sycl_backend::queue() call per worker thread: - sycl::malloc_device 16 u32, throw clearly if it returns null - q.memset to 0xCD sentinel - q.parallel_for(16) writing kPattern + idx - q.memcpy back, verify the writes landed - throw std::runtime_error with a structured diagnostic message if not The throw fires at the first GPU work request — well before any plot-specific allocation, kernel compile, or pipeline state is set up, turning a multi-round "T1 match produced 0 entries" investigation into a single one-line failure that points at AdaptiveCpp's HIP/CUDA backend producing a no-op kernel stub. Common causes the diagnostic message points to: - ACPP_DEBUG_LEVEL=2 to see the JIT compile log - rocminfo / nvidia-smi vs the AOT target (build.rs cargo:warning) - ACPP_TARGETS=generic to fall back from the spoof to SSCP JIT Bypass with POS2GPU_SKIP_SELFTEST=1 once the device is known good (useful for short-lived processes that re-validate every invocation). Verified locally on RTX 4090 (gfx-spoof N/A, PTX backend): - sycl_t1_parity --k 22 → ALL OK (self-test passes silently) - POS2GPU_SKIP_SELFTEST=1 sycl_t1_parity --k 22 → ALL OK (bypass works) Reported by the W5700 reporter — Radeon Pro W5700 / RDNA1 / gfx1010 / gfx1013-spoof + AdaptiveCpp. Production kernel writes silently no-op'd, surfacing only as 'T1 match produced 0 entries' deep in the streaming pipeline. With this self-test, the same configuration would have thrown immediately with a pointer to the diagnosis path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The detect_amd_gfx() spoof rewrites gfx1010/1011/1012 → gfx1013 as a community workaround for AdaptiveCpp not advertising those ISAs as direct HIP AOT targets. Empirically the spoof has worked on some W5700 setups but silently produces no-op kernels on others (kernel writes return cleanly with the output buffer untouched, surfacing as "T1 match produced 0 entries" deep in the streaming pipeline). Add an opt-out env var so users on broken-spoof setups can try AOT-targeting the actual ISA instead, without writing a full ACPP_TARGETS string. Improve the cargo:warning to document both opt-out paths (XCHPLOT2_NO_GFX_SPOOF=1 for native, ACPP_TARGETS=generic for SSCP JIT) so users hitting the spoof can self-help without re-deriving the escape hatches from the source. No promise that the native target compiles — if AdaptiveCpp doesn't accept gfx1010 as a HIP target on the user's toolchain version, the build fails loudly. That's still strictly better than silently producing broken kernels at runtime. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Single-file, no pos2_gpu / pos2_gpu_host link — just sycl/sycl.hpp + 16-element parallel_for that writes a known pattern, copies back, prints pass/fail per slot, exits 0 on all-OK. Use it as the first diagnostic step when sycl_t1_parity or production CLI silently produces no output. If hellosycl FAILs, the SYCL runtime itself can't dispatch kernels on the detected device — no xchplot2-level fix can recover, and the message points at the usual suspects (rpath, JIT no-op stubs, ACPP_TARGETS picking an unsupported ISA). If hellosycl PASSes, the runtime is healthy and the bug is specific to our kernel patterns / pipeline. Built via: cmake --build build --target hellosycl ./build/tools/sanity/hellosycl Or standalone: ACPP_TARGETS=hip:gfx1013 acpp -O2 hellosycl.cpp -o hellosycl LD_LIBRARY_PATH=/opt/rocm/lib ./hellosycl Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The earlier rpath fix put /opt/rocm/lib in the binary's RUNPATH but that only governs the binary's own dependency resolution. AdaptiveCpp dlopens librt-backend-hip.so at runtime, and *that* lib then dlopens libamdhip64 — glibc does not consult the calling binary's RUNPATH for those transitive backend deps. Result: ROCm silently fails to load, AdaptiveCpp falls through to its OpenMP host device, and tools like hellosycl / sycl_t1_parity report "ALL OK" while having executed entirely on CPU. Mirror build.rs:631 (cargo:rustc-link-lib=amdhip64) — make libamdhip64 a direct dependency of every CMake-built executable when ROCm is detected. The library is then loaded at process startup via RUNPATH, so the subsequent dlopen from librt-backend-hip.so succeeds trivially against the already-loaded handle. Verified locally: ldd build/tools/sanity/hellosycl → libamdhip64.so.7 => /opt/rocm/lib/libamdhip64.so.7 → libhsa-runtime64.so.1 => /opt/rocm/lib/libhsa-runtime64.so.1 NVIDIA-only hosts (no /opt/rocm/lib/libamdhip64.so) skip the link entirely via the EXISTS guard, so we don't regress builds without ROCm installed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The gfx1013 AOT spoof for gfx1010/1011/1012 was a community workaround that "should" run on close-ISA RDNA1 silicon. Empirically it has been observed to silently produce no-op kernels on at least one W5700 / ROCm 6 / AdaptiveCpp 25.10 setup — the kernel completes without writing anything, the failure surfaces only as "T1 match produced 0 entries" deep in the streaming pipeline. Same host with ACPP_TARGETS=generic (SSCP JIT) reproducibly: - hellosycl: ALL OK on AMD Radeon Pro W5700 - sycl_t1_parity --k 22: ALL OK (4194833 / 4194833) - sycl_t1_parity --k 24: ALL OK (16779604 / 16779604) Default for RDNA1 (gfx1010/1011/1012) → ACPP_TARGETS=generic. Two opt-in escape hatches preserved: - XCHPLOT2_FORCE_GFX_SPOOF=1 → restore the legacy gfx1013 AOT path for users who've validated their stack on it. - XCHPLOT2_NO_GFX_SPOOF=1 → AOT-target the actual ISA natively (build will fail if AdaptiveCpp doesn't advertise it). Non-RDNA1 AMD targets (RDNA2+) are unchanged — rocminfo's gfx string is passed through unmodified. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…rgets Previous gating of libamdhip64 link on `acpp_targets.starts_with("hip:")` broke the new RDNA1 default. After d939ee8 flipped RDNA1 to ACPP_TARGETS=generic, AMD hosts no longer hit the hip:* branch — so libamdhip64 stopped being linked into the binary. AdaptiveCpp's runtime dlopen of librt-backend-hip.so then failed to find libamdhip64.so.6 (RUNPATH isn't consulted for transitive backend deps on glibc), HIP backend didn't initialise, and the binary threw "No matching device" at first queue construction. Drop the hip:* gate. Link libamdhip64 whenever ROCm is reachable (/opt/rocm/lib/libamdhip64.so exists or ROCM_PATH points at one). NVIDIA-only hosts skip the link via the EXISTS guard. Mirrors the CMakeLists.txt fix from commit 60b7528 (`link_libraries(libamdhip64.so)`) for the cargo build path. Reported by the W5700 reporter — W5700 binary built after the RDNA1 default flip threw "No matching device" before any plot work. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The RDNA1 default flip in d939ee8 made detect_amd_gfx() return None for gfx1010/1011/1012 (so the caller picks ACPP_TARGETS=generic). But the same function was being used in the XCHPLOT2_BUILD_CUDA selector to decide "is there an AMD GPU?". With detect_amd_gfx() now returning None for RDNA1: if usable_nvidia_arch().is_some() { ON } // false on the W5700 reporter's box else if detect_amd_gfx().is_some() { OFF } // false! (RDNA1 → None) else if detect_intel_gpu() { OFF } // false else if detect_nvcc() { ON, "CI fallback" } // → ON → XCHPLOT2_BUILD_CUDA flipped to ON on his W5700 + CUDA-Toolkit-headers host. SortCuda.cu compiled, linked, and ran its CUB calls against AMD silicon, throwing "CUB memcpy keys_out: invalid argument" mid-pipeline (after launch_xs_gen had correctly populated keys_a/vals_a — visible in the POS2GPU_T1_DEBUG=1 output). Add amd_gpu_present() that just probes rocminfo for any gfx GPU, independent of which ACPP_TARGETS string we'd pick for it. Use it in the BUILD_CUDA selector so the AMD branch fires for RDNA1 too. ACPP_TARGETS detection unchanged — still uses detect_amd_gfx() for "which gfx target", and that function's None for RDNA1 still steers the caller into the generic-SSCP fallback. Reported by the W5700 reporter — W5700, ROCm 6, AdaptiveCpp 25.10, CUDA Toolkit headers present (for CudaHalfShim) but no real CUDA capability. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…rinsic claim The previous floor of sm_61 was set on a misreading of AdaptiveCpp's half.hpp: it does call __hadd / __hsub / __hmul / __hdiv / __hlt / __hgt without __CUDA_ARCH__ guards, but cuda_fp16.hpp implements those intrinsics with NV_IF_ELSE_TARGET(NV_PROVIDES_SM_53, native_PTX, fp32_emulation_fallback). So pre-sm_53 cards get a software fp32 fallback baked into the headers themselves — code compiles and runs, just slower. The floor was over-conservative. Real constraints: - sm_50: minimum that CUDA 12.x can codegen for. CUDA 11.x was last to support Kepler (sm_30-37); not in scope for this floor. - CUDA 13.x dropped sm_50-72 entirely; the existing CMakeLists preflight catches that pairing with FATAL_ERROR + fix block. Add a second arm in usable_nvidia_arch() that detects the toolkit mismatch (sm < 75 + nvcc >= 13) and routes the user to CUDA 12.9 or the container path that auto-pins it. The arm fires BEFORE we'd attempt to build, sparing the user a cryptic mid-build error. Net: any Maxwell+ NVIDIA card works as primary GPU as long as the user pairs it with the right CUDA toolkit. Maintainable without patching upstream AdaptiveCpp. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…fault Hardware compatibility table updated for two changes that landed recently in build.rs: - NVIDIA floor lowered sm_61 → sm_50 (commit a6985cf): pre-sm_53 cards now compile + run via cuda_fp16.h's fp32 emulation, no AdaptiveCpp patch needed. Note added that build.rs also routes around the CUDA 13 + sm < 75 toolkit mismatch. - RDNA1 default flipped from gfx1013 AOT spoof to generic SSCP JIT (commit d939ee8). The spoof was observed to silently produce no-op kernels on at least one W5700; generic SSCP is now the default, with XCHPLOT2_FORCE_GFX_SPOOF / XCHPLOT2_NO_GFX_SPOOF as opt-in escape hatches. Plus a CUDA-Toolkit-vs-arch matrix making the sm_50-72 / 12.9 constraint, the sm_75-90 / either-toolkit happy path, and the sm_120 / 12.8+ constraint explicit instead of folded into a single "12+ required" line. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

T0[a] is packed-LE (2S[a], S[a], S[a], 3S[a]). For S[0]=0x63 that's bytes [C6 63 63 A5]; read as a little-endian u32 = 0xa56363c6 — which is what the dump prints. The parenthetical inverted the byte order; 0xC66363A5 is the big-endian read of the same bytes (the form most AES references show, hence the slip). New text shows the algebraic construction plus the actual expected LE value, so the operator can verify both "is the table populated" and "is it the right table" at a glance under POS2GPU_T1_DEBUG=1. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…failure modes Three distinct symptoms all trace back to the same root cause (AMD host that also has CUDA Toolkit headers → build.rs picked XCHPLOT2_BUILD_CUDA=ON before amd_gpu_present() landed in fe726fe): - "0 usable GPU device(s)" with --devices N - "CUB memcpy keys_out: invalid argument" mid-pipeline - "T1 match produced 0 entries" on RDNA1 (separate root cause — gfx1013 spoof producing no-op stubs — but same family of invisible-failure symptom that benefits from being on a search- indexable troubleshooting page) Section is verbatim-symptom-first so users can grep their stderr and land on the fix without having to read the prose around it. Also mentions ACPP_VISIBILITY_MASK=hip;omp for the cosmetic CUDA-backend loader warning that AdaptiveCpp emits when built with CUDA support on a host without libcudart. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…catch Builds past 4394c66 surface the BUILD_CUDA-vs-non-CUDA-device mismatch at queue construction with a clear "selftest landed on a non-CUDA device" exception, not the bare CUB error 30 seconds in. Worth saying explicitly so users grepping the README know which symptom to expect on a recent build. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Mirrors detect_intel_gpu()'s sysfs PCI vendor-ID approach (0x1002 for AMD) so amd_gpu_present() works even when rocminfo isn't on $PATH at build time. Reproduces against a Radeon Pro W5700 host where the reporter has rocminfo installed (works at runtime via AdaptiveCpp's HIP backend) but the cargo install shell didn't have /opt/rocm/bin on PATH — autodetect missed AMD, fell through to the "nvcc present → CI fallback" arm, BUILD_CUDA flipped ON, the streaming pipeline tried to dispatch CUB sort against the W5700 and the new selftest at 4394c66 caught it loudly. The sysfs path needs no user-space tools — only readable /sys/class/drm/card*/device/vendor, which is true on every Linux host with the amdgpu / radeon kernel module loaded. Robust against: - rocminfo not on PATH (this case) - rocminfo on PATH but failing because /dev/kfd isn't accessible to the build user (cargo install via systemd / chroot / different uid) - ROCm not installed yet but the kernel module is loaded (e.g. on a fresh distro install where the user is mid-setup) Doesn't replace rocminfo — that's still the primary signal because it tells us the gfx target string we'd compile for. Sysfs only answers "is there an AMD GPU at all", which is exactly what amd_gpu_present() needs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

… AMD/Intel concurrently Previously SortCuda.cu/SortSyclCub.cpp and SortSycl.cpp were mutually exclusive at build time: BUILD_CUDA=ON gave CUB-only, BUILD_CUDA=OFF gave SYCL-only. A hybrid host (NVIDIA + AMD on the same box) had to pick one, hiding the other from --devices N (and hiding it from --devices all entirely). Reorganize the sort entry points into: - launch_sort_*_cub (SortSyclCub.cpp, BUILD_CUDA=ON only) - launch_sort_*_sycl (SortSycl.cpp, always built) - launch_sort_* (SortDispatch.cpp, always built; picks by q.get_device().get_backend() at runtime — sycl::backend::cuda → _cub, else → _sycl) CMake now always compiles SortSycl.cpp + SortDispatch.cpp; SortSyclCub.cpp is added on top when BUILD_CUDA=ON. The CUB branch in the dispatcher is gated by XCHPLOT2_HAVE_CUB so AMD-only / Intel-only / CPU builds compile it out — the dispatcher reduces to a single tail call into SortSycl on those builds. End-to-end on the dev box (NVIDIA RTX 4090 + AdaptiveCpp 25.10 SSCP generic JIT, BUILD_CUDA=ON): sycl_sort_parity all-PASS at every count (16 / 16k / 262k / 1M) for both pairs and keys, perf within noise of the pre-refactor CUB-only path. AdaptiveCpp's SSCP backend reports sycl::backend::cuda for NVIDIA devices, so the dispatcher routes to CUB as expected. Sets up the next two cleanups: usable_gpu_devices() can stop filtering non-CUDA backends (the binary handles them now) and the BUILD_CUDA-vs- device-mismatch selftest catch becomes redundant. Done in follow-up commits. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Prints id, name, backend, VRAM, compute-unit count, and which sort path the runtime dispatcher will route a worker on each device to (CUB on cuda-backend queues when this build links CUB, SortSycl otherwise). The printed `[N]` index is the same value `--devices N` in `plot` / `batch` accepts. Example output on a single-NVIDIA dev box: Visible GPU devices (1): [0] NVIDIA GeForce RTX 4090 backend=cuda vram=24076 MB CUs=128 sort:CUB Use `--devices N` (id) in `plot` / `batch` to pick a specific device, or `--devices all` for one worker per device. Implementation split across two TUs to keep the SYCL include out of cli.cpp: - SyclDeviceList.hpp: plain-types declaration (struct GpuDeviceInfo, list_gpu_devices()). Includable from any TU. - SyclDeviceList.cpp: queries via SyclBackend.hpp; compiled by acpp via add_sycl_to_target. Direct inclusion of SyclBackend.hpp into cli.cpp triggered a -Werror=narrowing in AdaptiveCpp's libkernel/host/builtins.hpp under g++; the split keeps cli.cpp SYCL-free. The opencl backend case in the switch was dropped — AdaptiveCpp's hipsycl::rt::backend_id enum doesn't expose it. cuda / hip / level_zero cover real-world deployments; everything else falls into the "?" default. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…Us-only Token semantics: all → every visible GPU + the CPU worker (was: GPUs only) gpu → every visible GPU (new — was implicit in `all`) cpu → CPU worker only (unchanged) 0,2,3 → explicit GPU ids (unchanged) Reads more naturally — "all" should mean everything; "gpu" gives the old all-GPUs-no-CPU behavior. Existing scripts using `--devices all` gain a CPU worker (1-2 orders slower than GPU, so it usually finishes last but doesn't block the GPU workers). print_usage, devices subcommand hint, and README examples all updated to reflect the new naming. Tested on dev box (NVIDIA + CPU). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…f idling Multi-device used to pre-partition entries round-robin: with 10 plots and a GPU + CPU host, the GPU got plots [0,2,4,6,8] and the CPU got [1,3,5,7,9]. The GPU finished its share in ~50s and then sat idle for ~25 minutes while the CPU plodded through its half. End-to-end batch wall was bounded by the CPU. Convert run_batch_slice's inner loop to optionally pull plot indices from a shared atomic counter instead of iterating its own vector. Multi-device passes a single shared `next_idx` to every worker; whichever worker finishes its current plot first grabs the next one. So the GPU keeps pulling work for as long as plots remain, and the CPU only handles whatever it can finish in the same wall. Per-worker pinned-buffer slot rotation is decoupled from the global plot index — peer workers each own their own GpuBufferPool, so the slot must come from a per-worker `local_count`, not the (now-shared) plot index. Single-device path unchanged (shared_idx defaults to nullptr → original sequential iteration). Verbose messages drop the misleading "%zu/%zu" denominator — with dynamic dispatch the worker doesn't know the batch's total or its own share. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Bumps [chia](https://github.com/Chia-Network/chia_rs) from 0.42.0 to 0.42.1. - [Release notes](https://github.com/Chia-Network/chia_rs/releases) - [Commits](Chia-Network/chia_rs@0.42.0...0.42.1) --- updated-dependencies: - dependency-name: chia dependency-version: 0.42.1 dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com>

Abraham Sewill and others added 30 commits April 19, 2026 19:59

Fixed claude's typos.

577d1d3

Abraham Sewill and others added 25 commits April 29, 2026 00:04

dependabot Bot added dependencies Pull requests that update a dependency file rust Pull requests that update rust code labels May 4, 2026

Jsewill force-pushed the main branch from f3c796b to 4e6ecbf Compare May 6, 2026 08:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

build(deps): bump chia from 0.42.0 to 0.42.1 in /keygen-rs#6

build(deps): bump chia from 0.42.0 to 0.42.1 in /keygen-rs#6
dependabot[bot] wants to merge 208 commits intomainfrom
dependabot/cargo/keygen-rs/chia-0.42.1

dependabot Bot commented on behalf of github May 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

dependabot Bot commented on behalf of github May 4, 2026

0.42.1

What's Changed

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant