build(deps): bump chia from 0.42.0 to 0.42.1 in /keygen-rs#6
Open
dependabot[bot] wants to merge 208 commits intomainfrom
Open
build(deps): bump chia from 0.42.0 to 0.42.1 in /keygen-rs#6dependabot[bot] wants to merge 208 commits intomainfrom
dependabot[bot] wants to merge 208 commits intomainfrom
Conversation
Introduces run_gpu_pipeline_streaming() — a per-phase alloc/free variant
of the plot pipeline that lets xchplot2 run on GPUs too small to hold
the ~15 GB GpuBufferPool (8 GB cards like GTX 1070). Verified bit-exact
against the pool path at k=18 and k=28.
Phase 2-3: orchestration + tile+merge.
* New GpuPipeline.cu body: allocate d_xs, run launch_construct_xs,
free scratch, alloc d_t1_meta/d_t1_mi, run T1 match, free d_xs,
etc. Each phase's buffers are sized exactly for that phase and
released before the next alloc.
* T1 and T2 sort phases tile the input and merge the sorted runs via
a stable 2-way merge-path kernel (merge_pairs_stable_2way). Ties
go to the left half, matching the global stable ordering (tile 0
indices are strictly less than tile 1's).
* XCHPLOT2_STREAMING=1 forces the streaming path through the
one-shot run_gpu_pipeline(cfg) overload — useful for testing and
for users who want the smaller peak even when the pool fits.
Phase 4: VRAM tracking + cap enforcement.
* StreamingStats struct + s_malloc/s_free route every cudaMalloc
in the streaming path through a tracker. POS2GPU_MAX_VRAM_MB
enforces a soft cap (throws before the allocation exceeds it);
POS2GPU_STREAMING_STATS=1 prints a per-alloc trace and a final
peak-VRAM summary. Pinned host allocations are excluded from the
cap since they don't consume device VRAM.
Phase 5: automatic dispatch with typed exception.
* New InsufficientVramError in GpuBufferPool.hpp, thrown by the pool
ctor specifically from its cudaMemGetInfo pre-check (other CUDA
failures still throw plain std::runtime_error).
* run_gpu_pipeline(cfg) and BatchPlotter::run_batch catch
InsufficientVramError and route to the streaming pipeline. No
user-facing flag. Prior approach string-matched .what() — brittle;
typed exception is compile-time-safe.
Phase 6: memory reductions to land under 8 GB at k=28.
* launch_t1_match / launch_t2_match now emit SoA streams — meta
(uint64), mi (uint32), xbits (uint32 for T2) — instead of packed
T1PairingGpu / T2PairingGpu arrays. Same total bytes, but the mi
column can be fed directly to CUB as the sort key input and freed
as soon as CUB consumes it (skips a copy-only extract kernel and
reclaims ~1 GB at k=28). Pool path carves the three SoA arrays
out of its existing d_pair_a slot; streaming allocates them as
three separate cudaMallocs.
* Streaming T2 sort splits the previously-fused merge_permute_t2
into three passes: merge_pairs_stable_2way → gather_u64 meta →
gather_u32 xbits. Frees source column between passes so each
gather's peak only holds one source + one output. Drops post-CUB
T2 peak from 9,360 MB to 7,280 MB.
* Streaming T2 sort uses N=4 tiling + tree-of-2-way-merges
(tile 0+1 → AB, tile 2+3 → CD, AB+CD → final). Halves per-tile
CUB scratch (~1,044 MB → ~522 MB); AB/CD intermediates fit in
the headroom gained. Without this, the binding CUB-scratch peak
was 8,324 MB — 130 MB over the 8 GB target.
* Alloc reorder throughout: sort outputs (d_t{1,2}_meta_sorted,
d_t2_xbits_sorted) are allocated only after CUB has freed its
scratch + vals_in buffers, keeping ~3 GB from going live all at
once.
Batch-mode streaming.
* BatchPlotter's streaming-fallback branch maintains two cap-sized
pinned D2H buffers (double-buffered like the pool path: plot N
writes slot N%2 while consumer reads slot (N-1)%2) and threads
them into a new overload:
run_gpu_pipeline_streaming(cfg, pinned_dst, pinned_capacity)
Returns a borrowing result (external_fragments_ptr into pinned_dst)
so the consumer reads directly from pinned — no intermediate
owning-vector copy.
* streaming_alloc_pinned_uint64 / streaming_free_pinned_uint64 shims
live in GpuPipeline.cu so BatchPlotter.cpp (plain .cpp without
cuda_runtime.h on its include path) can own pinned buffers.
* XCHPLOT2_STREAMING=1 also bypasses pool construction in
BatchPlotter; matches the one-shot dispatch and makes the batch
streaming path testable on high-VRAM hardware.
* Amortises the ~600 ms cudaMallocHost(2 GB) cost away: k=28 batch
streaming is 3.65 s/plot vs 3.05 s/plot pool; the remaining 0.60 s
delta is per-phase device alloc/free (streaming's whole point).
Parity.
* t1_parity, t2_parity rebuild the AoS form locally after the SoA
match kernels emit, preserving the existing CPU-vs-GPU set-equality
check. Both still ALL OK across all seeds.
* Pool vs streaming bit-exact at k=18 (6 plot_id × strength cases)
and k=28 (plot_id=0xab*32).
Measured k=28 streaming peak trajectory on a 4090:
| Stage | Peak VRAM |
|---------------------------------------|----------:|
| Before Phase 6 | 12,484 MB |
| Fuse + reorder | 10,400 MB |
| T2 match SoA | 9,360 MB |
| T2 sort 3-pass | 8,324 MB |
| T1 match SoA | 8,324 MB |
| N=4 T2 tile + tree merge (final) | 7,802 MB |
k=28 pool batch steady-state (5 plots on 4090, full free VRAM):
~2.09 s GPU per plot, 2.28 s wall/plot. Consistent with the pre-Phase-6
baseline — the SoA rewiring was structural, not perf-regressing.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds a streaming-path row to the perf table (~3.7 s/plot vs pool's ~2.06 s at k=28 on a 4090 — the delta is per-phase alloc/free that the streaming path pays in exchange for a ~7.8 GB peak that fits on an 8 GB card), expands the VRAM section to describe the two code paths and the auto-dispatch at pool construction, and notes the XCHPLOT2_STREAMING=1 override for forcing streaming on a high-VRAM card. Architecture block cross-references the new streaming variant in GpuPipeline. No user-visible API change — callers use the same `xchplot2 plot` / `test` / `batch` commands and get the right path based on available VRAM, with `GpuBufferPool::InsufficientVramError` as the dispatch signal. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two small, stacking perf wins in the three match-kernel wrappers
(T1, T2, T3).
1. compute_bucket_offsets is no longer <<<1, 1>>>.
The old kernel ran on a single thread that walked num_buckets
binary searches serially. Latency is fine at strength=2 (16
buckets at k=28) but scales linearly with (1 << strength) —
painful at higher strengths. The new kernel dispatches one thread
per bucket; thread num_buckets writes the sentinel
offsets[num_buckets] = total. Launched with
blocks = (num_buckets + 1 + 255) / 256.
Correctness preserved: each thread does the same lower_bound
lookup on its assigned bucket id as the old loop, just without
the monotone "start at previous pos" hint (the starting 'pos' in
the old version was purely a speedup; results are identical).
2. l_count_max is no longer computed on the host.
The old path D2H'd the bucket-offsets array, cudaStreamSynchronize'd,
and computed max over num_sections on CPU to size blocks_x for
match_all_buckets. Three host fences per plot.
Replaced with max_pairs_per_section(k, section_bits) from the new
shared header src/host/PoolSizing.hpp. This is the same formula
GpuBufferPool uses to size the persistent pool — a safe upper
bound on per-section L-count. Excess threads launched past the
real L-count early-exit on the existing `l >= l_end` guard at the
top of match_all_buckets, so the over-launch is free on the GPU.
The shared-header move also replaces the duplicated
max_pairs_per_section formula in GpuBufferPool.cu's anon namespace
and GpuPipeline.cu's max_pairs_per_section_streaming helper.
Measured on RTX 4090 (21 GB free), k=28 batch of 5 plots:
Before: producer 2.09 s/plot, batch wall 2.28 s/plot.
After: producer 1.96 s/plot, batch wall 2.15 s/plot.
That's ~6 % wall reduction per plot, bigger than the ~150 µs × 3 that
the raw host-fence count would suggest. cudaStreamSynchronize drains
CUB's internal async state as well as the one kernel, so removing it
unblocks more than just the offsets kernel.
Parity verified:
* t1_parity, t2_parity: ALL OK against the CPU reference (set
equality).
* Pool vs streaming bit-exact at k=18 (2 plot-ids × 2 strengths) and
k=28 (plot_id=0xab*32).
Prerequisite for subsequent PRs (per-phase streams + async D2H via
cudaEvent) that depend on the absence of the host fence to let phases
and plots actually overlap.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous wording ("build.rs auto-detects... falling back to sm_89.
Override with \$CUDA_ARCHITECTURES") didn't say what the override
actually does or when you'd reach for it. Now it spells out:
* autodetect is via `nvidia-smi --query-gpu=compute_cap` — builds for
only that architecture so the binary is small and the build is fast;
* fallback to sm_89 fires when nvidia-smi isn't in PATH or doesn't see
a GPU (containers, headless CI builders without the driver);
* override with CUDA_ARCHITECTURES when building for a different GPU
than the one compiling, or when you want a fat binary covering
multiple architectures (e.g. "89;120" for Ada + Blackwell).
Added a short table of common compute_cap values (61..120) so users
don't have to look them up separately.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
A reader landing on the README first wants to know whether their hardware will run it at all — not where it sits on a 4090 perf curve. Swap the two so the top-of-README info is "will this work for me?" and benchmarks live at the bottom as a forward-looking reference. Hardware compatibility now lists, up front: * GPU compute cap floor (sm_61; Pascal / GTX 10-series and up). * VRAM floor (8 GB, auto-streaming) and steady-state preference (16 GB+, pool path) with a cross-reference to the existing VRAM section. * PCIe width impact (Gen4 x4 → +240 ms/plot), with the live-check incantation that used to live in the Performance preamble. * Host RAM (~16 GB; batch pins ~4 GB). * Toolkit / runtime notes (CUDA 12+ to build, 12.8+ needed at runtime for Blackwell sm_120). * OS (Linux tested; Windows/macOS not). Performance section kept intact and moved just above License. Also refreshed the pool-path batch-wall row to 2.15 s/plot — the value from the most recent 5-plot benchmark after the compute_bucket_offsets + l_count_max cleanup. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Previously the pool carried 2 rotating pinned D2H slots and the batch producer/consumer channel held depth 1. That matched the measured case of producer wall > consumer wall (GPU ~2 s/plot, consumer FSE+fwrite ~1 s/plot on NVMe) — consumer always caught up before producer overwrote its slot. For deployments where the consumer is the long pole, depth 1 leaves the GPU idle while the consumer catches up. Concretely: a batch on SATA SSD (~500 MB/s) pushes FSE+write to ~4.4 s/plot, flipping the ratio. Parameterise on GpuBufferPool::kNumPinnedBuffers (static constexpr = 3). Pool ctor/dtor loop-allocate/free. Pool-overload of run_gpu_pipeline's pinned_index check widened to the new upper bound. BatchPlotter's streaming-fallback pinned array likewise grows to 3 via the existing streaming_alloc_pinned_uint64 shim. Channel becomes a bounded queue instead of std::optional: * capacity = kNumPinnedBuffers - 1 (= 2 currently). * push waits on cv_not_full; pop on cv_not_empty. * Invariant: the producer's slot-(i%N) reuse is safe because the channel holds at most (N-1) items, so the consumer must have popped plot (i - N) before the producer enqueues plot i. Host pinned cost at k=28: 4 GB → 6 GB. Device VRAM unchanged. On the 4090+NVMe reference the measured batch wall stays at 2.15 s/plot (producer-bound, depth doesn't help), confirming the change is latent capacity rather than a perf regression. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replace the CUDA-only kernels with portable SYCL implementations
compiled by AdaptiveCpp. Each kernel TU now lives as a .cpp consumed by
acpp; CUDA TUs (.cu) only ship when XCHPLOT2_BUILD_CUDA=ON.
Shared infrastructure:
- PortableAttrs.hpp — POS2_DEVICE_INLINE / POS2_HOST_DEVICE macros
that compile correctly under nvcc and acpp.
- AesTables.inl — AES T-tables shared between the CUDA and SYCL paths.
- SyclBackend.hpp — per-process sycl::queue (gpu_selector) plus a
device-side AES table buffer initialised on first use.
Per-kernel SYCL ports (.cpp consumed by acpp):
- T1OffsetsSycl, T2OffsetsSycl, T3OffsetsSycl
- PipelineKernelsSycl, XsKernelsSycl
- Renamed pipeline TUs (T1/T2/T3Kernel.cu, XsKernel.cu, GpuPipeline.cu,
GpuBufferPool.cu) to .cpp; outer wrappers now take sycl::queue&.
Sort wrapper:
- Sort.cuh declares launch_sort_pairs_u32_u32 / launch_sort_keys_u64
over sycl::queue&.
- SortCuda.cu (XCHPLOT2_BUILD_CUDA=ON) wires CUB radix sort, bridging
the queue↔CUDA-stream boundary by draining q with q.wait(), running
CUB on the default stream, then cudaStreamSynchronize.
- SortSycl.cpp ships as a stub that throws on call; the hand-rolled
SYCL radix sort lands in the next commit.
- AesStub.cpp provides a no-op initialize_aes_tables for non-CUDA
builds.
CMake:
- XCHPLOT2_BUILD_CUDA option (default ON) selects between SortCuda.cu /
SortSycl.cpp and AesGpu.cu+AesGpuBitsliced.cu / AesStub.cpp.
- enable_language(CUDA) and find_package(CUDAToolkit) are gated on the
option; CUDA include paths are probed and exposed to acpp TUs that
transitively pull cuda_fp16.h via AdaptiveCpp's half.hpp.
- add_sycl_to_target wraps the SYCL TU set; pos2_gpu links the union.
Updated parity tests (.cu) take sycl::queue&. New SYCL-side parity
tools (sycl_bucket_offsets_parity, sycl_g_x_parity) validate the ported
kernels against the CUDA reference.
Build matrices verified end-to-end:
XCHPLOT2_BUILD_CUDA=ON → NVIDIA fast path with CUB
XCHPLOT2_BUILD_CUDA=OFF → SYCL-everywhere via AdaptiveCpp (sort still
stubbed)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replace the SortSycl.cpp stub with a hand-rolled stable LSD radix sort
that runs on every AdaptiveCpp backend (CUDA, HIP, Level Zero, OpenCL).
Pipeline (per 4-bit pass; RADIX=16; TILE_SIZE=1024):
Phase 1 — per-tile parallel count. Each WG (256 threads × 4 items)
reduces its tile into a 16-bucket WG-local histogram via local
atomics, then writes those 16 counts (no atomics) into bucket-major
tile_hist[d * num_tiles + t].
Phase 2 — single multi-WG exclusive scan over the entire bucket-major
tile_hist via AdaptiveCpp's scanning::scan (decoupled-lookback).
Because the layout is bucket-major, one 1-D scan yields tile_offsets
directly — each entry is the global start of tile t's bucket-d range
in the output. Stable by construction: tile t < t' always lands
earlier within bucket d.
Phase 3 — cooperative per-tile scatter. Items load contiguously per
thread into local memory; for each digit d the WG runs one
exclusive_scan_over_group on per-thread match counts to assign ranks
in input order (stable), then every thread scatters its matching
items to local_bases[d] + rank. All 256 threads stay active, no
sequential bottleneck.
Sort.cuh no longer pulls cuda_fp16 / cuda_runtime — those moved into
SortCuda.cu (the only nvcc TU that needs them), keeping the public
header backend-portable.
Adds tools/parity/sycl_sort_parity that exercises both wrappers
against a std::sort reference at counts {16, 16K, 256K, 1M} × seeds
{1, 7, 31}; built unconditionally so it validates whichever Sort
backend is wired in (CUB on the NVIDIA build, hand-rolled radix on
non-CUDA). All 24 cases pass on both backends.
Throughput on RTX 4090 (warm, N=1M):
pairs: CUB 1.27 ms, SYCL radix 0.92 ms
keys: CUB 1.70 ms, SYCL radix 1.28 ms
The SYCL radix beats CUB-via-bridge at this scale because there's no
per-call SYCL→CUDA→SYCL fence; CUB's tuning is expected to take the
lead at N >> 1M.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The buffer pool aliases d_pair_b as the Xs construction scratch (the "alias d_pair_b for that, so no separate allocation" trick), so pair_bytes must be sized to fit either the largest pairing struct or the full Xs scratch. The previous calculation only accounted for the pairing structs (max 16 B/elem × cap = ~18 × total_xs at k=22), but the Xs scratch is 4 × total_xs uint32s plus the sort temp — and the sort temp alone is ~8 × total_xs (CUB's input/output API mode, and similarly ~8 × total_xs for the SYCL radix's ping-pong buffers). That puts the actual Xs need at ~24 × total_xs, exceeding pair_bytes on every k I tried (20, 22, 24, 26, 28). The constructor's runtime assertion was firing immediately on every plot attempt at HEAD, on both the CUB and SYCL backends — the alias was unsafe and we threw before allocating anything. End-to-end plotting was therefore broken at HEAD prior to this fix. Compute xs_temp_bytes first, then fold it into the pair_bytes max. The runtime assertion is dropped because the size now provably fits by construction. VRAM impact: at k=28, pair_bytes grows from ~4.83 GB (18 × total_xs) to ~6.4 GB (24 × total_xs), so two pair buffers cost an extra ~3.2 GB. Still comfortable on a 24 GB card. Verified end-to-end on RTX 4090, k=28 (warm timings, mean of 3): CUB: 7.25 s/plot (XCHPLOT2_BUILD_CUDA=ON) SYCL: 10.24 s/plot (XCHPLOT2_BUILD_CUDA=OFF) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Probe the build host once at configure time and pick a sensible
AdaptiveCpp target list:
- NVIDIA detected (nvidia-smi works) → ACPP_TARGETS=generic.
Counter-intuitively, AdaptiveCpp's LLVM SSCP "generic" path is a
few percent faster than cuda:sm_<arch> on our kernels at k=28
(warm wall: 7.25 s vs 7.78 s on RTX 4090 with the CUB build);
SSCP's runtime specialization beats CUDA-AOT for this workload.
- AMD detected (rocminfo Name: gfxXXXX) → ACPP_TARGETS=hip:gfxXXXX.
SSCP's HIP path is less mature, so AOT-compiling for the actual
gfx target is the safer pick on AMD.
- Otherwise → ACPP_TARGETS=generic (works everywhere; JITs on
first use).
User-overridable via -DACPP_TARGETS=... (CMake) or $ACPP_TARGETS
(cargo install). The CMake-side detection runs in execute_process
with ERROR_QUIET so missing tools just fall through cleanly. The
build.rs side reuses the existing detect_cuda_arch() result and
adds detect_amd_gfx() for the rocminfo path.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The SYCL/AdaptiveCpp port is ~1.5× slower on NVIDIA at k=28 than the original CUDA-only implementation. Users who only ever target NVIDIA should know they have the option of the legacy CUDA-only branch without giving up performance for portability. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add steady-state batch numbers for the three current paths on RTX 4090 at k=28: cuda-only (2.15 s/plot), main+CUB (2.41 s/plot), main+SYCL (3.79 s/plot). Note that main+CUB is +12% over cuda-only and main+SYCL is +57% over CUB — the gap is host-side AdaptiveCpp scheduling overhead, not kernel perf (per-kernel nsys is within ~7% across the two paths). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The Build section was a one-liner about CUDA + C++20 + CMake + Rust;
it didn't mention AdaptiveCpp at all even though slice 9 made
AdaptiveCpp a hard build dependency. Restructure into:
- Required toolchain (AdaptiveCpp, CUDA Toolkit headers + optional
nvcc, C++20 compiler, CMake, Rust). Note that CUDA Toolkit headers
are required on every build path because AdaptiveCpp's half.hpp
pulls cuda_fp16.h.
- Auto-fetched at configure time (pos2-chip via FetchContent, FSE
vendored under pos2-chip).
- Optional GPU runtimes for non-NVIDIA targets (ROCm probed by the
ACPP_TARGETS autodetect; oneAPI requires manual override).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…iveCpp
Three layered install paths so users can pick the friction they want:
1. Containerfile (podman-first, also docker). Build args select the
base image: nvidia/cuda for CUB+SYCL, rocm/dev-ubuntu for AMD,
intel/oneapi for Intel (experimental). All variants build
AdaptiveCpp 25.10 from source inside the image and ship a slim
runtime stage. ~15-30 min first build, layer-cached after.
2. scripts/install-deps.sh — distro-aware native bootstrap covering
Arch, Ubuntu/Debian, and Fedora families. Detects GPU vendor via
nvidia-smi/rocminfo and installs the right toolchain (full CUDA
for NVIDIA, CUDA *headers* + ROCm for AMD), then builds
AdaptiveCpp into /opt/adaptivecpp. --no-acpp opts out and lets
CMake fetch it.
3. CMake FetchContent fallback. find_package(AdaptiveCpp QUIET)
followed by FetchContent_Declare at v25.10.0 with
FetchContent_MakeAvailable when the local lookup fails. Opt-in
option XCHPLOT2_FETCH_ADAPTIVECPP=ON (default ON). The
add_sycl_to_target macro is verified after the fetch — if
AdaptiveCpp doesn't expose it as a subproject we error with a
pointer to the manual install.
build.rs also now reads $XCHPLOT2_BUILD_CUDA so the AMD/Intel container
builds can flip XCHPLOT2_BUILD_CUDA=OFF without touching CMake invocation.
README's Build section restructured into three clearly-labeled paths
with the full dependency table moved into path #3.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Validated the Containerfile by running a full podman build and a k=22
plot inside the container with GPU passthrough via CDI. Output is MD5-
identical to the host build (42dedec6...). Five fixes uncovered along
the way:
1. Add lld-18 to the apt install list — AdaptiveCpp's CMake hard-
errors when ld.lld is missing from PATH. Also pass
-DACPP_LLD_PATH=/usr/lib/llvm-18/bin/ld.lld explicitly.
2. Move ACPP_TARGETS autodetect *before* find_package(AdaptiveCpp)
in CMakeLists. AdaptiveCpp's package config reads the value at
find time, and an empty -DACPP_TARGETS= (default Containerfile
build-arg) makes acpp error out with "Unknown backend: ".
3. build.rs treats `Ok("")` from env::var("ACPP_TARGETS") the same
as Err — Containerfile build-args propagate as empty env vars
when the user doesn't override.
4. Link against AdaptiveCpp's runtime libs (acpp-rt + acpp-common)
in build.rs. The static archives produced by CMake reference
hipsycl::rt::* symbols that live there. ACPP_PREFIX env var
(default /opt/adaptivecpp) controls the search path; an rpath
entry is also added so the binary finds them at runtime.
5. Use the CUDA *devel* image as BASE_RUNTIME (not the slim
runtime) and install the full llvm-18 package in the runtime
stage — AdaptiveCpp's SSCP path shells out to `opt-18` and
`ptxas` at runtime, both of which are missing from the slim
CUDA runtime + libllvm18 combination ("LLVMToPtx: opt
invocation failed with exit code -1").
Plus a .dockerignore that drops build-*/, target/, third_party/, and
.git/ from the build context (was 946 MB, now ~50 MB).
Containerfile header comments still document the AMD ROCm and Intel
oneAPI build-arg combinations, but those remain untested.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The buffer-pool sizing fix (commit d70eefb) raised pool_bytes to include the aliased Xs scratch, which pushed pool_total at k=28 from ~12 GB to ~15.2 GB device + the 0.5 GB margin. The previous "16 GB+ cards use the pool" framing is now stale — RTX 4080 (16 GB) sits below the threshold after driver overhead and transparently falls back to streaming. Update the hardware-compat blurb and the VRAM section to reflect the new threshold and example cards (4090 / 5090 / A6000 / H100). Auto-fallback still hides the change from users. Steady-state per-plot reference also corrected from ~2.1s to ~2.4s (matches the post-port batch numbers in the Performance table). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Make explicit at the top of the README that plots are accurate (per-phase parity vs pos2-chip + bit-identical between backends + deterministic), but the project is still under active development on performance, cross-vendor support, and tooling. Point first-time users who just want a stable path at the cuda-only branch. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
User report: cargo install on a different machine fails to link with:
rust-lld: error: unable to find library -lacpp-rt
rust-lld: error: unable to find library -lacpp-common
build.rs's hardcoded prefix list was incomplete (missed Ubuntu's
/usr/lib/x86_64-linux-gnu, Arch's /usr/lib, and the FetchContent
build tree under OUT_DIR/cmake-build/_deps/adaptivecpp-build/).
CMakeLists now writes the actual AdaptiveCpp lib directory to
$cmake_build/acpp-prefix.txt at configure time:
- For installed AdaptiveCpp, derive from AdaptiveCpp_DIR
(<prefix>/lib/cmake/AdaptiveCpp → <prefix>/lib).
- For FetchContent builds, evaluate $<TARGET_FILE_DIR:acpp-rt>
at file(GENERATE) time so the path resolves to the in-tree
build artifact location.
build.rs reads acpp-prefix.txt first, falls back to ACPP_PREFIX /
AdaptiveCpp_ROOT env vars, then probes a wider list of standard
locations (/opt/adaptivecpp/lib, /usr/local/lib,
/usr/lib/x86_64-linux-gnu, /usr/lib).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
User reported a low-VRAM machine (8 GB free at k=28) ballooning to
~130 GB host RAM during a failed batch run. The streaming pipeline
errored with "sycl::malloc_device(d_xs_temp): null" but kept
accumulating allocations across the failure path.
Two leak-resistance fixes:
1. GpuBufferPool ctor wraps its allocation sequence in try/catch
and frees any partial allocations before rethrowing. Without
this, a mid-sequence OOM (e.g. d_pair_b after d_pair_a/d_storage
succeeded) leaks ~10 GB device + ~7 GB pinned host per failed
ctor — pathological under any retry loop.
2. GpuPipeline streaming's StreamingStats now has a destructor
that frees every allocation still tracked in its sizes map.
If the streaming function throws partway (Xs phase OOM after
d_xs already succeeded, T1 match OOM after T1 buffers
allocated, etc.), the dtor runs on unwind and releases what's
live. Removes the GPU leak that previously cascaded into the
batch loop's pinned-host accounting.
Plus a clearer s_malloc error message when sycl::malloc_device
returns null — includes phase, requested size, live total, and a
hint to try a smaller k or larger card. Replaces the cryptic
"sycl::malloc_device(d_xs_temp): null" with actionable info.
These don't yet make 8 GB cards fit at k=28 on the SYCL build —
that needs Xs tiling and/or SortSycl scratch reduction (next
slice). They just stop leaking when the size mismatch hits.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CUB-style DoubleBuffer pattern: launch_sort_pairs_u32_u32 and
launch_sort_keys_u64 now treat keys_in/keys_out (and vals_in/vals_out)
as a ping-pong pair across radix passes instead of allocating their own
keys_alt/vals_alt scratch (which was 8 × N bytes — 2 GB at k=28!).
The result always lands in keys_out; if the pass count is odd, the
wrapper does one final memcpy from keys_in.
API change: keys_in/vals_in are now non-const (caller treats them as
scratch on input). The CUB backend ignores the non-constness; the SYCL
backend uses both buffers as the ping-pong directly. Updated all call
sites (GpuBufferPool, GpuPipeline T1/T2/T3 sort sizing queries).
Memory wins at k=28 on the SYCL build:
pair_bytes: 6.0 GB → 4.36 GB
xs_temp: 6.18 GB → 4.33 GB
sort_scratch: 2.4 GB → 0.03 GB
pool total: 19 GB → 13 GB
streaming Xs: 8.2 GB → 6.3 GB ← fits 8 GB cards now!
Verified:
- All 24 sycl_sort_parity tests pass on the new sort.
- k=22 plot output is byte-identical between CUB and SYCL builds
(same MD5 42dedec6...).
The slot-of-extra memcpy on even-pass counts (versus old code's
initial memcpy on entry) is a wash; total bytes copied per sort is
unchanged.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…rofile CUB's input/output SortPairs API allocates ~2 GB of internal temp keys/ vals at N=2^28 — that's what kept the streaming Xs scratch at ~6 GB on the CUB build, OOM-ing 8 GB cards just like the (now-fixed) SYCL build did. Switch to cub::DoubleBuffer mode: caller's keys_in/keys_out and vals_in/vals_out act as the radix ping-pong, CUB's own scratch shrinks to ~MB of histograms. Side effect of DoubleBuffer mode: CUB picks which buffer the result lands in (db.Current()), which may be either keys_in or keys_out depending on the radix pass count. Mirror SortSycl's behaviour with a final cudaMemcpyAsync from db.Current() to keys_out when needed, preserving the public API contract (result always in keys_out). Memory wins at k=28 on the CUB build: pair_bytes: 6.0 GB → 4.36 GB xs_temp: 6.0 GB → 4.33 GB pool total: 19 GB → 13 GB streaming Xs: 8.0 GB → 6.3 GB ← fits 8 GB cards now too Verified: k=28 plot is byte-identical between CUB and SYCL builds (MD5 814b4f2e...). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Container UX before required users to manually pass --build-arg for
BASE_DEVEL, BASE_RUNTIME, ACPP_TARGETS, XCHPLOT2_BUILD_CUDA,
INSTALL_CUDA_HEADERS — one chain per GPU vendor. compose.yaml wires
those up as three named services (cuda / rocm / intel) sharing the
same Containerfile, so users just pick:
podman compose build cuda # NVIDIA, default
ACPP_GFX=gfx1031 podman compose build rocm # AMD, gfx target via env
podman compose build intel # Intel, untested
Each service also handles GPU device passthrough (nvidia.com/gpu=all
on CUDA, /dev/kfd + /dev/dri + group_add: video on ROCm) and bind-
mounts ./plots → /out so output lands on the host.
Containerfile additions: build the parity tests (sycl_sort_parity,
sycl_bucket_offsets_parity, sycl_g_x_parity, plot_file_parity) via a
plain CMake step after the cargo install, and copy them to
/usr/local/bin in the runtime stage. Lets users run a quick first-
port validation on a new GPU before attempting a full plot:
podman compose run --rm --entrypoint /usr/local/bin/sycl_sort_parity rocm
Image size grew from 2.54 GB → 7.78 GB because the runtime stage now
uses the CUDA *devel* image (needed by SSCP for runtime PTX assembly,
already required for SortCuda's nvcc TUs in the CUDA build) and
ships LLVM 18 binaries. Worth it for self-containment.
README's "Container" section rewritten to lead with compose.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Container builds run without GPU access, so compose.yaml has to hardcode defaults (sm_89 for cuda, gfx1100 for rocm). The new wrapper runs on the host (where nvidia-smi/rocminfo work), detects vendor + arch, and exports CUDA_ARCH or ACPP_GFX before invoking compose. ./scripts/build-container.sh # auto-detect ./scripts/build-container.sh --gpu amd # force AMD path ./scripts/build-container.sh --engine docker Drops the AMD UX from "set ACPP_GFX=gfx1031 then podman compose build rocm" to a single command. README updated to lead with the script and keep the manual compose invocation as an override path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Significant new functionality since 0.1.0 / the cuda-only era:
- SYCL/AdaptiveCpp port (slices 1-18); cross-vendor architecture
(AMD via HIP, Intel via Level Zero) with CUB preserved as opt-in
fast path on NVIDIA.
- Hand-rolled stable parallel SYCL radix sort.
- GpuBufferPool sizing fix + free-on-throw RAII.
- Both sort backends switched to DoubleBuffer ping-pong, dropping
Xs scratch from ~6 GB to ~4.3 GB at k=28 — 8 GB cards now plot
successfully via the streaming pipeline.
- Containerfile + compose.yaml + scripts/install-deps.sh +
scripts/build-container.sh: three layered install paths.
- Auto-detect ACPP_TARGETS, CUDA arch, and (in the container
wrapper) GPU vendor.
- README, performance numbers, branch / WIP docs.
CLI surface unchanged; user-visible API stable. No breaking changes
for anyone who only consumed `xchplot2 plot/test/batch`.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
User reported the AMD container build failing with:
fatal error: cannot open file '/opt/rocm/amdgcn/bitcode/ocml.bc':
Unknown attribute kind (102) (Producer: 'LLVM22.0.0git'
Reader: 'LLVM 18.1.3')
ROCm ships its own LLVM (currently dev-tip / LLVM 22). The HIP device
bitcode (ocml.bc, ockl.bc, …) is produced with that LLVM. AdaptiveCpp
was being built against Ubuntu's llvm-18, so when its HIP backend
linked our SYCL kernels against ROCm's bitcode, LLVM 18's reader
choked on LLVM 22's attribute encoding.
Fix: parametrize the LLVM toolchain via two new build args:
- LLVM_ROOT = base prefix containing bin/clang etc.
- LLVM_CMAKE_DIR = directory of LLVMConfig.cmake
(Ubuntu and ROCm lay these out differently —
Ubuntu: $LLVM_ROOT/cmake,
ROCm: $LLVM_ROOT/lib/cmake/llvm)
Defaults preserve Ubuntu's llvm-18 layout (NVIDIA/Intel paths
unchanged); compose.yaml's rocm service overrides both to point at
/opt/rocm/llvm so AdaptiveCpp + HIP backend match the bitcode
producer.
Also corrected a typo in the prior version: $LLVM_ROOT/bin/ contains
unsuffixed binaries (clang, clang++, ld.lld) — the -18 suffix only
exists on the Ubuntu /usr/bin/ symlinks, not in the versioned llvm-18
dir itself.
Verified: NVIDIA container still builds clean.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
User reported build-container.sh on a fresh AMD machine printing
"No GPU detected" because rocminfo wasn't installed — and
install-deps.sh's AMD package list (rocm-hip-sdk + rocm-libs) doesn't
pull rocminfo transitively.
- install-deps.sh: add rocminfo to all three distro AMD package
lists (Arch, Ubuntu/Debian, Fedora). It's the discovery tool
build-container.sh probes; tiny package, harmless to always
install on the AMD path.
- build-container.sh: when neither nvidia-smi nor rocminfo is
found, print a multi-line hint pointing the user at either
installing the right discovery tool, running install-deps.sh,
or forcing the vendor explicitly with --gpu.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
User reported the script printing "No GPU detected" even though rocminfo was installed and `command -v rocminfo && rocminfo | grep -q gfx` returned MATCH when run inline. The bug: the script enables `set -o pipefail`, which makes a pipeline return the rightmost non-zero exit code. rocminfo (and some nvidia-smi configurations) exit non-zero even when their output contains usable GPU info. So `rocminfo 2>/dev/null | grep -q gfx` returned 0 from grep but the pipeline returned 1 from rocminfo, causing the elif branch to evaluate to false. Restructure: capture each tool's stdout into a variable first (with `|| true` to swallow the non-zero exit), then test the captured string with [[ pattern ]]. No pipeline, no pipefail interaction. Verified: script now correctly detects NVIDIA on this host (vendor=nvidia service=cuda CUDA_ARCH=89). Should now work for AMD hosts where rocminfo is installed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ipt)
Second pipefail trap, same shape as the first one. The old gfx-detection
line:
gfx=$(rocminfo 2>/dev/null | awk '/.../ {print; exit}')
awk's `exit` after the first match closes its stdin, which delivers
SIGPIPE to rocminfo (still writing). With pipefail the pipeline returns
141 (128 + 13); set -e then exits the script silently.
That's why the user reported "no output" — the script was dying on
SIGPIPE right after writing the rocm_out variable, before reaching any
echo. The bash -x trace confirmed: execution reached `gfx=gfx1031`,
exit 141, no further output.
Fix: reuse the rocm_out string captured during vendor detection (or
capture it now if --gpu amd was forced) and parse with bash's built-in
[[ =~ ]] regex — no pipes, no SIGPIPE risk.
Verified locally: NVIDIA detection still works (vendor=nvidia
service=cuda CUDA_ARCH=89).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The three files that were committed under docs/ (gpu-portability-sketch.md, perf-opportunities.md, streaming-pipeline-design.md) are working notes from the SYCL port slices, not shipped documentation. One of them even self-identifies as "not shipped with the repo" in its first paragraph. Add docs/ to .gitignore and remove the existing files from the index. User-facing documentation belongs in README.md. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
typos' default dictionary tokenises mis-foo as `mis` + `foo` and flags `mis` as a misspelling of `miss`/`mist`. Both occurrences in validate_t1_count's broadened diagnostic from the previous commit trip this. Reword to `miscompile`/`miscompiling` — same compiler meaning, single token, dictionary-clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Streaming-pipeline plain and sliced T1 paths now print, when the env var is set, the first 16 d_xs (match_info, x) entries before the matcher launches and the resulting t1_count after. This discriminates "upstream Xs phase silently produced wrong data" from "matcher kernel fails at scale" on the W5700 / gfx1010 generic-JIT case where plot -k 28 returns 0 T1 entries while small-N parity passes. Gated on env var; default-off so production paths see no change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
When XCHPLOT2_BUILD_CUDA=OFF, autodetect ROCm via hip/hip_runtime.h. If present, define XCHPLOT2_SKIP_CUDA_RUNTIME and XCHPLOT2_SKIP_CUDA_FP16 so CudaHalfShim.hpp falls back to its opaque stubs instead of pulling in CUDA's <cuda_runtime.h>. Without the skip, dual-toolchain hosts (CUDA Toolkit + ROCm both installed, e.g. the W5700 reporter's W5700 box) hit typedef redefinition errors on char1 / int2 / etc. between CUDA's <vector_types.h> and ROCm's <amd_hip_vector_types.h>. Single-toolchain hosts (CUDA-only or AMD-only without CUDA Toolkit) are unaffected: the find_path is only triggered on CUDA-off builds, and the defines only land when ROCm is present. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The W5700 / k=28 plot showed [0..16] of every Xs intermediate uniformly 0xBE (HIP poison fill), suggesting either (a) launch_xs_gen no-op'd entirely on amdgcn at this scale, or (b) the kernel only failed to write the first few pages while bulk-writing further offsets. Sampling at head (idx=0), middle (idx=total/2), and tail (idx=total-16) discriminates the two — uniform 0xBE across all three positions confirms no-op; varied data at mid/tail confirms partial-write. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
W5700 / k=28: even after dropping Xs gen/pack tile to 1 k workgroups
(matching the parity-validated k=18 dispatch), post-gen H/M/T sees
0xCDCDCDCD (our pre-launch sentinel) — the kernel completes but
writes nothing. q.memset works (sentinel is visible), so queue
runtime primitives are fine; only kernel writes go missing. Smells
like AdaptiveCpp's HIP JIT producing empty stubs for our cooperative-
LDS + AesHashKeys kernels.
Two new env-gated checks before launch_xs_gen:
- Trivial parallel_for (256 work-items, no LDS, no captured struct,
no AES) writing 0xDEADBEEF to keys_a[0..16]. PASS / FAIL is a
yes/no on whether the SYCL submission path can dispatch *any*
kernel that actually writes on this device.
- Read d_aes_tables[0..16] from host — should match the standard
AES T0[0] = 0xC66363A5. If we see 0xBE or 0xCD instead, the
T-table USM buffer was never populated and the kernels are
reading garbage.
After this round we know whether the problem is below our level
(trivial kernel also fails) or above (trivial passes, our complex
kernels fail).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Previous SKIP_CUDA_FP16 path left __half / __half2 undefined entirely. On most hosts that's harmless (AdaptiveCpp's libkernel never names them on the HIP/SSCP path the build picks), but on the W5700 reporter's W5700 / gfx1010 / gfx1013-spoof + ROCm + AdaptiveCpp combo, the missing types caused the JIT to silently emit no-op kernel stubs — every kernel dispatch completed cleanly with zero device-side writes (sentinel fills survived intact through trivial parallel_for and the AES kernels alike). Three-tier resolution in CudaHalfShim.hpp: 1. CUDA Toolkit available + not skipped → <cuda_fp16.h> 2. ROCm available → <hip/hip_fp16.h> (provides __half via HIP) 3. Neither → minimal struct stubs (generic SSCP / Intel / containers) Tier 2 is the one that activates when XCHPLOT2_BUILD_CUDA=OFF + ROCm present (the configuration the prior CMake change targets), so AMD builds now have __half from HIP rather than relying on AdaptiveCpp's internal fallback. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Cargo's build.rs sets -Wl,-rpath for AdaptiveCpp's lib dir and
${rocm_root}/lib via rustc-link-arg, so the production xchplot2
binary loads HIP fine. CMakeLists.txt had no rpath setup, so
binaries built via plain `cmake -B build && cmake --build build
--target sycl_t1_parity` had an empty RUNPATH and threw
"hipsycl::sycl::exception: No matching device" at queue
construction because librt-backend-hip.so could not dlopen
libamdhip64.so.
Append _xchplot2_acpp_lib_dir and the ROCm install root's lib
subdir to CMAKE_BUILD_RPATH / CMAKE_INSTALL_RPATH globally,
right after both paths have been computed. The FetchContent
case (where _xchplot2_acpp_lib_dir is a generator expression)
is filtered out — CMake's BUILD_WITH_INSTALL_RPATH=OFF default
already covers in-tree targets there.
Verified locally:
readelf -d sycl_t1_parity → RUNPATH includes /opt/adaptivecpp/lib
and /opt/rocm/lib
unset LD_LIBRARY_PATH; ./sycl_t1_parity --k 22 → ALL OK
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds inline validate_kernel_dispatch(q) that runs once on first sycl_backend::queue() call per worker thread: - sycl::malloc_device 16 u32, throw clearly if it returns null - q.memset to 0xCD sentinel - q.parallel_for(16) writing kPattern + idx - q.memcpy back, verify the writes landed - throw std::runtime_error with a structured diagnostic message if not The throw fires at the first GPU work request — well before any plot-specific allocation, kernel compile, or pipeline state is set up, turning a multi-round "T1 match produced 0 entries" investigation into a single one-line failure that points at AdaptiveCpp's HIP/CUDA backend producing a no-op kernel stub. Common causes the diagnostic message points to: - ACPP_DEBUG_LEVEL=2 to see the JIT compile log - rocminfo / nvidia-smi vs the AOT target (build.rs cargo:warning) - ACPP_TARGETS=generic to fall back from the spoof to SSCP JIT Bypass with POS2GPU_SKIP_SELFTEST=1 once the device is known good (useful for short-lived processes that re-validate every invocation). Verified locally on RTX 4090 (gfx-spoof N/A, PTX backend): - sycl_t1_parity --k 22 → ALL OK (self-test passes silently) - POS2GPU_SKIP_SELFTEST=1 sycl_t1_parity --k 22 → ALL OK (bypass works) Reported by the W5700 reporter — Radeon Pro W5700 / RDNA1 / gfx1010 / gfx1013-spoof + AdaptiveCpp. Production kernel writes silently no-op'd, surfacing only as 'T1 match produced 0 entries' deep in the streaming pipeline. With this self-test, the same configuration would have thrown immediately with a pointer to the diagnosis path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The detect_amd_gfx() spoof rewrites gfx1010/1011/1012 → gfx1013 as a community workaround for AdaptiveCpp not advertising those ISAs as direct HIP AOT targets. Empirically the spoof has worked on some W5700 setups but silently produces no-op kernels on others (kernel writes return cleanly with the output buffer untouched, surfacing as "T1 match produced 0 entries" deep in the streaming pipeline). Add an opt-out env var so users on broken-spoof setups can try AOT-targeting the actual ISA instead, without writing a full ACPP_TARGETS string. Improve the cargo:warning to document both opt-out paths (XCHPLOT2_NO_GFX_SPOOF=1 for native, ACPP_TARGETS=generic for SSCP JIT) so users hitting the spoof can self-help without re-deriving the escape hatches from the source. No promise that the native target compiles — if AdaptiveCpp doesn't accept gfx1010 as a HIP target on the user's toolchain version, the build fails loudly. That's still strictly better than silently producing broken kernels at runtime. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Single-file, no pos2_gpu / pos2_gpu_host link — just sycl/sycl.hpp + 16-element parallel_for that writes a known pattern, copies back, prints pass/fail per slot, exits 0 on all-OK. Use it as the first diagnostic step when sycl_t1_parity or production CLI silently produces no output. If hellosycl FAILs, the SYCL runtime itself can't dispatch kernels on the detected device — no xchplot2-level fix can recover, and the message points at the usual suspects (rpath, JIT no-op stubs, ACPP_TARGETS picking an unsupported ISA). If hellosycl PASSes, the runtime is healthy and the bug is specific to our kernel patterns / pipeline. Built via: cmake --build build --target hellosycl ./build/tools/sanity/hellosycl Or standalone: ACPP_TARGETS=hip:gfx1013 acpp -O2 hellosycl.cpp -o hellosycl LD_LIBRARY_PATH=/opt/rocm/lib ./hellosycl Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The earlier rpath fix put /opt/rocm/lib in the binary's RUNPATH but that only governs the binary's own dependency resolution. AdaptiveCpp dlopens librt-backend-hip.so at runtime, and *that* lib then dlopens libamdhip64 — glibc does not consult the calling binary's RUNPATH for those transitive backend deps. Result: ROCm silently fails to load, AdaptiveCpp falls through to its OpenMP host device, and tools like hellosycl / sycl_t1_parity report "ALL OK" while having executed entirely on CPU. Mirror build.rs:631 (cargo:rustc-link-lib=amdhip64) — make libamdhip64 a direct dependency of every CMake-built executable when ROCm is detected. The library is then loaded at process startup via RUNPATH, so the subsequent dlopen from librt-backend-hip.so succeeds trivially against the already-loaded handle. Verified locally: ldd build/tools/sanity/hellosycl → libamdhip64.so.7 => /opt/rocm/lib/libamdhip64.so.7 → libhsa-runtime64.so.1 => /opt/rocm/lib/libhsa-runtime64.so.1 NVIDIA-only hosts (no /opt/rocm/lib/libamdhip64.so) skip the link entirely via the EXISTS guard, so we don't regress builds without ROCm installed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The gfx1013 AOT spoof for gfx1010/1011/1012 was a community workaround
that "should" run on close-ISA RDNA1 silicon. Empirically it has been
observed to silently produce no-op kernels on at least one W5700 /
ROCm 6 / AdaptiveCpp 25.10 setup — the kernel completes without
writing anything, the failure surfaces only as "T1 match produced 0
entries" deep in the streaming pipeline.
Same host with ACPP_TARGETS=generic (SSCP JIT) reproducibly:
- hellosycl: ALL OK on AMD Radeon Pro W5700
- sycl_t1_parity --k 22: ALL OK (4194833 / 4194833)
- sycl_t1_parity --k 24: ALL OK (16779604 / 16779604)
Default for RDNA1 (gfx1010/1011/1012) → ACPP_TARGETS=generic. Two
opt-in escape hatches preserved:
- XCHPLOT2_FORCE_GFX_SPOOF=1 → restore the legacy gfx1013 AOT path
for users who've validated their stack on it.
- XCHPLOT2_NO_GFX_SPOOF=1 → AOT-target the actual ISA natively
(build will fail if AdaptiveCpp doesn't advertise it).
Non-RDNA1 AMD targets (RDNA2+) are unchanged — rocminfo's gfx string
is passed through unmodified.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…rgets
Previous gating of libamdhip64 link on `acpp_targets.starts_with("hip:")`
broke the new RDNA1 default. After d939ee8 flipped RDNA1 to
ACPP_TARGETS=generic, AMD hosts no longer hit the hip:* branch — so
libamdhip64 stopped being linked into the binary. AdaptiveCpp's
runtime dlopen of librt-backend-hip.so then failed to find
libamdhip64.so.6 (RUNPATH isn't consulted for transitive backend deps
on glibc), HIP backend didn't initialise, and the binary threw
"No matching device" at first queue construction.
Drop the hip:* gate. Link libamdhip64 whenever ROCm is reachable
(/opt/rocm/lib/libamdhip64.so exists or ROCM_PATH points at one).
NVIDIA-only hosts skip the link via the EXISTS guard. Mirrors the
CMakeLists.txt fix from commit 60b7528 (`link_libraries(libamdhip64.so)`)
for the cargo build path.
Reported by the W5700 reporter — W5700 binary built after the RDNA1 default
flip threw "No matching device" before any plot work.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The RDNA1 default flip in d939ee8 made detect_amd_gfx() return None
for gfx1010/1011/1012 (so the caller picks ACPP_TARGETS=generic). But
the same function was being used in the XCHPLOT2_BUILD_CUDA selector
to decide "is there an AMD GPU?". With detect_amd_gfx() now
returning None for RDNA1:
if usable_nvidia_arch().is_some() { ON } // false on the W5700 reporter's box
else if detect_amd_gfx().is_some() { OFF } // false! (RDNA1 → None)
else if detect_intel_gpu() { OFF } // false
else if detect_nvcc() { ON, "CI fallback" } // → ON
→ XCHPLOT2_BUILD_CUDA flipped to ON on his W5700 + CUDA-Toolkit-headers
host. SortCuda.cu compiled, linked, and ran its CUB calls against AMD
silicon, throwing "CUB memcpy keys_out: invalid argument" mid-pipeline
(after launch_xs_gen had correctly populated keys_a/vals_a — visible
in the POS2GPU_T1_DEBUG=1 output).
Add amd_gpu_present() that just probes rocminfo for any gfx GPU,
independent of which ACPP_TARGETS string we'd pick for it. Use it in
the BUILD_CUDA selector so the AMD branch fires for RDNA1 too.
ACPP_TARGETS detection unchanged — still uses detect_amd_gfx() for
"which gfx target", and that function's None for RDNA1 still steers
the caller into the generic-SSCP fallback.
Reported by the W5700 reporter — W5700, ROCm 6, AdaptiveCpp 25.10, CUDA Toolkit
headers present (for CudaHalfShim) but no real CUDA capability.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…rinsic claim
The previous floor of sm_61 was set on a misreading of AdaptiveCpp's
half.hpp: it does call __hadd / __hsub / __hmul / __hdiv / __hlt /
__hgt without __CUDA_ARCH__ guards, but cuda_fp16.hpp implements those
intrinsics with NV_IF_ELSE_TARGET(NV_PROVIDES_SM_53, native_PTX,
fp32_emulation_fallback). So pre-sm_53 cards get a software fp32
fallback baked into the headers themselves — code compiles and runs,
just slower. The floor was over-conservative.
Real constraints:
- sm_50: minimum that CUDA 12.x can codegen for. CUDA 11.x was
last to support Kepler (sm_30-37); not in scope for this floor.
- CUDA 13.x dropped sm_50-72 entirely; the existing CMakeLists
preflight catches that pairing with FATAL_ERROR + fix block.
Add a second arm in usable_nvidia_arch() that detects the toolkit
mismatch (sm < 75 + nvcc >= 13) and routes the user to CUDA 12.9 or
the container path that auto-pins it. The arm fires BEFORE we'd
attempt to build, sparing the user a cryptic mid-build error.
Net: any Maxwell+ NVIDIA card works as primary GPU as long as the
user pairs it with the right CUDA toolkit. Maintainable without
patching upstream AdaptiveCpp.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…fault
Hardware compatibility table updated for two changes that landed
recently in build.rs:
- NVIDIA floor lowered sm_61 → sm_50 (commit a6985cf): pre-sm_53
cards now compile + run via cuda_fp16.h's fp32 emulation, no
AdaptiveCpp patch needed. Note added that build.rs also routes
around the CUDA 13 + sm < 75 toolkit mismatch.
- RDNA1 default flipped from gfx1013 AOT spoof to generic SSCP JIT
(commit d939ee8). The spoof was observed to silently produce
no-op kernels on at least one W5700; generic SSCP is now the
default, with XCHPLOT2_FORCE_GFX_SPOOF / XCHPLOT2_NO_GFX_SPOOF
as opt-in escape hatches.
Plus a CUDA-Toolkit-vs-arch matrix making the sm_50-72 / 12.9
constraint, the sm_75-90 / either-toolkit happy path, and the
sm_120 / 12.8+ constraint explicit instead of folded into a single
"12+ required" line.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
T0[a] is packed-LE (2S[a], S[a], S[a], 3S[a]). For S[0]=0x63 that's bytes [C6 63 63 A5]; read as a little-endian u32 = 0xa56363c6 — which is what the dump prints. The parenthetical inverted the byte order; 0xC66363A5 is the big-endian read of the same bytes (the form most AES references show, hence the slip). New text shows the algebraic construction plus the actual expected LE value, so the operator can verify both "is the table populated" and "is it the right table" at a glance under POS2GPU_T1_DEBUG=1. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…failure modes Three distinct symptoms all trace back to the same root cause (AMD host that also has CUDA Toolkit headers → build.rs picked XCHPLOT2_BUILD_CUDA=ON before amd_gpu_present() landed in fe726fe): - "0 usable GPU device(s)" with --devices N - "CUB memcpy keys_out: invalid argument" mid-pipeline - "T1 match produced 0 entries" on RDNA1 (separate root cause — gfx1013 spoof producing no-op stubs — but same family of invisible-failure symptom that benefits from being on a search- indexable troubleshooting page) Section is verbatim-symptom-first so users can grep their stderr and land on the fix without having to read the prose around it. Also mentions ACPP_VISIBILITY_MASK=hip;omp for the cosmetic CUDA-backend loader warning that AdaptiveCpp emits when built with CUDA support on a host without libcudart. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…catch Builds past 4394c66 surface the BUILD_CUDA-vs-non-CUDA-device mismatch at queue construction with a clear "selftest landed on a non-CUDA device" exception, not the bare CUB error 30 seconds in. Worth saying explicitly so users grepping the README know which symptom to expect on a recent build. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Mirrors detect_intel_gpu()'s sysfs PCI vendor-ID approach (0x1002 for
AMD) so amd_gpu_present() works even when rocminfo isn't on $PATH at
build time. Reproduces against a Radeon Pro W5700 host where the
reporter has rocminfo installed (works at runtime via AdaptiveCpp's
HIP backend) but the cargo install shell didn't have /opt/rocm/bin on
PATH — autodetect missed AMD, fell through to the "nvcc present → CI
fallback" arm, BUILD_CUDA flipped ON, the streaming pipeline tried to
dispatch CUB sort against the W5700 and the new selftest at 4394c66
caught it loudly.
The sysfs path needs no user-space tools — only readable
/sys/class/drm/card*/device/vendor, which is true on every Linux host
with the amdgpu / radeon kernel module loaded. Robust against:
- rocminfo not on PATH (this case)
- rocminfo on PATH but failing because /dev/kfd isn't accessible to
the build user (cargo install via systemd / chroot / different uid)
- ROCm not installed yet but the kernel module is loaded (e.g. on a
fresh distro install where the user is mid-setup)
Doesn't replace rocminfo — that's still the primary signal because it
tells us the gfx target string we'd compile for. Sysfs only answers
"is there an AMD GPU at all", which is exactly what amd_gpu_present()
needs.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… AMD/Intel concurrently
Previously SortCuda.cu/SortSyclCub.cpp and SortSycl.cpp were mutually
exclusive at build time: BUILD_CUDA=ON gave CUB-only, BUILD_CUDA=OFF
gave SYCL-only. A hybrid host (NVIDIA + AMD on the same box) had to
pick one, hiding the other from --devices N (and hiding it from
--devices all entirely).
Reorganize the sort entry points into:
- launch_sort_*_cub (SortSyclCub.cpp, BUILD_CUDA=ON only)
- launch_sort_*_sycl (SortSycl.cpp, always built)
- launch_sort_* (SortDispatch.cpp, always built; picks
by q.get_device().get_backend() at
runtime — sycl::backend::cuda → _cub,
else → _sycl)
CMake now always compiles SortSycl.cpp + SortDispatch.cpp; SortSyclCub.cpp
is added on top when BUILD_CUDA=ON. The CUB branch in the dispatcher
is gated by XCHPLOT2_HAVE_CUB so AMD-only / Intel-only / CPU builds
compile it out — the dispatcher reduces to a single tail call into
SortSycl on those builds.
End-to-end on the dev box (NVIDIA RTX 4090 + AdaptiveCpp 25.10 SSCP
generic JIT, BUILD_CUDA=ON): sycl_sort_parity all-PASS at every count
(16 / 16k / 262k / 1M) for both pairs and keys, perf within noise of
the pre-refactor CUB-only path. AdaptiveCpp's SSCP backend reports
sycl::backend::cuda for NVIDIA devices, so the dispatcher routes to
CUB as expected.
Sets up the next two cleanups: usable_gpu_devices() can stop filtering
non-CUDA backends (the binary handles them now) and the BUILD_CUDA-vs-
device-mismatch selftest catch becomes redundant. Done in follow-up
commits.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Prints id, name, backend, VRAM, compute-unit count, and which sort
path the runtime dispatcher will route a worker on each device to
(CUB on cuda-backend queues when this build links CUB, SortSycl
otherwise). The printed `[N]` index is the same value `--devices N`
in `plot` / `batch` accepts.
Example output on a single-NVIDIA dev box:
Visible GPU devices (1):
[0] NVIDIA GeForce RTX 4090 backend=cuda vram=24076 MB CUs=128 sort:CUB
Use `--devices N` (id) in `plot` / `batch` to pick a specific
device, or `--devices all` for one worker per device.
Implementation split across two TUs to keep the SYCL include out of
cli.cpp:
- SyclDeviceList.hpp: plain-types declaration (struct GpuDeviceInfo,
list_gpu_devices()). Includable from any TU.
- SyclDeviceList.cpp: queries via SyclBackend.hpp; compiled by acpp
via add_sycl_to_target.
Direct inclusion of SyclBackend.hpp into cli.cpp triggered a
-Werror=narrowing in AdaptiveCpp's libkernel/host/builtins.hpp under
g++; the split keeps cli.cpp SYCL-free.
The opencl backend case in the switch was dropped — AdaptiveCpp's
hipsycl::rt::backend_id enum doesn't expose it. cuda / hip /
level_zero cover real-world deployments; everything else falls into
the "?" default.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…Us-only Token semantics: all → every visible GPU + the CPU worker (was: GPUs only) gpu → every visible GPU (new — was implicit in `all`) cpu → CPU worker only (unchanged) 0,2,3 → explicit GPU ids (unchanged) Reads more naturally — "all" should mean everything; "gpu" gives the old all-GPUs-no-CPU behavior. Existing scripts using `--devices all` gain a CPU worker (1-2 orders slower than GPU, so it usually finishes last but doesn't block the GPU workers). print_usage, devices subcommand hint, and README examples all updated to reflect the new naming. Tested on dev box (NVIDIA + CPU). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…f idling Multi-device used to pre-partition entries round-robin: with 10 plots and a GPU + CPU host, the GPU got plots [0,2,4,6,8] and the CPU got [1,3,5,7,9]. The GPU finished its share in ~50s and then sat idle for ~25 minutes while the CPU plodded through its half. End-to-end batch wall was bounded by the CPU. Convert run_batch_slice's inner loop to optionally pull plot indices from a shared atomic counter instead of iterating its own vector. Multi-device passes a single shared `next_idx` to every worker; whichever worker finishes its current plot first grabs the next one. So the GPU keeps pulling work for as long as plots remain, and the CPU only handles whatever it can finish in the same wall. Per-worker pinned-buffer slot rotation is decoupled from the global plot index — peer workers each own their own GpuBufferPool, so the slot must come from a per-worker `local_count`, not the (now-shared) plot index. Single-device path unchanged (shared_idx defaults to nullptr → original sequential iteration). Verbose messages drop the misleading "%zu/%zu" denominator — with dynamic dispatch the worker doesn't know the batch's total or its own share. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Bumps [chia](https://github.com/Chia-Network/chia_rs) from 0.42.0 to 0.42.1. - [Release notes](https://github.com/Chia-Network/chia_rs/releases) - [Commits](Chia-Network/chia_rs@0.42.0...0.42.1) --- updated-dependencies: - dependency-name: chia dependency-version: 0.42.1 dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Bumps chia from 0.42.0 to 0.42.1.
Release notes
Sourced from chia's releases.
Commits
746c688Merge pull request #1426 from Chia-Network/bump-0.42.103e02c0bump version to 0.42.1df18c3eMerge pull request #1425 from Chia-Network/bump-clvmrsa23dd08harmonize versions of thiserror, p256, k256, rand, rand_chacha with clvm_rs. ...695643cbump clvmr to 0.17.7a256e9dMerge pull request #1422 from Chia-Network/check-time-lock1c51179python test cases787796bextend tests for check_time_lock()1a00644simplify check_time_lock() tests0c1c6a6[CHIA-3854] Alternative: pure storage cost model (12000/0) for generator iden...Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting
@dependabot rebase.Dependabot commands and options
You can trigger Dependabot actions by commenting on this PR:
@dependabot rebasewill rebase this PR@dependabot recreatewill recreate this PR, overwriting any edits that have been made to it@dependabot show <dependency name> ignore conditionswill show all of the ignore conditions of the specified dependency@dependabot ignore this major versionwill close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)@dependabot ignore this minor versionwill close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)@dependabot ignore this dependencywill close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)