Skip to content

[AMD GPU] Add Windows & Linux ROCm support and Linux MIGraphX support#1188

Open
Looong01 wants to merge 30 commits intolightvector:masterfrom
Looong01:AMD_GPU
Open

[AMD GPU] Add Windows & Linux ROCm support and Linux MIGraphX support#1188
Looong01 wants to merge 30 commits intolightvector:masterfrom
Looong01:AMD_GPU

Conversation

@Looong01
Copy link
Copy Markdown

All test passed, we can merge it to main branch! @lightvector

Bothe Windows and Linux Binary release has been published here: https://github.com/Looong01/KataGo-Multi-backends/releases

Background

This PR summarizes all commits by Looong01 on the AMD_GPU branch from 2025-07-28 to 2026-03-16 (23 commits total: 18 non-merge + 5 merge), focused on introducing and refining ROCm backend support in KataGo, plus the new MIGraphX backend added on the MIGraphX branch.

Key Changes — ROCm Backend

  • Added core ROCm backend implementation and utility files: rocmbackend.cpp, rocmhelpers.hip, rocmutils.*, rocmincludes.h, rocmerrorcheck.h.
  • Integrated USE_ROCM_BACKEND into startup and config flow (setup/benchmark/gtpconfig) for proper backend detection and config generation.
  • Expanded ROCm build logic in CMakeLists.txt: HIP compiler setup, architecture/FP16 detection, ROCm package lookup, and fallback linking.
  • Added Windows ROCm build support (HIP_PATH/ROCM_PATH, clang toolchain handling, Windows library search paths).
  • Iteratively improved rocmbackend.cpp with Convlayer method updates, performance tuning, and bug fixes.
  • Introduced and then removed experimental rocmbackend_new.cpp after merging validated changes into the main backend path.
  • Updated docs and sample configs (README / Compiling / cpp/configs/*) with ROCm instructions and rocmDeviceToUse*, rocmUseFP16 examples.
  • Regularly merged upstream lightvector:master to reduce branch drift.

Critical Bug Fix: ConvLayer accumulate (residual skip connections)

  • Root cause: miopenConvolutionForwardImmediate does not support alpha/beta parameters (unlike cuDNN's cudnnConvolutionForward). The original code set beta = accumulate ? 1.0 : 0.0 but this value was never passed to the MIOpen API, causing all residual skip connections to be silently dropped — the neural network output was effectively garbage.
  • Fix: When accumulate=true, save the output buffer (trunk) to a pre-allocated accumBuf via hipMemcpyAsync (Device-to-Device), run convolution (which overwrites the output buffer), then add the saved residual back using a new customCudaAddTensorsInplace GPU kernel. All operations stay in VRAM with zero CPU-side data transfer.
  • New kernels added in rocmhelpers.hip / rocmhelpers.h:
    • customCudaAddTensorsInplace(float*, const float*, int) 
    • customCudaAddTensorsInplace(half*, const half*, int)
  • Performance optimization: accumBuf is pre-allocated once per ConvLayer at construction time (sized for maxBatchSize), avoiding per-inference hipMalloc/hipFree overhead.

Secondary Fix: Algorithm enumeration buffer overflow

  • miopenConvolutionForwardGetSolutionCount returns the available count by overwriting the output parameter. The original code used this count to size a fixed stack array miopenConvSolution_t solutions[2*requestedAlgoCount] which could overflow. Replaced with std::vector<miopenConvSolution_t> solutions(availableAlgoCount) for safe dynamic sizing.

Windows ROCm Build — CMakeLists.txt Self-Configuration

Added full Windows ROCm build support directly into `CMakeLists.txt.

Key Changes — MIGraphX Backend (New)

Added a complete MIGraphX graph-compiler backend (migraphxbackend.cpp, 1886 lines) as an alternative to the ROCm (MIOpen) backend. MIGraphX compiles the entire neural network into a single fused GPU program, leveraging AMD's graph-level optimizations (operator fusion, memory planning, kernel scheduling).

Architecture

  • Graph-based inference: The full model (trunk → policy/value/ownership heads) is constructed as a migraphx::program at load time using MIGraphXGraphBuilder, compiled once, then cached as .mxr files under ~/.katago/migraphxcache/.
  • Multi-batch compilation: MIGraphX does not support dynamic batch dimensions. Instead, multiple fixed-batch programs are pre-compiled at sizes {4, 8, 16, 24, 32, 40, 64} (capped by maxBatchSize). At inference time, getBestBatchSize() selects the smallest compiled size ≥ actual batch to minimize GPU waste.
  • Cache system: Compiled programs are saved/loaded as .mxr files with naming format migraphx_{modelName}_{sha256}_{H}x{W}_batch{N}_fp{0|1}_nhwc{0|1}_{exact|max}.mxr. First launch compiles all batch sizes (slow); subsequent launches load from cache in seconds.

Neural Network Components Implemented

Component Status
Convolution (with padding/dilation/stride)
BatchNorm (merged scale/bias via unsqueeze+multibroadcast)
Residual Block
Global Pooling Residual Block (dual-branch with gpoolToBiasMul)
Nested Bottleneck Residual Block
Global Pooling (3 features: mean, mean×scale, max/scale2)
Policy Head (p1Conv + g1Conv→GPool→bias, p2Conv spatial, pass branch)
Value Head (v1Conv→GPool→v2Mul→v3Mul)
Score Value Head (shared v2→sv3Mul)
Ownership Head (vOwnershipConv→flatten)
Activations: ReLU, Mish, MishScale8
MatMul / FC layers (with optional bias)
SGF Metadata Encoder ❌ Disabled

FP16 Support

  • Enabled by default (when useFP16Mode is Auto or True).
  • Input parameters remain float_type on host; a convert op inside the graph handles float→half on GPU.
  • All weights/computation use half_type; outputs are cast back to float via static_cast<float> in the visit() lambda.

Build Integration

  • New USE_BACKEND=MIGRAPHX option in CMakeLists.txt (~60 lines of build logic).
  • Links against libmigraphx (and optional libmigraphx_gpu) from rocm.
  • Registered as "mgx" backend prefix in setup.cpp, forced NCHW format.
  • Backend identification in main.cpp: prints "Using MIGraphX backend".

Known Limitations

  • No dynamic batch: Fixed batch sizes require pre-compilation; small batches may waste GPU cycles (e.g., actual batch 3 uses compiled batch 4).
  • NCHW only: NHWC format is silently ignored.
  • SGF Metadata Encoder disabled: Potential weight shape mismatch, skipped for now.
  • No mask support in global pooling: Assumes full board at nnXLen×nnYLen.

Change Stats — ROCm

  • Commits: 23 (non-merge: 18, merge: 5) + post-PR bug fixes
  • Files touched: 21 + 3 (rocmbackend.cpp, rocmhelpers.hip, rocmhelpers.h)
  • Diffstat: +9372 / -4009 + +59 / -28

Change Stats — MIGraphX

  • Commits: 3 (non-merge: 3)
  • Files touched: 4 (migraphxbackend.cpp, CMakeLists.txt, setup.cpp, main.cpp)
  • Diffstat: +1977 / -1 (1886 lines new backend + 91 lines build/setup integration)

Included Commits (Author: Looong01)

ROCm Backend (AMD_GPU branch, 2025-07-28 ~ 2026-03-16)

  • 1f2ae46e 2025-07-28 Add ROCm backend
  • b4555304 2025-07-28 Fix bugs
  • 8b30cb96 2025-07-31 Update
  • 570ced01 2025-08-01 Fix bugs
  • abb61240 2025-08-01 Fix bugs
  • bfb292e7 2025-08-01 All bug fixed
  • 4606424f 2025-08-01 Update
  • 1e8ea788 2025-08-02 test new method
  • c1a09cf3 2025-08-02 Update
  • 0957b88b 2025-08-02 Test finished
  • c70d841a 2025-08-02 Update docks
  • 1d05ca8d 2025-08-02 Update gitignore
  • 9d4662b7 2025-08-02 Update new method
  • d40bd509 2025-08-02 Optimize performance
  • 158d24df 2025-08-13 Update new Convlayer method
  • ec32eb19 2025-08-13 Merge branch 'master' of https://github.com/Looong01/KataGo-ROCm
  • 0bfe0a14 2025-10-04 Add new compile target
  • f5fbb336 2025-11-08 Merge branch 'lightvector:master' into master
  • 26d8c5bd 2025-11-08 Add ROCm for Windows support
  • 555d2f17 2025-12-01 Merge branch 'lightvector:master' into master
  • dbc7cfa4 2026-02-22 Merge branch 'lightvector:master' into master
  • ed396b72 2026-02-28 Fix bugs
  • ccec62c5 2026-03-16 Merge branch 'lightvector:master' into master
  • xxxxxxxx 2026-04-19 Fix critical ConvLayer accumulate bug & algorithm buffer overflow

MIGraphX Backend (MIGraphX branch, 2026-02-27 ~ 2026-04-19)

  • c511c338 2026-02-27 Add MIGraphX support
  • 00cb6881 2026-04-19 Fix bugs (MIGraphX: 5 structural bugs, GELU→MishScale8, NHWC→NCHW, dimension mismatches)
  • b1da0e06 2026-04-19 Optimize performance (FP16 default, multi-batch compilation, cache per batch size)

@Looong01
Copy link
Copy Markdown
Author

Windows ROCm:

image

Linux ROCm:

rocm

Linux MIGraphx:

image

@Looong01
Copy link
Copy Markdown
Author

image

@lightvector
Copy link
Copy Markdown
Owner

Thanks, I'll look at this soon.

@Looong01
Copy link
Copy Markdown
Author

Thanks, I'll look at this soon.

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants