[AMD GPU] Add Windows & Linux ROCm support and Linux MIGraphX support by Looong01 · Pull Request #1188 · lightvector/KataGo

Looong01 · 2026-04-20T02:45:36Z

All test passed, we can merge it to main branch! @lightvector

Bothe Windows and Linux Binary release has been published here: https://github.com/Looong01/KataGo-Multi-backends/releases

Background

This PR summarizes all commits by Looong01 on the AMD_GPU branch from 2025-07-28 to 2026-03-16 (23 commits total: 18 non-merge + 5 merge), focused on introducing and refining ROCm backend support in KataGo, plus the new MIGraphX backend added on the MIGraphX branch.

Key Changes — ROCm Backend

Added core ROCm backend implementation and utility files: rocmbackend.cpp, rocmhelpers.hip, rocmutils.*, rocmincludes.h, rocmerrorcheck.h.
Integrated USE_ROCM_BACKEND into startup and config flow (setup/benchmark/gtpconfig) for proper backend detection and config generation.
Expanded ROCm build logic in CMakeLists.txt: HIP compiler setup, architecture/FP16 detection, ROCm package lookup, and fallback linking.
Added Windows ROCm build support (HIP_PATH/ROCM_PATH, clang toolchain handling, Windows library search paths).
Iteratively improved rocmbackend.cpp with Convlayer method updates, performance tuning, and bug fixes.
Introduced and then removed experimental rocmbackend_new.cpp after merging validated changes into the main backend path.
Updated docs and sample configs (README / Compiling / cpp/configs/*) with ROCm instructions and rocmDeviceToUse*, rocmUseFP16 examples.
Regularly merged upstream lightvector:master to reduce branch drift.

Critical Bug Fix: ConvLayer accumulate (residual skip connections)

Root cause: miopenConvolutionForwardImmediate does not support alpha/beta parameters (unlike cuDNN's cudnnConvolutionForward). The original code set beta = accumulate ? 1.0 : 0.0 but this value was never passed to the MIOpen API, causing all residual skip connections to be silently dropped — the neural network output was effectively garbage.
Fix: When accumulate=true, save the output buffer (trunk) to a pre-allocated accumBuf via hipMemcpyAsync (Device-to-Device), run convolution (which overwrites the output buffer), then add the saved residual back using a new customCudaAddTensorsInplace GPU kernel. All operations stay in VRAM with zero CPU-side data transfer.
New kernels added in rocmhelpers.hip / rocmhelpers.h:
- customCudaAddTensorsInplace(float*, const float*, int)
- customCudaAddTensorsInplace(half*, const half*, int)
Performance optimization: accumBuf is pre-allocated once per ConvLayer at construction time (sized for maxBatchSize), avoiding per-inference hipMalloc/hipFree overhead.

Secondary Fix: Algorithm enumeration buffer overflow

miopenConvolutionForwardGetSolutionCount returns the available count by overwriting the output parameter. The original code used this count to size a fixed stack array miopenConvSolution_t solutions[2*requestedAlgoCount] which could overflow. Replaced with std::vector<miopenConvSolution_t> solutions(availableAlgoCount) for safe dynamic sizing.

Windows ROCm Build — CMakeLists.txt Self-Configuration

Added full Windows ROCm build support directly into `CMakeLists.txt.

Key Changes — MIGraphX Backend (New)

Added a complete MIGraphX graph-compiler backend (migraphxbackend.cpp, 1886 lines) as an alternative to the ROCm (MIOpen) backend. MIGraphX compiles the entire neural network into a single fused GPU program, leveraging AMD's graph-level optimizations (operator fusion, memory planning, kernel scheduling).

Architecture

Graph-based inference: The full model (trunk → policy/value/ownership heads) is constructed as a migraphx::program at load time using MIGraphXGraphBuilder, compiled once, then cached as .mxr files under ~/.katago/migraphxcache/.
Multi-batch compilation: MIGraphX does not support dynamic batch dimensions. Instead, multiple fixed-batch programs are pre-compiled at sizes {4, 8, 16, 24, 32, 40, 64} (capped by maxBatchSize). At inference time, getBestBatchSize() selects the smallest compiled size ≥ actual batch to minimize GPU waste.
Cache system: Compiled programs are saved/loaded as .mxr files with naming format migraphx_{modelName}_{sha256}_{H}x{W}_batch{N}_fp{0|1}_nhwc{0|1}_{exact|max}.mxr. First launch compiles all batch sizes (slow); subsequent launches load from cache in seconds.

Neural Network Components Implemented

Component	Status
Convolution (with padding/dilation/stride)	✅
BatchNorm (merged scale/bias via unsqueeze+multibroadcast)	✅
Residual Block	✅
Global Pooling Residual Block (dual-branch with gpoolToBiasMul)	✅
Nested Bottleneck Residual Block	✅
Global Pooling (3 features: mean, mean×scale, max/scale2)	✅
Policy Head (p1Conv + g1Conv→GPool→bias, p2Conv spatial, pass branch)	✅
Value Head (v1Conv→GPool→v2Mul→v3Mul)	✅
Score Value Head (shared v2→sv3Mul)	✅
Ownership Head (vOwnershipConv→flatten)	✅
Activations: ReLU, Mish, MishScale8	✅
MatMul / FC layers (with optional bias)	✅
SGF Metadata Encoder	❌ Disabled

FP16 Support

Enabled by default (when useFP16Mode is Auto or True).
Input parameters remain float_type on host; a convert op inside the graph handles float→half on GPU.
All weights/computation use half_type; outputs are cast back to float via static_cast<float> in the visit() lambda.

Build Integration

New USE_BACKEND=MIGRAPHX option in CMakeLists.txt (~60 lines of build logic).
Links against libmigraphx (and optional libmigraphx_gpu) from rocm.
Registered as "mgx" backend prefix in setup.cpp, forced NCHW format.
Backend identification in main.cpp: prints "Using MIGraphX backend".

Known Limitations

No dynamic batch: Fixed batch sizes require pre-compilation; small batches may waste GPU cycles (e.g., actual batch 3 uses compiled batch 4).
NCHW only: NHWC format is silently ignored.
SGF Metadata Encoder disabled: Potential weight shape mismatch, skipped for now.
No mask support in global pooling: Assumes full board at nnXLen×nnYLen.

Change Stats — ROCm

Commits: 23 (non-merge: 18, merge: 5) + post-PR bug fixes
Files touched: 21 + 3 (rocmbackend.cpp, rocmhelpers.hip, rocmhelpers.h)
Diffstat: +9372 / -4009 + +59 / -28

Change Stats — MIGraphX

Commits: 3 (non-merge: 3)
Files touched: 4 (migraphxbackend.cpp, CMakeLists.txt, setup.cpp, main.cpp)
Diffstat: +1977 / -1 (1886 lines new backend + 91 lines build/setup integration)

Included Commits (Author: Looong01)

ROCm Backend (`AMD_GPU` branch, 2025-07-28 ~ 2026-03-16)

1f2ae46e 2025-07-28 Add ROCm backend
b4555304 2025-07-28 Fix bugs
8b30cb96 2025-07-31 Update
570ced01 2025-08-01 Fix bugs
abb61240 2025-08-01 Fix bugs
bfb292e7 2025-08-01 All bug fixed
4606424f 2025-08-01 Update
1e8ea788 2025-08-02 test new method
c1a09cf3 2025-08-02 Update
0957b88b 2025-08-02 Test finished
c70d841a 2025-08-02 Update docks
1d05ca8d 2025-08-02 Update gitignore
9d4662b7 2025-08-02 Update new method
d40bd509 2025-08-02 Optimize performance
158d24df 2025-08-13 Update new Convlayer method
ec32eb19 2025-08-13 Merge branch 'master' of https://github.com/Looong01/KataGo-ROCm
0bfe0a14 2025-10-04 Add new compile target
f5fbb336 2025-11-08 Merge branch 'lightvector:master' into master
26d8c5bd 2025-11-08 Add ROCm for Windows support
555d2f17 2025-12-01 Merge branch 'lightvector:master' into master
dbc7cfa4 2026-02-22 Merge branch 'lightvector:master' into master
ed396b72 2026-02-28 Fix bugs
ccec62c5 2026-03-16 Merge branch 'lightvector:master' into master
xxxxxxxx 2026-04-19 Fix critical ConvLayer accumulate bug & algorithm buffer overflow

MIGraphX Backend (`MIGraphX` branch, 2026-02-27 ~ 2026-04-19)

c511c338 2026-02-27 Add MIGraphX support
00cb6881 2026-04-19 Fix bugs (MIGraphX: 5 structural bugs, GELU→MishScale8, NHWC→NCHW, dimension mismatches)
b1da0e06 2026-04-19 Optimize performance (FP16 default, multi-batch compilation, cache per batch size)

Merge

Add ROCm and MIGraphX support for AMD GPU

Looong01 · 2026-04-20T03:26:31Z

Windows ROCm:

Linux ROCm:

Linux MIGraphx:

Looong01 · 2026-04-20T03:32:11Z

lightvector · 2026-04-20T15:30:22Z

Thanks, I'll look at this soon.

Looong01 · 2026-04-20T18:57:13Z

Thanks, I'll look at this soon.

Thanks!

Looong01 and others added 29 commits July 28, 2025 20:13

Add ROCm backend

1f2ae46

Fix bugs

b455530

Update

8b30cb9

Fix bugs

570ced0

Fix bugs

abb6124

All bug fixed

bfb292e

Update

4606424

test new method

1e8ea78

Update

c1a09cf

Test finished

0957b88

Update docks

c70d841

Update gitignore

1d05ca8

Update new method

9d4662b

Optimize performance

d40bd50

Update new Convlayer method

158d24d

Merge branch 'master' of https://github.com/Looong01/KataGo-ROCm

ec32eb1

Add new compile target

0bfe0a1

Merge branch 'lightvector:master' into master

f5fbb33

Add ROCm for Windows support

26d8c5b

Merge branch 'lightvector:master' into master

555d2f1

Merge branch 'lightvector:master' into master

dbc7cfa

Add MIGraphX support

c511c33

Merge pull request #3 from lightvector/master

d91e9f5

Merge

Merge branch 'lightvector:master' into MIGraphX

337b575

Fix bugs

00cb688

Optimize performance

b1da0e0

Add MIGraphX support

8a133a0

Merge pull request #4 from Looong01/MIGraphX

8c4eaf2

Add ROCm and MIGraphX support for AMD GPU

Add Windows ROCm support

e480af5

Update cpp/README.md

a36189e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[AMD GPU] Add Windows & Linux ROCm support and Linux MIGraphX support#1188

[AMD GPU] Add Windows & Linux ROCm support and Linux MIGraphX support#1188
Looong01 wants to merge 30 commits intolightvector:masterfrom
Looong01:AMD_GPU

Looong01 commented Apr 20, 2026

Uh oh!

Looong01 commented Apr 20, 2026

Uh oh!

Looong01 commented Apr 20, 2026

Uh oh!

lightvector commented Apr 20, 2026

Uh oh!

Looong01 commented Apr 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Looong01 commented Apr 20, 2026

Background

Key Changes — ROCm Backend

Critical Bug Fix: ConvLayer accumulate (residual skip connections)

Secondary Fix: Algorithm enumeration buffer overflow

Windows ROCm Build — CMakeLists.txt Self-Configuration

Key Changes — MIGraphX Backend (New)

Architecture

Neural Network Components Implemented

FP16 Support

Build Integration

Known Limitations

Change Stats — ROCm

Change Stats — MIGraphX

Included Commits (Author: Looong01)

ROCm Backend (AMD_GPU branch, 2025-07-28 ~ 2026-03-16)

MIGraphX Backend (MIGraphX branch, 2026-02-27 ~ 2026-04-19)

Uh oh!

Looong01 commented Apr 20, 2026

Uh oh!

Looong01 commented Apr 20, 2026

Uh oh!

lightvector commented Apr 20, 2026

Uh oh!

Looong01 commented Apr 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ROCm Backend (`AMD_GPU` branch, 2025-07-28 ~ 2026-03-16)

MIGraphX Backend (`MIGraphX` branch, 2026-02-27 ~ 2026-04-19)