hw: per-bank PLATFORM_MEMORY_OFFSET — fix U250 vx_busy hang at boot by hwirys · Pull Request #342 · vortexgpgpu/vortex

hwirys · 2026-04-30T23:42:58Z

Why this matters: the U250 path on master cannot boot

On Vortex master, building for Alveo U250 appears to succeed (Vitis produces an xclbin), but the kernel never starts on real silicon. ap_start is asserted, vx_busy goes high, and stays high forever. The CTL register reads back 0x1 indefinitely; the host eventually times out. This is the failure pattern reported in #262, #263, #278.

Vortex 2.3 added the U250 build infrastructure (platform recognition, xrt::bo allocation, connectivity.sp plumbing) but did not address the BAR-mismatch between Vortex's compile-time absolute addresses and where XRT actually places memory. The xclbin is produced; it just doesn't execute.

This PR is the smallest patch that makes a U250 boot. It is a logical prerequisite for any other U250 work.

Root cause

Vortex hard-codes some compile-time absolute byte addresses:

Symbol	Value (XLEN=64)	Where
`STARTUP_ADDR`	`0x1_8000_0000`	Where the kernel binary's text/data lives, fetched by Vortex on `ap_start`
`STACK_BASE_ADDR`	`0x1_FFFF_0000`	Per-thread stack top

XRT, on the other hand, allocates each xrt::bo at a virtual address chosen by the platform. On U250, bank 0 lands at 0x40_0000_0000 (256 GiB into the device address space). All AXI traffic from Vortex therefore goes to a region the platform's AXI fabric does not decode → vx_busy = 1 forever.

The pre-existing PLATFORM_MEMORY_OFFSET macro is conceptually the right escape hatch — it adds a constant offset to every outgoing AXI address — but on master it is a single global offset for all banks, while real Xilinx XRT places each bank at a different virtual address. So a single offset cannot cover U200/U250 four-channel deployments either.

Fix

Three small files:

`hw/rtl/afu/xrt/vortex_afu.vh` (+18 / −0)

Introduce per-bank macros. Each defaults to the legacy global PLATFORM_MEMORY_OFFSET so existing single-channel platforms are unchanged:

`ifndef PLATFORM_MEMORY_OFFSET_0
`define PLATFORM_MEMORY_OFFSET_0 `PLATFORM_MEMORY_OFFSET
`endif
// ... _1, _2, _3 likewise

`hw/rtl/afu/xrt/VX_afu_wrap.sv` (+14 / −2)

Build a 4-entry array from those macros and add the bank-i offset to each m_axi_mem_<i> outgoing AW/AR address (replaces the single global add):

wire [C_M_AXI_MEM_ADDR_WIDTH-1:0] platform_memory_offsets [4];
assign platform_memory_offsets[0] = `PLATFORM_MEMORY_OFFSET_0;
// ... _1, _2, _3
for (genvar i = 0; i < C_M_AXI_MEM_NUM_BANKS; ++i) begin : g_addressing
    assign m_axi_mem_awaddr_a[i] = m_axi_mem_awaddr_u[i] + platform_memory_offsets[i];
    assign m_axi_mem_araddr_a[i] = m_axi_mem_araddr_u[i] + platform_memory_offsets[i];
end

`hw/syn/xilinx/xrt/platforms.mk` (+6 / −2)

Switch the U250 entry to single-channel with the offset hard-coded so the build is deployable out of the box:

else ifneq ($(findstring xilinx_u250,$(XSA)),)
  CONFIGS += -DPLATFORM_MEMORY_NUM_BANKS=1 -DPLATFORM_MEMORY_ADDR_WIDTH=34
  VPP_FLAGS += --connectivity.sp vortex_afu_1.m_axi_mem_0:DDR[0]
  CONFIGS += -DPLATFORM_MEMORY_OFFSET_0=40'h4000000000

(0x40_0000_0000 is the empirical XRT base for DDR[0] on the
xilinx_u250_gen3x16_xdma_4_1_202210_1 shell. Multi-bank — full 64 GB across 4 DDR4 channels — needs the AFU to learn each bo's actual XRT VA at runtime, since each xrt::bo is placed by the runtime allocator. That follow-up will be submitted as a separate PR with a DCR-based runtime path; this PR is intentionally minimal and out-of-the-box deployable.)

Verification on real Alveo U250

Built with this patch (default DSP FPU, NUM_CORES=2 NUM_WARPS=4 XLEN=64, 200 MHz) on real U250 silicon — XRT 2.19.194, shell xilinx_u250_gen3x16_xdma_4_1_202210_1. The PLP load (xdma_shell_4_1) is required once per cold boot as documented in the README.

Test category	Without this PR	With this PR
Any kernel start	hangs at `ap_start`, `vx_busy=1` forever	boots ✅
`vecadd` (n = 16…16384)	n/a (hang)	all sizes PASS
`dogfood` Test0..Test20 (iadd, imul, idiv, fadd, fsub, fmul, fmadd, fnmadd, fdiv, fsqrt, ftoi, itof, fclamp, iclamp, …)	n/a	PASS
regression: `sgemm`, `dotproduct`, `demo`, `dropout`, `conv3`, `io_addr`, `fence`, `diverge`	n/a	PASS
OpenCL: `saxpy`, `vecadd`, `sgemm`, `sgemm2`, `sgemm3`, `stencil`, `sfilter`, `spmv`, `psort`, `oclprintf`	n/a	PASS
single-rank MPI: `mpi_vecadd`, `mpi_dotproduct`, `mpi_diverge`, `mpi_put_dotproduct`	n/a	PASS

That spans integer arithmetic, single-precision FP, vector reductions, matrix multiply, convolution, ML-style ops (dropout), and the full OpenCL/MPI execution stacks. Across all of those there is no observed regression introduced by the patch.

(dogfood Test21-trig produces FP64 NaNs because the default DSP FPU is FP32-only — that's a pre-existing FPU-implementation issue, not addressed here.)

Build numbers: total xclbin link 2 h 30 m, WNS = +0.057 ns @ 200 MHz (positive, no impl-strategy tuning), area added by this patch ≈ 4 × 36-bit adders. Negligible.

Backward compatibility

PLATFORM_MEMORY_OFFSET_<i> defaults to PLATFORM_MEMORY_OFFSET, so any platform that didn't define per-bank offsets sees the same effective offset on every bank as before. HBM platforms (U280/U55C/U50, all PLATFORM_MERGED_MEMORY_INTERFACE), VCK5000 single-channel, and Zynq UltraScale+ are byte-for-byte identical.
The U250 platforms.mk entry changes from NUM_BANKS=4 (no connectivity) to NUM_BANKS=1 (DDR[0] + offset). The previous configuration produced an xclbin that didn't actually run, so this is a strict improvement; any user wishing to revisit four-channel deployment needs the runtime VA plumbing of the follow-up PR anyway.

Files touched

hw/rtl/afu/xrt/vortex_afu.vh — +18 / −0
hw/rtl/afu/xrt/VX_afu_wrap.sv — +14 / −2
hw/syn/xilinx/xrt/platforms.mk — +6 / −2

References: closes (or substantially mitigates) #262, #263, #278.

When Vortex is built for an XRT-based Xilinx platform that allocates each memory bank as a separate xrt::bo, XRT picks a different virtual address per bank (e.g. on U250 bank 0 lands at 0x40_00000000) — which is far above Vortex's compile-time absolute addresses STARTUP_ADDR = 0x180000000 and STACK_BASE_ADDR = 0x1FFFF0000. AXI requests from Vortex therefore fail to decode at the slave, vx_busy stays high forever, and the kernel never starts. This is the underlying cause of the long-running U250 hang reports (vortexgpgpu#262, vortexgpgpu#263, vortexgpgpu#278) — Vortex 2.3 added the U250 build path but did not address this BAR mismatch, so the produced xclbin has never actually booted on real silicon. Fix - vortex_afu.vh: introduce per-bank PLATFORM_MEMORY_OFFSET_<i> (i = 0..3) macros, each defaulting to the legacy global PLATFORM_MEMORY_OFFSET so HBM platforms (U280/U55C/U50) and VCK5000 single-channel are byte-for-byte unchanged. - VX_afu_wrap.sv: build a 4-entry platform_memory_offsets array from those macros and add the bank-i offset to each outgoing m_axi_mem_<i> AW/AR address. - platforms.mk U250: switch to single-bank (NUM_BANKS=1, DDR[0]) and set PLATFORM_MEMORY_OFFSET_0=40'h4000000000 so the build works end-to-end out of the box. Multi-bank (full 64 GB) deployment needs a runtime mechanism to push each bo's actual XRT VA into the AFU and will follow as a separate PR. Verified end-to-end on real Alveo U250 (XRT 2.19.194, shell xilinx_u250_gen3x16_xdma_4_1_202210_1) at 200 MHz with the default DSP FPU. Without this patch the kernel hangs immediately at ap_start (vx_busy stuck high, CTL register reads back 0x1 indefinitely). With the patch the kernel boots and: - regression `vecadd` (n=16..16384), `sgemm`, `dotproduct`, `demo`, `dropout`, `conv3`, `io_addr`, `fence`, `diverge` — all pass - `dogfood` Test0..Test20 — pass - OpenCL: `saxpy`, `vecadd`, `sgemm`, `sgemm2`, `sgemm3`, `stencil`, `sfilter`, `spmv`, `psort`, `oclprintf` — pass - single-rank MPI: `mpi_vecadd`, `mpi_dotproduct`, `mpi_diverge`, `mpi_put_dotproduct` — pass Final WNS = +0.057 ns at 200 MHz; the patch is purely combinational addressing logic (3 extra adds in the AFU) so area and timing impact are negligible. Refs: vortexgpgpu#262, vortexgpgpu#263, vortexgpgpu#278.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

hw: per-bank PLATFORM_MEMORY_OFFSET — fix U250 vx_busy hang at boot#342

hw: per-bank PLATFORM_MEMORY_OFFSET — fix U250 vx_busy hang at boot#342
hwirys wants to merge 1 commit intovortexgpgpu:masterfrom
hwirys:fix/u250-platform-memory-offset

hwirys commented Apr 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

hwirys commented Apr 30, 2026

Why this matters: the U250 path on master cannot boot

Root cause

Fix

hw/rtl/afu/xrt/vortex_afu.vh (+18 / −0)

hw/rtl/afu/xrt/VX_afu_wrap.sv (+14 / −2)

hw/syn/xilinx/xrt/platforms.mk (+6 / −2)

Verification on real Alveo U250

Backward compatibility

Files touched

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

`hw/rtl/afu/xrt/vortex_afu.vh` (+18 / −0)

`hw/rtl/afu/xrt/VX_afu_wrap.sv` (+14 / −2)

`hw/syn/xilinx/xrt/platforms.mk` (+6 / −2)