Skip to content

hw: per-bank PLATFORM_MEMORY_OFFSET — fix U250 vx_busy hang at boot#342

Open
hwirys wants to merge 1 commit intovortexgpgpu:masterfrom
hwirys:fix/u250-platform-memory-offset
Open

hw: per-bank PLATFORM_MEMORY_OFFSET — fix U250 vx_busy hang at boot#342
hwirys wants to merge 1 commit intovortexgpgpu:masterfrom
hwirys:fix/u250-platform-memory-offset

Conversation

@hwirys
Copy link
Copy Markdown

@hwirys hwirys commented Apr 30, 2026

Why this matters: the U250 path on master cannot boot

On Vortex master, building for Alveo U250 appears to succeed (Vitis produces an xclbin), but the kernel never starts on real silicon. ap_start is asserted, vx_busy goes high, and stays high forever. The CTL register reads back 0x1 indefinitely; the host eventually times out. This is the failure pattern reported in #262, #263, #278.

Vortex 2.3 added the U250 build infrastructure (platform recognition, xrt::bo allocation, connectivity.sp plumbing) but did not address the BAR-mismatch between Vortex's compile-time absolute addresses and where XRT actually places memory. The xclbin is produced; it just doesn't execute.

This PR is the smallest patch that makes a U250 boot. It is a logical prerequisite for any other U250 work.


Root cause

Vortex hard-codes some compile-time absolute byte addresses:

Symbol Value (XLEN=64) Where
STARTUP_ADDR 0x1_8000_0000 Where the kernel binary's text/data lives, fetched by Vortex on ap_start
STACK_BASE_ADDR 0x1_FFFF_0000 Per-thread stack top

XRT, on the other hand, allocates each xrt::bo at a virtual address chosen by the platform. On U250, bank 0 lands at 0x40_0000_0000 (256 GiB into the device address space). All AXI traffic from Vortex therefore goes to a region the platform's AXI fabric does not decode → vx_busy = 1 forever.

The pre-existing PLATFORM_MEMORY_OFFSET macro is conceptually the right escape hatch — it adds a constant offset to every outgoing AXI address — but on master it is a single global offset for all banks, while real Xilinx XRT places each bank at a different virtual address. So a single offset cannot cover U200/U250 four-channel deployments either.


Fix

Three small files:

hw/rtl/afu/xrt/vortex_afu.vh (+18 / −0)

Introduce per-bank macros. Each defaults to the legacy global PLATFORM_MEMORY_OFFSET so existing single-channel platforms are unchanged:

`ifndef PLATFORM_MEMORY_OFFSET_0
`define PLATFORM_MEMORY_OFFSET_0 `PLATFORM_MEMORY_OFFSET
`endif
// ... _1, _2, _3 likewise

hw/rtl/afu/xrt/VX_afu_wrap.sv (+14 / −2)

Build a 4-entry array from those macros and add the bank-i offset to each m_axi_mem_<i> outgoing AW/AR address (replaces the single global add):

wire [C_M_AXI_MEM_ADDR_WIDTH-1:0] platform_memory_offsets [4];
assign platform_memory_offsets[0] = `PLATFORM_MEMORY_OFFSET_0;
// ... _1, _2, _3
for (genvar i = 0; i < C_M_AXI_MEM_NUM_BANKS; ++i) begin : g_addressing
    assign m_axi_mem_awaddr_a[i] = m_axi_mem_awaddr_u[i] + platform_memory_offsets[i];
    assign m_axi_mem_araddr_a[i] = m_axi_mem_araddr_u[i] + platform_memory_offsets[i];
end

hw/syn/xilinx/xrt/platforms.mk (+6 / −2)

Switch the U250 entry to single-channel with the offset hard-coded so the build is deployable out of the box:

else ifneq ($(findstring xilinx_u250,$(XSA)),)
  CONFIGS += -DPLATFORM_MEMORY_NUM_BANKS=1 -DPLATFORM_MEMORY_ADDR_WIDTH=34
  VPP_FLAGS += --connectivity.sp vortex_afu_1.m_axi_mem_0:DDR[0]
  CONFIGS += -DPLATFORM_MEMORY_OFFSET_0=40'h4000000000

(0x40_0000_0000 is the empirical XRT base for DDR[0] on the
xilinx_u250_gen3x16_xdma_4_1_202210_1 shell. Multi-bank — full 64 GB across 4 DDR4 channels — needs the AFU to learn each bo's actual XRT VA at runtime, since each xrt::bo is placed by the runtime allocator. That follow-up will be submitted as a separate PR with a DCR-based runtime path; this PR is intentionally minimal and out-of-the-box deployable.)


Verification on real Alveo U250

Built with this patch (default DSP FPU, NUM_CORES=2 NUM_WARPS=4 XLEN=64, 200 MHz) on real U250 silicon — XRT 2.19.194, shell xilinx_u250_gen3x16_xdma_4_1_202210_1. The PLP load (xdma_shell_4_1) is required once per cold boot as documented in the README.

Test category Without this PR With this PR
Any kernel start hangs at ap_start, vx_busy=1 forever boots ✅
vecadd (n = 16…16384) n/a (hang) all sizes PASS
dogfood Test0..Test20 (iadd, imul, idiv, fadd, fsub, fmul, fmadd, fnmadd, fdiv, fsqrt, ftoi, itof, fclamp, iclamp, …) n/a PASS
regression: sgemm, dotproduct, demo, dropout, conv3, io_addr, fence, diverge n/a PASS
OpenCL: saxpy, vecadd, sgemm, sgemm2, sgemm3, stencil, sfilter, spmv, psort, oclprintf n/a PASS
single-rank MPI: mpi_vecadd, mpi_dotproduct, mpi_diverge, mpi_put_dotproduct n/a PASS

That spans integer arithmetic, single-precision FP, vector reductions, matrix multiply, convolution, ML-style ops (dropout), and the full OpenCL/MPI execution stacks. Across all of those there is no observed regression introduced by the patch.

(dogfood Test21-trig produces FP64 NaNs because the default DSP FPU is FP32-only — that's a pre-existing FPU-implementation issue, not addressed here.)

Build numbers: total xclbin link 2 h 30 m, WNS = +0.057 ns @ 200 MHz (positive, no impl-strategy tuning), area added by this patch ≈ 4 × 36-bit adders. Negligible.


Backward compatibility

  • PLATFORM_MEMORY_OFFSET_<i> defaults to PLATFORM_MEMORY_OFFSET, so any platform that didn't define per-bank offsets sees the same effective offset on every bank as before. HBM platforms (U280/U55C/U50, all PLATFORM_MERGED_MEMORY_INTERFACE), VCK5000 single-channel, and Zynq UltraScale+ are byte-for-byte identical.
  • The U250 platforms.mk entry changes from NUM_BANKS=4 (no connectivity) to NUM_BANKS=1 (DDR[0] + offset). The previous configuration produced an xclbin that didn't actually run, so this is a strict improvement; any user wishing to revisit four-channel deployment needs the runtime VA plumbing of the follow-up PR anyway.

Files touched

  • hw/rtl/afu/xrt/vortex_afu.vh — +18 / −0
  • hw/rtl/afu/xrt/VX_afu_wrap.sv — +14 / −2
  • hw/syn/xilinx/xrt/platforms.mk — +6 / −2

References: closes (or substantially mitigates) #262, #263, #278.

When Vortex is built for an XRT-based Xilinx platform that allocates
each memory bank as a separate xrt::bo, XRT picks a different virtual
address per bank (e.g. on U250 bank 0 lands at 0x40_00000000) — which
is far above Vortex's compile-time absolute addresses STARTUP_ADDR
= 0x180000000 and STACK_BASE_ADDR = 0x1FFFF0000. AXI requests from
Vortex therefore fail to decode at the slave, vx_busy stays high
forever, and the kernel never starts. This is the underlying cause
of the long-running U250 hang reports (vortexgpgpu#262, vortexgpgpu#263, vortexgpgpu#278) — Vortex
2.3 added the U250 build path but did not address this BAR mismatch,
so the produced xclbin has never actually booted on real silicon.

Fix
- vortex_afu.vh: introduce per-bank PLATFORM_MEMORY_OFFSET_<i>
  (i = 0..3) macros, each defaulting to the legacy global
  PLATFORM_MEMORY_OFFSET so HBM platforms (U280/U55C/U50) and
  VCK5000 single-channel are byte-for-byte unchanged.
- VX_afu_wrap.sv: build a 4-entry platform_memory_offsets array
  from those macros and add the bank-i offset to each outgoing
  m_axi_mem_<i> AW/AR address.
- platforms.mk U250: switch to single-bank (NUM_BANKS=1, DDR[0])
  and set PLATFORM_MEMORY_OFFSET_0=40'h4000000000 so the build
  works end-to-end out of the box. Multi-bank (full 64 GB) deployment
  needs a runtime mechanism to push each bo's actual XRT VA into the
  AFU and will follow as a separate PR.

Verified end-to-end on real Alveo U250 (XRT 2.19.194, shell
xilinx_u250_gen3x16_xdma_4_1_202210_1) at 200 MHz with the default
DSP FPU. Without this patch the kernel hangs immediately at
ap_start (vx_busy stuck high, CTL register reads back 0x1
indefinitely). With the patch the kernel boots and:

- regression `vecadd` (n=16..16384), `sgemm`, `dotproduct`, `demo`,
  `dropout`, `conv3`, `io_addr`, `fence`, `diverge` — all pass
- `dogfood` Test0..Test20 — pass
- OpenCL: `saxpy`, `vecadd`, `sgemm`, `sgemm2`, `sgemm3`, `stencil`,
  `sfilter`, `spmv`, `psort`, `oclprintf` — pass
- single-rank MPI: `mpi_vecadd`, `mpi_dotproduct`, `mpi_diverge`,
  `mpi_put_dotproduct` — pass

Final WNS = +0.057 ns at 200 MHz; the patch is purely combinational
addressing logic (3 extra adds in the AFU) so area and timing impact
are negligible.

Refs: vortexgpgpu#262, vortexgpgpu#263, vortexgpgpu#278.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant