hw: per-bank PLATFORM_MEMORY_OFFSET — fix U250 vx_busy hang at boot#342
Open
hwirys wants to merge 1 commit intovortexgpgpu:masterfrom
Open
hw: per-bank PLATFORM_MEMORY_OFFSET — fix U250 vx_busy hang at boot#342hwirys wants to merge 1 commit intovortexgpgpu:masterfrom
hwirys wants to merge 1 commit intovortexgpgpu:masterfrom
Conversation
When Vortex is built for an XRT-based Xilinx platform that allocates each memory bank as a separate xrt::bo, XRT picks a different virtual address per bank (e.g. on U250 bank 0 lands at 0x40_00000000) — which is far above Vortex's compile-time absolute addresses STARTUP_ADDR = 0x180000000 and STACK_BASE_ADDR = 0x1FFFF0000. AXI requests from Vortex therefore fail to decode at the slave, vx_busy stays high forever, and the kernel never starts. This is the underlying cause of the long-running U250 hang reports (vortexgpgpu#262, vortexgpgpu#263, vortexgpgpu#278) — Vortex 2.3 added the U250 build path but did not address this BAR mismatch, so the produced xclbin has never actually booted on real silicon. Fix - vortex_afu.vh: introduce per-bank PLATFORM_MEMORY_OFFSET_<i> (i = 0..3) macros, each defaulting to the legacy global PLATFORM_MEMORY_OFFSET so HBM platforms (U280/U55C/U50) and VCK5000 single-channel are byte-for-byte unchanged. - VX_afu_wrap.sv: build a 4-entry platform_memory_offsets array from those macros and add the bank-i offset to each outgoing m_axi_mem_<i> AW/AR address. - platforms.mk U250: switch to single-bank (NUM_BANKS=1, DDR[0]) and set PLATFORM_MEMORY_OFFSET_0=40'h4000000000 so the build works end-to-end out of the box. Multi-bank (full 64 GB) deployment needs a runtime mechanism to push each bo's actual XRT VA into the AFU and will follow as a separate PR. Verified end-to-end on real Alveo U250 (XRT 2.19.194, shell xilinx_u250_gen3x16_xdma_4_1_202210_1) at 200 MHz with the default DSP FPU. Without this patch the kernel hangs immediately at ap_start (vx_busy stuck high, CTL register reads back 0x1 indefinitely). With the patch the kernel boots and: - regression `vecadd` (n=16..16384), `sgemm`, `dotproduct`, `demo`, `dropout`, `conv3`, `io_addr`, `fence`, `diverge` — all pass - `dogfood` Test0..Test20 — pass - OpenCL: `saxpy`, `vecadd`, `sgemm`, `sgemm2`, `sgemm3`, `stencil`, `sfilter`, `spmv`, `psort`, `oclprintf` — pass - single-rank MPI: `mpi_vecadd`, `mpi_dotproduct`, `mpi_diverge`, `mpi_put_dotproduct` — pass Final WNS = +0.057 ns at 200 MHz; the patch is purely combinational addressing logic (3 extra adds in the AFU) so area and timing impact are negligible. Refs: vortexgpgpu#262, vortexgpgpu#263, vortexgpgpu#278.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why this matters: the U250 path on master cannot boot
On Vortex
master, building for Alveo U250 appears to succeed (Vitis produces an xclbin), but the kernel never starts on real silicon.ap_startis asserted,vx_busygoes high, and stays high forever. TheCTLregister reads back0x1indefinitely; the host eventually times out. This is the failure pattern reported in #262, #263, #278.Vortex 2.3 added the U250 build infrastructure (platform recognition,
xrt::boallocation,connectivity.spplumbing) but did not address the BAR-mismatch between Vortex's compile-time absolute addresses and where XRT actually places memory. The xclbin is produced; it just doesn't execute.This PR is the smallest patch that makes a U250 boot. It is a logical prerequisite for any other U250 work.
Root cause
Vortex hard-codes some compile-time absolute byte addresses:
STARTUP_ADDR0x1_8000_0000ap_startSTACK_BASE_ADDR0x1_FFFF_0000XRT, on the other hand, allocates each
xrt::boat a virtual address chosen by the platform. On U250, bank 0 lands at0x40_0000_0000(256 GiB into the device address space). All AXI traffic from Vortex therefore goes to a region the platform's AXI fabric does not decode →vx_busy = 1forever.The pre-existing
PLATFORM_MEMORY_OFFSETmacro is conceptually the right escape hatch — it adds a constant offset to every outgoing AXI address — but on master it is a single global offset for all banks, while real Xilinx XRT places each bank at a different virtual address. So a single offset cannot cover U200/U250 four-channel deployments either.Fix
Three small files:
hw/rtl/afu/xrt/vortex_afu.vh(+18 / −0)Introduce per-bank macros. Each defaults to the legacy global
PLATFORM_MEMORY_OFFSETso existing single-channel platforms are unchanged:hw/rtl/afu/xrt/VX_afu_wrap.sv(+14 / −2)Build a 4-entry array from those macros and add the bank-i offset to each
m_axi_mem_<i>outgoing AW/AR address (replaces the single global add):hw/syn/xilinx/xrt/platforms.mk(+6 / −2)Switch the U250 entry to single-channel with the offset hard-coded so the build is deployable out of the box:
(
0x40_0000_0000is the empirical XRT base forDDR[0]on thexilinx_u250_gen3x16_xdma_4_1_202210_1shell. Multi-bank — full 64 GB across 4 DDR4 channels — needs the AFU to learn each bo's actual XRT VA at runtime, since eachxrt::bois placed by the runtime allocator. That follow-up will be submitted as a separate PR with a DCR-based runtime path; this PR is intentionally minimal and out-of-the-box deployable.)Verification on real Alveo U250
Built with this patch (default DSP FPU,
NUM_CORES=2 NUM_WARPS=4 XLEN=64, 200 MHz) on real U250 silicon — XRT 2.19.194, shellxilinx_u250_gen3x16_xdma_4_1_202210_1. The PLP load (xdma_shell_4_1) is required once per cold boot as documented in the README.ap_start,vx_busy=1forevervecadd(n = 16…16384)dogfoodTest0..Test20 (iadd, imul, idiv, fadd, fsub, fmul, fmadd, fnmadd, fdiv, fsqrt, ftoi, itof, fclamp, iclamp, …)sgemm,dotproduct,demo,dropout,conv3,io_addr,fence,divergesaxpy,vecadd,sgemm,sgemm2,sgemm3,stencil,sfilter,spmv,psort,oclprintfmpi_vecadd,mpi_dotproduct,mpi_diverge,mpi_put_dotproductThat spans integer arithmetic, single-precision FP, vector reductions, matrix multiply, convolution, ML-style ops (dropout), and the full OpenCL/MPI execution stacks. Across all of those there is no observed regression introduced by the patch.
(
dogfoodTest21-trig produces FP64 NaNs because the default DSP FPU is FP32-only — that's a pre-existing FPU-implementation issue, not addressed here.)Build numbers: total xclbin link 2 h 30 m, WNS = +0.057 ns @ 200 MHz (positive, no impl-strategy tuning), area added by this patch ≈ 4 × 36-bit adders. Negligible.
Backward compatibility
PLATFORM_MEMORY_OFFSET_<i>defaults toPLATFORM_MEMORY_OFFSET, so any platform that didn't define per-bank offsets sees the same effective offset on every bank as before. HBM platforms (U280/U55C/U50, allPLATFORM_MERGED_MEMORY_INTERFACE), VCK5000 single-channel, and Zynq UltraScale+ are byte-for-byte identical.NUM_BANKS=4 (no connectivity)toNUM_BANKS=1 (DDR[0] + offset). The previous configuration produced an xclbin that didn't actually run, so this is a strict improvement; any user wishing to revisit four-channel deployment needs the runtime VA plumbing of the follow-up PR anyway.Files touched
hw/rtl/afu/xrt/vortex_afu.vh— +18 / −0hw/rtl/afu/xrt/VX_afu_wrap.sv— +14 / −2hw/syn/xilinx/xrt/platforms.mk— +6 / −2References: closes (or substantially mitigates) #262, #263, #278.