hw: VX_axi_adapter — fix AXI4 master correctness on real U250 (LSU race + multi-bank addr)#340
Open
hwirys wants to merge 1 commit intovortexgpgpu:masterfrom
Open
hw: VX_axi_adapter — fix AXI4 master correctness on real U250 (LSU race + multi-bank addr)#340hwirys wants to merge 1 commit intovortexgpgpu:masterfrom
hwirys wants to merge 1 commit intovortexgpgpu:masterfrom
Conversation
1) AW-vs-AR ordering. AW and AR fired with different IDs are not ordered by AXI4 (A5), so on real DDR4 controllers a load issued right after a writethrough store on the same line can return the pre-store value. This silently corrupted any function-call-boundary stack spill/fill on Alveo U250 (e.g. dogfood Test21-trig sinf, regression `diverge` ~9/10 fail) while simx and rtlsim — whose memory models commit writes synchronously — passed. Fix: per-bank counter of outstanding writes (AW handshakes not yet matched by B) gates AR on the same bank only; cross-bank parallelism is preserved. New parameter WRITE_BEFORE_READ defaults to 1; verification environments that already serialize writes can set it to 0. 2) Multi-bank-out address generation. With NUM_BANKS_OUT > 1 (separate AXI master per DDR channel, e.g. U200/U250 4-bank), the previous formula `byte_addr = bank_addr * NUM_BANKS * LINE_SIZE + i * LINE_SIZE` only matched a single-master HBM-style fabric. With per-channel masters each DDR controller has its own [0, capacity) space and the host BANK_INTERLEAVE math at runtime/xrt/vortex.cpp:get_bank_info places data at compact offset `(block_addr / NUM_BANKS) * LINE_SIZE` in bo[bank] — the AFU and host disagreed by a stride that grew with VA, the long-standing 4-bank corruption. Fix: emit `byte_addr = bank_addr * LINE_SIZE` per channel. With NUM_BANKS_OUT = 1 this is bit-identical to the previous behaviour, so HBM single-master configurations (U280/U55C/U50) and VCK5000 single-channel are unchanged. Verified on real Alveo U250 (XRT 2.19, xdma_shell_4_1) single-bank build with default DSP FPU at 200 MHz: regression `diverge` 10/10 PASS (was ~1/10), dogfood Test0–Test20, vecadd, sgemm, dotproduct, demo, dropout, conv3, io_addr, fence, plus OpenCL (saxpy, vecadd, sgemm, sgemm2, sgemm3, stencil, sfilter, spmv, psort, oclprintf) and single-rank MPI (mpi_vecadd, mpi_dotproduct, mpi_diverge, mpi_put_dotproduct) — all pass. Final WNS = +0.057 ns; the patch adds ~7 LUT/FF per bank. Refs: vortexgpgpu#202, vortexgpgpu#262, vortexgpgpu#263, vortexgpgpu#278, vortexgpgpu#333.
36676aa to
8057dd5
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why this matters: U250 has been silently broken in upstream
The Alveo U250 is one of the canonical FPGA targets advertised in Vortex's documentation, but in
mastertoday it produces silently incorrect results on real silicon. Two distinct AXI4 master correctness bugs inVX_axi_adaptermake the entire U250 path effectively unusable for real GPGPU workloads:dogfood Test21-trig(sinf) produces NaN; regressiondivergefails ~9/10 runs; any function-call-boundary stack spill/fill is corrupti = 0masks the broken term in the formulaThis is precisely the failure pattern that has been reported and re-reported in #202, #262, #263, #278, #333. The race-style bug also showed up under VCK5000 multi-bank investigation — same root cause family. Without these two fixes Vortex on U250 is, in practice, correctness-broken.
What this PR changes
A single file (
hw/rtl/libs/VX_axi_adapter.sv) plus a new Verilator unit test. No interface changes, fully back-compat with single-bank/HBM platforms (U280/U55C/U50/VCK5000 single-channel) — they synthesize byte-for-byte identically to before.1. AW-vs-AR ordering — store-to-load hazard fix
Root cause (verified independently with two third-party LLMs as a sanity check on the AXI4 spec):
m_axi_awid[i] = xbar_tag_outandm_axi_arid[i] = xbar_tag_r_outuse different ID spaces.vx_fence()empirically papers over the bug (it stalls LSU dispatch until the fence response returns, draining the AXI write queue). That is the smoking gun.Fix: per-bank counter of outstanding writes (AW handshakes not yet matched by B). While the counter is non-zero, gate AR (and the read-side xbar
ready_out) for that bank only. This serializes same-bank AW→AR while preserving cross-bank parallelism. A newWRITE_BEFORE_READ=1parameter (default on) lets verification environments with synchronous-commit memory models opt out (=0).2. Multi-bank-out address generation — separate-DDR-channel fix
Root cause:
NUM_BANKS_OUT > 1with separate AXI masters per DDR channel (U200 4-bank, U250 4-bank), eachm_axi_mem_<i>connects to its own DDR controller viaconnectivity.sp DDR[<i>].byte_addr = bank_addr * NUM_BANKS * LINE_SIZE + i * LINE_SIZEputs data at strided per-channel addresses — that math assumes a single AXI master with downstream HBM-style routing, not separate per-channel DDR controllers.BANK_INTERLEAVE(inruntime/xrt/vortex.cpp:get_bank_info) puts data at offset(block_addr / NUM_BANKS) * LINE_SIZEwithinbo[bank]— a compact per-bank address space.VA = 0(1/256 cache lines).Fix: emit
byte_addr = bank_addr * LINE_SIZEper channel. WithNUM_BANKS_OUT = 1,iis always zero, so the new code is bit-identical to the prior single-master path — HBM platforms (U280/U55C/U50) and VCK5000 are unchanged.Verification
Verilator unit test (new, included)
hw/unittest/axi_adapter_top/instantiates the patchedVX_axi_adapterdirectly withNUM_BANKS_OUT=4and:BANK_INTERLEAVEmath for cla 0..15 across all 4 banks (32/32 cases pass).B); fires immediately afterBdrains.Real Alveo U250 hardware
Built an
xclbinwith this patch (single-bank, NUM_CORES=2, NUM_WARPS=4, XLEN=64, 200 MHz, default DSP FPU) and ran the upstream regression and OpenCL/MPI suites on real U250 silicon (XRT 2.19.194, shellxilinx_u250_gen3x16_xdma_4_1_202210_1):diverge× 10 runsTwo test categories were exercised that were previously known to fail intermittently:
diverge'shacker(&key, task_id)thenkey.userreload): 10/10 PASS after.divergeand withindogfood's warp-divergent cases.Timing: WNS = +0.057 ns, WHS = +0.010 ns at 200 MHz on U250 (first build). The patch adds a small per-bank counter (~7 LUTs/bank + 7 FFs/bank) and one combinational
~writes_pendinggate on AR — negligible area, no timing regression observed.Simulation regression (upstream rtlsim)
tests/regression/divergePASS 5/5 in rtlsim (already passed before the patch). No change because rtlsim'srtlsim_shiminstantiatesVortexdirectly, notVortex_axi, so this codepath isn't exercised in sim. The patch only affects the FPGA AXI master path — there is no risk to existing simulator-tested behavior.Files touched
hw/rtl/libs/VX_axi_adapter.sv— +47 / −13 lines (the two fixes)hw/rtl/libs/VX_axi_adapter_top.sv— new (unit-test wrapper, lint-off-only stylistic changes)hw/unittest/axi_adapter_top/Makefile— new (unit-test build using the existinghw/unittest/common.mk)hw/unittest/axi_adapter_top/main.cpp— new (Verilator C++ harness, 4 testcases)No
runtime/, no kernel changes, no platform mk changes. PR is intentionally minimal and self-contained so it can land independently and immediately unblock U250 users.What's NOT in this PR (but is needed for full 4-bank deployment)
The address-generation fix above makes 4-bank logically correct at the AFU boundary. To actually deploy 4-bank on U250/U200 you also need a way to communicate each bank's
xrt::bo::address()back to the AFU at runtime (so per-bankPLATFORM_MEMORY_OFFSET_<i>matches the actual XRT allocation). I have a runtime-DCR-based plumbing patch for that and will submit it as a separate PR after this one lands, since (a) it touchesruntime/,VX_afu_wrap.sv,VX_types.vh, andplatforms.mk— different review scope; (b) PR1 already gives single-bank U250 users full correctness today, which is the bigger immediate need.References: closes (or substantially mitigates) #202, #262, #263, #278, #333.