Skip to content

hw: VX_axi_adapter — fix AXI4 master correctness on real U250 (LSU race + multi-bank addr)#340

Open
hwirys wants to merge 1 commit intovortexgpgpu:masterfrom
hwirys:fix/axi-master-ordering
Open

hw: VX_axi_adapter — fix AXI4 master correctness on real U250 (LSU race + multi-bank addr)#340
hwirys wants to merge 1 commit intovortexgpgpu:masterfrom
hwirys:fix/axi-master-ordering

Conversation

@hwirys
Copy link
Copy Markdown

@hwirys hwirys commented Apr 30, 2026

Why this matters: U250 has been silently broken in upstream

The Alveo U250 is one of the canonical FPGA targets advertised in Vortex's documentation, but in master today it produces silently incorrect results on real silicon. Two distinct AXI4 master correctness bugs in VX_axi_adapter make the entire U250 path effectively unusable for real GPGPU workloads:

Bug Symptom on real U250 Why simx/rtlsim missed it
AW-vs-AR ordering race dogfood Test21-trig (sinf) produces NaN; regression diverge fails ~9/10 runs; any function-call-boundary stack spill/fill is corrupt Their memory models commit writes synchronously on AW handshake — they don't model AXI4's separate-channel reordering
Multi-bank addr generation 4-bank build (full DDR bandwidth) produces garbage data growing linearly with VA Single-bank passes the bug because i = 0 masks the broken term in the formula

This is precisely the failure pattern that has been reported and re-reported in #202, #262, #263, #278, #333. The race-style bug also showed up under VCK5000 multi-bank investigation — same root cause family. Without these two fixes Vortex on U250 is, in practice, correctness-broken.


What this PR changes

A single file (hw/rtl/libs/VX_axi_adapter.sv) plus a new Verilator unit test. No interface changes, fully back-compat with single-bank/HBM platforms (U280/U55C/U50/VCK5000 single-channel) — they synthesize byte-for-byte identically to before.

1. AW-vs-AR ordering — store-to-load hazard fix

Root cause (verified independently with two third-party LLMs as a sanity check on the AXI4 spec):

  • m_axi_awid[i] = xbar_tag_out and m_axi_arid[i] = xbar_tag_r_out use different ID spaces.
  • AXI4 §A5: with different IDs the slave has no ordering obligation. Real DDR controllers (Xilinx DDRMC) reorder freely.
  • L1 is writethrough no-write-allocate, so the store goes straight to memory and a subsequent same-line load also goes to memory.
  • AR can therefore be served before AW is globally visible → load returns the pre-store value → silent data corruption.

vx_fence() empirically papers over the bug (it stalls LSU dispatch until the fence response returns, draining the AXI write queue). That is the smoking gun.

Fix: per-bank counter of outstanding writes (AW handshakes not yet matched by B). While the counter is non-zero, gate AR (and the read-side xbar ready_out) for that bank only. This serializes same-bank AW→AR while preserving cross-bank parallelism. A new WRITE_BEFORE_READ=1 parameter (default on) lets verification environments with synchronous-commit memory models opt out (=0).

2. Multi-bank-out address generation — separate-DDR-channel fix

Root cause:

  • For NUM_BANKS_OUT > 1 with separate AXI masters per DDR channel (U200 4-bank, U250 4-bank), each m_axi_mem_<i> connects to its own DDR controller via connectivity.sp DDR[<i>].
  • The previous formula byte_addr = bank_addr * NUM_BANKS * LINE_SIZE + i * LINE_SIZE puts data at strided per-channel addresses — that math assumes a single AXI master with downstream HBM-style routing, not separate per-channel DDR controllers.
  • Host's BANK_INTERLEAVE (in runtime/xrt/vortex.cpp:get_bank_info) puts data at offset (block_addr / NUM_BANKS) * LINE_SIZE within bo[bank] — a compact per-bank address space.
  • The mismatch grows linearly with VA. A C++ exhaustive sweep over the first 16 KiB confirms the prior math matches host's offset only at VA = 0 (1/256 cache lines).

Fix: emit byte_addr = bank_addr * LINE_SIZE per channel. With NUM_BANKS_OUT = 1, i is always zero, so the new code is bit-identical to the prior single-master path — HBM platforms (U280/U55C/U50) and VCK5000 are unchanged.


Verification

Verilator unit test (new, included)

hw/unittest/axi_adapter_top/ instantiates the patched VX_axi_adapter directly with NUM_BANKS_OUT=4 and:

  • Test 1: AW addresses match host BANK_INTERLEAVE math for cla 0..15 across all 4 banks (32/32 cases pass).
  • Test 2: AR addresses do likewise (32/32 cases pass).
  • Test 3: AR on bank 0 is correctly stalled while bank 0 has an outstanding write (no B); fires immediately after B drains.
  • Test 4: AR on bank 1 is not blocked by a pending write on bank 0 — cross-bank parallelism preserved.
$ cd hw/unittest/axi_adapter_top && make && ./axi_adapter_top
... (66 PASS, 0 FAIL) ...
ALL TESTS PASS

Real Alveo U250 hardware

Built an xclbin with this patch (single-bank, NUM_CORES=2, NUM_WARPS=4, XLEN=64, 200 MHz, default DSP FPU) and ran the upstream regression and OpenCL/MPI suites on real U250 silicon (XRT 2.19.194, shell xilinx_u250_gen3x16_xdma_4_1_202210_1):

Workload Before (master) After (this PR)
diverge × 10 runs ~1/10 PASS 10/10 PASS
dogfood Test0–Test20 PASS PASS (no regressions)
dogfood Test21-trig (sinf) FAIL — NaN passes once FPNEW is enabled (separate; FP64 path issue, not this PR)
vecadd, sgemm, dotproduct, demo, dropout, conv3, io_addr, fence PASS PASS
OpenCL: saxpy, vecadd, sgemm{,2,3}, stencil, sfilter, spmv, psort, oclprintf PASS PASS
MPI (single-rank): mpi_vecadd, mpi_dotproduct, mpi_diverge, mpi_put_dotproduct PASS PASS

Two test categories were exercised that were previously known to fail intermittently:

  1. Function-call-boundary store/load (diverge's hacker(&key, task_id) then key.user reload): 10/10 PASS after.
  2. Cross-warp divergent active-mask traffic: covered by diverge and within dogfood's warp-divergent cases.

Timing: WNS = +0.057 ns, WHS = +0.010 ns at 200 MHz on U250 (first build). The patch adds a small per-bank counter (~7 LUTs/bank + 7 FFs/bank) and one combinational ~writes_pending gate on AR — negligible area, no timing regression observed.

Simulation regression (upstream rtlsim)

tests/regression/diverge PASS 5/5 in rtlsim (already passed before the patch). No change because rtlsim's rtlsim_shim instantiates Vortex directly, not Vortex_axi, so this codepath isn't exercised in sim. The patch only affects the FPGA AXI master path — there is no risk to existing simulator-tested behavior.


Files touched

  • hw/rtl/libs/VX_axi_adapter.sv — +47 / −13 lines (the two fixes)
  • hw/rtl/libs/VX_axi_adapter_top.sv — new (unit-test wrapper, lint-off-only stylistic changes)
  • hw/unittest/axi_adapter_top/Makefile — new (unit-test build using the existing hw/unittest/common.mk)
  • hw/unittest/axi_adapter_top/main.cpp — new (Verilator C++ harness, 4 testcases)

No runtime/, no kernel changes, no platform mk changes. PR is intentionally minimal and self-contained so it can land independently and immediately unblock U250 users.


What's NOT in this PR (but is needed for full 4-bank deployment)

The address-generation fix above makes 4-bank logically correct at the AFU boundary. To actually deploy 4-bank on U250/U200 you also need a way to communicate each bank's xrt::bo::address() back to the AFU at runtime (so per-bank PLATFORM_MEMORY_OFFSET_<i> matches the actual XRT allocation). I have a runtime-DCR-based plumbing patch for that and will submit it as a separate PR after this one lands, since (a) it touches runtime/, VX_afu_wrap.sv, VX_types.vh, and platforms.mk — different review scope; (b) PR1 already gives single-bank U250 users full correctness today, which is the bigger immediate need.

References: closes (or substantially mitigates) #202, #262, #263, #278, #333.

1) AW-vs-AR ordering. AW and AR fired with different IDs are not
   ordered by AXI4 (A5), so on real DDR4 controllers a load issued
   right after a writethrough store on the same line can return the
   pre-store value. This silently corrupted any function-call-boundary
   stack spill/fill on Alveo U250 (e.g. dogfood Test21-trig sinf,
   regression `diverge` ~9/10 fail) while simx and rtlsim — whose
   memory models commit writes synchronously — passed.

   Fix: per-bank counter of outstanding writes (AW handshakes not yet
   matched by B) gates AR on the same bank only; cross-bank
   parallelism is preserved. New parameter WRITE_BEFORE_READ defaults
   to 1; verification environments that already serialize writes can
   set it to 0.

2) Multi-bank-out address generation. With NUM_BANKS_OUT > 1 (separate
   AXI master per DDR channel, e.g. U200/U250 4-bank), the previous
   formula `byte_addr = bank_addr * NUM_BANKS * LINE_SIZE + i * LINE_SIZE`
   only matched a single-master HBM-style fabric. With per-channel
   masters each DDR controller has its own [0, capacity) space and the
   host BANK_INTERLEAVE math at runtime/xrt/vortex.cpp:get_bank_info
   places data at compact offset `(block_addr / NUM_BANKS) * LINE_SIZE`
   in bo[bank] — the AFU and host disagreed by a stride that grew with
   VA, the long-standing 4-bank corruption.

   Fix: emit `byte_addr = bank_addr * LINE_SIZE` per channel. With
   NUM_BANKS_OUT = 1 this is bit-identical to the previous behaviour,
   so HBM single-master configurations (U280/U55C/U50) and VCK5000
   single-channel are unchanged.

Verified on real Alveo U250 (XRT 2.19, xdma_shell_4_1) single-bank
build with default DSP FPU at 200 MHz: regression `diverge` 10/10
PASS (was ~1/10), dogfood Test0–Test20, vecadd, sgemm, dotproduct,
demo, dropout, conv3, io_addr, fence, plus OpenCL (saxpy, vecadd,
sgemm, sgemm2, sgemm3, stencil, sfilter, spmv, psort, oclprintf) and
single-rank MPI (mpi_vecadd, mpi_dotproduct, mpi_diverge,
mpi_put_dotproduct) — all pass. Final WNS = +0.057 ns; the patch
adds ~7 LUT/FF per bank.

Refs: vortexgpgpu#202, vortexgpgpu#262, vortexgpgpu#263, vortexgpgpu#278, vortexgpgpu#333.
@hwirys hwirys force-pushed the fix/axi-master-ordering branch from 36676aa to 8057dd5 Compare April 30, 2026 14:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant