hw: VX_axi_adapter — fix AXI4 master correctness on real U250 (LSU race + multi-bank addr) by hwirys · Pull Request #340 · vortexgpgpu/vortex

hwirys · 2026-04-30T13:34:32Z

Why this matters: U250 has been silently broken in upstream

The Alveo U250 is one of the canonical FPGA targets advertised in Vortex's documentation, but in master today it produces silently incorrect results on real silicon. Two distinct AXI4 master correctness bugs in VX_axi_adapter make the entire U250 path effectively unusable for real GPGPU workloads:

Bug	Symptom on real U250	Why simx/rtlsim missed it
AW-vs-AR ordering race	`dogfood Test21-trig` (`sinf`) produces NaN; regression `diverge` fails ~9/10 runs; any function-call-boundary stack spill/fill is corrupt	Their memory models commit writes synchronously on AW handshake — they don't model AXI4's separate-channel reordering
Multi-bank addr generation	4-bank build (full DDR bandwidth) produces garbage data growing linearly with VA	Single-bank passes the bug because `i = 0` masks the broken term in the formula

This is precisely the failure pattern that has been reported and re-reported in #202, #262, #263, #278, #333. The race-style bug also showed up under VCK5000 multi-bank investigation — same root cause family. Without these two fixes Vortex on U250 is, in practice, correctness-broken.

What this PR changes

A single file (hw/rtl/libs/VX_axi_adapter.sv) plus a new Verilator unit test. No interface changes, fully back-compat with single-bank/HBM platforms (U280/U55C/U50/VCK5000 single-channel) — they synthesize byte-for-byte identically to before.

1. AW-vs-AR ordering — store-to-load hazard fix

Root cause (verified independently with two third-party LLMs as a sanity check on the AXI4 spec):

m_axi_awid[i] = xbar_tag_out and m_axi_arid[i] = xbar_tag_r_out use different ID spaces.
AXI4 §A5: with different IDs the slave has no ordering obligation. Real DDR controllers (Xilinx DDRMC) reorder freely.
L1 is writethrough no-write-allocate, so the store goes straight to memory and a subsequent same-line load also goes to memory.
AR can therefore be served before AW is globally visible → load returns the pre-store value → silent data corruption.

vx_fence() empirically papers over the bug (it stalls LSU dispatch until the fence response returns, draining the AXI write queue). That is the smoking gun.

Fix: per-bank counter of outstanding writes (AW handshakes not yet matched by B). While the counter is non-zero, gate AR (and the read-side xbar ready_out) for that bank only. This serializes same-bank AW→AR while preserving cross-bank parallelism. A new WRITE_BEFORE_READ=1 parameter (default on) lets verification environments with synchronous-commit memory models opt out (=0).

2. Multi-bank-out address generation — separate-DDR-channel fix

Root cause:

For NUM_BANKS_OUT > 1 with separate AXI masters per DDR channel (U200 4-bank, U250 4-bank), each m_axi_mem_<i> connects to its own DDR controller via connectivity.sp DDR[<i>].
The previous formula byte_addr = bank_addr * NUM_BANKS * LINE_SIZE + i * LINE_SIZE puts data at strided per-channel addresses — that math assumes a single AXI master with downstream HBM-style routing, not separate per-channel DDR controllers.
Host's BANK_INTERLEAVE (in runtime/xrt/vortex.cpp:get_bank_info) puts data at offset (block_addr / NUM_BANKS) * LINE_SIZE within bo[bank] — a compact per-bank address space.
The mismatch grows linearly with VA. A C++ exhaustive sweep over the first 16 KiB confirms the prior math matches host's offset only at VA = 0 (1/256 cache lines).

Fix: emit byte_addr = bank_addr * LINE_SIZE per channel. With NUM_BANKS_OUT = 1, i is always zero, so the new code is bit-identical to the prior single-master path — HBM platforms (U280/U55C/U50) and VCK5000 are unchanged.

Verification

Verilator unit test (new, included)

hw/unittest/axi_adapter_top/ instantiates the patched VX_axi_adapter directly with NUM_BANKS_OUT=4 and:

Test 1: AW addresses match host BANK_INTERLEAVE math for cla 0..15 across all 4 banks (32/32 cases pass).
Test 2: AR addresses do likewise (32/32 cases pass).
Test 3: AR on bank 0 is correctly stalled while bank 0 has an outstanding write (no B); fires immediately after B drains.
Test 4: AR on bank 1 is not blocked by a pending write on bank 0 — cross-bank parallelism preserved.

$ cd hw/unittest/axi_adapter_top && make && ./axi_adapter_top
... (66 PASS, 0 FAIL) ...
ALL TESTS PASS

Real Alveo U250 hardware

Built an xclbin with this patch (single-bank, NUM_CORES=2, NUM_WARPS=4, XLEN=64, 200 MHz, default DSP FPU) and ran the upstream regression and OpenCL/MPI suites on real U250 silicon (XRT 2.19.194, shell xilinx_u250_gen3x16_xdma_4_1_202210_1):

Workload	Before (master)	After (this PR)
`diverge` × 10 runs	~1/10 PASS	10/10 PASS
dogfood Test0–Test20	PASS	PASS (no regressions)
dogfood Test21-trig (sinf)	FAIL — NaN	passes once FPNEW is enabled (separate; FP64 path issue, not this PR)
vecadd, sgemm, dotproduct, demo, dropout, conv3, io_addr, fence	PASS	PASS
OpenCL: saxpy, vecadd, sgemm{,2,3}, stencil, sfilter, spmv, psort, oclprintf	PASS	PASS
MPI (single-rank): mpi_vecadd, mpi_dotproduct, mpi_diverge, mpi_put_dotproduct	PASS	PASS

Two test categories were exercised that were previously known to fail intermittently:

Function-call-boundary store/load (diverge's hacker(&key, task_id) then key.user reload): 10/10 PASS after.
Cross-warp divergent active-mask traffic: covered by diverge and within dogfood's warp-divergent cases.

Timing: WNS = +0.057 ns, WHS = +0.010 ns at 200 MHz on U250 (first build). The patch adds a small per-bank counter (~7 LUTs/bank + 7 FFs/bank) and one combinational ~writes_pending gate on AR — negligible area, no timing regression observed.

Simulation regression (upstream rtlsim)

tests/regression/diverge PASS 5/5 in rtlsim (already passed before the patch). No change because rtlsim's rtlsim_shim instantiates Vortex directly, not Vortex_axi, so this codepath isn't exercised in sim. The patch only affects the FPGA AXI master path — there is no risk to existing simulator-tested behavior.

Files touched

hw/rtl/libs/VX_axi_adapter.sv — +47 / −13 lines (the two fixes)
hw/rtl/libs/VX_axi_adapter_top.sv — new (unit-test wrapper, lint-off-only stylistic changes)
hw/unittest/axi_adapter_top/Makefile — new (unit-test build using the existing hw/unittest/common.mk)
hw/unittest/axi_adapter_top/main.cpp — new (Verilator C++ harness, 4 testcases)

No runtime/, no kernel changes, no platform mk changes. PR is intentionally minimal and self-contained so it can land independently and immediately unblock U250 users.

What's NOT in this PR (but is needed for full 4-bank deployment)

The address-generation fix above makes 4-bank logically correct at the AFU boundary. To actually deploy 4-bank on U250/U200 you also need a way to communicate each bank's xrt::bo::address() back to the AFU at runtime (so per-bank PLATFORM_MEMORY_OFFSET_<i> matches the actual XRT allocation). I have a runtime-DCR-based plumbing patch for that and will submit it as a separate PR after this one lands, since (a) it touches runtime/, VX_afu_wrap.sv, VX_types.vh, and platforms.mk — different review scope; (b) PR1 already gives single-bank U250 users full correctness today, which is the bigger immediate need.

References: closes (or substantially mitigates) #202, #262, #263, #278, #333.

1) AW-vs-AR ordering. AW and AR fired with different IDs are not ordered by AXI4 (A5), so on real DDR4 controllers a load issued right after a writethrough store on the same line can return the pre-store value. This silently corrupted any function-call-boundary stack spill/fill on Alveo U250 (e.g. dogfood Test21-trig sinf, regression `diverge` ~9/10 fail) while simx and rtlsim — whose memory models commit writes synchronously — passed. Fix: per-bank counter of outstanding writes (AW handshakes not yet matched by B) gates AR on the same bank only; cross-bank parallelism is preserved. New parameter WRITE_BEFORE_READ defaults to 1; verification environments that already serialize writes can set it to 0. 2) Multi-bank-out address generation. With NUM_BANKS_OUT > 1 (separate AXI master per DDR channel, e.g. U200/U250 4-bank), the previous formula `byte_addr = bank_addr * NUM_BANKS * LINE_SIZE + i * LINE_SIZE` only matched a single-master HBM-style fabric. With per-channel masters each DDR controller has its own [0, capacity) space and the host BANK_INTERLEAVE math at runtime/xrt/vortex.cpp:get_bank_info places data at compact offset `(block_addr / NUM_BANKS) * LINE_SIZE` in bo[bank] — the AFU and host disagreed by a stride that grew with VA, the long-standing 4-bank corruption. Fix: emit `byte_addr = bank_addr * LINE_SIZE` per channel. With NUM_BANKS_OUT = 1 this is bit-identical to the previous behaviour, so HBM single-master configurations (U280/U55C/U50) and VCK5000 single-channel are unchanged. Verified on real Alveo U250 (XRT 2.19, xdma_shell_4_1) single-bank build with default DSP FPU at 200 MHz: regression `diverge` 10/10 PASS (was ~1/10), dogfood Test0–Test20, vecadd, sgemm, dotproduct, demo, dropout, conv3, io_addr, fence, plus OpenCL (saxpy, vecadd, sgemm, sgemm2, sgemm3, stencil, sfilter, spmv, psort, oclprintf) and single-rank MPI (mpi_vecadd, mpi_dotproduct, mpi_diverge, mpi_put_dotproduct) — all pass. Final WNS = +0.057 ns; the patch adds ~7 LUT/FF per bank. Refs: vortexgpgpu#202, vortexgpgpu#262, vortexgpgpu#263, vortexgpgpu#278, vortexgpgpu#333.

hwirys force-pushed the fix/axi-master-ordering branch from 36676aa to 8057dd5 Compare April 30, 2026 14:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

hw: VX_axi_adapter — fix AXI4 master correctness on real U250 (LSU race + multi-bank addr)#340

hw: VX_axi_adapter — fix AXI4 master correctness on real U250 (LSU race + multi-bank addr)#340
hwirys wants to merge 1 commit intovortexgpgpu:masterfrom
hwirys:fix/axi-master-ordering

hwirys commented Apr 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

hwirys commented Apr 30, 2026

Why this matters: U250 has been silently broken in upstream

What this PR changes

1. AW-vs-AR ordering — store-to-load hazard fix

2. Multi-bank-out address generation — separate-DDR-channel fix

Verification

Verilator unit test (new, included)

Real Alveo U250 hardware

Simulation regression (upstream rtlsim)

Files touched

What's NOT in this PR (but is needed for full 4-bank deployment)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant