Refactor: decouple AICore L2Perf writes via stable per-core staging ring by ChaoZheng109 · Pull Request #709 · hw-native-sys/simpler

ChaoZheng109 · 2026-05-06T02:15:39Z

Why

The previous design coupled AICore L2Perf writes to AICPU's L2PerfBuffer rotation: AICPU exposed a public l2_perf_aicpu_switch_buffer(buffer_idx) and AICore reloaded the buffer address every iteration via the handshake. Two problems:

Leaky API. switch_buffer was a public collector method driven by dispatch-side counters (CoreExecState::dispatch_count, AicpuExecutor::core_dispatch_counts_). Buffer rotation is an AICPU-internal concern, but the dispatch path had to know about it and thread a Runtime* through SchedulerContext::dispatch_* just to reach the API.
Two rotation races that lost records:
- Premature switch — buffer not actually full. Rotation was driven by dispatched task count, not by completed records. AICPU could swap to a new buffer while the previous one still had empty slots — those slots were never reclaimed by the host, inflating the apparent "drop" count.
- In-flight (WIP) writes lost on switch. Because AICore reloaded the records-buffer address from the handshake every iteration, an AICPU-side switch racing with an AICore write in progress could land the record in a buffer the host had already drained-and-recycled (or a different buffer than the one AICPU was about to read), silently losing the record with no mismatch signal.

What

Decouple AICore from L2PerfBuffer rotation by introducing a stable per-core staging ring.

New L2PerfAicoreRing — written exclusively by AICore at dual_issue_slots[task_id % PLATFORM_L2_AICORE_RING_SIZE]. Address is published once via Handshake::l2_perf_aicore_ring_addr (renamed from l2_perf_records_addr) and never reassigned. AICore caches it once after init; the per-iteration handshake reload + dcci(my_hank, …) in host_build_graph go away.
L2PerfBuffer rotation becomes AICPU-internal — it happens inside l2_perf_aicpu_complete_record the moment a buffer fills, replacing the dispatch-side counters. Public l2_perf_aicpu_switch_buffer API is removed; rotation is now a private switch_records_buffer.
L2Perf init moves to before handshake_all_cores so AICore observes a non-zero ring address the moment aicpu_ready=1 unblocks Phase 1.
New mismatch_record_count bucket on L2PerfBufferState — distinguishes ring/task_id invariant violations (hard error, DEV_ERROR-logged) from capacity-driven dropped_record_count. Reconcile arithmetic becomes collected + dropped + mismatch == device_total.
Drop the now-unused Runtime* parameter from SchedulerContext dispatch_* helpers (it was only threaded through to reach the old switch API).
docs/l2-swimlane-profiling.md updated to match.

Testing

Simulation tests pass
Hardware tests pass
L2Perf reconcile counters check out on a multi-core run (collected + dropped + mismatch == device_total, mismatch == 0)

gemini-code-assist

Code Review

This pull request introduces a unified profiling framework for the a2a3 and a5 platforms, consolidating the L2 swimlane, PMU, and Tensor Dump collectors. It replaces per-subsystem thread management with a shared ProfilerBase and BufferPoolManager architecture, improves the AICore-to-AICPU timing publication protocol by using a stable staging ring, and enhances error reporting for buffer management and invariant violations. My feedback highlights potential performance bottlenecks in the management loop's global scan and suggests evaluating whether silent record drops on invariant violations should trigger more explicit error handling.

Introduce L2PerfAicoreRing — a per-core ring written exclusively by AICore at dual_issue_slots[task_id % PLATFORM_L2_AICORE_RING_SIZE]. Its address is published once via Handshake::l2_perf_aicore_ring_addr and never reassigned, so AICore is fully decoupled from the AICPU's records-buffer rotation: - AICore caches the ring pointer once after init; the per-iteration handshake reload + dcci on host_build_graph go away. - AICPU rotates L2PerfBuffer internally inside l2_perf_aicpu_complete_record the moment a buffer fills, instead of relying on dispatch-side counters (CoreExecState::dispatch_count / AicpuExecutor::core_dispatch_counts_) driving an externally visible buffer swap. The public switch_buffer API is gone; rotation is a private switch_records_buffer. - L2Perf init moves before handshake_all_cores so AICore observes a non-zero ring address the moment aicpu_ready=1 unblocks Phase 1. - Add mismatch_record_count to L2PerfBufferState — the runtime's completion-before-dispatch invariant says ring/task_id mismatch must never happen, so it gets its own bucket (hard error, DEV_ERROR-logged) distinct from capacity-driven dropped_record_count. Reconcile arithmetic becomes collected + dropped + mismatch == device_total. - Drop the now-unused Runtime* parameter from SchedulerContext dispatch_* helpers (it was only threaded through to reach the old switch API). Doc + comments in docs/l2-swimlane-profiling.md updated to match.

gemini-code-assist Bot reviewed May 6, 2026

View reviewed changes

Comment thread src/a2a3/platform/include/host/profiling_common/profiler_base.h

Comment thread src/a2a3/platform/src/aicpu/l2_perf_collector_aicpu.cpp

ChaoZheng109 mentioned this pull request May 7, 2026

Refactor(a2a3): decouple profiling from runtime, own it in platform #714

Merged

2 tasks

ChaoZheng109 force-pushed the a2a3/l2perf branch from 70dcc5d to 2c907f5 Compare May 7, 2026 02:56

ChaoWao approved these changes May 7, 2026

View reviewed changes

ChaoWao merged commit 551a79c into hw-native-sys:main May 7, 2026
13 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor: decouple AICore L2Perf writes via stable per-core staging ring#709

Refactor: decouple AICore L2Perf writes via stable per-core staging ring#709
ChaoWao merged 1 commit into
hw-native-sys:mainfrom
ChaoZheng109:a2a3/l2perf

ChaoZheng109 commented May 6, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ChaoZheng109 commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why

What

Testing

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ChaoZheng109 commented May 6, 2026 •

edited

Loading