Skip to content

Refactor: decouple AICore L2Perf writes via stable per-core staging ring#709

Merged
ChaoWao merged 1 commit into
hw-native-sys:mainfrom
ChaoZheng109:a2a3/l2perf
May 7, 2026
Merged

Refactor: decouple AICore L2Perf writes via stable per-core staging ring#709
ChaoWao merged 1 commit into
hw-native-sys:mainfrom
ChaoZheng109:a2a3/l2perf

Conversation

@ChaoZheng109
Copy link
Copy Markdown
Collaborator

@ChaoZheng109 ChaoZheng109 commented May 6, 2026

Why

The previous design coupled AICore L2Perf writes to AICPU's L2PerfBuffer rotation: AICPU exposed a public l2_perf_aicpu_switch_buffer(buffer_idx) and AICore reloaded the buffer address every iteration via the handshake. Two problems:

  1. Leaky API. switch_buffer was a public collector method driven by dispatch-side counters (CoreExecState::dispatch_count, AicpuExecutor::core_dispatch_counts_). Buffer rotation is an AICPU-internal concern, but the dispatch path had to know about it and thread a Runtime* through SchedulerContext::dispatch_* just to reach the API.

  2. Two rotation races that lost records:

    • Premature switch — buffer not actually full. Rotation was driven by dispatched task count, not by completed records. AICPU could swap to a new buffer while the previous one still had empty slots — those slots were never reclaimed by the host, inflating the apparent "drop" count.
    • In-flight (WIP) writes lost on switch. Because AICore reloaded the records-buffer address from the handshake every iteration, an AICPU-side switch racing with an AICore write in progress could land the record in a buffer the host had already drained-and-recycled (or a different buffer than the one AICPU was about to read), silently losing the record with no mismatch signal.

What

Decouple AICore from L2PerfBuffer rotation by introducing a stable per-core staging ring.

  • New L2PerfAicoreRing — written exclusively by AICore at dual_issue_slots[task_id % PLATFORM_L2_AICORE_RING_SIZE]. Address is published once via Handshake::l2_perf_aicore_ring_addr (renamed from l2_perf_records_addr) and never reassigned. AICore caches it once after init; the per-iteration handshake reload + dcci(my_hank, …) in host_build_graph go away.
  • L2PerfBuffer rotation becomes AICPU-internal — it happens inside l2_perf_aicpu_complete_record the moment a buffer fills, replacing the dispatch-side counters. Public l2_perf_aicpu_switch_buffer API is removed; rotation is now a private switch_records_buffer.
  • L2Perf init moves to before handshake_all_cores so AICore observes a non-zero ring address the moment aicpu_ready=1 unblocks Phase 1.
  • New mismatch_record_count bucket on L2PerfBufferState — distinguishes ring/task_id invariant violations (hard error, DEV_ERROR-logged) from capacity-driven dropped_record_count. Reconcile arithmetic becomes collected + dropped + mismatch == device_total.
  • Drop the now-unused Runtime* parameter from SchedulerContext dispatch_* helpers (it was only threaded through to reach the old switch API).
  • docs/l2-swimlane-profiling.md updated to match.

Testing

  • Simulation tests pass
  • Hardware tests pass
  • L2Perf reconcile counters check out on a multi-core run (collected + dropped + mismatch == device_total, mismatch == 0)

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a unified profiling framework for the a2a3 and a5 platforms, consolidating the L2 swimlane, PMU, and Tensor Dump collectors. It replaces per-subsystem thread management with a shared ProfilerBase and BufferPoolManager architecture, improves the AICore-to-AICPU timing publication protocol by using a stable staging ring, and enhances error reporting for buffer management and invariant violations. My feedback highlights potential performance bottlenecks in the management loop's global scan and suggests evaluating whether silent record drops on invariant violations should trigger more explicit error handling.

Comment thread src/a2a3/platform/include/host/profiling_common/profiler_base.h
Comment thread src/a2a3/platform/src/aicpu/l2_perf_collector_aicpu.cpp
Introduce L2PerfAicoreRing — a per-core ring written exclusively by AICore
at dual_issue_slots[task_id % PLATFORM_L2_AICORE_RING_SIZE]. Its address is
published once via Handshake::l2_perf_aicore_ring_addr and never reassigned,
so AICore is fully decoupled from the AICPU's records-buffer rotation:

- AICore caches the ring pointer once after init; the per-iteration handshake
  reload + dcci on host_build_graph go away.
- AICPU rotates L2PerfBuffer internally inside l2_perf_aicpu_complete_record
  the moment a buffer fills, instead of relying on dispatch-side counters
  (CoreExecState::dispatch_count / AicpuExecutor::core_dispatch_counts_)
  driving an externally visible buffer swap. The public switch_buffer API is
  gone; rotation is a private switch_records_buffer.
- L2Perf init moves before handshake_all_cores so AICore observes a non-zero
  ring address the moment aicpu_ready=1 unblocks Phase 1.
- Add mismatch_record_count to L2PerfBufferState — the runtime's
  completion-before-dispatch invariant says ring/task_id mismatch must never
  happen, so it gets its own bucket (hard error, DEV_ERROR-logged) distinct
  from capacity-driven dropped_record_count. Reconcile arithmetic becomes
  collected + dropped + mismatch == device_total.
- Drop the now-unused Runtime* parameter from SchedulerContext dispatch_*
  helpers (it was only threaded through to reach the old switch API).

Doc + comments in docs/l2-swimlane-profiling.md updated to match.
@ChaoWao ChaoWao merged commit 551a79c into hw-native-sys:main May 7, 2026
13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants