Refactor(a2a3): decouple profiling from runtime, own it in platform#714
Merged
ChaoWao merged 1 commit intoMay 8, 2026
Merged
Conversation
4764678 to
13a6e50
Compare
There was a problem hiding this comment.
Code Review
This pull request implements a unified host-side profiling framework and integrates it with L2 Swimlane, PMU, and Tensor Dump subsystems across multiple architectures. It introduces stable per-core staging rings to decouple AICore timing writes from AICPU buffer management and consolidates profiling enablement signals within KernelArgs. Feedback identifies a portability issue in the logging logic of the L2 and PMU collectors, where casting 64-bit addresses to unsigned long may lead to truncation on certain platforms; it is recommended to use the %p format specifier or PRIx64 macro instead.
Profiling becomes a platform-layer concern instead of part of the runtime/AICore handshake contract. `enable_profiling_flag` and `l2_perf_aicore_ring_addr` move out of the runtime's `Handshake` struct and into `KernelArgs` + a platform-owned per-core state surface (`set_aicore_profiling_flag` / `set_aicore_l2_perf_ring` with matching getters), mirroring the AICPU-side `set_l2_swimlane_enabled` / `set_pmu_enabled` pattern. Effect: - The runtime/AICore handshake carries only synchronization + identity fields. Adding a new profiling sub-feature no longer touches `Handshake` or `aicore_execute`'s signature. - Profiling lifetime is fully owned by the platform: AICore kernel entry indexes its per-core `L2PerfAicoreRing*` from `KernelArgs::aicore_ring_addr` and publishes it via the setters; AICore code reads via the getters; runtime never sees the storage. - Add `aicore/aicore_profiling_state.h` (set/get for flag + ring) - Onboard backing: `[[block_local]]` statics in onboard aicore/kernel.cpp (weak symbols dedup across AIC/AIV) - Sim backing: pthread TLS in sim aicore/kernel.cpp; sim launch ABI extended with `enable_profiling_flag` + `aicore_ring_addr` - KernelArgs gains `aicore_ring_addr` + `enable_profiling_flag`; bit layout doc moves here from runtime.h - Host `L2PerfCollector` publishes the per-core ring table that KernelArgs forwards - Runtime `Handshake` shrinks to just sync + identity fields in both runtimes
13a6e50 to
118b44b
Compare
ChaoWao
approved these changes
May 8, 2026
4 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Profiling becomes a platform-layer concern instead of part of the runtime/AICore handshake contract.
enable_profiling_flagandl2_perf_aicore_ring_addrmove out of the runtime'sHandshakestruct and intoKernelArgsplus a platform-owned per-core state surface (set_aicore_profiling_flag/set_aicore_l2_perf_ring, with matching getters). Mirrors the AICPU-sideset_l2_swimlane_enabled/set_pmu_enabledpattern.After this change:
Handshakeoraicore_execute's signature.L2PerfAicoreRing*fromKernelArgs::aicore_ring_addrand publishes it via the setters; AICore code reads via the getters; the runtime never sees the storage.Key changes
src/a2a3/platform/include/aicore/aicore_profiling_state.h— set/get for the per-core profiling flag and L2Perf ring pointer[[block_local]]statics inonboard/aicore/kernel.cpp; setters/getters use weak linkage to dedup across the AIC + AIV compilation units linked into one AICore binarysim/aicore/kernel.cpp; the sim launch ABI gainsenable_profiling_flag+aicore_ring_addrso the wrapper can populate slots beforeaicore_executeKernelArgsgainsaicore_ring_addr(device ptr to auint64_t[num_aicore]table of per-coreL2PerfAicoreRing*) +enable_profiling_flag; the bit-layout doc moves here fromruntime.hL2PerfCollectorpublishes the per-core ring address table thatKernelArgs::aicore_ring_addrpoints atHandshakeshrinks in bothhost_build_graph/runtime/runtime.handtensormap_and_ringbuffer/runtime/runtime.h—l2_perf_aicore_ring_addrandenable_profiling_flagremovedMerge order
This PR is the third in a chain. Please merge in this order:
Currently this PR is opened against
main. After #705 and #709 land, I will rebase this branch onto the newmainso the diff reflects only the runtime → platform decoupling.Testing