Skip to content

[NE16] Add GAP9_w_NE16 platform (NE16 accelerator on GAP9)#1

Open
runwangdl wants to merge 24 commits into
develfrom
gap9-ne16
Open

[NE16] Add GAP9_w_NE16 platform (NE16 accelerator on GAP9)#1
runwangdl wants to merge 24 commits into
develfrom
gap9-ne16

Conversation

@runwangdl
Copy link
Copy Markdown
Owner

@runwangdl runwangdl commented Apr 13, 2026

Adds the NE16 neural engine as an accelerator Engine on top of the existing GAP9 platform, registered as a new composite platform GAP9_w_NE16 that mirrors the Siracusa_w_neureka pattern.

Added

  • Deeploy/Targets/NE16/ — full Target: Platform/Engine/Bindings/Parsers/Tiler/Deployer/Templates/TileConstraints/TopologyOptimizationPasses. NE16Platform extends GAP9Platform with engines=[NE16Engine, GAP9ClusterEngine]; NE16Deployer extends GAP9Deployer. _weightEncode ported from pulp-nnx/test/Ne16Weight.py (single CIN_SUBTILE=16 mode).
  • DeeployTest/deeployRunner_tiled_gap9_w_ne16.py + DeeployTest/test_gap9_ne16_tiled_config.py — runner + kernel test config.
  • DeeployTest/test_platforms.py — pytest functions test_gap9_w_ne16_tiled_kernels_l2_{single,double}buffer under marker gap9_w_ne16_tiled.
  • .github/workflows/{ci-platform-gap9-w-ne16-tiled.yml,_runner-gap9-w-ne16-tiled.yml} — CI jobs (single + double buffer L2).
  • TargetLibraries/GAP9/CMakeLists.txtadd_subdirectory(pulp-nnx) with USE_NE16=ON for GAP9_w_NE16.

Changed

  • DeeployTest/testUtils/platformMapping.py — register GAP9_w_NE16 in names/mapPlatform/setupMemoryPlatform/mapDeployer.
  • DeeployTest/testMVP.py — wrap deployer with EngineColoringDeployerWrapper for GAP9_w_NE16 (without it NE16 nodes never get an engine color and parsing fails).
  • DeeployTest/testUtils/core/execution.py — append the GAP9 SDK image build target for GAP9_w_NE16 (so chip.soc.mram.bin is produced before gvsoc run).
  • CMakeLists.txt, DeeployTest/CMakeLists.txt — accept GAP9_w_NE16 alongside GAP9 in the platform branches.
  • Deeploy/Targets/NE16/Templates/ConvTemplate.py — NE16 subtile constants per ne16_task_defs.h: CIN_SUBTILE 16, output 3, weight stride d0 = 3*3*weight_d0_stride_mode8 = 18 for DW/Dense (PW qw * weight_d0_stride = 16). Emit top-level ne16_task_t fields (weight_d0_stride, qw, subtile_output_channel, kernel_shape, depthwise) that the HW reads at dispatch time.
  • Deeploy/Targets/NE16/TopologyOptimizationPasses/Passes.py — DW weight layout: after Deeploy's NHWC→NCHW transpose, swap axes 0/1 once more so _weightEncode sees the standard (cout, 1, H, W) layout and produces the correct (1, 1, packed_bytes) single-block output expected by the NE16 HW.
  • Deeploy/Targets/NE16/TileConstraints/NE16DepthwiseConstraint.py — DW weight is a single packed block (not per-cout); constrain weightOutChannelVar == Max and reuse the same HyperRectangle((0,0,0), weightShape) for every output-channel tile.
  • Deeploy/Targets/NE16/Parsers.py — drop the group == shape[1] check in NE16DWConv2DParser (invalid under the post-encode rank-3 layout).

Fixed

  • Deeploy/Targets/PULPOpen/Templates/FloatGemmTemplate.py — work around a pre-existing ImportError: cannot import name 'float32_tPtr' from 'Deeploy.AbstractDataTypes' by defining it locally via PointerClass(float32_t).

Test plan

Run on gvsoc gap9.evk inside ghcr.io/pulp-platform/deeploy-gap9:devel. All verified dispatches (ne16_nnx_dispatch appears in generated Network.c for NE16-routed nodes):

Test L1 Buffer Errors Runtime (cycles)
Kernels/Integer/Conv/PW_2D_RQ/Regular_RQ 32000 single 0 / 1152 ~900k
Kernels/Integer/Conv/PW_2D_RQ/Regular_RQ 16000 single 0
Kernels/Integer/Conv/PW_2D 32000 single 0
Kernels/Integer/Conv/DW_2D_RQ 32000 single 0 / 1280 ~27k
Kernels/Integer/Conv/DW_2D_RQ 16000 single 0
Kernels/Integer/Conv/StriddedPadded_2D_RQ 32000 single 0
Kernels/Integer/Conv/PW_2D_RQ/Regular_RQ 32000 double 0
Kernels/Integer/Conv/DW_2D_RQ 32000 double 0

Follow-up (out of scope):

  • PW_2D_RQ/Unsigned_RQ uses int8 input. Ne16TestConf.py only supports uint8 and NE16 HAL doesn't expose a signed-input conf0 flag; proper support needs sign-propagation (shift int8 → uint8 + adjust weight_offset).
  • 3x3 dense-conv kernel tests don't exist in Tests/Kernels/Integer/Conv/ today (Regular_2D_RQ is 8×8); coverage is via the model path once the remaining tiling-system edge cases are resolved.

PR Merge Checklist

  1. The PR is rebased on the latest devel commit and pointing to devel.
  2. Your PR reviewed and approved.
  3. All checks are passing.
  4. The CHANGELOG.md file has been updated.
  5. If the docker was modified, change back its link after review.

@runwangdl runwangdl force-pushed the gap9-ne16 branch 12 times, most recently from 4edb011 to 748707a Compare April 14, 2026 08:54
runwangdl and others added 2 commits April 14, 2026 10:43
Mirrors the Siracusa_w_neureka pattern. NE16Platform extends GAP9Platform
with engines=[NE16Engine, GAP9ClusterEngine]; NE16Deployer extends
GAP9Deployer (reuses ClDma transformers via GAP9Bindings).

New Target: Deeploy/Targets/NE16/ (Platform, Engine, Bindings, Parsers,
Tiler, Deployer, Templates, TileConstraints, TopologyOptimizationPasses).
The _weightEncode function is ported from pulp-nnx/test/Ne16Weight.py
(single CIN_SUBTILE=16 mode, no 1x1 vs 3x3 split). ConvTemplate subtile
constants set per ne16_task_defs.h (output 3x3, weight stride bytes
PW=16 DW/Dense=144).

New test infrastructure:
- DeeployTest/deeployRunner_tiled_gap9_w_ne16.py
- DeeployTest/test_gap9_ne16_tiled_config.py (PW/DW/Dense RQ Conv)

DeeployTest wiring:
- testUtils/platformMapping.py: register GAP9_w_NE16 in the platforms
  list, mapPlatform, setupMemoryPlatform, mapDeployer.
- testMVP.py: include GAP9_w_NE16 in the EngineColoringDeployerWrapper
  branch (without it NE16AdjustWeightMemoryLayoutPass never fires and
  parsing backtracks to exhaustion).
- testUtils/core/execution.py: build the GAP9 SDK 'image' target for
  GAP9_w_NE16 too (so chip.soc.mram.bin is produced before gvsoc run).
- CMakeLists.txt, DeeployTest/CMakeLists.txt: accept GAP9_w_NE16
  alongside GAP9 in the platform branches.
- TargetLibraries/GAP9/CMakeLists.txt: for GAP9_w_NE16 platform,
  add_subdirectory on pulp-nnx with USE_NE16=ON and link it into
  deeploygap9.

Fix: Deeploy/Targets/PULPOpen/Templates/FloatGemmTemplate.py referenced
an undefined symbol float32_tPtr from Deeploy.AbstractDataTypes; define
it locally via PointerClass(float32_t) to unblock the import chain
reached by NE16Platform.

Verified on gvsoc gap9.evk:
  PW 1x1 RQ  (Regular_RQ):    0/1152 errors, 901917 cycles
  DW 3x3 RQ  (DW_2D_RQ):      0/1280 errors, 27339  cycles  (--enable-3x3)
  Dense 3x3  (Regular_2D_RQ): 0/6372 errors, 244595 cycles  (--enable-3x3)
- Add NE16 linear layer kernels, including a topology pass, NE16 templates, parsers, tile constraints, and bindings
- The topology pass recognizes NE16-compatible GEMM layers, adjusts the weight layout for the NE16, and converts the requant shift/scale to the NE16 format
- The template detects whether the input is signed; if so, it adds a +128 offset to the input during C runtime and compensates via the bias
- Add GAP9 SDK-based Dequant/Quant templates using CNN_Copy.c kernels, replacing the generic templates
- Add a generic DequantQuantMergePass that folds adjacent Dequant→Quant pairs into identity or RequantShift
- Add a GAP9-specific TopologyOptimizer (GAP9Optimizer) to replace PULPOptimizer

Bug fixes:
- Add output signedness check in QuantChecker
- Fix L3 DMA template (add proper casts) and remove the blocking L3 DMA hack
- Isolate dory memory functions from other libraries in CMakeLists so they compile with -Og while compute kernels compile with -O3
- Disable PULPAddRequantMergePass due to incorrect pattern matching when Add has multiple consumers

Co-authored-by: runwangdl <samanthawangdl@gmail.com>
@runwangdl runwangdl force-pushed the gap9-ne16 branch 2 times, most recently from b8087fc to b3f40e5 Compare April 14, 2026 10:50
- TargetLibraries/GAP9/CMakeLists.txt: rename CNN_Libraries_NE16 →
  CNN_Libraries_HWPE (the actual gap9-sdk path); skip SDK
  CNN_BasicKernels_NE16.c source for GAP9_w_NE16 platform (it uses the
  pulp-nnx ne16 stack, so the SDK NE16 kernels are not needed).
- Deeploy/Targets/NE16/Platform.py: instantiate the GAP9ClusterEngine
  with a trimmed includeList (no CNN_BasicKernels_NE16.h /
  ne16_utils.h / CNN_Copy.h) so the generated Network.c does not pull
  in the SDK NE16 header alongside pulp-nnx ne16_task_defs.h — the
  NE16_REG_* macros are defined in both and trigger -Werror redefs.
ghcr.io/pulp-platform/deeploy-gap9:* is hosted in pulp-platform's
private GitHub Container Registry. Only upstream's self-hosted
runners have credentials to pull it; on fork CI runs (ubuntu-latest)
the docker pull fails with 'Error response from daemon: denied' and
the whole job is reported as failure.

Guard the select-env entry of all three gap9 workflows
(ci-platform-gap9.yml, -tiled.yml, -w-ne16-tiled.yml) so they SKIP
cleanly on forks instead of FAILING. Upstream behaviour is unchanged.
QuantChecker.checkOutputType (added by the NE16-Linear PR) requires
opSigned == outputTypeSigned. Existing Generic and PULPOpen bindings
only registered the signed-int8 output variant, so any Quant pattern
with signed=0 (e.g. 4-bit unsigned quantization in
Models/Transformer_DeepQuant) had no candidate and parsing exhausted
backtracking.

Add uint8 output to BasicQuantBindings and uint8 input to
BasicDequantBindings in both Targets/Generic/Bindings.py and
Targets/PULPOpen/Bindings.py.

Verified: Models/Transformer_DeepQuant network gen now succeeds for
both Generic and Siracusa platforms.
The Snitch FP32 GEMM/TransB-5000 build OOMs the GitHub-hosted runner
('std::bad_alloc' from the C compiler driver) when 4 pytest-xdist
workers compile in parallel. Two workers leave enough headroom on
the standard 7-GB runner.

(Pre-existing flake; surfaced as a hard fail in CI runs that happen
to land both heavy FP32 GEMM compilations on adjacent workers.)
The generated Network.c includes CNN_BasicKernels_NE16.h (from the GAP9
SDK autotiler CNN_Libraries_HWPE directory), but this path was missing
from the cmake include directories, causing build failures on plain GAP9.
KerConv_NE16_T.Pad is declared as v4u (unsigned) in the GAP9 SDK but
the template was using (v4s){0,0,0,0} (signed), causing a compilation
error on GCC with -Werror.
Runs each NE16 conv kernel with --profileTiling after the normal test
suite to collect cycle counts from gvsoc.
profileTiling generates code calling getCycles() in Network.c but the
header declaring it was not included. Add CycleCounter.h to both GAP9
and NE16 platform include lists, and expose the GAP9 inc/ directory to
the network target so the header is found at compile time.
GCC 7.1.1 has LTO linking bugs with the GAP9 SDK PMSIS library. The
profiling step needs a clean rebuild with LTO disabled to avoid
conflicts with the cached LTO-enabled build from the test step.
The --enable-3x3 flag was parsed by the runner script but never passed
to generateNetwork.py, so NE16Engine.enable3x3 was always False. DW 3x3
and Dense 3x3 convolutions silently fell back to the PULP cluster
instead of dispatching to NE16. Add the flag and set it on the engine.
The deeployRunner parsed --enable-3x3 but never forwarded it to
generateNetwork.py's gen_args, so NE16Engine.enable3x3 stayed False
and DW/Dense 3x3 convs silently fell back to the cluster.
Add a 64×64×32×32 Dense 3x3 RQ Conv test case (75M ops) to properly
benchmark NE16 throughput. The existing Dense_2D_RQ test (16×16×8×8)
is too small — NE16 dispatch overhead dominates at only 12.8%
utilization. Also wire --enable-3x3 through deeployRunner gen_args.
…r model test

Adds Models/MLPerf/VisualWakeWords (MobileNetV1 96x96, 27 Convs) to the NE16
L2 single-buffer model test set with l1=60000, and switches the model test
fixture to gen_args=[] (no --enable-3x3) so that:

- 13 pointwise 1x1 convs dispatch to NE16 (~82.7% of total MACs)
- 1 stride-2 dense + 13 depthwise convs fall back to the GAP9 cluster

The full --enable-3x3 path with mixed-engine graphs (some convs on NE16,
others on cluster) still has two known issues — the NE16Deployer global swap
to NCHWtoNHWCPass leaves cluster-fallback DW weights in NE16 NHWC layout,
and the PULP DW tile constraint assumes NCHW input layout — so this commit
intentionally avoids that path and runs DW/dense on the cluster.

gvsoc gap9.evk: 0 errors / 2 outputs, runtime 1 847 256 cycles
(MAC/Cyc = 4.05, vs 35 theoretical peak for full-NE16 MobileNetV1).

All existing NE16 kernel tests still pass (10/10 in gap9_w_ne16_tiled suite).
…able-3x3

The previous design replaced the global PULPNCHWtoNHWCPass with NCHWtoNHWCPass
whenever --enable-3x3 was on, which forced *every* depthwise conv (including
ones that fall back to the GAP9 cluster for stride > 1) into the NE16 NHWC
weight layout (cin/g=1, H, W, cout). PULPDWConv2DParser checks
group == weight.shape[0] and rejected those nodes, breaking codegen for any
graph that mixes NE16 and cluster DW convs.

Switch back to the default PULPNCHWtoNHWCPass — which produces cluster-friendly
weights (cout, H, W, cin/g) and skips the input transpose — and add a new
NE16-only fixup pass inside NE16OptimizationPass that, only for engine == "NE16"
DW convs:
  - transposes the weight to NE16 NHWC layout (1, H, W, cout)
  - inserts the NCHW -> NHWC input transpose NE16 needs

Cluster-colored DW convs are left in the PULP layout so PULPDWConv2DParser
and its DW tile constraint (NCHW input) keep working.

Restores --enable-3x3 for the model fixture and re-enables DW + PW on NE16
for MobileNetV1 (MLPerf VisualWakeWords).

gvsoc gap9.evk:
  - PW-only baseline: 1 847 256 cycles, MAC/Cyc 4.05
  - PW + DW on NE16:  1 190 437 cycles, MAC/Cyc 6.29  (-35.6% cycles, +55% MAC/Cyc)
  - NE16 dispatch covers 91.3% of total MACs; only 6 stride-2 layers remain on cluster.

All 10 existing NE16 tests (9 kernels + MobileNetV1) still pass.
With the engine-aware DW NHWC fixup pass in place, the previous saturation
failure on stride-2 NE16 dispatch is gone — the root cause was the global
NHWC swap forcing wrong layout, not a HAL-level bug. Add --enableStrides
alongside --enable-3x3 to the model fixture so all 27 MobileNetV1 convs go
to NE16 (no cluster fallback).

gvsoc gap9.evk:
  - PW-only:                            1 847 256 cyc  MAC/Cyc 4.05
  - PW + DW-s1 (--enable-3x3):          1 190 437 cyc  MAC/Cyc 6.29
  - All convs (--enable-3x3 + Strides):   845 217 cyc  MAC/Cyc 8.86

Final speedup vs PW-only baseline: 2.19x (-54.2% cycles).
NE16 dispatch count goes from 14 -> 28 (all 27 Convs + the final Gemm-as-PW),
cluster path runs only the residual MaxPool.

All 10 NE16 tests still pass (9 kernels + MobileNetV1).
The NE16 Dense 3x3 tile constraint constructed a rank-3 WeightCube
(COffset, 0, 0), (CSize, weightShape[-2], weightShape[-1]) for a rank-4
encoded weight buffer (cout, cinMajor, bits, H*W*cinMinorBytes). With
single-buffer everything fit in one tile so the offset stayed 0 and the
left-padded rectangle landed safely. Double-buffer splits cout across
multiple tiles, so the cout offset shifted onto the cinMajor axis (size 1)
during _legalizeTransfers and tripped the rectangle-bounds assertion.

Construct a rank-matching rectangle from the start. Same fix applies to
both the standard L2 path and the WeightMemory_SRAM path.

Also register the missing test_gap9_w_ne16_tiled_models_l2_doublebuffer
pytest function and L2_DOUBLEBUFFER_MODELS config so MobileNetV1
(MLPerf/VisualWakeWords) covers both buffer modes.

gvsoc gap9.evk:
  - L1=128000 single buffer: 791 461 cyc, MAC/Cyc 9.46
  - L1=90000  double buffer: 855 488 cyc, MAC/Cyc 8.75

Double buffer at full 128KB L1 hits a runtime allocator failure because
the encoded-weight footprint plus double-buffered tile pairs exceeds the
budget; capping double-buffer L1 at 90KB leaves runtime headroom while
the tile solver still picks reasonable tile sizes. For this small VWW
0.25x model (7.5M MACs, 96x96) double buffering doesn't beat single
buffer — the layers are too compute-light to amortize the smaller tiles.
The infrastructure is now in place for larger models where DB pays off.

All 11 NE16 tests pass (9 kernels + 2 models).
Mirrors ci-platform-gap9-tiled.yml's coverage — the NE16 workflow only had
kernel jobs, so the new MobileNetV1 (MLPerf/VisualWakeWords) model tests
weren't running on upstream's runners.
…lows)

The NE16 CI runs on ecab71e and 7f6ce26 tripped on git's "dubious
ownership" check for TargetLibraries/CMSIS/third_party/CMSIS-NN during
submodule fetch, killing the job before any pytest started. The CMSIS-NN
submodule SHA hasn't changed since 822dd32 (where this workflow last
passed), so this is purely a runner workspace-ownership issue.

Add the same Mark workspace as safe step that every other platform's
_runner-*.yml already uses (siracusa, snitch, gap9, gap9-tiled, chimera,
generic, cortexm, mempool, siracusa-neureka-tiled, ...).
runwangdl added 4 commits May 14, 2026 21:01
The pass was added during the NE16 Linear PR integration (6c8ae2b) and
matches every Gemm/RequantizedGemm node without checking the engine
attribute, so cluster-bound GEMMs (e.g. MLPerf AnomalyDetection's 10
Gemm+RQ layers — they never run on NE16) had their mul/bias rewritten
into NE16 scale/scale_n/shift-diff layout. The cluster pulp_nn_linear
kernel then consumed the rewritten constants under its original integer
contract and produced ±1 mismatches versus the int8 reference outputs.

Mirror the existing NE16AdjustWeightMemoryLayoutPass: bail out for
nodes whose engine attr isn't "NE16". Pure-GAP9 cluster Gemms keep
Deeploy's Generic + PULPGEMMRequantMergePass layout (including the
bias += div/2 rounding compensation), matching the reference.

gvsoc gap9.evk (Models/MLPerf/AnomalyDetection L1=64000):
  - before: 33/640 errors (all ±1), Runtime 89110 cycles
  - after:    0/640 errors,         Runtime 79332 cycles
  - devel base 3b011bb (where bug doesn't exist): 0/640, 78500 cycles

gap9_tiled L2 single-buffer models goes from 9/11 → 10/11 pass. The
remaining failure (MLPerf/ImageClassification, parser backtracking on
a standalone RequantShift node) is unrelated to GEMM and pre-dates
this fix.
GAP9Platform's loweringPasses had PULPAddRequantMergePass commented out
without an explanation. Without it, RequantShifts that follow an Add
(residual quantize step in ResNet-like blocks) never get folded into a
RequantizedAdd, so they survive standalone into the backend with their
float32 scalar mul/add intact. The PULP RequantShift bindings only
accept int32 mul/add, so parser-side type checking rejects every
binding and the whole graph backtracks. MLPerf/ImageClassification
(int8 ResNet, 14 RequantShifts, 3 Adds) was the visible victim — at
the devel base 3b011bb it passed because PULPAddRequantMergePass was
active there.

Restore the pass to the lowering chain alongside its conv/gemm/matmul
peers. The model now parses, builds and runs cleanly:
  - Models/MLPerf/ImageClassification on gvsoc gap9.evk:
      0 / 10 errors, 1 365 882 cycles
      (devel base: 0 / 10 errors, 1 368 399 cycles — bit-equivalent)

Also tighten RQSSplitPass.\_split_rqs_fun: when it duplicates the
mul/add constants for each downstream RQ, cast their values to int32
if they are integer-valued. Source ONNX often stores RequantShift's
scalar mul/add as float32 because they get folded into Conv/Gemm bias
later. With the new split, they survive standalone, so type-checking
needs them to already be int32. This is a defense-in-depth fix —
PULPAddRequantMergePass alone already unblocks the failing model.

gap9_tiled and models and singlebuffer and l2: 11 / 11 pass
(was 9 / 11 before, then 10 / 11 after the GEMM engine-check fix).
Two FP32 kernel tests regressed since the NE16 Linear PR integration:

1. FP32 GEMM/Regular: the new GAP9_NE16GEMMInt32Mapper was prepended to
   the plain Gemm op binding list. It uses the same GEMMParser class as
   the FloatGEMM and GEMMDequant mappers, and Deeploy keys candidate
   bindings by parser class — so listing the NE16Int32 mapper first
   masked the other two. FP32 inputs failed the NE16Int32 type check
   and there was no backtrack to the FloatGEMM mapper.

   Drop GAP9_NE16GEMMInt32Mapper from the Gemm op list (the int8/uint8
   path is already covered by RequantizedGemm with its own NE16 mapper).

2. FP32 Reshape/SkipConnection: MatMulAddMergePass fused MatMul+Add
   into a Gemm with alpha=beta=1 and no transA/transB. The MatMul inputs
   in this test don't share Gemm semantics, so the merge produced
   wrong outputs (16/16 errors). Devel base doesn't include this pass
   in the GAP9 lowering chain and the test passes there. Disable it
   for GAP9 to match.

gvsoc gap9.evk:
  - FP32 GEMM/Regular:           0/1024 errors, 28987 cycles
  - FP32 Reshape/SkipConnection: 0/16 errors,    8343 cycles
  - devel base: 0/1024, 28k-ish; 0/16, 8020 cycles — same shape.

gap9_tiled suite: 96 passed / 1 failed (was 92/5).
Remaining failure (Models/CCT/FP32/CCT_2_32_32_128) also fails at
3b011bb devel base — pre-existing, unrelated to this PR.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants