Conversation
4edb011 to
748707a
Compare
Mirrors the Siracusa_w_neureka pattern. NE16Platform extends GAP9Platform with engines=[NE16Engine, GAP9ClusterEngine]; NE16Deployer extends GAP9Deployer (reuses ClDma transformers via GAP9Bindings). New Target: Deeploy/Targets/NE16/ (Platform, Engine, Bindings, Parsers, Tiler, Deployer, Templates, TileConstraints, TopologyOptimizationPasses). The _weightEncode function is ported from pulp-nnx/test/Ne16Weight.py (single CIN_SUBTILE=16 mode, no 1x1 vs 3x3 split). ConvTemplate subtile constants set per ne16_task_defs.h (output 3x3, weight stride bytes PW=16 DW/Dense=144). New test infrastructure: - DeeployTest/deeployRunner_tiled_gap9_w_ne16.py - DeeployTest/test_gap9_ne16_tiled_config.py (PW/DW/Dense RQ Conv) DeeployTest wiring: - testUtils/platformMapping.py: register GAP9_w_NE16 in the platforms list, mapPlatform, setupMemoryPlatform, mapDeployer. - testMVP.py: include GAP9_w_NE16 in the EngineColoringDeployerWrapper branch (without it NE16AdjustWeightMemoryLayoutPass never fires and parsing backtracks to exhaustion). - testUtils/core/execution.py: build the GAP9 SDK 'image' target for GAP9_w_NE16 too (so chip.soc.mram.bin is produced before gvsoc run). - CMakeLists.txt, DeeployTest/CMakeLists.txt: accept GAP9_w_NE16 alongside GAP9 in the platform branches. - TargetLibraries/GAP9/CMakeLists.txt: for GAP9_w_NE16 platform, add_subdirectory on pulp-nnx with USE_NE16=ON and link it into deeploygap9. Fix: Deeploy/Targets/PULPOpen/Templates/FloatGemmTemplate.py referenced an undefined symbol float32_tPtr from Deeploy.AbstractDataTypes; define it locally via PointerClass(float32_t) to unblock the import chain reached by NE16Platform. Verified on gvsoc gap9.evk: PW 1x1 RQ (Regular_RQ): 0/1152 errors, 901917 cycles DW 3x3 RQ (DW_2D_RQ): 0/1280 errors, 27339 cycles (--enable-3x3) Dense 3x3 (Regular_2D_RQ): 0/6372 errors, 244595 cycles (--enable-3x3)
- Add NE16 linear layer kernels, including a topology pass, NE16 templates, parsers, tile constraints, and bindings - The topology pass recognizes NE16-compatible GEMM layers, adjusts the weight layout for the NE16, and converts the requant shift/scale to the NE16 format - The template detects whether the input is signed; if so, it adds a +128 offset to the input during C runtime and compensates via the bias - Add GAP9 SDK-based Dequant/Quant templates using CNN_Copy.c kernels, replacing the generic templates - Add a generic DequantQuantMergePass that folds adjacent Dequant→Quant pairs into identity or RequantShift - Add a GAP9-specific TopologyOptimizer (GAP9Optimizer) to replace PULPOptimizer Bug fixes: - Add output signedness check in QuantChecker - Fix L3 DMA template (add proper casts) and remove the blocking L3 DMA hack - Isolate dory memory functions from other libraries in CMakeLists so they compile with -Og while compute kernels compile with -O3 - Disable PULPAddRequantMergePass due to incorrect pattern matching when Add has multiple consumers Co-authored-by: runwangdl <samanthawangdl@gmail.com>
b8087fc to
b3f40e5
Compare
- TargetLibraries/GAP9/CMakeLists.txt: rename CNN_Libraries_NE16 → CNN_Libraries_HWPE (the actual gap9-sdk path); skip SDK CNN_BasicKernels_NE16.c source for GAP9_w_NE16 platform (it uses the pulp-nnx ne16 stack, so the SDK NE16 kernels are not needed). - Deeploy/Targets/NE16/Platform.py: instantiate the GAP9ClusterEngine with a trimmed includeList (no CNN_BasicKernels_NE16.h / ne16_utils.h / CNN_Copy.h) so the generated Network.c does not pull in the SDK NE16 header alongside pulp-nnx ne16_task_defs.h — the NE16_REG_* macros are defined in both and trigger -Werror redefs.
ghcr.io/pulp-platform/deeploy-gap9:* is hosted in pulp-platform's private GitHub Container Registry. Only upstream's self-hosted runners have credentials to pull it; on fork CI runs (ubuntu-latest) the docker pull fails with 'Error response from daemon: denied' and the whole job is reported as failure. Guard the select-env entry of all three gap9 workflows (ci-platform-gap9.yml, -tiled.yml, -w-ne16-tiled.yml) so they SKIP cleanly on forks instead of FAILING. Upstream behaviour is unchanged.
QuantChecker.checkOutputType (added by the NE16-Linear PR) requires opSigned == outputTypeSigned. Existing Generic and PULPOpen bindings only registered the signed-int8 output variant, so any Quant pattern with signed=0 (e.g. 4-bit unsigned quantization in Models/Transformer_DeepQuant) had no candidate and parsing exhausted backtracking. Add uint8 output to BasicQuantBindings and uint8 input to BasicDequantBindings in both Targets/Generic/Bindings.py and Targets/PULPOpen/Bindings.py. Verified: Models/Transformer_DeepQuant network gen now succeeds for both Generic and Siracusa platforms.
The Snitch FP32 GEMM/TransB-5000 build OOMs the GitHub-hosted runner
('std::bad_alloc' from the C compiler driver) when 4 pytest-xdist
workers compile in parallel. Two workers leave enough headroom on
the standard 7-GB runner.
(Pre-existing flake; surfaced as a hard fail in CI runs that happen
to land both heavy FP32 GEMM compilations on adjacent workers.)
The generated Network.c includes CNN_BasicKernels_NE16.h (from the GAP9 SDK autotiler CNN_Libraries_HWPE directory), but this path was missing from the cmake include directories, causing build failures on plain GAP9.
KerConv_NE16_T.Pad is declared as v4u (unsigned) in the GAP9 SDK but
the template was using (v4s){0,0,0,0} (signed), causing a compilation
error on GCC with -Werror.
Runs each NE16 conv kernel with --profileTiling after the normal test suite to collect cycle counts from gvsoc.
profileTiling generates code calling getCycles() in Network.c but the header declaring it was not included. Add CycleCounter.h to both GAP9 and NE16 platform include lists, and expose the GAP9 inc/ directory to the network target so the header is found at compile time.
GCC 7.1.1 has LTO linking bugs with the GAP9 SDK PMSIS library. The profiling step needs a clean rebuild with LTO disabled to avoid conflicts with the cached LTO-enabled build from the test step.
The --enable-3x3 flag was parsed by the runner script but never passed to generateNetwork.py, so NE16Engine.enable3x3 was always False. DW 3x3 and Dense 3x3 convolutions silently fell back to the PULP cluster instead of dispatching to NE16. Add the flag and set it on the engine.
The deeployRunner parsed --enable-3x3 but never forwarded it to generateNetwork.py's gen_args, so NE16Engine.enable3x3 stayed False and DW/Dense 3x3 convs silently fell back to the cluster.
Add a 64×64×32×32 Dense 3x3 RQ Conv test case (75M ops) to properly benchmark NE16 throughput. The existing Dense_2D_RQ test (16×16×8×8) is too small — NE16 dispatch overhead dominates at only 12.8% utilization. Also wire --enable-3x3 through deeployRunner gen_args.
…r model test Adds Models/MLPerf/VisualWakeWords (MobileNetV1 96x96, 27 Convs) to the NE16 L2 single-buffer model test set with l1=60000, and switches the model test fixture to gen_args=[] (no --enable-3x3) so that: - 13 pointwise 1x1 convs dispatch to NE16 (~82.7% of total MACs) - 1 stride-2 dense + 13 depthwise convs fall back to the GAP9 cluster The full --enable-3x3 path with mixed-engine graphs (some convs on NE16, others on cluster) still has two known issues — the NE16Deployer global swap to NCHWtoNHWCPass leaves cluster-fallback DW weights in NE16 NHWC layout, and the PULP DW tile constraint assumes NCHW input layout — so this commit intentionally avoids that path and runs DW/dense on the cluster. gvsoc gap9.evk: 0 errors / 2 outputs, runtime 1 847 256 cycles (MAC/Cyc = 4.05, vs 35 theoretical peak for full-NE16 MobileNetV1). All existing NE16 kernel tests still pass (10/10 in gap9_w_ne16_tiled suite).
…able-3x3 The previous design replaced the global PULPNCHWtoNHWCPass with NCHWtoNHWCPass whenever --enable-3x3 was on, which forced *every* depthwise conv (including ones that fall back to the GAP9 cluster for stride > 1) into the NE16 NHWC weight layout (cin/g=1, H, W, cout). PULPDWConv2DParser checks group == weight.shape[0] and rejected those nodes, breaking codegen for any graph that mixes NE16 and cluster DW convs. Switch back to the default PULPNCHWtoNHWCPass — which produces cluster-friendly weights (cout, H, W, cin/g) and skips the input transpose — and add a new NE16-only fixup pass inside NE16OptimizationPass that, only for engine == "NE16" DW convs: - transposes the weight to NE16 NHWC layout (1, H, W, cout) - inserts the NCHW -> NHWC input transpose NE16 needs Cluster-colored DW convs are left in the PULP layout so PULPDWConv2DParser and its DW tile constraint (NCHW input) keep working. Restores --enable-3x3 for the model fixture and re-enables DW + PW on NE16 for MobileNetV1 (MLPerf VisualWakeWords). gvsoc gap9.evk: - PW-only baseline: 1 847 256 cycles, MAC/Cyc 4.05 - PW + DW on NE16: 1 190 437 cycles, MAC/Cyc 6.29 (-35.6% cycles, +55% MAC/Cyc) - NE16 dispatch covers 91.3% of total MACs; only 6 stride-2 layers remain on cluster. All 10 existing NE16 tests (9 kernels + MobileNetV1) still pass.
With the engine-aware DW NHWC fixup pass in place, the previous saturation failure on stride-2 NE16 dispatch is gone — the root cause was the global NHWC swap forcing wrong layout, not a HAL-level bug. Add --enableStrides alongside --enable-3x3 to the model fixture so all 27 MobileNetV1 convs go to NE16 (no cluster fallback). gvsoc gap9.evk: - PW-only: 1 847 256 cyc MAC/Cyc 4.05 - PW + DW-s1 (--enable-3x3): 1 190 437 cyc MAC/Cyc 6.29 - All convs (--enable-3x3 + Strides): 845 217 cyc MAC/Cyc 8.86 Final speedup vs PW-only baseline: 2.19x (-54.2% cycles). NE16 dispatch count goes from 14 -> 28 (all 27 Convs + the final Gemm-as-PW), cluster path runs only the residual MaxPool. All 10 NE16 tests still pass (9 kernels + MobileNetV1).
The NE16 Dense 3x3 tile constraint constructed a rank-3 WeightCube (COffset, 0, 0), (CSize, weightShape[-2], weightShape[-1]) for a rank-4 encoded weight buffer (cout, cinMajor, bits, H*W*cinMinorBytes). With single-buffer everything fit in one tile so the offset stayed 0 and the left-padded rectangle landed safely. Double-buffer splits cout across multiple tiles, so the cout offset shifted onto the cinMajor axis (size 1) during _legalizeTransfers and tripped the rectangle-bounds assertion. Construct a rank-matching rectangle from the start. Same fix applies to both the standard L2 path and the WeightMemory_SRAM path. Also register the missing test_gap9_w_ne16_tiled_models_l2_doublebuffer pytest function and L2_DOUBLEBUFFER_MODELS config so MobileNetV1 (MLPerf/VisualWakeWords) covers both buffer modes. gvsoc gap9.evk: - L1=128000 single buffer: 791 461 cyc, MAC/Cyc 9.46 - L1=90000 double buffer: 855 488 cyc, MAC/Cyc 8.75 Double buffer at full 128KB L1 hits a runtime allocator failure because the encoded-weight footprint plus double-buffered tile pairs exceeds the budget; capping double-buffer L1 at 90KB leaves runtime headroom while the tile solver still picks reasonable tile sizes. For this small VWW 0.25x model (7.5M MACs, 96x96) double buffering doesn't beat single buffer — the layers are too compute-light to amortize the smaller tiles. The infrastructure is now in place for larger models where DB pays off. All 11 NE16 tests pass (9 kernels + 2 models).
Mirrors ci-platform-gap9-tiled.yml's coverage — the NE16 workflow only had kernel jobs, so the new MobileNetV1 (MLPerf/VisualWakeWords) model tests weren't running on upstream's runners.
…lows) The NE16 CI runs on ecab71e and 7f6ce26 tripped on git's "dubious ownership" check for TargetLibraries/CMSIS/third_party/CMSIS-NN during submodule fetch, killing the job before any pytest started. The CMSIS-NN submodule SHA hasn't changed since 822dd32 (where this workflow last passed), so this is purely a runner workspace-ownership issue. Add the same Mark workspace as safe step that every other platform's _runner-*.yml already uses (siracusa, snitch, gap9, gap9-tiled, chimera, generic, cortexm, mempool, siracusa-neureka-tiled, ...).
The pass was added during the NE16 Linear PR integration (6c8ae2b) and matches every Gemm/RequantizedGemm node without checking the engine attribute, so cluster-bound GEMMs (e.g. MLPerf AnomalyDetection's 10 Gemm+RQ layers — they never run on NE16) had their mul/bias rewritten into NE16 scale/scale_n/shift-diff layout. The cluster pulp_nn_linear kernel then consumed the rewritten constants under its original integer contract and produced ±1 mismatches versus the int8 reference outputs. Mirror the existing NE16AdjustWeightMemoryLayoutPass: bail out for nodes whose engine attr isn't "NE16". Pure-GAP9 cluster Gemms keep Deeploy's Generic + PULPGEMMRequantMergePass layout (including the bias += div/2 rounding compensation), matching the reference. gvsoc gap9.evk (Models/MLPerf/AnomalyDetection L1=64000): - before: 33/640 errors (all ±1), Runtime 89110 cycles - after: 0/640 errors, Runtime 79332 cycles - devel base 3b011bb (where bug doesn't exist): 0/640, 78500 cycles gap9_tiled L2 single-buffer models goes from 9/11 → 10/11 pass. The remaining failure (MLPerf/ImageClassification, parser backtracking on a standalone RequantShift node) is unrelated to GEMM and pre-dates this fix.
GAP9Platform's loweringPasses had PULPAddRequantMergePass commented out without an explanation. Without it, RequantShifts that follow an Add (residual quantize step in ResNet-like blocks) never get folded into a RequantizedAdd, so they survive standalone into the backend with their float32 scalar mul/add intact. The PULP RequantShift bindings only accept int32 mul/add, so parser-side type checking rejects every binding and the whole graph backtracks. MLPerf/ImageClassification (int8 ResNet, 14 RequantShifts, 3 Adds) was the visible victim — at the devel base 3b011bb it passed because PULPAddRequantMergePass was active there. Restore the pass to the lowering chain alongside its conv/gemm/matmul peers. The model now parses, builds and runs cleanly: - Models/MLPerf/ImageClassification on gvsoc gap9.evk: 0 / 10 errors, 1 365 882 cycles (devel base: 0 / 10 errors, 1 368 399 cycles — bit-equivalent) Also tighten RQSSplitPass.\_split_rqs_fun: when it duplicates the mul/add constants for each downstream RQ, cast their values to int32 if they are integer-valued. Source ONNX often stores RequantShift's scalar mul/add as float32 because they get folded into Conv/Gemm bias later. With the new split, they survive standalone, so type-checking needs them to already be int32. This is a defense-in-depth fix — PULPAddRequantMergePass alone already unblocks the failing model. gap9_tiled and models and singlebuffer and l2: 11 / 11 pass (was 9 / 11 before, then 10 / 11 after the GEMM engine-check fix).
Two FP32 kernel tests regressed since the NE16 Linear PR integration: 1. FP32 GEMM/Regular: the new GAP9_NE16GEMMInt32Mapper was prepended to the plain Gemm op binding list. It uses the same GEMMParser class as the FloatGEMM and GEMMDequant mappers, and Deeploy keys candidate bindings by parser class — so listing the NE16Int32 mapper first masked the other two. FP32 inputs failed the NE16Int32 type check and there was no backtrack to the FloatGEMM mapper. Drop GAP9_NE16GEMMInt32Mapper from the Gemm op list (the int8/uint8 path is already covered by RequantizedGemm with its own NE16 mapper). 2. FP32 Reshape/SkipConnection: MatMulAddMergePass fused MatMul+Add into a Gemm with alpha=beta=1 and no transA/transB. The MatMul inputs in this test don't share Gemm semantics, so the merge produced wrong outputs (16/16 errors). Devel base doesn't include this pass in the GAP9 lowering chain and the test passes there. Disable it for GAP9 to match. gvsoc gap9.evk: - FP32 GEMM/Regular: 0/1024 errors, 28987 cycles - FP32 Reshape/SkipConnection: 0/16 errors, 8343 cycles - devel base: 0/1024, 28k-ish; 0/16, 8020 cycles — same shape. gap9_tiled suite: 96 passed / 1 failed (was 92/5). Remaining failure (Models/CCT/FP32/CCT_2_32_32_128) also fails at 3b011bb devel base — pre-existing, unrelated to this PR.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Adds the NE16 neural engine as an accelerator Engine on top of the existing GAP9 platform, registered as a new composite platform
GAP9_w_NE16that mirrors theSiracusa_w_neurekapattern.Added
Deeploy/Targets/NE16/— full Target: Platform/Engine/Bindings/Parsers/Tiler/Deployer/Templates/TileConstraints/TopologyOptimizationPasses.NE16PlatformextendsGAP9Platformwithengines=[NE16Engine, GAP9ClusterEngine];NE16DeployerextendsGAP9Deployer._weightEncodeported frompulp-nnx/test/Ne16Weight.py(single CIN_SUBTILE=16 mode).DeeployTest/deeployRunner_tiled_gap9_w_ne16.py+DeeployTest/test_gap9_ne16_tiled_config.py— runner + kernel test config.DeeployTest/test_platforms.py— pytest functionstest_gap9_w_ne16_tiled_kernels_l2_{single,double}bufferunder markergap9_w_ne16_tiled..github/workflows/{ci-platform-gap9-w-ne16-tiled.yml,_runner-gap9-w-ne16-tiled.yml}— CI jobs (single + double buffer L2).TargetLibraries/GAP9/CMakeLists.txt—add_subdirectory(pulp-nnx)withUSE_NE16=ONforGAP9_w_NE16.Changed
DeeployTest/testUtils/platformMapping.py— registerGAP9_w_NE16in names/mapPlatform/setupMemoryPlatform/mapDeployer.DeeployTest/testMVP.py— wrap deployer withEngineColoringDeployerWrapperforGAP9_w_NE16(without it NE16 nodes never get an engine color and parsing fails).DeeployTest/testUtils/core/execution.py— append the GAP9 SDKimagebuild target forGAP9_w_NE16(sochip.soc.mram.binis produced beforegvsoc run).CMakeLists.txt,DeeployTest/CMakeLists.txt— acceptGAP9_w_NE16alongsideGAP9in the platform branches.Deeploy/Targets/NE16/Templates/ConvTemplate.py— NE16 subtile constants perne16_task_defs.h:CIN_SUBTILE16, output3, weight strided0 = 3*3*weight_d0_stride_mode8 = 18for DW/Dense (PWqw * weight_d0_stride = 16). Emit top-levelne16_task_tfields (weight_d0_stride,qw,subtile_output_channel,kernel_shape,depthwise) that the HW reads at dispatch time.Deeploy/Targets/NE16/TopologyOptimizationPasses/Passes.py— DW weight layout: after Deeploy's NHWC→NCHW transpose, swap axes 0/1 once more so_weightEncodesees the standard(cout, 1, H, W)layout and produces the correct(1, 1, packed_bytes)single-block output expected by the NE16 HW.Deeploy/Targets/NE16/TileConstraints/NE16DepthwiseConstraint.py— DW weight is a single packed block (not per-cout); constrainweightOutChannelVar == Maxand reuse the sameHyperRectangle((0,0,0), weightShape)for every output-channel tile.Deeploy/Targets/NE16/Parsers.py— drop thegroup == shape[1]check inNE16DWConv2DParser(invalid under the post-encode rank-3 layout).Fixed
Deeploy/Targets/PULPOpen/Templates/FloatGemmTemplate.py— work around a pre-existingImportError: cannot import name 'float32_tPtr' from 'Deeploy.AbstractDataTypes'by defining it locally viaPointerClass(float32_t).Test plan
Run on gvsoc
gap9.evkinsideghcr.io/pulp-platform/deeploy-gap9:devel. All verified dispatches (ne16_nnx_dispatchappears in generated Network.c for NE16-routed nodes):Kernels/Integer/Conv/PW_2D_RQ/Regular_RQKernels/Integer/Conv/PW_2D_RQ/Regular_RQKernels/Integer/Conv/PW_2DKernels/Integer/Conv/DW_2D_RQKernels/Integer/Conv/DW_2D_RQKernels/Integer/Conv/StriddedPadded_2D_RQKernels/Integer/Conv/PW_2D_RQ/Regular_RQKernels/Integer/Conv/DW_2D_RQFollow-up (out of scope):
PW_2D_RQ/Unsigned_RQuses int8 input.Ne16TestConf.pyonly supports uint8 and NE16 HAL doesn't expose a signed-input conf0 flag; proper support needs sign-propagation (shift int8 → uint8 + adjust weight_offset).Tests/Kernels/Integer/Conv/today (Regular_2D_RQis 8×8); coverage is via the model path once the remaining tiling-system edge cases are resolved.PR Merge Checklist
develcommit and pointing todevel.CHANGELOG.mdfile has been updated.