[NE16] Add GAP9_w_NE16 platform (NE16 accelerator on GAP9) by runwangdl · Pull Request #1 · runwangdl/Deeploy

runwangdl · 2026-04-13T22:18:50Z

Adds the NE16 neural engine as an accelerator Engine on top of the existing GAP9 platform, registered as a new composite platform GAP9_w_NE16 that mirrors the Siracusa_w_neureka pattern.

Added

Deeploy/Targets/NE16/ — full Target: Platform/Engine/Bindings/Parsers/Tiler/Deployer/Templates/TileConstraints/TopologyOptimizationPasses. NE16Platform extends GAP9Platform with engines=[NE16Engine, GAP9ClusterEngine]; NE16Deployer extends GAP9Deployer. _weightEncode ported from pulp-nnx/test/Ne16Weight.py (single CIN_SUBTILE=16 mode).
DeeployTest/deeployRunner_tiled_gap9_w_ne16.py + DeeployTest/test_gap9_ne16_tiled_config.py — runner + kernel test config.
DeeployTest/test_platforms.py — pytest functions test_gap9_w_ne16_tiled_kernels_l2_{single,double}buffer under marker gap9_w_ne16_tiled.
.github/workflows/{ci-platform-gap9-w-ne16-tiled.yml,_runner-gap9-w-ne16-tiled.yml} — CI jobs (single + double buffer L2).
TargetLibraries/GAP9/CMakeLists.txt — add_subdirectory(pulp-nnx) with USE_NE16=ON for GAP9_w_NE16.

Changed

DeeployTest/testUtils/platformMapping.py — register GAP9_w_NE16 in names/mapPlatform/setupMemoryPlatform/mapDeployer.
DeeployTest/testMVP.py — wrap deployer with EngineColoringDeployerWrapper for GAP9_w_NE16 (without it NE16 nodes never get an engine color and parsing fails).
DeeployTest/testUtils/core/execution.py — append the GAP9 SDK image build target for GAP9_w_NE16 (so chip.soc.mram.bin is produced before gvsoc run).
CMakeLists.txt, DeeployTest/CMakeLists.txt — accept GAP9_w_NE16 alongside GAP9 in the platform branches.
Deeploy/Targets/NE16/Templates/ConvTemplate.py — NE16 subtile constants per ne16_task_defs.h: CIN_SUBTILE 16, output 3, weight stride d0 = 3*3*weight_d0_stride_mode8 = 18 for DW/Dense (PW qw * weight_d0_stride = 16). Emit top-level ne16_task_t fields (weight_d0_stride, qw, subtile_output_channel, kernel_shape, depthwise) that the HW reads at dispatch time.
Deeploy/Targets/NE16/TopologyOptimizationPasses/Passes.py — DW weight layout: after Deeploy's NHWC→NCHW transpose, swap axes 0/1 once more so _weightEncode sees the standard (cout, 1, H, W) layout and produces the correct (1, 1, packed_bytes) single-block output expected by the NE16 HW.
Deeploy/Targets/NE16/TileConstraints/NE16DepthwiseConstraint.py — DW weight is a single packed block (not per-cout); constrain weightOutChannelVar == Max and reuse the same HyperRectangle((0,0,0), weightShape) for every output-channel tile.
Deeploy/Targets/NE16/Parsers.py — drop the group == shape[1] check in NE16DWConv2DParser (invalid under the post-encode rank-3 layout).

Fixed

Deeploy/Targets/PULPOpen/Templates/FloatGemmTemplate.py — work around a pre-existing ImportError: cannot import name 'float32_tPtr' from 'Deeploy.AbstractDataTypes' by defining it locally via PointerClass(float32_t).

Test plan

Run on gvsoc gap9.evk inside ghcr.io/pulp-platform/deeploy-gap9:devel. All verified dispatches (ne16_nnx_dispatch appears in generated Network.c for NE16-routed nodes):

Test	L1	Buffer	Errors	Runtime (cycles)
`Kernels/Integer/Conv/PW_2D_RQ/Regular_RQ`	32000	single	0 / 1152	~900k
`Kernels/Integer/Conv/PW_2D_RQ/Regular_RQ`	16000	single	0	—
`Kernels/Integer/Conv/PW_2D`	32000	single	0	—
`Kernels/Integer/Conv/DW_2D_RQ`	32000	single	0 / 1280	~27k
`Kernels/Integer/Conv/DW_2D_RQ`	16000	single	0	—
`Kernels/Integer/Conv/StriddedPadded_2D_RQ`	32000	single	0	—
`Kernels/Integer/Conv/PW_2D_RQ/Regular_RQ`	32000	double	0	—
`Kernels/Integer/Conv/DW_2D_RQ`	32000	double	0	—

Follow-up (out of scope):

PW_2D_RQ/Unsigned_RQ uses int8 input. Ne16TestConf.py only supports uint8 and NE16 HAL doesn't expose a signed-input conf0 flag; proper support needs sign-propagation (shift int8 → uint8 + adjust weight_offset).
3x3 dense-conv kernel tests don't exist in Tests/Kernels/Integer/Conv/ today (Regular_2D_RQ is 8×8); coverage is via the model path once the remaining tiling-system edge cases are resolved.

PR Merge Checklist

The PR is rebased on the latest devel commit and pointing to devel.
Your PR reviewed and approved.
All checks are passing.
The CHANGELOG.md file has been updated.
If the docker was modified, change back its link after review.

Mirrors the Siracusa_w_neureka pattern. NE16Platform extends GAP9Platform with engines=[NE16Engine, GAP9ClusterEngine]; NE16Deployer extends GAP9Deployer (reuses ClDma transformers via GAP9Bindings). New Target: Deeploy/Targets/NE16/ (Platform, Engine, Bindings, Parsers, Tiler, Deployer, Templates, TileConstraints, TopologyOptimizationPasses). The _weightEncode function is ported from pulp-nnx/test/Ne16Weight.py (single CIN_SUBTILE=16 mode, no 1x1 vs 3x3 split). ConvTemplate subtile constants set per ne16_task_defs.h (output 3x3, weight stride bytes PW=16 DW/Dense=144). New test infrastructure: - DeeployTest/deeployRunner_tiled_gap9_w_ne16.py - DeeployTest/test_gap9_ne16_tiled_config.py (PW/DW/Dense RQ Conv) DeeployTest wiring: - testUtils/platformMapping.py: register GAP9_w_NE16 in the platforms list, mapPlatform, setupMemoryPlatform, mapDeployer. - testMVP.py: include GAP9_w_NE16 in the EngineColoringDeployerWrapper branch (without it NE16AdjustWeightMemoryLayoutPass never fires and parsing backtracks to exhaustion). - testUtils/core/execution.py: build the GAP9 SDK 'image' target for GAP9_w_NE16 too (so chip.soc.mram.bin is produced before gvsoc run). - CMakeLists.txt, DeeployTest/CMakeLists.txt: accept GAP9_w_NE16 alongside GAP9 in the platform branches. - TargetLibraries/GAP9/CMakeLists.txt: for GAP9_w_NE16 platform, add_subdirectory on pulp-nnx with USE_NE16=ON and link it into deeploygap9. Fix: Deeploy/Targets/PULPOpen/Templates/FloatGemmTemplate.py referenced an undefined symbol float32_tPtr from Deeploy.AbstractDataTypes; define it locally via PointerClass(float32_t) to unblock the import chain reached by NE16Platform. Verified on gvsoc gap9.evk: PW 1x1 RQ (Regular_RQ): 0/1152 errors, 901917 cycles DW 3x3 RQ (DW_2D_RQ): 0/1280 errors, 27339 cycles (--enable-3x3) Dense 3x3 (Regular_2D_RQ): 0/6372 errors, 244595 cycles (--enable-3x3)

- Add NE16 linear layer kernels, including a topology pass, NE16 templates, parsers, tile constraints, and bindings - The topology pass recognizes NE16-compatible GEMM layers, adjusts the weight layout for the NE16, and converts the requant shift/scale to the NE16 format - The template detects whether the input is signed; if so, it adds a +128 offset to the input during C runtime and compensates via the bias - Add GAP9 SDK-based Dequant/Quant templates using CNN_Copy.c kernels, replacing the generic templates - Add a generic DequantQuantMergePass that folds adjacent Dequant→Quant pairs into identity or RequantShift - Add a GAP9-specific TopologyOptimizer (GAP9Optimizer) to replace PULPOptimizer Bug fixes: - Add output signedness check in QuantChecker - Fix L3 DMA template (add proper casts) and remove the blocking L3 DMA hack - Isolate dory memory functions from other libraries in CMakeLists so they compile with -Og while compute kernels compile with -O3 - Disable PULPAddRequantMergePass due to incorrect pattern matching when Add has multiple consumers Co-authored-by: runwangdl <samanthawangdl@gmail.com>

- TargetLibraries/GAP9/CMakeLists.txt: rename CNN_Libraries_NE16 → CNN_Libraries_HWPE (the actual gap9-sdk path); skip SDK CNN_BasicKernels_NE16.c source for GAP9_w_NE16 platform (it uses the pulp-nnx ne16 stack, so the SDK NE16 kernels are not needed). - Deeploy/Targets/NE16/Platform.py: instantiate the GAP9ClusterEngine with a trimmed includeList (no CNN_BasicKernels_NE16.h / ne16_utils.h / CNN_Copy.h) so the generated Network.c does not pull in the SDK NE16 header alongside pulp-nnx ne16_task_defs.h — the NE16_REG_* macros are defined in both and trigger -Werror redefs.

ghcr.io/pulp-platform/deeploy-gap9:* is hosted in pulp-platform's private GitHub Container Registry. Only upstream's self-hosted runners have credentials to pull it; on fork CI runs (ubuntu-latest) the docker pull fails with 'Error response from daemon: denied' and the whole job is reported as failure. Guard the select-env entry of all three gap9 workflows (ci-platform-gap9.yml, -tiled.yml, -w-ne16-tiled.yml) so they SKIP cleanly on forks instead of FAILING. Upstream behaviour is unchanged.

QuantChecker.checkOutputType (added by the NE16-Linear PR) requires opSigned == outputTypeSigned. Existing Generic and PULPOpen bindings only registered the signed-int8 output variant, so any Quant pattern with signed=0 (e.g. 4-bit unsigned quantization in Models/Transformer_DeepQuant) had no candidate and parsing exhausted backtracking. Add uint8 output to BasicQuantBindings and uint8 input to BasicDequantBindings in both Targets/Generic/Bindings.py and Targets/PULPOpen/Bindings.py. Verified: Models/Transformer_DeepQuant network gen now succeeds for both Generic and Siracusa platforms.

The Snitch FP32 GEMM/TransB-5000 build OOMs the GitHub-hosted runner ('std::bad_alloc' from the C compiler driver) when 4 pytest-xdist workers compile in parallel. Two workers leave enough headroom on the standard 7-GB runner. (Pre-existing flake; surfaced as a hard fail in CI runs that happen to land both heavy FP32 GEMM compilations on adjacent workers.)

The generated Network.c includes CNN_BasicKernels_NE16.h (from the GAP9 SDK autotiler CNN_Libraries_HWPE directory), but this path was missing from the cmake include directories, causing build failures on plain GAP9.

KerConv_NE16_T.Pad is declared as v4u (unsigned) in the GAP9 SDK but the template was using (v4s){0,0,0,0} (signed), causing a compilation error on GCC with -Werror.

Runs each NE16 conv kernel with --profileTiling after the normal test suite to collect cycle counts from gvsoc.

profileTiling generates code calling getCycles() in Network.c but the header declaring it was not included. Add CycleCounter.h to both GAP9 and NE16 platform include lists, and expose the GAP9 inc/ directory to the network target so the header is found at compile time.

GCC 7.1.1 has LTO linking bugs with the GAP9 SDK PMSIS library. The profiling step needs a clean rebuild with LTO disabled to avoid conflicts with the cached LTO-enabled build from the test step.

The --enable-3x3 flag was parsed by the runner script but never passed to generateNetwork.py, so NE16Engine.enable3x3 was always False. DW 3x3 and Dense 3x3 convolutions silently fell back to the PULP cluster instead of dispatching to NE16. Add the flag and set it on the engine.

The deeployRunner parsed --enable-3x3 but never forwarded it to generateNetwork.py's gen_args, so NE16Engine.enable3x3 stayed False and DW/Dense 3x3 convs silently fell back to the cluster.

Add a 64×64×32×32 Dense 3x3 RQ Conv test case (75M ops) to properly benchmark NE16 throughput. The existing Dense_2D_RQ test (16×16×8×8) is too small — NE16 dispatch overhead dominates at only 12.8% utilization. Also wire --enable-3x3 through deeployRunner gen_args.

…r model test Adds Models/MLPerf/VisualWakeWords (MobileNetV1 96x96, 27 Convs) to the NE16 L2 single-buffer model test set with l1=60000, and switches the model test fixture to gen_args=[] (no --enable-3x3) so that: - 13 pointwise 1x1 convs dispatch to NE16 (~82.7% of total MACs) - 1 stride-2 dense + 13 depthwise convs fall back to the GAP9 cluster The full --enable-3x3 path with mixed-engine graphs (some convs on NE16, others on cluster) still has two known issues — the NE16Deployer global swap to NCHWtoNHWCPass leaves cluster-fallback DW weights in NE16 NHWC layout, and the PULP DW tile constraint assumes NCHW input layout — so this commit intentionally avoids that path and runs DW/dense on the cluster. gvsoc gap9.evk: 0 errors / 2 outputs, runtime 1 847 256 cycles (MAC/Cyc = 4.05, vs 35 theoretical peak for full-NE16 MobileNetV1). All existing NE16 kernel tests still pass (10/10 in gap9_w_ne16_tiled suite).

…able-3x3 The previous design replaced the global PULPNCHWtoNHWCPass with NCHWtoNHWCPass whenever --enable-3x3 was on, which forced *every* depthwise conv (including ones that fall back to the GAP9 cluster for stride > 1) into the NE16 NHWC weight layout (cin/g=1, H, W, cout). PULPDWConv2DParser checks group == weight.shape[0] and rejected those nodes, breaking codegen for any graph that mixes NE16 and cluster DW convs. Switch back to the default PULPNCHWtoNHWCPass — which produces cluster-friendly weights (cout, H, W, cin/g) and skips the input transpose — and add a new NE16-only fixup pass inside NE16OptimizationPass that, only for engine == "NE16" DW convs: - transposes the weight to NE16 NHWC layout (1, H, W, cout) - inserts the NCHW -> NHWC input transpose NE16 needs Cluster-colored DW convs are left in the PULP layout so PULPDWConv2DParser and its DW tile constraint (NCHW input) keep working. Restores --enable-3x3 for the model fixture and re-enables DW + PW on NE16 for MobileNetV1 (MLPerf VisualWakeWords). gvsoc gap9.evk: - PW-only baseline: 1 847 256 cycles, MAC/Cyc 4.05 - PW + DW on NE16: 1 190 437 cycles, MAC/Cyc 6.29 (-35.6% cycles, +55% MAC/Cyc) - NE16 dispatch covers 91.3% of total MACs; only 6 stride-2 layers remain on cluster. All 10 existing NE16 tests (9 kernels + MobileNetV1) still pass.

With the engine-aware DW NHWC fixup pass in place, the previous saturation failure on stride-2 NE16 dispatch is gone — the root cause was the global NHWC swap forcing wrong layout, not a HAL-level bug. Add --enableStrides alongside --enable-3x3 to the model fixture so all 27 MobileNetV1 convs go to NE16 (no cluster fallback). gvsoc gap9.evk: - PW-only: 1 847 256 cyc MAC/Cyc 4.05 - PW + DW-s1 (--enable-3x3): 1 190 437 cyc MAC/Cyc 6.29 - All convs (--enable-3x3 + Strides): 845 217 cyc MAC/Cyc 8.86 Final speedup vs PW-only baseline: 2.19x (-54.2% cycles). NE16 dispatch count goes from 14 -> 28 (all 27 Convs + the final Gemm-as-PW), cluster path runs only the residual MaxPool. All 10 NE16 tests still pass (9 kernels + MobileNetV1).

The NE16 Dense 3x3 tile constraint constructed a rank-3 WeightCube (COffset, 0, 0), (CSize, weightShape[-2], weightShape[-1]) for a rank-4 encoded weight buffer (cout, cinMajor, bits, H*W*cinMinorBytes). With single-buffer everything fit in one tile so the offset stayed 0 and the left-padded rectangle landed safely. Double-buffer splits cout across multiple tiles, so the cout offset shifted onto the cinMajor axis (size 1) during _legalizeTransfers and tripped the rectangle-bounds assertion. Construct a rank-matching rectangle from the start. Same fix applies to both the standard L2 path and the WeightMemory_SRAM path. Also register the missing test_gap9_w_ne16_tiled_models_l2_doublebuffer pytest function and L2_DOUBLEBUFFER_MODELS config so MobileNetV1 (MLPerf/VisualWakeWords) covers both buffer modes. gvsoc gap9.evk: - L1=128000 single buffer: 791 461 cyc, MAC/Cyc 9.46 - L1=90000 double buffer: 855 488 cyc, MAC/Cyc 8.75 Double buffer at full 128KB L1 hits a runtime allocator failure because the encoded-weight footprint plus double-buffered tile pairs exceeds the budget; capping double-buffer L1 at 90KB leaves runtime headroom while the tile solver still picks reasonable tile sizes. For this small VWW 0.25x model (7.5M MACs, 96x96) double buffering doesn't beat single buffer — the layers are too compute-light to amortize the smaller tiles. The infrastructure is now in place for larger models where DB pays off. All 11 NE16 tests pass (9 kernels + 2 models).

Mirrors ci-platform-gap9-tiled.yml's coverage — the NE16 workflow only had kernel jobs, so the new MobileNetV1 (MLPerf/VisualWakeWords) model tests weren't running on upstream's runners.

…lows) The NE16 CI runs on ecab71e and 7f6ce26 tripped on git's "dubious ownership" check for TargetLibraries/CMSIS/third_party/CMSIS-NN during submodule fetch, killing the job before any pytest started. The CMSIS-NN submodule SHA hasn't changed since 822dd32 (where this workflow last passed), so this is purely a runner workspace-ownership issue. Add the same Mark workspace as safe step that every other platform's _runner-*.yml already uses (siracusa, snitch, gap9, gap9-tiled, chimera, generic, cortexm, mempool, siracusa-neureka-tiled, ...).

The pass was added during the NE16 Linear PR integration (6c8ae2b) and matches every Gemm/RequantizedGemm node without checking the engine attribute, so cluster-bound GEMMs (e.g. MLPerf AnomalyDetection's 10 Gemm+RQ layers — they never run on NE16) had their mul/bias rewritten into NE16 scale/scale_n/shift-diff layout. The cluster pulp_nn_linear kernel then consumed the rewritten constants under its original integer contract and produced ±1 mismatches versus the int8 reference outputs. Mirror the existing NE16AdjustWeightMemoryLayoutPass: bail out for nodes whose engine attr isn't "NE16". Pure-GAP9 cluster Gemms keep Deeploy's Generic + PULPGEMMRequantMergePass layout (including the bias += div/2 rounding compensation), matching the reference. gvsoc gap9.evk (Models/MLPerf/AnomalyDetection L1=64000): - before: 33/640 errors (all ±1), Runtime 89110 cycles - after: 0/640 errors, Runtime 79332 cycles - devel base 3b011bb (where bug doesn't exist): 0/640, 78500 cycles gap9_tiled L2 single-buffer models goes from 9/11 → 10/11 pass. The remaining failure (MLPerf/ImageClassification, parser backtracking on a standalone RequantShift node) is unrelated to GEMM and pre-dates this fix.

GAP9Platform's loweringPasses had PULPAddRequantMergePass commented out without an explanation. Without it, RequantShifts that follow an Add (residual quantize step in ResNet-like blocks) never get folded into a RequantizedAdd, so they survive standalone into the backend with their float32 scalar mul/add intact. The PULP RequantShift bindings only accept int32 mul/add, so parser-side type checking rejects every binding and the whole graph backtracks. MLPerf/ImageClassification (int8 ResNet, 14 RequantShifts, 3 Adds) was the visible victim — at the devel base 3b011bb it passed because PULPAddRequantMergePass was active there. Restore the pass to the lowering chain alongside its conv/gemm/matmul peers. The model now parses, builds and runs cleanly: - Models/MLPerf/ImageClassification on gvsoc gap9.evk: 0 / 10 errors, 1 365 882 cycles (devel base: 0 / 10 errors, 1 368 399 cycles — bit-equivalent) Also tighten RQSSplitPass.\_split_rqs_fun: when it duplicates the mul/add constants for each downstream RQ, cast their values to int32 if they are integer-valued. Source ONNX often stores RequantShift's scalar mul/add as float32 because they get folded into Conv/Gemm bias later. With the new split, they survive standalone, so type-checking needs them to already be int32. This is a defense-in-depth fix — PULPAddRequantMergePass alone already unblocks the failing model. gap9_tiled and models and singlebuffer and l2: 11 / 11 pass (was 9 / 11 before, then 10 / 11 after the GEMM engine-check fix).

Two FP32 kernel tests regressed since the NE16 Linear PR integration: 1. FP32 GEMM/Regular: the new GAP9_NE16GEMMInt32Mapper was prepended to the plain Gemm op binding list. It uses the same GEMMParser class as the FloatGEMM and GEMMDequant mappers, and Deeploy keys candidate bindings by parser class — so listing the NE16Int32 mapper first masked the other two. FP32 inputs failed the NE16Int32 type check and there was no backtrack to the FloatGEMM mapper. Drop GAP9_NE16GEMMInt32Mapper from the Gemm op list (the int8/uint8 path is already covered by RequantizedGemm with its own NE16 mapper). 2. FP32 Reshape/SkipConnection: MatMulAddMergePass fused MatMul+Add into a Gemm with alpha=beta=1 and no transA/transB. The MatMul inputs in this test don't share Gemm semantics, so the merge produced wrong outputs (16/16 errors). Devel base doesn't include this pass in the GAP9 lowering chain and the test passes there. Disable it for GAP9 to match. gvsoc gap9.evk: - FP32 GEMM/Regular: 0/1024 errors, 28987 cycles - FP32 Reshape/SkipConnection: 0/16 errors, 8343 cycles - devel base: 0/1024, 28k-ish; 0/16, 8020 cycles — same shape. gap9_tiled suite: 96 passed / 1 failed (was 92/5). Remaining failure (Models/CCT/FP32/CCT_2_32_32_128) also fails at 3b011bb devel base — pre-existing, unrelated to this PR.

runwangdl requested review from Victor-Jung and Xeratec as code owners April 13, 2026 22:18

runwangdl force-pushed the gap9-ne16 branch 12 times, most recently from 4edb011 to 748707a Compare April 14, 2026 08:54

runwangdl and others added 2 commits April 14, 2026 10:43

runwangdl force-pushed the gap9-ne16 branch 2 times, most recently from b8087fc to b3f40e5 Compare April 14, 2026 10:50

runwangdl force-pushed the gap9-ne16 branch from b3f40e5 to 6c8ae2b Compare April 14, 2026 10:59

runwangdl added 4 commits April 14, 2026 11:01

[NE16] add CNN_Libraries_HWPE include path for GAP9 SDK NE16 kernels

a47ae48

The generated Network.c includes CNN_BasicKernels_NE16.h (from the GAP9 SDK autotiler CNN_Libraries_HWPE directory), but this path was missing from the cmake include directories, causing build failures on plain GAP9.

runwangdl force-pushed the gap9-ne16 branch from 11a641b to a47ae48 Compare April 18, 2026 22:09

runwangdl added 5 commits April 19, 2026 00:18

[NE16] fix v4s/v4u type mismatch in NE16 GEMM template

6a3793a

KerConv_NE16_T.Pad is declared as v4u (unsigned) in the GAP9 SDK but the template was using (v4s){0,0,0,0} (signed), causing a compilation error on GCC with -Werror.

[CI] add NE16 profiling step to GAP9+NE16 tiled workflow

e6355a9

Runs each NE16 conv kernel with --profileTiling after the normal test suite to collect cycle counts from gvsoc.

[CI] fix NE16 profiling: disable LTO and clean build dir

2d75f76

GCC 7.1.1 has LTO linking bugs with the GAP9 SDK PMSIS library. The profiling step needs a clean rebuild with LTO disabled to avoid conflicts with the cached LTO-enabled build from the test step.

runwangdl added 8 commits April 19, 2026 18:48

[NE16] pass --enable-3x3 from deeployRunner to generateNetwork

ac3d172

The deeployRunner parsed --enable-3x3 but never forwarded it to generateNetwork.py's gen_args, so NE16Engine.enable3x3 stayed False and DW/Dense 3x3 convs silently fell back to the cluster.

[CI] add GAP9_w_NE16 model jobs (singlebuffer + doublebuffer L2)

7f6ce26

Mirrors ci-platform-gap9-tiled.yml's coverage — the NE16 workflow only had kernel jobs, so the new MobileNetV1 (MLPerf/VisualWakeWords) model tests weren't running on upstream's runners.

runwangdl force-pushed the gap9-ne16 branch from 8dbf69a to 438f100 Compare May 14, 2026 20:05

runwangdl added 4 commits May 14, 2026 21:01

[GAP9] lint: drop unused MatMulAddMergePass import + yapf format

9be1c57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[NE16] Add GAP9_w_NE16 platform (NE16 accelerator on GAP9)#1

[NE16] Add GAP9_w_NE16 platform (NE16 accelerator on GAP9)#1
runwangdl wants to merge 24 commits into
develfrom
gap9-ne16

runwangdl commented Apr 13, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

runwangdl commented Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Added

Changed

Fixed

Test plan

PR Merge Checklist

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

runwangdl commented Apr 13, 2026 •

edited

Loading