Skip to content

[Draft] Redmule platform#67

Draft
runwangdl wants to merge 9 commits into
pulp-platform:develfrom
runwangdl:redmule_platform
Draft

[Draft] Redmule platform#67
runwangdl wants to merge 9 commits into
pulp-platform:develfrom
runwangdl:redmule_platform

Conversation

@runwangdl
Copy link
Copy Markdown
Contributor

@runwangdl runwangdl commented May 8, 2025

Redmule Platform
(Rebased on Picolib imf PR and CCT optim PR)

Added

  • Redmule Platform, Engine, Tiler, Deployer, Binding
  • Matmul with Redmule tileConstraint, template, kernel
  • Conv Im2col with Redmule tileConstraint, template, kernel
  • Pass for Conv im2col weighttranspose

PR Merge Checklist

  1. The PR is rebased on the latest devel commit and pointing to devel.
  2. Your PR reviewed and approved.
  3. All checks are passing.
  4. The CHANGELOG.md file has been updated.
  5. If the docker was modified, change back its link after review.

@runwangdl runwangdl force-pushed the redmule_platform branch 3 times, most recently from 985a645 to 9ef9cc2 Compare May 9, 2025 14:05
@Victor-Jung Victor-Jung added Feature Addition of new features Milestone labels May 18, 2025
@Victor-Jung Victor-Jung removed this from Deeploy May 22, 2025
@Xeratec Xeratec added this to Deeploy May 22, 2025
@Xeratec Xeratec added this to the Release 0.2.0 milestone May 22, 2025
@Xeratec Xeratec removed the Milestone label May 22, 2025
@Victor-Jung Victor-Jung moved this to In review in Deeploy May 22, 2025
@Victor-Jung Victor-Jung moved this from In review to In progress in Deeploy Jun 19, 2025
@Xeratec Xeratec modified the milestones: Release xxx, Release 0.3.0 Nov 19, 2025
Minimal port of RedMulE-platform code from the user's redmule_platform
branch (which had accumulated unrelated CCT_Optim merges) onto a clean
devel base.

What landed:
- New target Deeploy/Targets/Redmule/ (Platform, Engine, Deployer,
  Bindings, Parsers, Tiler, Templates, TileConstraints,
  TopologyOptimizationPasses).
- FP32 RedMulE matmul kernel TargetLibraries/PULPOpen/src/Matmul_fp32_Redmule.c
- Test runner DeeployTest/testRunner_tiled_siracusa_w_redmule.py plus
  Float test fixtures (testFloat{Matmul,MatmulLarge,MatmulLarge256,2DConvolution,2dConvLarge,GEMM,GEMMtransB}).
- Wiring in platformMapping.py, top-level CMakeLists.txt,
  DeeployTest/CMakeLists.txt, TargetLibraries/PULPOpen/CMakeLists.txt.
- Makefile: GVSOC_COMMIT_HASH points at runwangdl/gvsoc fork 35d00d1
  (carries the light_redmule vendored copy + Siracusa cluster wiring).

Fixes / portings required for devel compatibility:
- Deeploy/Targets/PULPOpen/Templates/FloatGemmTemplate.py: define
  float32_tPtr locally (unresolved import left on devel).
- Deeploy/Targets/Redmule/TopologyOptimizationPasses/Passes.py: switch
  from the retired _permuteLastTwoDims / _appendTransposeNode helpers
  to upstream's _appendTranspose.
- Add empty __init__.py to Targets/{Chimera,Redmule,SoftHier}.

What intentionally did NOT land:
- CCT_Optim-era edits to PULPOpen Templates (Add/Conv/GELU/Layernorm/
  MatMul/MaxPool/Relu/Softmax), Generic Layers.py computeOps, CCT test
  suites, parallel/unroll rewrites.
- Buggy -march=rv32imc inside meson-build-script-rv32imf.txt.
- Hard-to-merge edits to DeeployTest/Platforms/Siracusa/src/deeploytest.c.
- The old-style .github/workflows/TestRunnerTiledSiracusaWithRedmule.yml;
  new-style ci-platform-siracusa-redmule-tiled.yml TBD.

Verified end-to-end: testFloatMatmul on GVSoC (runwangdl/gvsoc@35d00d1,
pulp submodule @ 371772c) passes with 'Errors: 0 out of 256'.
The Tests/ directory layout on devel was reorganized into Kernels/,
Models/, Others/ subdirectories. Drop the flat-path Float test inputs
ported from redmule_platform; they'll be re-added under the new
structure in a follow-up.
Mirrors the neureka-tiled pattern:
- DeeployTest/test_siracusa_redmule_tiled_config.py with empty
  L2_{SINGLE,DOUBLE}BUFFER_KERNELS dicts (to be populated once Float
  kernel test fixtures land under Tests/Kernels/Float/).
- conftest.py: register 'siracusa_redmule_tiled' pytest marker.
- test_platforms.py: two parametrized test functions (L2 single- and
  double-buffer) for the redmule platform.
- .github/workflows/_runner-siracusa-redmule-tiled.yml: reusable runner
  mirroring _runner-siracusa-neureka-tiled.yml.
- .github/workflows/ci-platform-siracusa-redmule-tiled.yml: top-level
  trigger, defaults to ghcr.io/runwangdl/deeploy:redmule Docker image.

With empty configs the tests collect and skip cleanly (pytest 'got
empty parameter set'). No wmem variants since RedMulE does not use
Neureka weight memory.
- yapf / isort / autoflake / trailing-whitespace across the Redmule
  Python target and platformMapping wiring.
- clang-format over TargetLibraries/PULPOpen/src/Matmul_fp32_Redmule.c.
- Add SPDX/license header to Matmul_fp32_Redmule.c (reuse hook).
The GAP9 CI uses ghcr.io/pulp-platform/deeploy-gap9:devel, which is
only pullable with pulp-platform org credentials. On a fork the job
fails at 'Initialize containers'. Add github.repository_owner guard
so forks skip the jobs cleanly.
The docs workflow publishes to gh-pages, which on a fork races with
external pushes and lacks origin remote setup. Gate on
github.repository_owner == 'pulp-platform' so only upstream publishes.
Point the redmule tiled CI config at existing upstream FP32 kernel
test fixtures under Tests/Kernels/FP32/GEMM (Regular, TransB). Both
single-buffer and double-buffer variants verified locally end-to-end
on GVSoC (Errors: 0 / 256, runtime ~4k cycles).
Without this fallback _select-env.yml resolves to the upstream
pulp-platform/deeploy:devel image, which ships a GVSoC build that
does not include the light_redmule model — the redmule test runner
then hangs. Point the default at the fork's custom image so push
events get the correct GVSoC build.
ghcr.io/runwangdl/deeploy:redmule is a private package; add
credentials block using the workflow's GITHUB_TOKEN so the runner
container step can pull it.
runwangdl added a commit to runwangdl/TrainDeeploy that referenced this pull request May 10, 2026
The Siracusa+RedMulE training CI on 1782a88 got past Python codegen but
failed at link time:

    ld.lld: error: undefined symbol:
        Conv2d_Im2Col_fp32_fp32_fp32_HWC_8_Redmule
    >>> referenced by TrainingNetwork.c:5386 in
        _node_1_tokenizer_..._Conv_cluster_fork

The original RedMulE PR (pulp-platform/Deeploy#67) shipped only the
matmul kernel TargetLibraries/PULPOpen/src/Matmul_fp32_Redmule.c.  The
ConvTemplate references a `Conv2d_Im2Col_..._8_Redmule` kernel that has
no corresponding source in the tree, and 67b754b already deleted the
testFloat2DConvolution / testFloat2dConvLarge fixtures that would have
exercised the Redmule Conv path.  So the Conv binding has always been
load-bearing only for non-test models like CCT_train, and on those it
breaks the link.

Two coupled changes route Conv through the existing PULPClusterEngine
(which has a working PULP_Conv2d_Im2Col_fp32_fp32_fp32_HWC):

- Drop 'Conv' from RedmuleMapping.  Without it Conv falls through to
  the second engine in RedmulePlatform's engine list (PULPCluster).
- Drop RedMuleAdjustWeightMemoryLayoutPass from the lowering passes.
  That pass transposed Conv weights from [F,H,W,Cin] to [H,W,Cin,F]
  for the RedMulE accelerator's expected layout; once Conv is on the
  PULPCluster engine, PULP expects [F,H,W,Cin] and the pre-applied
  transpose makes Tiling produce out-of-bounds tile rectangles
  (locally repro'd: AssertionError "Rectangle offset should be zero
  when the dimensions are the same. Received rectangle
  HyperRectangle(offset=(3, 0, 0, 0), dims=(3, 3, 3, 32))" in
  TilingCodegen.minimizeRectangle).

Both are clearly marked in-source as "restore when the RedMulE Conv
kernel lands."  Locally validated end-to-end:
- testMVPTraining.py    -> exit 0 (TrainingNetwork.c emits
  PULP_Conv2d_Im2Col_fp32_fp32_fp32_HWC for the tokenizer Conv).
- testMVPOptimizer.py   -> exit 0.

Matmul / Gemm continue to bind to RedMulE as before.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
runwangdl added a commit to runwangdl/TrainDeeploy that referenced this pull request May 10, 2026
Implements the kernel symbol that
Deeploy/Targets/Redmule/Templates/ConvTemplate.py has been pointing at
since the original pulp-platform/Deeploy#67 port -- it was a declared-
but-never-defined dangling reference, which is why 78a05d4 had to
unmap Conv from RedmuleMapping and route it through PULPCluster.

- TargetLibraries/PULPOpen/src/Conv2d_Im2Col_fp32_Redmule.c
  All 8 cluster cores cooperatively build the [N_out, P*Q*C] im2col
  matrix in the hoisted L1 transient buffer (contiguous slices of
  output positions, zero-pad when h_in/w_in fall outside the input).
  Core 0 then triggers a single RedMulE GEMM
      [N_out, K] @ [K, F]  ->  [N_out, F]
  via MatMul_*_Redmule / Gemm_*_Redmule from Matmul_fp32_Redmule.c.
  When has_bias is true the [F] bias is broadcast in-place into pOut
  and Gemm runs with y_addr = z_addr = pOut (same pattern the existing
  MatMul kernel already uses for its Y=Z=pDstY zero-init).

- Conv.h declares the new symbol.

- ConvTemplate.py:
  * forwards ${bias} and ${has_bias} (PULPFPConv2DParser already
    populates them) -- the previous template silently dropped bias.
  * sizes the im2col transient buffer to the full per-tile
    H_out * W_out * (C*P*Q) footprint instead of the prior 8-row
    scratch; one big GEMM amortises RedMulE's MMIO setup cost.

- Engine.RedmuleMapping restores 'Conv': ConvLayer([Conv2DRedmuleMapper]).

- Deployer.py restores RedMuleAdjustWeightMemoryLayoutPass -- it
  permutes Conv weights from [F,P,Q,C] to [P,Q,C,F] = flat [P*Q*C, F],
  exactly the right operand the im2col GEMM consumes.  Both Conv and
  the layout pass were disabled together in 78a05d4 (PULPCluster
  fallback expects [F,P,Q,C]); both come back together now.

Locally validated: testMVPTraining.py + testMVPOptimizer.py both exit 0
on Models/Training/CCT/cct_train @ Siracusa_w_redmule; generated
TrainingNetwork.c now emits Conv2d_Im2Col_fp32_fp32_fp32_HWC_8_Redmule
for the tokenizer Conv (was PULP_Conv2d_Im2Col_*_HWC).

GVSoC numerical tolerance still has to be checked on CI -- this is a
new kernel, not a wrapper around an existing one, and the broadcasted-
bias path was never exercised before.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Feature Addition of new features

Projects

Status: In progress

Development

Successfully merging this pull request may close these issues.

3 participants