Skip to content

perf: --relocatable direct selector bypasses synth-opt — general codegen optimization (research + stats tracking) #209

@avrabe

Description

@avrabe

Tracking issue for the general codegen-optimization effort (not a gale-specific tweak — improvements must help every --relocatable compile). gale to post on-target statistics + regression feedback here; I post research/findings + land general optimizations validated against the wasmtime-vs-unicorn oracle (scripts/repro/wake_path_differential.py).

Baseline (gale's z_impl_k_sem_give, --target cortex-m4 --relocatable, v0.11.15)

  • .text: 219 instructions / 694 bytes
  • ~35% of instructions are loads/stores/register-moves (memory traffic + shuffling)
  • 18% are SP-relative frame spill/reload (ldr/str [sp,#k])
  • 0 adjacent redundancies (str[k];ldr[k], ldr[k];ldr[k], mov rX,rX) → the waste is non-local (a local/param reloaded on each local.get; values re-materialized across the function), so a naive adjacent-peephole won't help — it needs a small local dataflow pass.

Root cause

--relocatable routes to select_with_stack (the direct selector, #197) which bypasses the synth-opt IR optimizer (CSE/DCE/const-fold/regalloc). It emits straight-line stack-machine code: every operand materialized to a register, every local.get a fresh reload, param frame-backing (#204) adds a reload per read.

General optimization options (ranked by leverage vs risk)

  1. Local redundant-memory elimination on the selector output — track which register currently holds each frame slot; rewrite a reload to a mov (or drop it) when the value is still live in a register; drop dead stores. General, local, low-risk, oracle-checkable. (Recommended first.)
  2. Keep multi-read locals/params in registers in select_with_stack (load once, reuse) instead of reload-per-local.get. Higher payoff, touches the allocator.
  3. Route --relocatable through synth-opt (the big lever — reuse the real optimizer) once its ABI is made relocatable-correct (the reason v0.11.9: pointer param live across calls not preserved in a complex/register-heavy fn (sem read from 0x20000100+clobbered r0) — minimal cases pass (follow-up to #188) #197 bypassed it: absolute linmem base + non-preserving calls). Highest payoff, highest risk.

Asks for gale

  • Post the on-target measurement (cycles/instructions for the hot functions, ideally a per-function or per-region breakdown) so I optimize what's actually hot, not what merely looks redundant.
  • Flag any correctness regression from an optimization immediately (the oracle harness is my pre-merge guard, but on-hardware is ground truth).

Each optimization ships as a normal bugfix-cadence release with a falsification statement; correctness is gated by the differential oracle + full suite + the #193/#186 fuzz.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions