perf: --relocatable direct selector bypasses synth-opt — general codegen optimization (research + stats tracking)

Tracking issue for the **general** codegen-optimization effort (not a gale-specific tweak — improvements must help every `--relocatable` compile). gale to post on-target **statistics + regression feedback** here; I post research/findings + land general optimizations validated against the wasmtime-vs-unicorn oracle (`scripts/repro/wake_path_differential.py`).

## Baseline (gale's `z_impl_k_sem_give`, `--target cortex-m4 --relocatable`, v0.11.15)
- `.text`: **219 instructions / 694 bytes**
- **~35%** of instructions are loads/stores/register-moves (memory traffic + shuffling)
- **18%** are SP-relative frame spill/reload (`ldr/str [sp,#k]`)
- **0** *adjacent* redundancies (`str[k];ldr[k]`, `ldr[k];ldr[k]`, `mov rX,rX`) → the waste is **non-local** (a local/param reloaded on each `local.get`; values re-materialized across the function), so a naive adjacent-peephole won't help — it needs a small local dataflow pass.

## Root cause
`--relocatable` routes to `select_with_stack` (the direct selector, #197) which **bypasses the `synth-opt` IR optimizer** (CSE/DCE/const-fold/regalloc). It emits straight-line stack-machine code: every operand materialized to a register, every `local.get` a fresh reload, param frame-backing (#204) adds a reload per read.

## General optimization options (ranked by leverage vs risk)
1. **Local redundant-memory elimination on the selector output** — track which register currently holds each frame slot; rewrite a reload to a `mov` (or drop it) when the value is still live in a register; drop dead stores. General, local, low-risk, oracle-checkable. *(Recommended first.)*
2. **Keep multi-read locals/params in registers** in `select_with_stack` (load once, reuse) instead of reload-per-`local.get`. Higher payoff, touches the allocator.
3. **Route `--relocatable` through `synth-opt`** (the big lever — reuse the real optimizer) once its ABI is made relocatable-correct (the reason #197 bypassed it: absolute linmem base + non-preserving calls). Highest payoff, highest risk.

## Asks for gale
- Post the **on-target measurement** (cycles/instructions for the hot functions, ideally a per-function or per-region breakdown) so I optimize what's actually hot, not what merely looks redundant.
- Flag any **correctness regression** from an optimization immediately (the oracle harness is my pre-merge guard, but on-hardware is ground truth).

Each optimization ships as a normal bugfix-cadence release with a falsification statement; correctness is gated by the differential oracle + full suite + the #193/#186 fuzz.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: --relocatable direct selector bypasses synth-opt — general codegen optimization (research + stats tracking) #209

Baseline (gale's `z_impl_k_sem_give`, `--target cortex-m4 --relocatable`, v0.11.15)

Root cause

General optimization options (ranked by leverage vs risk)

Asks for gale

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

perf: --relocatable direct selector bypasses synth-opt — general codegen optimization (research + stats tracking) #209

Description

Baseline (gale's z_impl_k_sem_give, --target cortex-m4 --relocatable, v0.11.15)

Root cause

General optimization options (ranked by leverage vs risk)

Asks for gale

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Baseline (gale's `z_impl_k_sem_give`, `--target cortex-m4 --relocatable`, v0.11.15)