Skip to content

perf: Replace BTreeMap with sorted vec#1

Closed
michael-weigelt wants to merge 2 commits into
mwe/conflict_setfrom
mwe/live_range_set
Closed

perf: Replace BTreeMap with sorted vec#1
michael-weigelt wants to merge 2 commits into
mwe/conflict_setfrom
mwe/live_range_set

Conversation

@michael-weigelt

@michael-weigelt michael-weigelt commented Jun 5, 2026

Copy link
Copy Markdown
Owner

Follow-up to bytecodealliance#261.

A flamegraph showed that during (non-optimizing) compilation of a Wasm function with 1000s of locals, try_to_allocate_bundle_to_reg spends a lot of time (14% overall) peeking and walking a BTreeMap. After bytecodealliance#261, those 14% are more like 30%, so this PR replaces that BTreeMap by a sorted vec (with a BTreeMap-like interface), which is much more cache-friendly. The trade-off is that mutations are O(log n) instead of O(1), but they seem to be rare enough compared to walking the tree that it's worth it.

Note that I have not found a benchmark for the general-purpose case, so I can only report the speedups of my slightly degenerate usecase of many-locals-functions, compiled with all optimizations turned off.

Compiling a Wasm function with X locals before and after bytecodealliance#261 and after this PR:

                                  PR 261's improvement      this PR's improvement
1000 locals: 780ms compilation -> 363ms   -53%   -> (PR2) 216ms  -40%  overall: -72%
10k  locals:  66s compilation ->   23s    -65%   ->        10s   -56%  overall: -84%
40k  locals: 952s compilation ->  356s    -62%   ->       166s   -53%  overall: -82%

@michael-weigelt michael-weigelt marked this pull request as draft June 5, 2026 20:30
@cfallin

cfallin commented Jun 5, 2026

Copy link
Copy Markdown

(I saw this referenced from the other PR, I hope you don't mind my comments here)

While the speedup is indeed impressive on your particular use-case, I think this is a change we will not be able to take, for an algorithmic-complexity reason: the allocation map per preg will see a lot of insertions into the middle, and can see removals (on backtracking) as well. A sorted Vec is great if we can sort once and then rely on bsearch to locate keys and scan from there, but will degrade to quadratic behavior overall in any workload that allocates a lot of liveranges in any order that is not top-to-bottom (and in general liveranges are sorted by weight/score so they'll be processed in some arbitrary order). Thus this will pose a huge blowup risk to compile time.

(Nevertheless I'll be curious what you find on Sightglass with other benchmarks, especially large ones like SpiderMonkey; that will tell us how common this blowup risk is and can still feed into thinking.)

@michael-weigelt

Copy link
Copy Markdown
Owner Author

Not at all, thanks for having an early look.

the allocation map per preg will see a lot of insertions into the middle

Yeah given this information I agree. I was not sure if that would be the case, so trying it out seemed faster than trying to understand the bigger picture (for now). So thanks for the input. I don't see a way to turn this PR into something usable then. I am nevertheless running the benchmarks for main+this PR (without the conflict_set changes) and will report them here. Let me know if you'd also like to see results from main+conflict_set+this PR.

@cfallin

cfallin commented Jun 9, 2026

Copy link
Copy Markdown

Sure, data is always welcome at least to point us toward what we might be missing (and could be useful if we decide to e.g. do an adaptive thing in the future).

@michael-weigelt

Copy link
Copy Markdown
Owner Author

main + only this PR
all_normal_pr2.txt

@michael-weigelt

Copy link
Copy Markdown
Owner Author

main + bytecodealliance#261 + this PR
all_normal_both.txt

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants