Skip to content

ci(miri): parallelize with cargo-nextest so Miri uses the cores (not 45 min on one)#521

Open
avrabe wants to merge 5 commits into
mainfrom
ci/miri-nextest-parallel
Open

ci(miri): parallelize with cargo-nextest so Miri uses the cores (not 45 min on one)#521
avrabe wants to merge 5 commits into
mainfrom
ci/miri-nextest-parallel

Conversation

@avrabe

@avrabe avrabe commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

The measured answer to "where do CPU/mem vanish?"

From the #520 run timeline:

  • Peak 15 jobs ran concurrently, queues ~22 s → the pool is using the hardware; capacity isn't the bottleneck.
  • Miri ran 45 m 28 s — 5× the next-slowest job — as one cargo miri test process. Miri is a single-threaded interpreter, so it pins one core for 45 min while the rest of the box is idle, and hogs the scarce lean-mem runner the whole time (that's the "contention" we saw — a symptom, not the cause). The 24 GB is Miri shadow memory, not a leak.

So the resources don't vanish — Miri just can't use them.

Fix: cargo miri nextest run (process-per-test)

nextest runs each test in its own process → many Miri processes in parallel → the cores get used and wall-time collapses toward the slowest single test.

  • --test-threads 4: conservative start so concurrent Miri shadow-memory stays under 24 GB; raise once a green run shows real peak-RSS headroom.
  • Same test selection — the libtest --skip <m> list is translated to nextest -E 'not (test(<m>) | ...)'. Verified locally: 721 run / 412 excluded, identical modules.
  • Adds a Show runner resources step (nproc, free -h) so the log makes the available CPU/mem explicit.

What to watch on this PR's CI

  • The Miri job wall-time (expect ~17–26 min at 4 threads vs 45) and that it stays green.
  • The resource-print confirming the box's real cores/RAM.

If 4 threads lands comfortably under memory, I'll raise --test-threads in a follow-up to push the wall-time down further (and can then tighten the timeout). Builds on REQ-215; Refs #509.

🤖 Generated with Claude Code

…le cores

Measured root cause of the 45-min Miri job + lean-mem contention: `cargo miri
test` runs ONE Miri process — a single-threaded interpreter — pinning one core
for 45 min while the rest of the (ample) box sits idle. The CPU/mem doesn't
vanish into contention (peak 15 concurrent jobs, ~22s queues); Miri just can't
use it, and it hogs the scarce lean-mem runner for 45 min so other lean-mem
jobs queue behind it.

Switch to `cargo miri nextest run`: each test runs in its own process, so the
runner's cores are actually used and wall-time collapses toward the slowest
single test. `--test-threads 4` bounds concurrent Miri processes under the 24G
shadow-memory ceiling (conservative start; raise after a green run shows peak
RSS). Same test selection: the libtest `--skip <m>` list is translated to
nextest's `-E 'not (test(<m>) | ...)'` (verified locally: 721 run / 412
excluded, identical modules). Adds a resource-print step so the log shows the
box's real nproc/free.

Refs: REQ-215, #509
Trace: skip

@github-actions github-actions Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Performance Alert ⚠️

Possible performance regression was detected for benchmark 'Rivet Criterion Benchmarks'.
Benchmark result of this commit is worse than the previous benchmark result exceeding threshold 1.20.

Benchmark suite Current: b4d5cd2 Previous: e60a3a9 Ratio
traceability_matrix/1000 56223 ns/iter (± 2317) 43193 ns/iter (± 499) 1.30
query/10000 321001 ns/iter (± 1729) 236806 ns/iter (± 4501) 1.36

This comment was automatically generated by workflow using github-action-benchmark.

@codecov

codecov Bot commented Jun 10, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

avrabe added 2 commits June 10, 2026 15:20
…not 24

Measured the runner directly (the new resource-print step): 32 cores, 125 GiB
RAM — the "lean-mem" label is a misnomer and memory is a non-constraint. At
--test-threads 4 the parallel Miri job still timed out at 45 min (~678 tests ×
~20s ÷ 4 ≈ 56 min); 28 of 32 cores sat idle. Raise to 24 threads:
~226 core-min ÷ 24 ≈ ~9-10 min, using ~72 GiB (well under 125). This is where
the CPU/mem was "vanishing": into idle cores, because nothing was using them.

Refs: REQ-215, #509
Trace: skip
Correction to the earlier "24 threads thrash" claim: measured effective
concurrency is 23x — parallelism works fine. The real problem is work VOLUME:
the 678-test sweep is ~930 test-minutes (mean 104s/test, worst 557s) of mostly
safe business logic (regex/glob/HTML) that Miri runs ~100-1000x slower than
native and that has no `unsafe` to validate. No thread count fits that in a
per-PR budget.

So split it (the #498 nightly pattern):
- PR/push `miri` → only the real UB surface: the SyntaxKind `transmute`s in the
  rowan CST parsers (sexpr + yaml_cst), 41 tests, ~3 min. This is what Miri is
  FOR.
- nightly/manual `miri-full` → the full Miri-compatible sweep, 24 threads on the
  125 GiB runner, 90-min budget, off the per-PR path.

Refs: REQ-215, #509
Trace: skip
avrabe added 2 commits June 10, 2026 19:14
When I split Miri into scoped (PR) + full (nightly) I kept MIRIFLAGS on
miri-full but dropped it from the scoped job, so it ran under default Stacked
Borrows and surfaced rowan's thin-token SharedReadOnly zero-size-retag UB —
which I mistakenly reported as a fresh finding. Research (subagent, source-
grounded) confirms: rust-analyzer/rowan #210/#211/#212 are all closed UNMERGED
(upstream is rewriting rowan, not patching); the maintainer holds that rowan
should pass Tree Borrows and the failing Stacked Borrows rule is unlikely to
survive Rust's final aliasing model. Our fork pin is sound under Tree Borrows,
which the original job already used.

Restore `MIRIFLAGS: -Zmiri-disable-isolation -Zmiri-tree-borrows` on the scoped
job (matching miri-full and the pre-split job). Update the rowan pin comment in
Cargo.toml to reflect the closed-unmerged/rewrite reality.

Refs: REQ-215, #509
Trace: skip
…ri SB-sound

Adopts phall1's unmerged rust-analyzer/rowan#212 into our fork as
fix/miri-soundness-v3 (= v2 + `48a1b5e` GreenToken-as-DST + `9e7abd1` cursor SB
fix). v2 was only sound under Tree Borrows; the residual thin-token
SharedReadOnly zero-size-retag UB is now fixed, so rivet's rowan parsers are
sound under BOTH Stacked Borrows and Tree Borrows.

- Cargo.toml: pin branch v2 → v3; comment rewritten to the closed-unmerged /
  upstream-rewrite reality + the SB-soundness.
- Cargo.lock: rowan git rev → 9e7abd1.
- ci.yml: the scoped PR Miri job now gates under STACKED BORROWS (strictest;
  drop -Zmiri-tree-borrows) since v3 makes it sound. miri-full stays on Tree
  Borrows for the broader (incl. mutable-path) sweep.

Confirmed: rivet-core builds against v3; native sexpr::/yaml_cst:: tests pass;
rowan fork v3 builds + carries phall1's miri_node_cache_rehash / miri_cursor_free_sb
regression tests; actionlint clean.

Refs: REQ-215, #509
Trace: skip
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant