feat: multi-node scale-out for SLURM deployments (image sync, manifest restore, best-CSV aggregation) by mkuznet1 · Pull Request #112 · ROCm/madengine

mkuznet1 · 2026-04-23T15:56:47Z

Summary

Closes the multi-node coordination gap in the SLURM path. The upstream refactor
made madengine multi-node observable (per-rank log staging, login-node
collect_results, partial manifest restore). It did not yet coordinate
which node owns the image, when workers can enter docker run,
whose CSV the aggregator should pick, and which manifest fields a worker
needs to reconstruct the builder's environment.
This PR adds that coordination, plus a handful of single-node issues that
only hurt under scale-out. Single-node behavior is unchanged — every new
path is gated on MAD_DOCKER_BUILDS, NNODES > 1, manifest presence, or
failure flow.

What's new

Multi-node coordination

Shared local-image tar cache (MAD_DOCKER_BUILDS): rank 0 docker saves,
workers docker load. No per-node rebuilds or registry re-pulls.
TCP rendezvous barrier between rank 0 and workers so no one enters
docker run against a half-loaded image. No shared FS required.
Rebuild-from-manifest fallback on workers where the local image is missing.
Best-match multi-node CSV picker in collect_results (ranks candidates
by non-empty performance rows). Multi-node perf validation defers the
empty-CSV verdict to the login-node step instead of failing the first rank.
Manifest restore on workers now covers docker_mounts, docker_build_arg,
docker_gpus, gpu_vendor, guest_os — runtime values keep priority.
Bug fix on the restore path
docker_env_vars restore no longer overwrites os.environ-sourced values
(e.g. MAD_SECRETS_HFTOKEN) with unexpanded ${VAR} literals from an
old manifest. Fixes HF 401 on Primus / gated models on rerun.
Single-node issues exposed by scale-out
madengine --version / --help are side-effect-free (MAD_SETUP_MODEL_DIR
is skipped; SLURM preflight probes run under env -u MODEL_DIR). No more
GBs × N hosts on every job.
Duplicate-mount protection when a model supplies its own -v in
additional_docker_run_options.
Expanded SLURM → Docker env pass-through for vLLM/SGLang disagg and Primus
(xP, yD, PROXY_TYPE, PD_SYNC_ROOT, BARRIER_TIMEOUT_S,
REQUIRE_RDMA, MODEL_NAME, MODEL_DIR, …).
Container-failure diagnostics: ps, sockets (ss/netstat/lsof),
log tails of /run_logs and /myworkspace/<model_dir> captured on
docker exec failure; rocEnvTool console timeouts added to the
benign-pattern list; failure-path artifact preservation for
perf_*.csv, perf-*.csv, benchmark_*_CONCURRENCY.log.

Validation

End-to-end on a real SLURM cluster (oci_cx7), three RDMA-enabled workloads:
MLPerf Llama-3.1 8B training (2 × 8 GPUs, 8 IB HCAs), Primus Megatron-LM
Llama-3.1 8B training (2 × 8 GPUs), vLLM disaggregated inference Llama-3.1 8B
(3 × 8 GPUs, UCX RDMA-CM). New integration tests cover the shared-tar-cache
paths (primary-saves, existing-tar-is-loaded, worker-waits-for-primary).

Copilot

Pull request overview

This PR adds missing multi-node coordination for SLURM-based distributed runs, focusing on making “local image mode” usable at scale (shared image tar caching + rendezvous barrier), improving manifest restore fidelity on workers, and making login-node result collection choose the most complete CSV across nodes.

Changes:

Add shared local-image tar cache support (MAD_DOCKER_BUILDS) with rank-0 save and worker load, plus a TCP rendezvous barrier to prevent premature docker run.
Expand manifest restore to include additional runtime context (mounts/build args/GPU config) while preventing manifest docker env vars from overwriting env-sourced secrets.
Improve multi-node result collection by selecting the “best” multiple-results CSV (most non-empty performance rows) and deferring empty-CSV verdicts to aggregation.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 8 comments.

Show a summary per file

File	Description
tests/integration/test_container_execution.py	Adds integration coverage for shared tar-cache image availability logic.
src/madengine/orchestration/run_orchestrator.py	Restores additional runtime context fields from manifest and avoids overwriting env-sourced `docker_env_vars`.
src/madengine/execution/container_runner.py	Implements image tar caching, TCP barrier synchronization, duplicate-mount avoidance, extended env passthrough, and extra failure diagnostics/artifact staging.
src/madengine/deployment/templates/slurm/job.sh.j2	Makes `madengine --help/--version` probes side-effect-free by unsetting `MODEL_DIR`.
src/madengine/deployment/slurm.py	Adds “best CSV” selection for multi-node `multiple_results` aggregation.
src/madengine/core/constants.py	Skips MODEL_DIR setup and suppresses verbose config prints for `--help/--version` invocations.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 6 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

…esults _select_best_multiple_results_csv() stripped whitespace from fieldnames to detect a "performance" column but then read row values via row.get("performance") against the unstripped DictReader keys. If a node emitted the header as " performance", detection succeeded while every row scored 0 and ranking became order-dependent. Normalize per-row keys (mirroring the {k.strip(): v} pattern used elsewhere in collect_results) so detection and lookup agree. Addresses Copilot review threads r3132174282 and r3162090619 on PR ROCm#112.

Context auto-imports secrets whose keys contain MAD_SECRETS_* (note the plural). The comment referenced the non-existent MAD_SECRET_HFTOKEN, which would mislead anyone debugging HF auth issues during manifest restore. Comment-only change; no behavior impact. Addresses Copilot review threads r3132174316 and r3162090646 on PR ROCm#112.

Both ContainerRunner._get_build_args() and DockerBuilder.get_build_arg() interpolated build-arg keys/values into a `docker build` command using manual single-quote wrapping. Console.sh executes with shell=True, so values containing quotes, whitespace, `$`, or other shell metacharacters would break the command and, when docker_build_arg is sourced from manifests / user context, become a shell-injection vector. Wrap each KEY=VALUE pair with shlex.quote() in both call sites. For simple alphanumeric values the wrapper is a no-op (shlex.quote treats `=` as a safe character), so the on-the-wire format only changes when the original wrapping was actually unsafe. Update the affected DockerBuilder integration tests to match the new shell-safe output. Addresses Copilot review threads r3132174338 and r3162090405 on PR ROCm#112.

_build_local_image_from_manifest() built its `docker build ...` command by concatenating unquoted run_image, dockerfile, and docker_context. Because Console.sh runs with shell=True, paths containing spaces would fail the build and manifest-controlled values could inject additional shell tokens. Wrap each component with shlex.quote(). Also quote run_image in the subsequent `docker image inspect` call for consistency with the other inspect sites in this module. Addresses Copilot review threads r3132174397 and r3162090481 on PR ROCm#112.

… barrier _tcp_image_ready_barrier() did `rank_int = int(node_rank)` without guarding, so an unset or malformed NODE_RANK / RANK env var would abort the run with an opaque ValueError mid-barrier. Wrap the conversion in try/except and raise RuntimeError with the offending value. A fallback to rank 0 is intentionally not used here: silently promoting a worker to rank 0 would make it bind a listener and deadlock the other peers. Addresses Copilot review threads r3132174416 and r3162090578 on PR ROCm#112.

…ADDR when possible The TCP image-ready barrier in ContainerRunner._tcp_image_ready_barrier had two shared-cluster weaknesses: 1. The barrier token was derived only from SLURM_JOB_ID (`token = f"JOB{job_id}"`) and is therefore predictable. Another user or process on the master node could send a forged `READY JOB<id> <rank>` message and consume a worker slot. 2. The listener bound to `0.0.0.0`, gratuitously exposing the barrier on every interface of the master host. Mitigations: * Allow callers to set `MAD_BARRIER_TOKEN` to an opaque secret (generated once in the SLURM submit script and exported to all ranks). When unset, fall back to the previous `JOB<id>` token for backward compatibility. * Try to bind to the resolved IP of `MASTER_ADDR` first; fall back to `0.0.0.0` only when that binding fails (e.g. on single-node / loopback test scenarios). CHANGELOG updated to document the new env var and the bind behaviour. A larger fix (auto-generated nonce broadcast via SLURM env, mutual auth) is deferred — it would require coordination with the submit script. Addresses Copilot review thread r3132174443 on PR ROCm#112.

…emove redundant call Two intertwined bugs in the multi-node local-image flow: 1. _ensure_local_image_available() previously called the TCP barrier only on the `tar_missing_at_start` branch. If nodes disagreed on tar existence (e.g. shared MAD_DOCKER_BUILDS with NFS visibility lag, or stale cache state), different ranks took different numbers of barriers and the participants could deadlock until timeout. 2. run_models_from_manifest() then called _sync_after_local_image_ready() again immediately after _ensure_local_image_available() returned, adding a redundant second rendezvous on the slow path and increasing the chance of a flaky timeout. Refactor _ensure_local_image_available() into three explicit phases: (a) Primary makes the image (and tar, if cache configured) available. (b) All ranks cross _sync_after_local_image_ready() exactly once; the function is already a no-op for single-node runs. (c) Workers consume the artefact (load from tar or fall back to build/pull when no shared cache is configured). Drop the redundant outer barrier in run_models_from_manifest() and document the new invariant in the docstring. Existing tests in tests/integration/test_container_execution.py continue to pass: the worker-waits test still observes exactly one _sync_after_local_image_ready() call; the existing-tar test never sets NNODES so the barrier is a no-op. Addresses Copilot review threads r3132174463 and r3162090536 on PR ROCm#112.

Copilot

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated 5 comments.

…y barrier - Worker now sends and compares against int(node_rank) so leading-zero values like NODE_RANK="01" still match the master's ACK (always built from int(parts[2])). Previously the worker would time out waiting for the literal "GO <tok> 01" while the master responded with "GO <tok> 1". - Master path now logs the joined peer list after the rendezvous so a previously dead `peers` accumulator becomes useful diagnostic output for stuck multi-node startups. Addresses PR ROCm#112 Copilot threads r3227113648 (rank mismatch) and r3227113591 (unused `peers`).

Copilot

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated 3 comments.

…on, shell-escape hardening, manifest restore, and multi-node CSV aggregation This squashes the PR ROCm#112 work into a single commit on top of upstream develop (aa94fb7). The PR adds the multi-node SLURM `localimage` path together with the safety/quality follow-ups raised in Copilot and self-review. Multi-node local-image preparation - Build missing local images from the manifest on primary, then sync all nodes via a lightweight TCP image-ready barrier so workers only start pulling/loading once the image actually exists. - Shared-tar cache for local images (MAD_DOCKER_BUILDS): primary ``docker save`` s the freshly-built image into the shared dir, workers ``docker load`` from the same tar instead of rebuilding. Tar is now written atomically via a sibling ``.tmp.<pid>`` file + ``os.replace`` so a primary crash mid-save never leaves a half-written tar that the next run would happily load. - Barrier participation is now symmetric across nodes (single call site in ``_ensure_local_image_available``) and the redundant call site in ``run_models_from_manifest`` was removed. - ``MAD_BARRIER_TOKEN`` can override the default token (which is derived from MASTER_ADDR / port so it doesn't leak across concurrent jobs), and the master socket is bound to MASTER_ADDR's resolved IP when possible (fallback to 0.0.0.0). - Worker ranks are normalized (``int(NODE_RANK)``) for both READY and ACK messages so leading-zero values like ``NODE_RANK="01"`` no longer silently time out against the master's int-parsed ACK string. - Strict env parsing: malformed ``NODE_RANK`` / ``RANK`` or ``NNODES`` / ``WORLD_SIZE`` raises ``RuntimeError`` instead of silently defaulting to 0/1, which previously let a worker take the primary code path (image build, tar save) and deadlock the barrier — or skip the barrier entirely. - On master barrier timeout, any worker connections that did manage to ACK are now closed so workers see EOF instead of a hanging half-open socket, surfacing the real error in worker logs. - The accepted-peer list is logged on success for easier multi-node deadlock diagnosis from the master log alone. Shell-escape hardening - ``ContainerRunner._get_build_args`` and ``DockerBuilder.get_build_arg`` quote every build-arg key/value through ``shlex.quote`` to match ``Console.sh``'s ``shell=True`` execution path. - ``_build_local_image_from_manifest`` quotes the image tag, Dockerfile, and build context for the same reason. - ``model_dir`` is now passed through ``_bash_quote_path`` before interpolation into the on-failure container diagnostics ``bash -lc`` block so model dirs with whitespace or shell metacharacters do not break the diagnostic copy. - ``cp -- /run_logs/...`` for benchmark logs only quotes a concrete ``SLURM_JOB_ID`` and leaves the default ``*`` as a real glob (instead of a quoted literal that silently no-op'd on non-SLURM runs). Manifest restore / docker_env_vars precedence - Runtime context fields (docker mounts, build args, env vars) are restored from the manifest on execution so per-model overrides survive a slurm submission round-trip. - ``docker_env_vars`` from the manifest never overwrite runtime values already populated by ``Context`` (e.g. ``MAD_SECRETS_*`` read from ``os.environ``). Manifest entries may still contain unexpanded ``${VAR}`` placeholders, and the runtime-resolved value must win. - Comment corrected from ``MAD_SECRET_HFTOKEN`` to ``MAD_SECRETS_HFTOKEN``. Multi-node CSV aggregation - ``_select_best_multiple_results_csv`` now strips whitespace from ``DictReader`` header names and from row keys when computing the "non-empty performance" score, so multi-node multiple_results.csv files emitted by some Primus configs (which include a leading space before column names) are scored correctly instead of always tying at zero. - Empty-perf detection is deferred to the aggregated result in multi-node mode so a single empty per-node CSV no longer fails the whole job before aggregation has a chance to run. Mount/env pass-through & CLI isolation - Docker mount list is deduplicated so multiple mount sources mapping to the same target only produce one ``-v`` flag. - SLURM env var pass-through is widened to include ``SLURM_JOB_ID``, ``MASTER_ADDR``, ``NODE_RANK``, ``NNODES``, ``WORLD_SIZE`` and the ``MAD_*`` family, so containers see a consistent rendezvous environment. - ``--version`` / ``--help`` lightweight CLI invocations no longer mutate ``MODEL_DIR`` global state, which previously polluted subsequent full runs in the same shell. Container diagnostics - On model script failure, collect a bounded snapshot from the failed container (``/run_logs``, ``/myworkspace/<model_dir>``) so multi-node failures aren't opaque — paired with the model_dir quoting above. Touched: ``src/madengine/execution/container_runner.py``, ``src/madengine/execution/docker_builder.py``, ``src/madengine/orchestration/run_orchestrator.py``, ``src/madengine/deployment/slurm.py``, ``src/madengine/deployment/templates/slurm/job.sh.j2``, ``src/madengine/core/constants.py``, ``tests/integration/test_container_execution.py``, ``CHANGELOG.md``.

mkuznet1 self-assigned this Apr 23, 2026

Copilot AI review requested due to automatic review settings April 23, 2026 15:56

mkuznet1 requested a review from gargrahul as a code owner April 23, 2026 15:56

mkuznet1 added the enhancement New feature or request label Apr 23, 2026

mkuznet1 requested review from Rohan138, coketaste and leconcio as code owners April 23, 2026 15:56

mkuznet1 added the help wanted Extra attention is needed label Apr 23, 2026

mkuznet1 requested a review from Cemberk as a code owner April 23, 2026 15:56

Copilot started reviewing on behalf of mkuznet1 April 23, 2026 15:58 View session

Copilot AI reviewed Apr 23, 2026

View reviewed changes

coketaste requested a review from Copilot April 29, 2026 15:05

Copilot started reviewing on behalf of coketaste April 29, 2026 15:06 View session

Copilot AI reviewed Apr 29, 2026

View reviewed changes

Copilot AI review requested due to automatic review settings May 12, 2026 14:06

Copilot started reviewing on behalf of mkuznet1 May 12, 2026 14:08 View session

Copilot AI reviewed May 12, 2026

View reviewed changes

Copilot AI review requested due to automatic review settings May 12, 2026 14:35

Copilot started reviewing on behalf of mkuznet1 May 12, 2026 14:37 View session

Copilot AI reviewed May 12, 2026

View reviewed changes

Comment thread src/madengine/execution/container_runner.py Outdated

Comment thread src/madengine/execution/container_runner.py Outdated

Comment thread src/madengine/execution/container_runner.py Outdated

mkuznet1 force-pushed the pr-aicomnet-squash branch from e36284a to beb8070 Compare May 12, 2026 15:14

Conversation

mkuznet1 commented Apr 23, 2026

Summary

What's new

Validation

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants