Skip to content

feat: multi-node scale-out for SLURM deployments (image sync, manifest restore, best-CSV aggregation)#112

Open
mkuznet1 wants to merge 1 commit into
ROCm:developfrom
mkuznet1:pr-aicomnet-squash
Open

feat: multi-node scale-out for SLURM deployments (image sync, manifest restore, best-CSV aggregation)#112
mkuznet1 wants to merge 1 commit into
ROCm:developfrom
mkuznet1:pr-aicomnet-squash

Conversation

@mkuznet1
Copy link
Copy Markdown

Summary

Closes the multi-node coordination gap in the SLURM path. The upstream refactor
made madengine multi-node observable (per-rank log staging, login-node
collect_results, partial manifest restore). It did not yet coordinate
which node owns the image, when workers can enter docker run,
whose CSV the aggregator should pick, and which manifest fields a worker
needs to reconstruct the builder's environment.
This PR adds that coordination, plus a handful of single-node issues that
only hurt under scale-out. Single-node behavior is unchanged — every new
path is gated on MAD_DOCKER_BUILDS, NNODES > 1, manifest presence, or
failure flow.

What's new

Multi-node coordination

  • Shared local-image tar cache (MAD_DOCKER_BUILDS): rank 0 docker saves,
    workers docker load. No per-node rebuilds or registry re-pulls.
  • TCP rendezvous barrier between rank 0 and workers so no one enters
    docker run against a half-loaded image. No shared FS required.
  • Rebuild-from-manifest fallback on workers where the local image is missing.
  • Best-match multi-node CSV picker in collect_results (ranks candidates
    by non-empty performance rows). Multi-node perf validation defers the
    empty-CSV verdict to the login-node step instead of failing the first rank.
  • Manifest restore on workers now covers docker_mounts, docker_build_arg,
    docker_gpus, gpu_vendor, guest_os — runtime values keep priority.
    Bug fix on the restore path
  • docker_env_vars restore no longer overwrites os.environ-sourced values
    (e.g. MAD_SECRETS_HFTOKEN) with unexpanded ${VAR} literals from an
    old manifest. Fixes HF 401 on Primus / gated models on rerun.
    Single-node issues exposed by scale-out
  • madengine --version / --help are side-effect-free (MAD_SETUP_MODEL_DIR
    is skipped; SLURM preflight probes run under env -u MODEL_DIR). No more
    GBs × N hosts on every job.
  • Duplicate-mount protection when a model supplies its own -v in
    additional_docker_run_options.
  • Expanded SLURM → Docker env pass-through for vLLM/SGLang disagg and Primus
    (xP, yD, PROXY_TYPE, PD_SYNC_ROOT, BARRIER_TIMEOUT_S,
    REQUIRE_RDMA, MODEL_NAME, MODEL_DIR, …).
  • Container-failure diagnostics: ps, sockets (ss/netstat/lsof),
    log tails of /run_logs and /myworkspace/<model_dir> captured on
    docker exec failure; rocEnvTool console timeouts added to the
    benign-pattern list; failure-path artifact preservation for
    perf_*.csv, perf-*.csv, benchmark_*_CONCURRENCY.log.

Validation

End-to-end on a real SLURM cluster (oci_cx7), three RDMA-enabled workloads:
MLPerf Llama-3.1 8B training (2 × 8 GPUs, 8 IB HCAs), Primus Megatron-LM
Llama-3.1 8B training (2 × 8 GPUs), vLLM disaggregated inference Llama-3.1 8B
(3 × 8 GPUs, UCX RDMA-CM). New integration tests cover the shared-tar-cache
paths (primary-saves, existing-tar-is-loaded, worker-waits-for-primary).

@mkuznet1 mkuznet1 self-assigned this Apr 23, 2026
Copilot AI review requested due to automatic review settings April 23, 2026 15:56
@mkuznet1 mkuznet1 requested a review from gargrahul as a code owner April 23, 2026 15:56
@mkuznet1 mkuznet1 added the enhancement New feature or request label Apr 23, 2026
@mkuznet1 mkuznet1 added the help wanted Extra attention is needed label Apr 23, 2026
@mkuznet1 mkuznet1 requested a review from Cemberk as a code owner April 23, 2026 15:56
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds missing multi-node coordination for SLURM-based distributed runs, focusing on making “local image mode” usable at scale (shared image tar caching + rendezvous barrier), improving manifest restore fidelity on workers, and making login-node result collection choose the most complete CSV across nodes.

Changes:

  • Add shared local-image tar cache support (MAD_DOCKER_BUILDS) with rank-0 save and worker load, plus a TCP rendezvous barrier to prevent premature docker run.
  • Expand manifest restore to include additional runtime context (mounts/build args/GPU config) while preventing manifest docker env vars from overwriting env-sourced secrets.
  • Improve multi-node result collection by selecting the “best” multiple-results CSV (most non-empty performance rows) and deferring empty-CSV verdicts to aggregation.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
tests/integration/test_container_execution.py Adds integration coverage for shared tar-cache image availability logic.
src/madengine/orchestration/run_orchestrator.py Restores additional runtime context fields from manifest and avoids overwriting env-sourced docker_env_vars.
src/madengine/execution/container_runner.py Implements image tar caching, TCP barrier synchronization, duplicate-mount avoidance, extended env passthrough, and extra failure diagnostics/artifact staging.
src/madengine/deployment/templates/slurm/job.sh.j2 Makes madengine --help/--version probes side-effect-free by unsetting MODEL_DIR.
src/madengine/deployment/slurm.py Adds “best CSV” selection for multi-node multiple_results aggregation.
src/madengine/core/constants.py Skips MODEL_DIR setup and suppresses verbose config prints for --help/--version invocations.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/madengine/execution/container_runner.py
Comment thread src/madengine/deployment/slurm.py Outdated
Comment thread src/madengine/orchestration/run_orchestrator.py Outdated
Comment thread src/madengine/execution/container_runner.py Outdated
Comment thread src/madengine/execution/container_runner.py
Comment thread src/madengine/execution/container_runner.py Outdated
Comment thread src/madengine/execution/container_runner.py
Comment thread src/madengine/execution/container_runner.py Outdated
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 6 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/madengine/execution/container_runner.py Outdated
Comment thread src/madengine/execution/container_runner.py
Comment thread src/madengine/execution/container_runner.py Outdated
Comment thread src/madengine/execution/container_runner.py Outdated
Comment thread src/madengine/deployment/slurm.py Outdated
Comment thread src/madengine/orchestration/run_orchestrator.py Outdated
mkuznet1 added a commit to mkuznet1/madengine that referenced this pull request May 12, 2026
…esults

_select_best_multiple_results_csv() stripped whitespace from fieldnames to
detect a "performance" column but then read row values via
row.get("performance") against the unstripped DictReader keys. If a node
emitted the header as " performance", detection succeeded while every row
scored 0 and ranking became order-dependent.

Normalize per-row keys (mirroring the {k.strip(): v} pattern used elsewhere
in collect_results) so detection and lookup agree.

Addresses Copilot review threads r3132174282 and r3162090619 on PR ROCm#112.
mkuznet1 added a commit to mkuznet1/madengine that referenced this pull request May 12, 2026
Context auto-imports secrets whose keys contain MAD_SECRETS_* (note the
plural). The comment referenced the non-existent MAD_SECRET_HFTOKEN, which
would mislead anyone debugging HF auth issues during manifest restore.

Comment-only change; no behavior impact.

Addresses Copilot review threads r3132174316 and r3162090646 on PR ROCm#112.
mkuznet1 added a commit to mkuznet1/madengine that referenced this pull request May 12, 2026
Both ContainerRunner._get_build_args() and DockerBuilder.get_build_arg()
interpolated build-arg keys/values into a `docker build` command using
manual single-quote wrapping. Console.sh executes with shell=True, so
values containing quotes, whitespace, `$`, or other shell metacharacters
would break the command and, when docker_build_arg is sourced from
manifests / user context, become a shell-injection vector.

Wrap each KEY=VALUE pair with shlex.quote() in both call sites. For
simple alphanumeric values the wrapper is a no-op (shlex.quote treats
`=` as a safe character), so the on-the-wire format only changes when
the original wrapping was actually unsafe.

Update the affected DockerBuilder integration tests to match the new
shell-safe output.

Addresses Copilot review threads r3132174338 and r3162090405 on PR ROCm#112.
mkuznet1 added a commit to mkuznet1/madengine that referenced this pull request May 12, 2026
_build_local_image_from_manifest() built its `docker build ...` command by
concatenating unquoted run_image, dockerfile, and docker_context. Because
Console.sh runs with shell=True, paths containing spaces would fail the
build and manifest-controlled values could inject additional shell tokens.

Wrap each component with shlex.quote(). Also quote run_image in the
subsequent `docker image inspect` call for consistency with the other
inspect sites in this module.

Addresses Copilot review threads r3132174397 and r3162090481 on PR ROCm#112.
mkuznet1 added a commit to mkuznet1/madengine that referenced this pull request May 12, 2026
… barrier

_tcp_image_ready_barrier() did `rank_int = int(node_rank)` without guarding,
so an unset or malformed NODE_RANK / RANK env var would abort the run with
an opaque ValueError mid-barrier.

Wrap the conversion in try/except and raise RuntimeError with the offending
value. A fallback to rank 0 is intentionally not used here: silently
promoting a worker to rank 0 would make it bind a listener and deadlock the
other peers.

Addresses Copilot review threads r3132174416 and r3162090578 on PR ROCm#112.
mkuznet1 added a commit to mkuznet1/madengine that referenced this pull request May 12, 2026
…ADDR when possible

The TCP image-ready barrier in ContainerRunner._tcp_image_ready_barrier had
two shared-cluster weaknesses:

1. The barrier token was derived only from SLURM_JOB_ID
   (`token = f"JOB{job_id}"`) and is therefore predictable. Another user or
   process on the master node could send a forged `READY JOB<id> <rank>`
   message and consume a worker slot.
2. The listener bound to `0.0.0.0`, gratuitously exposing the barrier on
   every interface of the master host.

Mitigations:

* Allow callers to set `MAD_BARRIER_TOKEN` to an opaque secret (generated
  once in the SLURM submit script and exported to all ranks). When unset,
  fall back to the previous `JOB<id>` token for backward compatibility.
* Try to bind to the resolved IP of `MASTER_ADDR` first; fall back to
  `0.0.0.0` only when that binding fails (e.g. on single-node / loopback
  test scenarios).

CHANGELOG updated to document the new env var and the bind behaviour.

A larger fix (auto-generated nonce broadcast via SLURM env, mutual auth)
is deferred — it would require coordination with the submit script.

Addresses Copilot review thread r3132174443 on PR ROCm#112.
mkuznet1 added a commit to mkuznet1/madengine that referenced this pull request May 12, 2026
…emove redundant call

Two intertwined bugs in the multi-node local-image flow:

1. _ensure_local_image_available() previously called the TCP barrier only on
   the `tar_missing_at_start` branch. If nodes disagreed on tar existence
   (e.g. shared MAD_DOCKER_BUILDS with NFS visibility lag, or stale cache
   state), different ranks took different numbers of barriers and the
   participants could deadlock until timeout.
2. run_models_from_manifest() then called _sync_after_local_image_ready()
   again immediately after _ensure_local_image_available() returned,
   adding a redundant second rendezvous on the slow path and increasing
   the chance of a flaky timeout.

Refactor _ensure_local_image_available() into three explicit phases:
  (a) Primary makes the image (and tar, if cache configured) available.
  (b) All ranks cross _sync_after_local_image_ready() exactly once;
      the function is already a no-op for single-node runs.
  (c) Workers consume the artefact (load from tar or fall back to
      build/pull when no shared cache is configured).

Drop the redundant outer barrier in run_models_from_manifest() and
document the new invariant in the docstring.

Existing tests in tests/integration/test_container_execution.py continue
to pass: the worker-waits test still observes exactly one
_sync_after_local_image_ready() call; the existing-tar test never sets
NNODES so the barrier is a no-op.

Addresses Copilot review threads r3132174463 and r3162090536 on PR ROCm#112.
Copilot AI review requested due to automatic review settings May 12, 2026 14:06
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated 5 comments.

Comment thread src/madengine/execution/container_runner.py Outdated
Comment thread src/madengine/execution/container_runner.py
Comment thread src/madengine/execution/container_runner.py
Comment thread src/madengine/execution/container_runner.py
Comment thread src/madengine/execution/container_runner.py
mkuznet1 added a commit to mkuznet1/madengine that referenced this pull request May 12, 2026
…y barrier

- Worker now sends and compares against int(node_rank) so leading-zero
  values like NODE_RANK="01" still match the master's ACK (always built
  from int(parts[2])). Previously the worker would time out waiting for
  the literal "GO <tok> 01" while the master responded with "GO <tok> 1".
- Master path now logs the joined peer list after the rendezvous so a
  previously dead `peers` accumulator becomes useful diagnostic output
  for stuck multi-node startups.

Addresses PR ROCm#112 Copilot threads r3227113648 (rank mismatch) and
r3227113591 (unused `peers`).
Copilot AI review requested due to automatic review settings May 12, 2026 14:35
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated 3 comments.

Comment thread src/madengine/execution/container_runner.py Outdated
Comment thread src/madengine/execution/container_runner.py Outdated
Comment thread src/madengine/execution/container_runner.py Outdated
…on, shell-escape hardening, manifest restore, and multi-node CSV aggregation

This squashes the PR ROCm#112 work into a single commit on top of upstream
develop (aa94fb7). The PR adds the multi-node SLURM `localimage` path
together with the safety/quality follow-ups raised in Copilot and
self-review.

Multi-node local-image preparation
- Build missing local images from the manifest on primary, then sync all
  nodes via a lightweight TCP image-ready barrier so workers only start
  pulling/loading once the image actually exists.
- Shared-tar cache for local images (MAD_DOCKER_BUILDS): primary
  ``docker save`` s the freshly-built image into the shared dir, workers
  ``docker load`` from the same tar instead of rebuilding. Tar is now
  written atomically via a sibling ``.tmp.<pid>`` file + ``os.replace``
  so a primary crash mid-save never leaves a half-written tar that the
  next run would happily load.
- Barrier participation is now symmetric across nodes (single call site
  in ``_ensure_local_image_available``) and the redundant call site in
  ``run_models_from_manifest`` was removed.
- ``MAD_BARRIER_TOKEN`` can override the default token (which is derived
  from MASTER_ADDR / port so it doesn't leak across concurrent jobs),
  and the master socket is bound to MASTER_ADDR's resolved IP when
  possible (fallback to 0.0.0.0).
- Worker ranks are normalized (``int(NODE_RANK)``) for both READY and
  ACK messages so leading-zero values like ``NODE_RANK="01"`` no longer
  silently time out against the master's int-parsed ACK string.
- Strict env parsing: malformed ``NODE_RANK`` / ``RANK`` or
  ``NNODES`` / ``WORLD_SIZE`` raises ``RuntimeError`` instead of silently
  defaulting to 0/1, which previously let a worker take the primary code
  path (image build, tar save) and deadlock the barrier — or skip the
  barrier entirely.
- On master barrier timeout, any worker connections that did manage to
  ACK are now closed so workers see EOF instead of a hanging half-open
  socket, surfacing the real error in worker logs.
- The accepted-peer list is logged on success for easier multi-node
  deadlock diagnosis from the master log alone.

Shell-escape hardening
- ``ContainerRunner._get_build_args`` and
  ``DockerBuilder.get_build_arg`` quote every build-arg key/value
  through ``shlex.quote`` to match ``Console.sh``'s ``shell=True``
  execution path.
- ``_build_local_image_from_manifest`` quotes the image tag,
  Dockerfile, and build context for the same reason.
- ``model_dir`` is now passed through ``_bash_quote_path`` before
  interpolation into the on-failure container diagnostics ``bash -lc``
  block so model dirs with whitespace or shell metacharacters do not
  break the diagnostic copy.
- ``cp -- /run_logs/...`` for benchmark logs only quotes a concrete
  ``SLURM_JOB_ID`` and leaves the default ``*`` as a real glob (instead
  of a quoted literal that silently no-op'd on non-SLURM runs).

Manifest restore / docker_env_vars precedence
- Runtime context fields (docker mounts, build args, env vars) are
  restored from the manifest on execution so per-model overrides survive
  a slurm submission round-trip.
- ``docker_env_vars`` from the manifest never overwrite runtime values
  already populated by ``Context`` (e.g. ``MAD_SECRETS_*`` read from
  ``os.environ``). Manifest entries may still contain unexpanded
  ``${VAR}`` placeholders, and the runtime-resolved value must win.
- Comment corrected from ``MAD_SECRET_HFTOKEN`` to ``MAD_SECRETS_HFTOKEN``.

Multi-node CSV aggregation
- ``_select_best_multiple_results_csv`` now strips whitespace from
  ``DictReader`` header names and from row keys when computing the
  "non-empty performance" score, so multi-node multiple_results.csv
  files emitted by some Primus configs (which include a leading space
  before column names) are scored correctly instead of always tying at
  zero.
- Empty-perf detection is deferred to the aggregated result in
  multi-node mode so a single empty per-node CSV no longer fails the
  whole job before aggregation has a chance to run.

Mount/env pass-through & CLI isolation
- Docker mount list is deduplicated so multiple mount sources mapping
  to the same target only produce one ``-v`` flag.
- SLURM env var pass-through is widened to include
  ``SLURM_JOB_ID``, ``MASTER_ADDR``, ``NODE_RANK``, ``NNODES``,
  ``WORLD_SIZE`` and the ``MAD_*`` family, so containers see a
  consistent rendezvous environment.
- ``--version`` / ``--help`` lightweight CLI invocations no longer mutate
  ``MODEL_DIR`` global state, which previously polluted subsequent full
  runs in the same shell.

Container diagnostics
- On model script failure, collect a bounded snapshot from the failed
  container (``/run_logs``, ``/myworkspace/<model_dir>``) so multi-node
  failures aren't opaque — paired with the model_dir quoting above.

Touched: ``src/madengine/execution/container_runner.py``,
``src/madengine/execution/docker_builder.py``,
``src/madengine/orchestration/run_orchestrator.py``,
``src/madengine/deployment/slurm.py``,
``src/madengine/deployment/templates/slurm/job.sh.j2``,
``src/madengine/core/constants.py``,
``tests/integration/test_container_execution.py``, ``CHANGELOG.md``.
@mkuznet1 mkuznet1 force-pushed the pr-aicomnet-squash branch from e36284a to beb8070 Compare May 12, 2026 15:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request help wanted Extra attention is needed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants