feat: multi-node scale-out for SLURM deployments (image sync, manifest restore, best-CSV aggregation)#112
feat: multi-node scale-out for SLURM deployments (image sync, manifest restore, best-CSV aggregation)#112mkuznet1 wants to merge 1 commit into
Conversation
There was a problem hiding this comment.
Pull request overview
This PR adds missing multi-node coordination for SLURM-based distributed runs, focusing on making “local image mode” usable at scale (shared image tar caching + rendezvous barrier), improving manifest restore fidelity on workers, and making login-node result collection choose the most complete CSV across nodes.
Changes:
- Add shared local-image tar cache support (
MAD_DOCKER_BUILDS) with rank-0 save and worker load, plus a TCP rendezvous barrier to prevent prematuredocker run. - Expand manifest restore to include additional runtime context (mounts/build args/GPU config) while preventing manifest docker env vars from overwriting env-sourced secrets.
- Improve multi-node result collection by selecting the “best” multiple-results CSV (most non-empty
performancerows) and deferring empty-CSV verdicts to aggregation.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/integration/test_container_execution.py | Adds integration coverage for shared tar-cache image availability logic. |
| src/madengine/orchestration/run_orchestrator.py | Restores additional runtime context fields from manifest and avoids overwriting env-sourced docker_env_vars. |
| src/madengine/execution/container_runner.py | Implements image tar caching, TCP barrier synchronization, duplicate-mount avoidance, extended env passthrough, and extra failure diagnostics/artifact staging. |
| src/madengine/deployment/templates/slurm/job.sh.j2 | Makes madengine --help/--version probes side-effect-free by unsetting MODEL_DIR. |
| src/madengine/deployment/slurm.py | Adds “best CSV” selection for multi-node multiple_results aggregation. |
| src/madengine/core/constants.py | Skips MODEL_DIR setup and suppresses verbose config prints for --help/--version invocations. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 6 out of 6 changed files in this pull request and generated 6 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
…esults
_select_best_multiple_results_csv() stripped whitespace from fieldnames to
detect a "performance" column but then read row values via
row.get("performance") against the unstripped DictReader keys. If a node
emitted the header as " performance", detection succeeded while every row
scored 0 and ranking became order-dependent.
Normalize per-row keys (mirroring the {k.strip(): v} pattern used elsewhere
in collect_results) so detection and lookup agree.
Addresses Copilot review threads r3132174282 and r3162090619 on PR ROCm#112.
Context auto-imports secrets whose keys contain MAD_SECRETS_* (note the plural). The comment referenced the non-existent MAD_SECRET_HFTOKEN, which would mislead anyone debugging HF auth issues during manifest restore. Comment-only change; no behavior impact. Addresses Copilot review threads r3132174316 and r3162090646 on PR ROCm#112.
Both ContainerRunner._get_build_args() and DockerBuilder.get_build_arg() interpolated build-arg keys/values into a `docker build` command using manual single-quote wrapping. Console.sh executes with shell=True, so values containing quotes, whitespace, `$`, or other shell metacharacters would break the command and, when docker_build_arg is sourced from manifests / user context, become a shell-injection vector. Wrap each KEY=VALUE pair with shlex.quote() in both call sites. For simple alphanumeric values the wrapper is a no-op (shlex.quote treats `=` as a safe character), so the on-the-wire format only changes when the original wrapping was actually unsafe. Update the affected DockerBuilder integration tests to match the new shell-safe output. Addresses Copilot review threads r3132174338 and r3162090405 on PR ROCm#112.
_build_local_image_from_manifest() built its `docker build ...` command by concatenating unquoted run_image, dockerfile, and docker_context. Because Console.sh runs with shell=True, paths containing spaces would fail the build and manifest-controlled values could inject additional shell tokens. Wrap each component with shlex.quote(). Also quote run_image in the subsequent `docker image inspect` call for consistency with the other inspect sites in this module. Addresses Copilot review threads r3132174397 and r3162090481 on PR ROCm#112.
… barrier _tcp_image_ready_barrier() did `rank_int = int(node_rank)` without guarding, so an unset or malformed NODE_RANK / RANK env var would abort the run with an opaque ValueError mid-barrier. Wrap the conversion in try/except and raise RuntimeError with the offending value. A fallback to rank 0 is intentionally not used here: silently promoting a worker to rank 0 would make it bind a listener and deadlock the other peers. Addresses Copilot review threads r3132174416 and r3162090578 on PR ROCm#112.
…ADDR when possible
The TCP image-ready barrier in ContainerRunner._tcp_image_ready_barrier had
two shared-cluster weaknesses:
1. The barrier token was derived only from SLURM_JOB_ID
(`token = f"JOB{job_id}"`) and is therefore predictable. Another user or
process on the master node could send a forged `READY JOB<id> <rank>`
message and consume a worker slot.
2. The listener bound to `0.0.0.0`, gratuitously exposing the barrier on
every interface of the master host.
Mitigations:
* Allow callers to set `MAD_BARRIER_TOKEN` to an opaque secret (generated
once in the SLURM submit script and exported to all ranks). When unset,
fall back to the previous `JOB<id>` token for backward compatibility.
* Try to bind to the resolved IP of `MASTER_ADDR` first; fall back to
`0.0.0.0` only when that binding fails (e.g. on single-node / loopback
test scenarios).
CHANGELOG updated to document the new env var and the bind behaviour.
A larger fix (auto-generated nonce broadcast via SLURM env, mutual auth)
is deferred — it would require coordination with the submit script.
Addresses Copilot review thread r3132174443 on PR ROCm#112.
…emove redundant call
Two intertwined bugs in the multi-node local-image flow:
1. _ensure_local_image_available() previously called the TCP barrier only on
the `tar_missing_at_start` branch. If nodes disagreed on tar existence
(e.g. shared MAD_DOCKER_BUILDS with NFS visibility lag, or stale cache
state), different ranks took different numbers of barriers and the
participants could deadlock until timeout.
2. run_models_from_manifest() then called _sync_after_local_image_ready()
again immediately after _ensure_local_image_available() returned,
adding a redundant second rendezvous on the slow path and increasing
the chance of a flaky timeout.
Refactor _ensure_local_image_available() into three explicit phases:
(a) Primary makes the image (and tar, if cache configured) available.
(b) All ranks cross _sync_after_local_image_ready() exactly once;
the function is already a no-op for single-node runs.
(c) Workers consume the artefact (load from tar or fall back to
build/pull when no shared cache is configured).
Drop the redundant outer barrier in run_models_from_manifest() and
document the new invariant in the docstring.
Existing tests in tests/integration/test_container_execution.py continue
to pass: the worker-waits test still observes exactly one
_sync_after_local_image_ready() call; the existing-tar test never sets
NNODES so the barrier is a no-op.
Addresses Copilot review threads r3132174463 and r3162090536 on PR ROCm#112.
…y barrier - Worker now sends and compares against int(node_rank) so leading-zero values like NODE_RANK="01" still match the master's ACK (always built from int(parts[2])). Previously the worker would time out waiting for the literal "GO <tok> 01" while the master responded with "GO <tok> 1". - Master path now logs the joined peer list after the rendezvous so a previously dead `peers` accumulator becomes useful diagnostic output for stuck multi-node startups. Addresses PR ROCm#112 Copilot threads r3227113648 (rank mismatch) and r3227113591 (unused `peers`).
…on, shell-escape hardening, manifest restore, and multi-node CSV aggregation This squashes the PR ROCm#112 work into a single commit on top of upstream develop (aa94fb7). The PR adds the multi-node SLURM `localimage` path together with the safety/quality follow-ups raised in Copilot and self-review. Multi-node local-image preparation - Build missing local images from the manifest on primary, then sync all nodes via a lightweight TCP image-ready barrier so workers only start pulling/loading once the image actually exists. - Shared-tar cache for local images (MAD_DOCKER_BUILDS): primary ``docker save`` s the freshly-built image into the shared dir, workers ``docker load`` from the same tar instead of rebuilding. Tar is now written atomically via a sibling ``.tmp.<pid>`` file + ``os.replace`` so a primary crash mid-save never leaves a half-written tar that the next run would happily load. - Barrier participation is now symmetric across nodes (single call site in ``_ensure_local_image_available``) and the redundant call site in ``run_models_from_manifest`` was removed. - ``MAD_BARRIER_TOKEN`` can override the default token (which is derived from MASTER_ADDR / port so it doesn't leak across concurrent jobs), and the master socket is bound to MASTER_ADDR's resolved IP when possible (fallback to 0.0.0.0). - Worker ranks are normalized (``int(NODE_RANK)``) for both READY and ACK messages so leading-zero values like ``NODE_RANK="01"`` no longer silently time out against the master's int-parsed ACK string. - Strict env parsing: malformed ``NODE_RANK`` / ``RANK`` or ``NNODES`` / ``WORLD_SIZE`` raises ``RuntimeError`` instead of silently defaulting to 0/1, which previously let a worker take the primary code path (image build, tar save) and deadlock the barrier — or skip the barrier entirely. - On master barrier timeout, any worker connections that did manage to ACK are now closed so workers see EOF instead of a hanging half-open socket, surfacing the real error in worker logs. - The accepted-peer list is logged on success for easier multi-node deadlock diagnosis from the master log alone. Shell-escape hardening - ``ContainerRunner._get_build_args`` and ``DockerBuilder.get_build_arg`` quote every build-arg key/value through ``shlex.quote`` to match ``Console.sh``'s ``shell=True`` execution path. - ``_build_local_image_from_manifest`` quotes the image tag, Dockerfile, and build context for the same reason. - ``model_dir`` is now passed through ``_bash_quote_path`` before interpolation into the on-failure container diagnostics ``bash -lc`` block so model dirs with whitespace or shell metacharacters do not break the diagnostic copy. - ``cp -- /run_logs/...`` for benchmark logs only quotes a concrete ``SLURM_JOB_ID`` and leaves the default ``*`` as a real glob (instead of a quoted literal that silently no-op'd on non-SLURM runs). Manifest restore / docker_env_vars precedence - Runtime context fields (docker mounts, build args, env vars) are restored from the manifest on execution so per-model overrides survive a slurm submission round-trip. - ``docker_env_vars`` from the manifest never overwrite runtime values already populated by ``Context`` (e.g. ``MAD_SECRETS_*`` read from ``os.environ``). Manifest entries may still contain unexpanded ``${VAR}`` placeholders, and the runtime-resolved value must win. - Comment corrected from ``MAD_SECRET_HFTOKEN`` to ``MAD_SECRETS_HFTOKEN``. Multi-node CSV aggregation - ``_select_best_multiple_results_csv`` now strips whitespace from ``DictReader`` header names and from row keys when computing the "non-empty performance" score, so multi-node multiple_results.csv files emitted by some Primus configs (which include a leading space before column names) are scored correctly instead of always tying at zero. - Empty-perf detection is deferred to the aggregated result in multi-node mode so a single empty per-node CSV no longer fails the whole job before aggregation has a chance to run. Mount/env pass-through & CLI isolation - Docker mount list is deduplicated so multiple mount sources mapping to the same target only produce one ``-v`` flag. - SLURM env var pass-through is widened to include ``SLURM_JOB_ID``, ``MASTER_ADDR``, ``NODE_RANK``, ``NNODES``, ``WORLD_SIZE`` and the ``MAD_*`` family, so containers see a consistent rendezvous environment. - ``--version`` / ``--help`` lightweight CLI invocations no longer mutate ``MODEL_DIR`` global state, which previously polluted subsequent full runs in the same shell. Container diagnostics - On model script failure, collect a bounded snapshot from the failed container (``/run_logs``, ``/myworkspace/<model_dir>``) so multi-node failures aren't opaque — paired with the model_dir quoting above. Touched: ``src/madengine/execution/container_runner.py``, ``src/madengine/execution/docker_builder.py``, ``src/madengine/orchestration/run_orchestrator.py``, ``src/madengine/deployment/slurm.py``, ``src/madengine/deployment/templates/slurm/job.sh.j2``, ``src/madengine/core/constants.py``, ``tests/integration/test_container_execution.py``, ``CHANGELOG.md``.
e36284a to
beb8070
Compare
Summary
Closes the multi-node coordination gap in the SLURM path. The upstream refactor
made madengine multi-node observable (per-rank log staging, login-node
collect_results, partial manifest restore). It did not yet coordinatewhich node owns the image, when workers can enter
docker run,whose CSV the aggregator should pick, and which manifest fields a worker
needs to reconstruct the builder's environment.
This PR adds that coordination, plus a handful of single-node issues that
only hurt under scale-out. Single-node behavior is unchanged — every new
path is gated on
MAD_DOCKER_BUILDS,NNODES > 1, manifest presence, orfailure flow.
What's new
Multi-node coordination
MAD_DOCKER_BUILDS): rank 0docker saves,workers
docker load. No per-node rebuilds or registry re-pulls.docker runagainst a half-loaded image. No shared FS required.collect_results(ranks candidatesby non-empty
performancerows). Multi-node perf validation defers theempty-CSV verdict to the login-node step instead of failing the first rank.
docker_mounts,docker_build_arg,docker_gpus,gpu_vendor,guest_os— runtime values keep priority.Bug fix on the restore path
docker_env_varsrestore no longer overwritesos.environ-sourced values(e.g.
MAD_SECRETS_HFTOKEN) with unexpanded${VAR}literals from anold manifest. Fixes HF 401 on Primus / gated models on rerun.
Single-node issues exposed by scale-out
madengine --version/--helpare side-effect-free (MAD_SETUP_MODEL_DIRis skipped; SLURM preflight probes run under
env -u MODEL_DIR). No moreGBs × N hosts on every job.
-vinadditional_docker_run_options.(
xP,yD,PROXY_TYPE,PD_SYNC_ROOT,BARRIER_TIMEOUT_S,REQUIRE_RDMA,MODEL_NAME,MODEL_DIR, …).ps, sockets (ss/netstat/lsof),log tails of
/run_logsand/myworkspace/<model_dir>captured ondocker execfailure;rocEnvToolconsole timeouts added to thebenign-pattern list; failure-path artifact preservation for
perf_*.csv,perf-*.csv,benchmark_*_CONCURRENCY.log.Validation
End-to-end on a real SLURM cluster (
oci_cx7), three RDMA-enabled workloads:MLPerf Llama-3.1 8B training (2 × 8 GPUs, 8 IB HCAs), Primus Megatron-LM
Llama-3.1 8B training (2 × 8 GPUs), vLLM disaggregated inference Llama-3.1 8B
(3 × 8 GPUs, UCX RDMA-CM). New integration tests cover the shared-tar-cache
paths (primary-saves, existing-tar-is-loaded, worker-waits-for-primary).