feat(slurm): add slurm_multi multi-node SLURM launcher (minimal-additive port from PR #86) by raviguptaamd · Pull Request #124 · ROCm/madengine

raviguptaamd · 2026-05-10T02:30:22Z

Summary

This PR adds a new SLURM launcher, slurm_multi, for self-managed multi-node workloads (proxy + prefill + decode topologies, e.g., SGLang Disaggregated).

The discipline for this branch is strictly additive: no develop functionality is removed unless absolutely required for slurm_multi to work, and every line vs develop is justified by one of the contract points below.

Diff size vs develop

 12 files changed, 2071 insertions(+), 5 deletions(-)

The handful of deletions are line-modifications at agreed touchpoints (e.g., adding a new elif branch to dispatch, extending an existing if scripts_arg.endswith(".sh") to also accept .slurm).

What landed

File	Action	Purpose
`src/madengine/deployment/common.py`	MODIFY	Register `slurm_multi` in `VALID_LAUNCHERS`; add hyphen alias normalization. All other develop content (`functools.lru_cache`, `tools_include_rocprof_family`, `_ROCPROF_FAMILY_TOOL_NAMES`, `configure_multi_node_profiling`, `sglang-disagg` membership) preserved.
`src/madengine/deployment/slurm.py`	MODIFY	`SLURM_MULTI_ALIASES`; allocation detection in `__init__`; allocation helper methods; `slurm_multi` early dispatch in `prepare()`; `_prepare_slurm_multi_script` (sets `_is_slurm_multi=True` for the deploy gate); `_generate_slurm_multi_command`; new `elif` in `_generate_launcher_command` (existing primus/sglang-disagg dispatch preserved); gated bash-in-salloc branch in `deploy()` (slurm_multi only); `_run_inside_existing_allocation`; `_collect_slurm_multi_results` + early dispatch in `collect_results()`; `reservation=` kwarg threaded to existing `SlurmNodeSelector(...)`; cwd `perf.csv` aggregation so the dashboard reporter finds slurm_multi runs.
`src/madengine/deployment/slurm_node_selector.py`	MODIFY	Optional `reservation` parameter so the reservation propagates to srun health/cleanup commands.
`src/madengine/deployment/base.py`	MODIFY	Add `skip_monitoring: bool = False` to `DeploymentResult` dataclass; honor it in `BaseDeployment.execute()` so the slurm_multi bash branch can correctly bypass the SLURM monitor poll.
`src/madengine/execution/container_runner.py`	MODIFY	`SLURM_MULTI_ALIASES` / `SELF_MANAGED_LAUNCHERS`; `_run_self_managed` (runs the model script directly on baremetal so srun/scontrol work; propagates env from `model_info.env_vars` and `additional_context.env_vars`); self-managed launcher early dispatch in `run_container()`; `.slurm` extension recognized alongside `.sh`. Develop helpers preserved verbatim (`_print_run_env_table`, `_resolve_docker_image`, `_ENV_KEY_RE`, `_bash_quote_path`, `_cp_model_dir_file_to_cwd_cmd`, `MAD_OUTPUT_CSV` passthrough, `PRIMUS_*` env-var allowlist, `PERFORMANCE_LOG_PATTERN` regex, CSV fieldname stripping).
`src/madengine/execution/docker_builder.py`	MODIFY	After build, write resolved registry image into `built_models[*].env_vars.DOCKER_IMAGE_NAME` so the run-time `srun docker pull "$DOCKER_IMAGE_NAME"` on each compute node finds the registry image.
`src/madengine/orchestration/build_orchestrator.py`	MODIFY	`use_image` and `build_on_compute` parameters on `execute()`; mutex validation; use_image dispatch (incl. `auto` sentinel); build_on_compute dispatch; slurm_multi registry gate; `_execute_with_prebuilt_image`; `_resolve_image_from_model_card`; `_execute_build_on_compute` (sbatch wrapper that builds + pushes from one compute node, polls completion, marks `built_on_compute: true` in the manifest). Develop's `load_credentials` top-level usage, `detect_local_gpu_arch` parameter, and `MAD_SYSTEM_GPU_ARCHITECTURE resolved` print all preserved.
`src/madengine/cli/commands/build.py`	MODIFY	New `--use-image [IMAGE \| auto]` and `--build-on-compute` flags + 3 mutex validation blocks; pass through to `validate_additional_context`, `create_args_namespace`, `BuildOrchestrator.execute()`. Forward-compat TODO comment for the Typer `is_flag=False, flag_value="auto"` deprecation.
`src/madengine/cli/validators.py`	MODIFY	Add `use_image: Optional[str] = None` parameter (signature + docstring only — body unchanged) so `build.py`'s call site does not TypeError.
`tests/unit/test_slurm_multi.py`	NEW	Three locked-in contracts: (1) `slurm_multi` registered in `VALID_LAUNCHERS`; (2) hyphen alias normalizes; (3) wrapper SBATCH script exports every `model_info.env_vars` key. Fixture mirrors MAD-private PR #186's `pyt_sglang_disagg_qwen3-32b_short` entry verbatim.
`examples/slurm-configs/minimal/slurm-multi-minimal.json`	NEW	Minimal reference config alongside develop's existing `*-minimal.json` examples.
`CHANGELOG.md`	MODIFY	New `[Unreleased]` section documenting all of the above.

Introduce the `slurm_multi` self-managed multi-node SLURM launcher, fully additive vs develop. Verbatim minimal-delta port from PR ROCm#86 limited to the slurm_multi surface; no develop functionality removed. - common.py: register `slurm_multi` in VALID_LAUNCHERS, add hyphen alias normalization (`slurm-multi` -> `slurm_multi`). All other develop content (functools.lru_cache, _ROCPROF_FAMILY_TOOL_NAMES, tools_include_rocprof_family, full configure_multi_node_profiling rocprof-family handling, sglang-disagg in VALID_LAUNCHERS) preserved. - slurm.py: SLURM_MULTI_ALIASES; reservation field + allocation detection in __init__; helper methods (_get_allocation_node_count, _validate_allocation_nodes); slurm_multi early dispatch in prepare(); _prepare_slurm_multi_script (sets `_is_slurm_multi = True` for deploy() gate); _generate_slurm_multi_command; new elif in _generate_launcher_command (existing primus/sglang-disagg dispatch preserved); gated bash-in-salloc branch in deploy() (slurm_multi only -- non-slurm_multi launchers continue to use sbatch); _run_inside_existing_allocation; _collect_slurm_multi_results + early dispatch in collect_results(); reservation= kwarg threaded to existing SlurmNodeSelector(...) call. - slurm_node_selector.py: optional `reservation` parameter so the reservation propagates to srun health/cleanup commands. Source: ROCm/madengine PR ROCm#86 (slurm_multi surface only). 0 deletions from develop. Co-authored-by: Cursor <cursoragent@cursor.com>

…or slurm_multi Wire the build/run orchestration paths needed by the slurm_multi launcher. All existing flows for non-slurm_multi launchers continue to work unchanged; new behavior is gated on `distributed.launcher == "slurm_multi"` (or `slurm-multi`). - execution/container_runner.py: SLURM_MULTI_ALIASES, SELF_MANAGED_LAUNCHERS constants; `_run_self_managed` method (runs the model script directly on baremetal so srun/scontrol work inside it, propagates env from model_info.env_vars and additional_context.env_vars); self-managed launcher early dispatch in run_container() that fires only when launcher matches SELF_MANAGED_LAUNCHERS; `.slurm` extension recognized alongside `.sh` in script-extension detection (T-C2). Develop helpers preserved verbatim (_print_run_env_table, _docker_image_exists_locally, _bash_quote_path, _cp_model_dir_file_to_cwd_cmd, _resolve_docker_image, _ENV_KEY_RE env-var validation, MAD_OUTPUT_CSV passthrough, PRIMUS_CONFIG_PATH/PRIMUS_CLI_EXTRA in env-var allowlist, PERFORMANCE_LOG_PATTERN regex, CSV fieldname stripping). - execution/docker_builder.py: after build, write resolved registry image into built_models[*].env_vars.DOCKER_IMAGE_NAME so the run-time `srun docker pull "\$DOCKER_IMAGE_NAME"` on each compute node finds the registry image. Pure addition (one block before manifest assembly). - orchestration/build_orchestrator.py: import shutil; capture user-supplied slurm keys (_original_user_slurm_keys); add `use_image: Optional[str]` and `build_on_compute: bool` to execute(); mutex validation between use_image and build_on_compute; use_image dispatch (incl. "auto" sentinel that resolves from model card via _resolve_image_from_model_card); build_on_compute dispatch (registry required, validated in _execute_build_on_compute); slurm_multi registry gate that runs DiscoverModels early to detect slurm_multi tags and either raises a structured ConfigurationError with helpful suggestions or auto-uses DOCKER_IMAGE_NAME from the model card env_vars (implicit --use-image fallback); _execute_with_prebuilt_image; _resolve_image_from_model_card; _execute_build_on_compute (sbatch wrapper that builds + pushes from one compute node, polls completion, marks `built_on_compute: true` in the manifest). Develop's load_credentials top-level usage, detect_local_gpu_arch parameter, and resolved_arch print all preserved unchanged. Note: PR ROCm#86's universal merge of model_info.env_vars into docker_env_vars in run_container() was DELIBERATELY DROPPED -- it is not slurm_multi-essential (slurm_multi gets env_vars via _prepare_slurm_multi_script's wrapper export block and via _run_self_managed's own merge). Applying it universally would silently change behavior for non-slurm_multi launchers and has an incorrect precedence vs additional_context.env_vars; it can be revisited as a separate PR if needed. Source: ROCm/madengine PR ROCm#86 (slurm_multi surface only). 0 deletions from develop except the agreed `.sh` -> `.sh|.slurm` line modification (T-C2: harmless extension to script-extension recognition). Co-authored-by: Cursor <cursoragent@cursor.com>

Surface the orchestrator paths from the previous commit through the `madengine build` CLI: - cli/commands/build.py: add `--use-image [IMAGE | auto]` (skip the local Docker build and use the named image; `auto` resolves from the model card's DOCKER_IMAGE_NAME); add `--build-on-compute` (build on a SLURM compute node and push to the configured registry; manifest records `built_on_compute: true`); 3 mutex validation blocks (--use-image vs --registry, --use-image vs --build-on-compute, --build-on-compute requires --registry); pass both new kwargs through to validate_additional_context, create_args_namespace, and BuildOrchestrator.execute(). - cli/validators.py: add `use_image: Optional[str] = None` parameter (signature + docstring only -- body unchanged) so build.py's call site does not TypeError. PR ROCm#86's other validators.py rewrites (drops of ROCM_PATH / MAD_ROCM_PATH validation, dockerfile_matched fallback removal) are NOT inherited; develop validation is preserved. Required for the slurm_multi launcher's documented build options in MAD-private #186. Co-authored-by: Cursor <cursoragent@cursor.com>

- tests/unit/test_slurm_multi.py: nine tests across three classes - TestSlurmMultiRegistration: `slurm_multi` is in VALID_LAUNCHERS; SLURM_MULTI_ALIASES contains both canonical and hyphen forms. - TestNormalizeSlurmMultiAliases: canonical and hyphen alias both normalize to `slurm_multi`; unknown values still fall through to the existing `slurm -> docker` default. - TestSlurmMultiPrepareScript: end-to-end fixture that mirrors MAD-private PR #186's `pyt_sglang_disagg_qwen3-32b_short` entry verbatim. Asserts that prepare() takes the slurm_multi early- dispatch path, sets `_is_slurm_multi = True` for the deploy() gate, emits an SBATCH header that reflects the model card slurm block, and emits every model_info.env_vars key as `export KEY="value"` in the wrapper script. This is the explicit safety net for the Q2 decision to drop PR ROCm#86's universal docker_env_vars merge -- if slurm_multi ever stops propagating those env_vars to the workload, this test fails loudly. Existing tests in tests/unit/test_deployment.py and tests/unit/test_container_runner.py are NOT modified. - examples/slurm-configs/minimal/slurm-multi-minimal.json: small reference config alongside develop's existing *-minimal.json examples (`-f` add because *.json is gitignored, matching how existing files in this dir were originally added). Co-authored-by: Cursor <cursoragent@cursor.com>

…lti bash branch The earlier feat(deployment) commit ported `_run_inside_existing_allocation` verbatim from PR ROCm#86 (used by the slurm_multi bash-in-salloc branch). That method constructs `DeploymentResult(..., skip_monitoring=True)` to signal that the script ran synchronously and the monitor phase should be skipped. Develop's `DeploymentResult` dataclass did not have that field, so the construction raised `TypeError` at runtime, the failure was swallowed by the orchestrator, and `madengine run` exited 0 even when the wrapper script failed. Two minimal additive edits to deployment/base.py: - DeploymentResult: add `skip_monitoring: bool = False` field. - BaseDeployment.execute(): change the monitor guard from `if self.config.monitor:` to `if self.config.monitor and not result.skip_monitoring:` so the bash branch can correctly bypass the SLURM job poll (there is no SLURM job to poll — the script ran inline). Behavior for non-slurm_multi launchers is unchanged: they never construct DeploymentResult with skip_monitoring=True, the field defaults to False, and the monitor() call still fires as before. Discovered during R5.1b cluster smoke test of `pyt_sglang_disagg_qwen3-32b_short` from MAD-private PR #186. Co-authored-by: Cursor <cursoragent@cursor.com>

…eporter The slurm_multi model script (e.g., MAD-private's run_xPyD_models.slurm) writes its perf CSV to /shared_inference/$USER/$JOBID/perf.csv. `_collect_slurm_multi_results` correctly finds and ingests that CSV via `_collect_results_parse_perf_csv` (results['successful_runs'] is populated, manifest exit codes are right). However the CLI's post-run reporter (`display_performance_table` in cli/utils.py) reads from cwd 'perf.csv' by default, so users saw a cosmetic "Performance CSV not found: perf.csv" warning even on a successful slurm_multi run. Local and classic-SLURM flows already leave a cumulative perf.csv in cwd via update_perf_csv(); slurm_multi did not. Fix: inside `_collect_slurm_multi_results`, after the per-job CSV is located and parsed, also write its rows into the conventional cwd perf.csv path: * if cwd 'perf.csv' is absent: shutil.copy() the per-job file * if it exists (from previous runs): append data rows (skip the per-job header so the cwd file stays single-headed) Original per-job CSV at /shared_inference/$USER/$JOBID/perf.csv is not modified or deleted. Wrapped in try/except so any I/O failure degrades to a yellow warning rather than failing the run. Affects only `_collect_slurm_multi_results`, which only runs for the slurm_multi launcher dispatch added earlier on this branch. Zero effect on non-slurm_multi launchers (they still take develop's classic collect_results path which writes cwd perf.csv via update_perf_csv). Validated end-to-end: Llama-3.1-8B smoke on 3 reservation nodes (Distributed_Inference_CI), bash branch in salloc, parallel docker pull on all nodes, proxy/prefill/decode came up, benchmark concurrency=8 / 32-32 ISL/OSL ran to completion, run exit 0, cwd perf.csv now contains the per-job row (THROUGHPUT/TTFT/ITL columns match per-job source). Co-authored-by: Cursor <cursoragent@cursor.com>

- CHANGELOG.md: new [Unreleased] section documenting the slurm_multi launcher, --use-image / --build-on-compute CLI flags, the slurm_multi build registry gate, the bash-in-salloc execution path, the DeploymentResult.skip_monitoring field, the SlurmNodeSelector reservation parameter, the new contract tests + minimal example, and the cwd perf.csv aggregation fix. - cli/commands/build.py: comment near the --use-image Typer option flagging that `is_flag=False, flag_value="auto"` is being deprecated by Typer, with the planned migration path. Behavior unchanged (matches MAD-private PR #186's documented UX). Co-authored-by: Cursor <cursoragent@cursor.com>

Copilot

Pull request overview

Adds a new additive slurm_multi SLURM launcher path intended for self-managed multi-node workloads (wrapper SBATCH + baremetal model script execution), plus supporting build-time options to make registry/pull flows workable.

Changes:

Introduces slurm_multi launcher registration/normalization and SLURM deployment support (including salloc “bash” execution + perf.csv aggregation).
Adds madengine build --use-image [IMAGE|auto] and --build-on-compute build modes, plus registry/image propagation for run-phase parallel pulls.
Adds contract tests and a minimal example config for slurm_multi.

Reviewed changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated 9 comments.

Show a summary per file

File	Description
`src/madengine/deployment/common.py`	Registers `slurm_multi` and normalizes `slurm-multi` → `slurm_multi`.
`src/madengine/deployment/slurm.py`	Implements `slurm_multi` wrapper generation, allocation-aware deploy path, and results collection.
`src/madengine/deployment/slurm_node_selector.py`	Threads `reservation` through `srun` health/cleanup commands.
`src/madengine/deployment/base.py`	Adds `DeploymentResult.skip_monitoring` and honors it in `execute()`.
`src/madengine/execution/container_runner.py`	Adds self-managed launcher path to run scripts on host instead of inside madengine Docker.
`src/madengine/execution/docker_builder.py`	Propagates `registry_image` to `built_models[*].env_vars.DOCKER_IMAGE_NAME`.
`src/madengine/orchestration/build_orchestrator.py`	Adds `--use-image` / `--build-on-compute` flows and slurm_multi registry gating logic.
`src/madengine/cli/commands/build.py`	Adds CLI flags and mutex validation for the new build modes.
`src/madengine/cli/validators.py`	Extends `validate_additional_context` signature to accept `use_image`.
`tests/unit/test_slurm_multi.py`	Adds contract tests for launcher registration/aliasing and wrapper env-var exports.
`examples/slurm-configs/minimal/slurm-multi-minimal.json`	Provides a minimal reference config for `slurm_multi`.
`CHANGELOG.md`	Documents new launcher and build/run behaviors.

Comments suppressed due to low confidence (1)

src/madengine/cli/validators.py:321

The validate_additional_context docstring says use_image "skips required field validation", but the parameter is currently unused and validation is still always performed via finalize_additional_context_dict(context). Please either implement the documented behavior or update the docstring to avoid misleading callers.

def validate_additional_context(
    additional_context: str,
    additional_context_file: Optional[str] = None,
    use_image: Optional[str] = None,
) -> Dict[str, Any]:
    """
    Validate and parse additional context.

    Args:
        additional_context: JSON string containing additional context
        additional_context_file: Optional file containing additional context
        use_image: Optional pre-built image to use (skips required field validation)

    Returns:
        Dict containing parsed additional context

    Raises:
        typer.Exit: If validation fails
    """
    context, _ = merge_additional_context_from_sources(
        additional_context, additional_context_file
    )
    finalize_additional_context_dict(context)
    return context

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.


 import json
 import os
+import shutil


+            # Key by model_name (not use_image) so multiple models sharing
+            # the same pre-built image are all preserved in the manifest.
+            for model in models:
+                model_name = model.get("name", "unknown")
+                model_distributed = model.get("distributed", {})
+
+                # Merge DOCKER_IMAGE_NAME into env_vars for parallel pull in run phase
+                model_env_vars = model.get("env_vars", {}).copy()
+                model_env_vars["DOCKER_IMAGE_NAME"] = use_image
+
+                manifest["built_models"][model_name] = {
+                    "name": model_name,
+                    "image": use_image,
+                    "docker_image": use_image,


+                # Merge model's distributed config from the first model.
+                # If multiple models have differing distributed configs, warn — only the first wins here.
+                if len(models) > 1:
+                    distinct_distributed = {tuple(sorted((m.get("distributed") or {}).items())) for m in models}
+                    if len(distinct_distributed) > 1:
+                        self.rich_console.print(


+        import shutil
+


+        # Prepare environment
+        env = os.environ.copy()
+        env.update(run_env)
+
+        # Add model-specific env vars from model_info
+        if "env_vars" in model_info and model_info["env_vars"]:
+            for key, value in model_info["env_vars"].items():
+                env[key] = str(value)
+                print(f"  ENV: {key}={value}")
+
+        # Add env vars from additional_context
+        if self.additional_context and "env_vars" in self.additional_context:
+            for key, value in self.additional_context["env_vars"].items():
+                env[key] = str(value)
+


+            "",
+            f"# slurm_multi launcher script for {model_info['name']}",
+            f"# Generated by madengine for slurm_multi",
+            "",
+            "set -e",
+            "",
+            "# Environment variables",
+        ])
+
+        for key, value in env_vars.items():
+            script_lines.append(f"export {key}=\"{value}\"")
+
+        script_lines.append("")
+        script_lines.extend([
+            "echo '=========================================='",
+            "echo 'slurm_multi Launcher'",
+            "echo '=========================================='",
+            f"echo 'Model: {model_info['name']}'",
+            f"echo 'Script: {model_script_path}'",
+            "echo 'SLURM_JOB_ID:' $SLURM_JOB_ID",
+            "echo 'SLURM_NNODES:' $SLURM_NNODES",
+            "echo 'SLURM_NODELIST:' $SLURM_NODELIST",
+            "echo ''",
+        ])
+
+        # Check if image needs parallel pull on all nodes
+        # Pull if: image is from registry (contains / or .) and not a local ci-* build
+        docker_image = env_vars.get("DOCKER_IMAGE_NAME", "")
+        is_registry_image = docker_image and not docker_image.startswith("ci-") and ("/" in docker_image or "." in docker_image)
+
+        if is_registry_image:
+            # Add parallel docker pull on all nodes
+            # This ensures all nodes have the image before running
+            script_lines.extend([
+                "",
+                "# Pull Docker image in parallel on all nodes",
+                "echo '=========================================='",
+                "echo 'Pulling Docker image on all nodes in parallel'",
+                "echo '=========================================='",
+                f"echo 'Image: {docker_image}'",
+                "echo ''",
+                "",
+                f"srun --nodes=$SLURM_NNODES --ntasks=$SLURM_NNODES bash -c \"",
+                f"    echo \\\"[\\$(hostname)] Pulling {docker_image}...\\\"",
+                f"    docker pull {docker_image}",
+                "    PULL_RC=\\$?",
+                "    if [ \\$PULL_RC -eq 0 ]; then",
+                "        echo \\\"[\\$(hostname)] Pull SUCCESS\\\"",
+                "    else",
+                "        echo \\\"[\\$(hostname)] Pull FAILED with exit code \\$PULL_RC\\\"",
+                "    fi",
+                "    exit \\$PULL_RC",
+                "\"",
+                "PULL_EXIT=$?",
+                "",
+                "if [ $PULL_EXIT -ne 0 ]; then",
+                "    echo 'Docker pull failed on one or more nodes'",
+                "    exit $PULL_EXIT",
+                "fi",
+                "",
+                "echo ''",
+                "echo 'Docker image pulled on all nodes'",
+                "echo ''",
+            ])
+
+        # Create completion marker path for robust completion detection.
+        # Namespace by SLURM_JOB_ID so concurrent / repeat runs of the same model
+        # tag don't collide on each other's marker files. monitor() reconstructs
+        # the same path using the deployment_id returned by sbatch.
+        completion_marker_dir = self.output_dir.resolve()
+        completion_marker_template = (
+            completion_marker_dir
+            / f"madengine_{model_info['name']}_${{SLURM_JOB_ID:-local}}.complete"
+        )
+
+        script_lines.extend([
+            "",
+            "# Change to script directory",
+            f"cd {model_script_path.parent}",
+            "",
+            "# Run the model script directly on the host",
+            f"echo 'Executing: bash {model_script_path.name} {model_args}'",
+            f"bash {model_script_path.name} {model_args}",
+            "SCRIPT_EXIT_CODE=$?",
+            "",
+            "echo ''",
+            "echo 'Script completed.'",
+            "",
+            "# Write completion marker for madengine to detect (job-id namespaced)",
+            f"echo \"exit_code=$SCRIPT_EXIT_CODE\" > {completion_marker_template}",
+            f"echo \"timestamp=$(date -Iseconds)\" >> {completion_marker_template}",
+            f"echo \"Completion marker written: {completion_marker_template}\"",
+            "",
+            "exit $SCRIPT_EXIT_CODE",
+        ])


+            script_name = os.path.basename(scripts_arg)
+        elif scripts_arg.endswith(".py"):
+            script_path = scripts_arg
+            script_name = os.path.basename(scripts_arg)
+        else:
+            # Directory specified - look for run.sh
+            script_path = os.path.join(scripts_arg, "run.sh")
+            script_name = "run.sh"


+        else:
+            # Private/internal registry - may not need auth
+            self.rich_console.print(f"  Auth: Private registry (auth may not be required)")
+            requires_auth = dockerhub_user and dockerhub_password


+                        use_image=implicit_image,
+                        manifest_output=manifest_output,
+                    )
+                # No card image (or divergent images across models): fall through to error.


Manifest-shape correctness (Copilot C2): - _execute_with_prebuilt_image now keys built_images by model_name (one entry per model, all pointing at the same use_image) so it matches built_models and the rest of the codebase's invariant that ContainerRunner.run_models_from_manifest() relies on at built_models.get(image_name, {}). Without this, --use-image (and the implicit slurm_multi DOCKER_IMAGE_NAME gate) silently skipped every model in the run phase. Multi-model dedup TypeError fix (Copilot C3): - Replace tuple(sorted((m.get('distributed') or {}).items())) and the same pattern for slurm with json.dumps(..., sort_keys=True, default=str). The old pattern raised TypeError: unhashable type: 'dict' as soon as a model's distributed block contained nested dicts (sglang_disagg / vllm_disagg). slurm_multi wrapper completion-marker on failure (Copilot C6): - The wrapper script enables `set -e` at the top, so a non-zero exit from the model script terminated the wrapper before SCRIPT_EXIT_CODE was captured and the completion marker was written -- failed runs looked like hangs to monitor(). Wrap the bash invocation in `set +e` / `set -e` so the marker is always written. Self-managed launcher hygiene (Copilot C4, C5, C7): - Drop unused `import shutil` inside _run_self_managed. - Drop unused script_name local in _run_self_managed (only script_path is consumed by the function). - Redact env_vars values in the run log (`ENV: KEY=<set>`) so model-card credentials (HF_TOKEN, MAD_DOCKERHUB_PASSWORD, etc.) don't leak. Build orchestrator hygiene (Copilot C1, C8): - Drop top-level `import shutil` (unused). - Drop the two `requires_auth = ...` assignments in _execute_build_on_compute (assigned but never read). Validators docstring honesty (Copilot C10): - validate_additional_context's `use_image` parameter is currently informational only (parameter retained for build.py call-site compatibility); the docstring previously claimed it skipped required- field validation, which the body never implemented. Tightened the docstring to match reality. New contract tests (tests/unit/test_slurm_multi.py): - test_wrapper_disables_set_e_around_model_script: locks in C6's set +e/SCRIPT_EXIT_CODE/set -e ordering around the bash invocation. - test_built_images_and_models_share_keys: locks in C2's invariant using a fake 2-model fixture and a stubbed BuildOrchestrator. - test_multi_model_nested_dict_distributed_does_not_raise: regression for C3 using two sglang_disagg-style cards with differing nested dicts in `distributed`; would TypeError under the old code. Skipped: Copilot C9 (broader BuildOrchestrator branch tests for --use-image / --build-on-compute / registry-gating). The slurm_multi contract suite + multi-node cluster smoke already cover the slurm_multi surface end-to-end; broader orchestrator-branch coverage is a worthwhile follow-up but out of scope for this minimal-additive PR. Tests: - pytest tests/unit/test_slurm_multi.py -v -> 12 passed (was 9). - pytest tests/unit -q -> 436 passed / 2 failed; the 2 failures are the same PermissionError environmental baseline that exists on develop (test_upload_file_to_mongodb_file_not_found and test_validate_additional_context_file_not_found, both raise PermissionError on '/nonexistent/file.json' instead of FileNotFound on this filesystem). Zero new failures. Co-authored-by: Cursor <cursoragent@cursor.com>

raviguptaamd · 2026-05-11T06:39:11Z

Addressed in commit 68d0bf3:

#	Copilot comment	Action
C1	`build_orchestrator.py:13` unused `import shutil`	removed
C2	`_execute_with_prebuilt_image` manifest schema mismatch	bug fix -- `built_images` now keyed by `model_name` (one entry per model, all pointing at the same `use_image`), restoring the `built_images.keys() == built_models.keys()` invariant `ContainerRunner.run_models_from_manifest()` relies on at line 2041. Without this, `--use-image` and the implicit slurm_multi `DOCKER_IMAGE_NAME` gate silently skipped every model in the run phase.
C3	`tuple(sorted(items()))` unhashable for nested-dict distributed/slurm	bug fix -- replaced with `json.dumps(..., sort_keys=True, default=str)` in both spots (otherwise sglang_disagg / vllm_disagg cards triggered `TypeError: unhashable type: 'dict'`)
C4	`container_runner.py:809` unused `import shutil` (in `_run_self_managed`)	removed
C5	`_run_self_managed` env-var value logging	security -- log `ENV: KEY=<set>` (keys only) so HF_TOKEN / MAD_DOCKERHUB_PASSWORD / etc. carried in `model_info.env_vars` don't leak into the run log
C6	`set -e` + `SCRIPT_EXIT_CODE=$?` interaction in slurm_multi wrapper	bug fix -- wrap the `bash <model_script>` call in `set +e` / `set -e` so a non-zero model-script exit doesn't terminate the wrapper before the completion marker is written (without this, failures looked like hangs to `monitor()`)
C7	unused local `script_name` in `_run_self_managed`	removed (only `script_path` is consumed by the function)
C8	unused `requires_auth` in `_execute_build_on_compute`	removed (assigned twice, never read)
C10	`validators.py:309` docstring lie about `use_image` skipping validation	tightened to reflect actual behavior (parameter is informational only, retained for `build.py` call-site compatibility)

New contract tests in tests/unit/test_slurm_multi.py lock in the C2, C3, and C6 invariants:

test_built_images_and_models_share_keys -- 2-model fixture, asserts set(built_images) == set(built_models) and that every entry's docker_image == use_image
test_multi_model_nested_dict_distributed_does_not_raise -- 2 sglang_disagg-style cards with differing nested distributed.sglang_disagg; would TypeError under the old code
test_wrapper_disables_set_e_around_model_script -- asserts set +e / SCRIPT_EXIT_CODE=$? / set -e ordering around the bash invocation

Skipped: C9 (broader BuildOrchestrator branch tests for --use-image / --build-on-compute / registry gating). The slurm_multi-specific contract suite (12 tests, all passing) plus the multi-node cluster smoke (3-node SGLang Disagg llama-3.1-8b end-to-end on Distributed_Inference_CI, RC=0, perf table populated) already cover the slurm_multi surface end-to-end. Broader orchestrator-branch coverage for the generic --use-image / --build-on-compute paths is a worthwhile follow-up but out of scope for this minimal-additive PR.

Test results

pytest tests/unit/test_slurm_multi.py -v -- 12 passed (was 9; +3 new contracts)
pytest tests/unit -q -- 436 passed / 2 failed; the 2 failures are the same PermissionError environmental baseline that exists on develop (test_upload_file_to_mongodb_file_not_found, test_validate_additional_context_file_not_found; both raise PermissionError on '/nonexistent/file.json' instead of FileNotFoundError on this filesystem). Zero new failures.

raviguptaamd and others added 5 commits May 9, 2026 08:10

raviguptaamd requested review from Rohan138, amathews-amd, coketaste, gargrahul and ppalaniappan-amd May 10, 2026 02:30

raviguptaamd and others added 2 commits May 10, 2026 03:00

raviguptaamd marked this pull request as ready for review May 11, 2026 00:39

raviguptaamd requested a review from leconcio as a code owner May 11, 2026 00:39

Copilot AI review requested due to automatic review settings May 11, 2026 00:39

raviguptaamd requested a review from Cemberk as a code owner May 11, 2026 00:39

Copilot started reviewing on behalf of raviguptaamd May 11, 2026 00:40 View session

raviguptaamd marked this pull request as draft May 11, 2026 00:41

raviguptaamd marked this pull request as ready for review May 11, 2026 00:43

Copilot AI reviewed May 11, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(slurm): add slurm_multi multi-node SLURM launcher (minimal-additive port from PR #86)#124

feat(slurm): add slurm_multi multi-node SLURM launcher (minimal-additive port from PR #86)#124
raviguptaamd wants to merge 8 commits into
ROCm:developfrom
raviguptaamd:add_slurm_multi_launcher

raviguptaamd commented May 10, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

raviguptaamd commented May 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

raviguptaamd commented May 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Diff size vs develop

What landed

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

raviguptaamd commented May 11, 2026

Test results

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

raviguptaamd commented May 10, 2026 •

edited

Loading