Skip to content

feat(slurm): add slurm_multi multi-node SLURM launcher (minimal-additive port from PR #86)#124

Open
raviguptaamd wants to merge 8 commits into
ROCm:developfrom
raviguptaamd:add_slurm_multi_launcher
Open

feat(slurm): add slurm_multi multi-node SLURM launcher (minimal-additive port from PR #86)#124
raviguptaamd wants to merge 8 commits into
ROCm:developfrom
raviguptaamd:add_slurm_multi_launcher

Conversation

@raviguptaamd
Copy link
Copy Markdown

@raviguptaamd raviguptaamd commented May 10, 2026

Summary

This PR adds a new SLURM launcher, slurm_multi, for self-managed multi-node workloads (proxy + prefill + decode topologies, e.g., SGLang Disaggregated).

The discipline for this branch is strictly additive: no develop functionality is removed unless absolutely required for slurm_multi to work, and every line vs develop is justified by one of the contract points below.

Diff size vs develop

 12 files changed, 2071 insertions(+), 5 deletions(-)

The handful of deletions are line-modifications at agreed touchpoints (e.g., adding a new elif branch to dispatch, extending an existing if scripts_arg.endswith(".sh") to also accept .slurm).

What landed

File Action Purpose
src/madengine/deployment/common.py MODIFY Register slurm_multi in VALID_LAUNCHERS; add hyphen alias normalization. All other develop content (functools.lru_cache, tools_include_rocprof_family, _ROCPROF_FAMILY_TOOL_NAMES, configure_multi_node_profiling, sglang-disagg membership) preserved.
src/madengine/deployment/slurm.py MODIFY SLURM_MULTI_ALIASES; allocation detection in __init__; allocation helper methods; slurm_multi early dispatch in prepare(); _prepare_slurm_multi_script (sets _is_slurm_multi=True for the deploy gate); _generate_slurm_multi_command; new elif in _generate_launcher_command (existing primus/sglang-disagg dispatch preserved); gated bash-in-salloc branch in deploy() (slurm_multi only); _run_inside_existing_allocation; _collect_slurm_multi_results + early dispatch in collect_results(); reservation= kwarg threaded to existing SlurmNodeSelector(...); cwd perf.csv aggregation so the dashboard reporter finds slurm_multi runs.
src/madengine/deployment/slurm_node_selector.py MODIFY Optional reservation parameter so the reservation propagates to srun health/cleanup commands.
src/madengine/deployment/base.py MODIFY Add skip_monitoring: bool = False to DeploymentResult dataclass; honor it in BaseDeployment.execute() so the slurm_multi bash branch can correctly bypass the SLURM monitor poll.
src/madengine/execution/container_runner.py MODIFY SLURM_MULTI_ALIASES / SELF_MANAGED_LAUNCHERS; _run_self_managed (runs the model script directly on baremetal so srun/scontrol work; propagates env from model_info.env_vars and additional_context.env_vars); self-managed launcher early dispatch in run_container(); .slurm extension recognized alongside .sh. Develop helpers preserved verbatim (_print_run_env_table, _resolve_docker_image, _ENV_KEY_RE, _bash_quote_path, _cp_model_dir_file_to_cwd_cmd, MAD_OUTPUT_CSV passthrough, PRIMUS_* env-var allowlist, PERFORMANCE_LOG_PATTERN regex, CSV fieldname stripping).
src/madengine/execution/docker_builder.py MODIFY After build, write resolved registry image into built_models[*].env_vars.DOCKER_IMAGE_NAME so the run-time srun docker pull "$DOCKER_IMAGE_NAME" on each compute node finds the registry image.
src/madengine/orchestration/build_orchestrator.py MODIFY use_image and build_on_compute parameters on execute(); mutex validation; use_image dispatch (incl. auto sentinel); build_on_compute dispatch; slurm_multi registry gate; _execute_with_prebuilt_image; _resolve_image_from_model_card; _execute_build_on_compute (sbatch wrapper that builds + pushes from one compute node, polls completion, marks built_on_compute: true in the manifest). Develop's load_credentials top-level usage, detect_local_gpu_arch parameter, and MAD_SYSTEM_GPU_ARCHITECTURE resolved print all preserved.
src/madengine/cli/commands/build.py MODIFY New --use-image [IMAGE | auto] and --build-on-compute flags + 3 mutex validation blocks; pass through to validate_additional_context, create_args_namespace, BuildOrchestrator.execute(). Forward-compat TODO comment for the Typer is_flag=False, flag_value="auto" deprecation.
src/madengine/cli/validators.py MODIFY Add use_image: Optional[str] = None parameter (signature + docstring only — body unchanged) so build.py's call site does not TypeError.
tests/unit/test_slurm_multi.py NEW Three locked-in contracts: (1) slurm_multi registered in VALID_LAUNCHERS; (2) hyphen alias normalizes; (3) wrapper SBATCH script exports every model_info.env_vars key. Fixture mirrors MAD-private PR #186's pyt_sglang_disagg_qwen3-32b_short entry verbatim.
examples/slurm-configs/minimal/slurm-multi-minimal.json NEW Minimal reference config alongside develop's existing *-minimal.json examples.
CHANGELOG.md MODIFY New [Unreleased] section documenting all of the above.

raviguptaamd and others added 5 commits May 9, 2026 08:10
Introduce the `slurm_multi` self-managed multi-node SLURM launcher,
fully additive vs develop. Verbatim minimal-delta port from PR ROCm#86
limited to the slurm_multi surface; no develop functionality removed.

- common.py: register `slurm_multi` in VALID_LAUNCHERS, add hyphen
  alias normalization (`slurm-multi` -> `slurm_multi`). All other
  develop content (functools.lru_cache, _ROCPROF_FAMILY_TOOL_NAMES,
  tools_include_rocprof_family, full configure_multi_node_profiling
  rocprof-family handling, sglang-disagg in VALID_LAUNCHERS) preserved.
- slurm.py: SLURM_MULTI_ALIASES; reservation field + allocation
  detection in __init__; helper methods (_get_allocation_node_count,
  _validate_allocation_nodes); slurm_multi early dispatch in prepare();
  _prepare_slurm_multi_script (sets `_is_slurm_multi = True` for
  deploy() gate); _generate_slurm_multi_command; new elif in
  _generate_launcher_command (existing primus/sglang-disagg dispatch
  preserved); gated bash-in-salloc branch in deploy() (slurm_multi
  only -- non-slurm_multi launchers continue to use sbatch);
  _run_inside_existing_allocation; _collect_slurm_multi_results +
  early dispatch in collect_results(); reservation= kwarg threaded
  to existing SlurmNodeSelector(...) call.
- slurm_node_selector.py: optional `reservation` parameter so the
  reservation propagates to srun health/cleanup commands.

Source: ROCm/madengine PR ROCm#86 (slurm_multi surface only). 0 deletions
from develop.

Co-authored-by: Cursor <cursoragent@cursor.com>
…or slurm_multi

Wire the build/run orchestration paths needed by the slurm_multi
launcher. All existing flows for non-slurm_multi launchers continue
to work unchanged; new behavior is gated on
`distributed.launcher == "slurm_multi"` (or `slurm-multi`).

- execution/container_runner.py: SLURM_MULTI_ALIASES,
  SELF_MANAGED_LAUNCHERS constants; `_run_self_managed` method (runs
  the model script directly on baremetal so srun/scontrol work
  inside it, propagates env from model_info.env_vars and
  additional_context.env_vars); self-managed launcher early dispatch
  in run_container() that fires only when launcher matches
  SELF_MANAGED_LAUNCHERS; `.slurm` extension recognized alongside
  `.sh` in script-extension detection (T-C2). Develop helpers
  preserved verbatim (_print_run_env_table, _docker_image_exists_locally,
  _bash_quote_path, _cp_model_dir_file_to_cwd_cmd, _resolve_docker_image,
  _ENV_KEY_RE env-var validation, MAD_OUTPUT_CSV passthrough,
  PRIMUS_CONFIG_PATH/PRIMUS_CLI_EXTRA in env-var allowlist,
  PERFORMANCE_LOG_PATTERN regex, CSV fieldname stripping).
- execution/docker_builder.py: after build, write resolved registry
  image into built_models[*].env_vars.DOCKER_IMAGE_NAME so the
  run-time `srun docker pull "\$DOCKER_IMAGE_NAME"` on each compute
  node finds the registry image. Pure addition (one block before
  manifest assembly).
- orchestration/build_orchestrator.py: import shutil; capture
  user-supplied slurm keys (_original_user_slurm_keys); add
  `use_image: Optional[str]` and `build_on_compute: bool` to
  execute(); mutex validation between use_image and build_on_compute;
  use_image dispatch (incl. "auto" sentinel that resolves from model
  card via _resolve_image_from_model_card); build_on_compute
  dispatch (registry required, validated in _execute_build_on_compute);
  slurm_multi registry gate that runs DiscoverModels early to detect
  slurm_multi tags and either raises a structured ConfigurationError
  with helpful suggestions or auto-uses DOCKER_IMAGE_NAME from the
  model card env_vars (implicit --use-image fallback);
  _execute_with_prebuilt_image; _resolve_image_from_model_card;
  _execute_build_on_compute (sbatch wrapper that builds + pushes from
  one compute node, polls completion, marks `built_on_compute: true`
  in the manifest). Develop's load_credentials top-level usage,
  detect_local_gpu_arch parameter, and resolved_arch print all
  preserved unchanged.

Note: PR ROCm#86's universal merge of model_info.env_vars into
docker_env_vars in run_container() was DELIBERATELY DROPPED -- it
is not slurm_multi-essential (slurm_multi gets env_vars via
_prepare_slurm_multi_script's wrapper export block and via
_run_self_managed's own merge). Applying it universally would
silently change behavior for non-slurm_multi launchers and has an
incorrect precedence vs additional_context.env_vars; it can be
revisited as a separate PR if needed.

Source: ROCm/madengine PR ROCm#86 (slurm_multi surface only). 0 deletions
from develop except the agreed `.sh` -> `.sh|.slurm` line modification
(T-C2: harmless extension to script-extension recognition).

Co-authored-by: Cursor <cursoragent@cursor.com>
Surface the orchestrator paths from the previous commit through the
`madengine build` CLI:

- cli/commands/build.py: add `--use-image [IMAGE | auto]` (skip the
  local Docker build and use the named image; `auto` resolves from
  the model card's DOCKER_IMAGE_NAME); add `--build-on-compute`
  (build on a SLURM compute node and push to the configured
  registry; manifest records `built_on_compute: true`); 3 mutex
  validation blocks (--use-image vs --registry, --use-image vs
  --build-on-compute, --build-on-compute requires --registry); pass
  both new kwargs through to validate_additional_context,
  create_args_namespace, and BuildOrchestrator.execute().
- cli/validators.py: add `use_image: Optional[str] = None`
  parameter (signature + docstring only -- body unchanged) so
  build.py's call site does not TypeError. PR ROCm#86's other
  validators.py rewrites (drops of ROCM_PATH / MAD_ROCM_PATH
  validation, dockerfile_matched fallback removal) are NOT
  inherited; develop validation is preserved.

Required for the slurm_multi launcher's documented build options
in MAD-private #186.

Co-authored-by: Cursor <cursoragent@cursor.com>
- tests/unit/test_slurm_multi.py: nine tests across three classes
  - TestSlurmMultiRegistration: `slurm_multi` is in VALID_LAUNCHERS;
    SLURM_MULTI_ALIASES contains both canonical and hyphen forms.
  - TestNormalizeSlurmMultiAliases: canonical and hyphen alias both
    normalize to `slurm_multi`; unknown values still fall through to
    the existing `slurm -> docker` default.
  - TestSlurmMultiPrepareScript: end-to-end fixture that mirrors
    MAD-private PR #186's `pyt_sglang_disagg_qwen3-32b_short` entry
    verbatim. Asserts that prepare() takes the slurm_multi early-
    dispatch path, sets `_is_slurm_multi = True` for the deploy()
    gate, emits an SBATCH header that reflects the model card slurm
    block, and emits every model_info.env_vars key as
    `export KEY="value"` in the wrapper script. This is the explicit
    safety net for the Q2 decision to drop PR ROCm#86's universal
    docker_env_vars merge -- if slurm_multi ever stops propagating
    those env_vars to the workload, this test fails loudly.
  Existing tests in tests/unit/test_deployment.py and
  tests/unit/test_container_runner.py are NOT modified.
- examples/slurm-configs/minimal/slurm-multi-minimal.json:
  small reference config alongside develop's existing
  *-minimal.json examples (`-f` add because *.json is gitignored,
  matching how existing files in this dir were originally added).

Co-authored-by: Cursor <cursoragent@cursor.com>
…lti bash branch

The earlier feat(deployment) commit ported `_run_inside_existing_allocation`
verbatim from PR ROCm#86 (used by the slurm_multi bash-in-salloc branch).
That method constructs `DeploymentResult(..., skip_monitoring=True)` to
signal that the script ran synchronously and the monitor phase should
be skipped. Develop's `DeploymentResult` dataclass did not have that
field, so the construction raised `TypeError` at runtime, the failure
was swallowed by the orchestrator, and `madengine run` exited 0 even
when the wrapper script failed.

Two minimal additive edits to deployment/base.py:

- DeploymentResult: add `skip_monitoring: bool = False` field.
- BaseDeployment.execute(): change the monitor guard from
  `if self.config.monitor:` to
  `if self.config.monitor and not result.skip_monitoring:`
  so the bash branch can correctly bypass the SLURM job poll
  (there is no SLURM job to poll — the script ran inline).

Behavior for non-slurm_multi launchers is unchanged: they never
construct DeploymentResult with skip_monitoring=True, the field
defaults to False, and the monitor() call still fires as before.

Discovered during R5.1b cluster smoke test of
`pyt_sglang_disagg_qwen3-32b_short` from MAD-private PR #186.

Co-authored-by: Cursor <cursoragent@cursor.com>
raviguptaamd and others added 2 commits May 10, 2026 03:00
…eporter

The slurm_multi model script (e.g., MAD-private's run_xPyD_models.slurm)
writes its perf CSV to /shared_inference/$USER/$JOBID/perf.csv.
`_collect_slurm_multi_results` correctly finds and ingests that CSV via
`_collect_results_parse_perf_csv` (results['successful_runs'] is
populated, manifest exit codes are right). However the CLI's
post-run reporter (`display_performance_table` in cli/utils.py) reads
from cwd 'perf.csv' by default, so users saw a cosmetic
"Performance CSV not found: perf.csv" warning even on a successful
slurm_multi run. Local and classic-SLURM flows already leave a
cumulative perf.csv in cwd via update_perf_csv(); slurm_multi did not.

Fix: inside `_collect_slurm_multi_results`, after the per-job CSV is
located and parsed, also write its rows into the conventional cwd
perf.csv path:
  * if cwd 'perf.csv' is absent: shutil.copy() the per-job file
  * if it exists (from previous runs): append data rows (skip the
    per-job header so the cwd file stays single-headed)

Original per-job CSV at /shared_inference/$USER/$JOBID/perf.csv is
not modified or deleted. Wrapped in try/except so any I/O failure
degrades to a yellow warning rather than failing the run.

Affects only `_collect_slurm_multi_results`, which only runs for the
slurm_multi launcher dispatch added earlier on this branch. Zero
effect on non-slurm_multi launchers (they still take develop's
classic collect_results path which writes cwd perf.csv via
update_perf_csv).

Validated end-to-end: Llama-3.1-8B smoke on 3 reservation nodes
(Distributed_Inference_CI), bash branch in salloc, parallel docker
pull on all nodes, proxy/prefill/decode came up, benchmark
concurrency=8 / 32-32 ISL/OSL ran to completion, run exit 0, cwd
perf.csv now contains the per-job row (THROUGHPUT/TTFT/ITL columns
match per-job source).

Co-authored-by: Cursor <cursoragent@cursor.com>
- CHANGELOG.md: new [Unreleased] section documenting the slurm_multi
  launcher, --use-image / --build-on-compute CLI flags, the slurm_multi
  build registry gate, the bash-in-salloc execution path, the
  DeploymentResult.skip_monitoring field, the SlurmNodeSelector
  reservation parameter, the new contract tests + minimal example,
  and the cwd perf.csv aggregation fix.
- cli/commands/build.py: comment near the --use-image Typer option
  flagging that `is_flag=False, flag_value="auto"` is being deprecated
  by Typer, with the planned migration path. Behavior unchanged
  (matches MAD-private PR #186's documented UX).

Co-authored-by: Cursor <cursoragent@cursor.com>
@raviguptaamd raviguptaamd marked this pull request as ready for review May 11, 2026 00:39
@raviguptaamd raviguptaamd requested a review from leconcio as a code owner May 11, 2026 00:39
Copilot AI review requested due to automatic review settings May 11, 2026 00:39
@raviguptaamd raviguptaamd requested a review from Cemberk as a code owner May 11, 2026 00:39
@raviguptaamd raviguptaamd marked this pull request as draft May 11, 2026 00:41
@raviguptaamd raviguptaamd marked this pull request as ready for review May 11, 2026 00:43
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new additive slurm_multi SLURM launcher path intended for self-managed multi-node workloads (wrapper SBATCH + baremetal model script execution), plus supporting build-time options to make registry/pull flows workable.

Changes:

  • Introduces slurm_multi launcher registration/normalization and SLURM deployment support (including salloc “bash” execution + perf.csv aggregation).
  • Adds madengine build --use-image [IMAGE|auto] and --build-on-compute build modes, plus registry/image propagation for run-phase parallel pulls.
  • Adds contract tests and a minimal example config for slurm_multi.

Reviewed changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated 9 comments.

Show a summary per file
File Description
src/madengine/deployment/common.py Registers slurm_multi and normalizes slurm-multislurm_multi.
src/madengine/deployment/slurm.py Implements slurm_multi wrapper generation, allocation-aware deploy path, and results collection.
src/madengine/deployment/slurm_node_selector.py Threads reservation through srun health/cleanup commands.
src/madengine/deployment/base.py Adds DeploymentResult.skip_monitoring and honors it in execute().
src/madengine/execution/container_runner.py Adds self-managed launcher path to run scripts on host instead of inside madengine Docker.
src/madengine/execution/docker_builder.py Propagates registry_image to built_models[*].env_vars.DOCKER_IMAGE_NAME.
src/madengine/orchestration/build_orchestrator.py Adds --use-image / --build-on-compute flows and slurm_multi registry gating logic.
src/madengine/cli/commands/build.py Adds CLI flags and mutex validation for the new build modes.
src/madengine/cli/validators.py Extends validate_additional_context signature to accept use_image.
tests/unit/test_slurm_multi.py Adds contract tests for launcher registration/aliasing and wrapper env-var exports.
examples/slurm-configs/minimal/slurm-multi-minimal.json Provides a minimal reference config for slurm_multi.
CHANGELOG.md Documents new launcher and build/run behaviors.
Comments suppressed due to low confidence (1)

src/madengine/cli/validators.py:321

  • The validate_additional_context docstring says use_image "skips required field validation", but the parameter is currently unused and validation is still always performed via finalize_additional_context_dict(context). Please either implement the documented behavior or update the docstring to avoid misleading callers.
def validate_additional_context(
    additional_context: str,
    additional_context_file: Optional[str] = None,
    use_image: Optional[str] = None,
) -> Dict[str, Any]:
    """
    Validate and parse additional context.

    Args:
        additional_context: JSON string containing additional context
        additional_context_file: Optional file containing additional context
        use_image: Optional pre-built image to use (skips required field validation)

    Returns:
        Dict containing parsed additional context

    Raises:
        typer.Exit: If validation fails
    """
    context, _ = merge_additional_context_from_sources(
        additional_context, additional_context_file
    )
    finalize_additional_context_dict(context)
    return context

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.


import json
import os
import shutil
Comment on lines +523 to +536
# Key by model_name (not use_image) so multiple models sharing
# the same pre-built image are all preserved in the manifest.
for model in models:
model_name = model.get("name", "unknown")
model_distributed = model.get("distributed", {})

# Merge DOCKER_IMAGE_NAME into env_vars for parallel pull in run phase
model_env_vars = model.get("env_vars", {}).copy()
model_env_vars["DOCKER_IMAGE_NAME"] = use_image

manifest["built_models"][model_name] = {
"name": model_name,
"image": use_image,
"docker_image": use_image,
Comment on lines +570 to +575
# Merge model's distributed config from the first model.
# If multiple models have differing distributed configs, warn — only the first wins here.
if len(models) > 1:
distinct_distributed = {tuple(sorted((m.get("distributed") or {}).items())) for m in models}
if len(distinct_distributed) > 1:
self.rich_console.print(
Comment on lines +809 to +810
import shutil

Comment on lines +873 to +887
# Prepare environment
env = os.environ.copy()
env.update(run_env)

# Add model-specific env vars from model_info
if "env_vars" in model_info and model_info["env_vars"]:
for key, value in model_info["env_vars"].items():
env[key] = str(value)
print(f" ENV: {key}={value}")

# Add env vars from additional_context
if self.additional_context and "env_vars" in self.additional_context:
for key, value in self.additional_context["env_vars"].items():
env[key] = str(value)

Comment on lines +462 to +556
"",
f"# slurm_multi launcher script for {model_info['name']}",
f"# Generated by madengine for slurm_multi",
"",
"set -e",
"",
"# Environment variables",
])

for key, value in env_vars.items():
script_lines.append(f"export {key}=\"{value}\"")

script_lines.append("")
script_lines.extend([
"echo '=========================================='",
"echo 'slurm_multi Launcher'",
"echo '=========================================='",
f"echo 'Model: {model_info['name']}'",
f"echo 'Script: {model_script_path}'",
"echo 'SLURM_JOB_ID:' $SLURM_JOB_ID",
"echo 'SLURM_NNODES:' $SLURM_NNODES",
"echo 'SLURM_NODELIST:' $SLURM_NODELIST",
"echo ''",
])

# Check if image needs parallel pull on all nodes
# Pull if: image is from registry (contains / or .) and not a local ci-* build
docker_image = env_vars.get("DOCKER_IMAGE_NAME", "")
is_registry_image = docker_image and not docker_image.startswith("ci-") and ("/" in docker_image or "." in docker_image)

if is_registry_image:
# Add parallel docker pull on all nodes
# This ensures all nodes have the image before running
script_lines.extend([
"",
"# Pull Docker image in parallel on all nodes",
"echo '=========================================='",
"echo 'Pulling Docker image on all nodes in parallel'",
"echo '=========================================='",
f"echo 'Image: {docker_image}'",
"echo ''",
"",
f"srun --nodes=$SLURM_NNODES --ntasks=$SLURM_NNODES bash -c \"",
f" echo \\\"[\\$(hostname)] Pulling {docker_image}...\\\"",
f" docker pull {docker_image}",
" PULL_RC=\\$?",
" if [ \\$PULL_RC -eq 0 ]; then",
" echo \\\"[\\$(hostname)] Pull SUCCESS\\\"",
" else",
" echo \\\"[\\$(hostname)] Pull FAILED with exit code \\$PULL_RC\\\"",
" fi",
" exit \\$PULL_RC",
"\"",
"PULL_EXIT=$?",
"",
"if [ $PULL_EXIT -ne 0 ]; then",
" echo 'Docker pull failed on one or more nodes'",
" exit $PULL_EXIT",
"fi",
"",
"echo ''",
"echo 'Docker image pulled on all nodes'",
"echo ''",
])

# Create completion marker path for robust completion detection.
# Namespace by SLURM_JOB_ID so concurrent / repeat runs of the same model
# tag don't collide on each other's marker files. monitor() reconstructs
# the same path using the deployment_id returned by sbatch.
completion_marker_dir = self.output_dir.resolve()
completion_marker_template = (
completion_marker_dir
/ f"madengine_{model_info['name']}_${{SLURM_JOB_ID:-local}}.complete"
)

script_lines.extend([
"",
"# Change to script directory",
f"cd {model_script_path.parent}",
"",
"# Run the model script directly on the host",
f"echo 'Executing: bash {model_script_path.name} {model_args}'",
f"bash {model_script_path.name} {model_args}",
"SCRIPT_EXIT_CODE=$?",
"",
"echo ''",
"echo 'Script completed.'",
"",
"# Write completion marker for madengine to detect (job-id namespaced)",
f"echo \"exit_code=$SCRIPT_EXIT_CODE\" > {completion_marker_template}",
f"echo \"timestamp=$(date -Iseconds)\" >> {completion_marker_template}",
f"echo \"Completion marker written: {completion_marker_template}\"",
"",
"exit $SCRIPT_EXIT_CODE",
])
Comment on lines +822 to +829
script_name = os.path.basename(scripts_arg)
elif scripts_arg.endswith(".py"):
script_path = scripts_arg
script_name = os.path.basename(scripts_arg)
else:
# Directory specified - look for run.sh
script_path = os.path.join(scripts_arg, "run.sh")
script_name = "run.sh"
else:
# Private/internal registry - may not need auth
self.rich_console.print(f" Auth: Private registry (auth may not be required)")
requires_auth = dockerhub_user and dockerhub_password
use_image=implicit_image,
manifest_output=manifest_output,
)
# No card image (or divergent images across models): fall through to error.
Manifest-shape correctness (Copilot C2):
- _execute_with_prebuilt_image now keys built_images by model_name (one
  entry per model, all pointing at the same use_image) so it matches
  built_models and the rest of the codebase's invariant that
  ContainerRunner.run_models_from_manifest() relies on at
  built_models.get(image_name, {}). Without this, --use-image (and the
  implicit slurm_multi DOCKER_IMAGE_NAME gate) silently skipped every
  model in the run phase.

Multi-model dedup TypeError fix (Copilot C3):
- Replace tuple(sorted((m.get('distributed') or {}).items())) and the
  same pattern for slurm with json.dumps(..., sort_keys=True, default=str).
  The old pattern raised TypeError: unhashable type: 'dict' as soon as a
  model's distributed block contained nested dicts (sglang_disagg /
  vllm_disagg).

slurm_multi wrapper completion-marker on failure (Copilot C6):
- The wrapper script enables `set -e` at the top, so a non-zero exit
  from the model script terminated the wrapper before SCRIPT_EXIT_CODE
  was captured and the completion marker was written -- failed runs
  looked like hangs to monitor(). Wrap the bash invocation in
  `set +e` / `set -e` so the marker is always written.

Self-managed launcher hygiene (Copilot C4, C5, C7):
- Drop unused `import shutil` inside _run_self_managed.
- Drop unused script_name local in _run_self_managed (only script_path
  is consumed by the function).
- Redact env_vars values in the run log (`ENV: KEY=<set>`) so model-card
  credentials (HF_TOKEN, MAD_DOCKERHUB_PASSWORD, etc.) don't leak.

Build orchestrator hygiene (Copilot C1, C8):
- Drop top-level `import shutil` (unused).
- Drop the two `requires_auth = ...` assignments in
  _execute_build_on_compute (assigned but never read).

Validators docstring honesty (Copilot C10):
- validate_additional_context's `use_image` parameter is currently
  informational only (parameter retained for build.py call-site
  compatibility); the docstring previously claimed it skipped required-
  field validation, which the body never implemented. Tightened the
  docstring to match reality.

New contract tests (tests/unit/test_slurm_multi.py):
- test_wrapper_disables_set_e_around_model_script: locks in C6's
  set +e/SCRIPT_EXIT_CODE/set -e ordering around the bash invocation.
- test_built_images_and_models_share_keys: locks in C2's invariant
  using a fake 2-model fixture and a stubbed BuildOrchestrator.
- test_multi_model_nested_dict_distributed_does_not_raise: regression
  for C3 using two sglang_disagg-style cards with differing nested
  dicts in `distributed`; would TypeError under the old code.

Skipped: Copilot C9 (broader BuildOrchestrator branch tests for
--use-image / --build-on-compute / registry-gating). The slurm_multi
contract suite + multi-node cluster smoke already cover the slurm_multi
surface end-to-end; broader orchestrator-branch coverage is a worthwhile
follow-up but out of scope for this minimal-additive PR.

Tests:
- pytest tests/unit/test_slurm_multi.py -v  -> 12 passed (was 9).
- pytest tests/unit -q  -> 436 passed / 2 failed; the 2 failures are
  the same PermissionError environmental baseline that exists on
  develop (test_upload_file_to_mongodb_file_not_found and
  test_validate_additional_context_file_not_found, both raise
  PermissionError on '/nonexistent/file.json' instead of FileNotFound
  on this filesystem). Zero new failures.

Co-authored-by: Cursor <cursoragent@cursor.com>
@raviguptaamd
Copy link
Copy Markdown
Author

Addressed in commit 68d0bf3:

# Copilot comment Action
C1 build_orchestrator.py:13 unused import shutil removed
C2 _execute_with_prebuilt_image manifest schema mismatch bug fix -- built_images now keyed by model_name (one entry per model, all pointing at the same use_image), restoring the built_images.keys() == built_models.keys() invariant ContainerRunner.run_models_from_manifest() relies on at line 2041. Without this, --use-image and the implicit slurm_multi DOCKER_IMAGE_NAME gate silently skipped every model in the run phase.
C3 tuple(sorted(items())) unhashable for nested-dict distributed/slurm bug fix -- replaced with json.dumps(..., sort_keys=True, default=str) in both spots (otherwise sglang_disagg / vllm_disagg cards triggered TypeError: unhashable type: 'dict')
C4 container_runner.py:809 unused import shutil (in _run_self_managed) removed
C5 _run_self_managed env-var value logging security -- log ENV: KEY=<set> (keys only) so HF_TOKEN / MAD_DOCKERHUB_PASSWORD / etc. carried in model_info.env_vars don't leak into the run log
C6 set -e + SCRIPT_EXIT_CODE=$? interaction in slurm_multi wrapper bug fix -- wrap the bash <model_script> call in set +e / set -e so a non-zero model-script exit doesn't terminate the wrapper before the completion marker is written (without this, failures looked like hangs to monitor())
C7 unused local script_name in _run_self_managed removed (only script_path is consumed by the function)
C8 unused requires_auth in _execute_build_on_compute removed (assigned twice, never read)
C10 validators.py:309 docstring lie about use_image skipping validation tightened to reflect actual behavior (parameter is informational only, retained for build.py call-site compatibility)

New contract tests in tests/unit/test_slurm_multi.py lock in the C2, C3, and C6 invariants:

  • test_built_images_and_models_share_keys -- 2-model fixture, asserts set(built_images) == set(built_models) and that every entry's docker_image == use_image
  • test_multi_model_nested_dict_distributed_does_not_raise -- 2 sglang_disagg-style cards with differing nested distributed.sglang_disagg; would TypeError under the old code
  • test_wrapper_disables_set_e_around_model_script -- asserts set +e / SCRIPT_EXIT_CODE=$? / set -e ordering around the bash invocation

Skipped: C9 (broader BuildOrchestrator branch tests for --use-image / --build-on-compute / registry gating). The slurm_multi-specific contract suite (12 tests, all passing) plus the multi-node cluster smoke (3-node SGLang Disagg llama-3.1-8b end-to-end on Distributed_Inference_CI, RC=0, perf table populated) already cover the slurm_multi surface end-to-end. Broader orchestrator-branch coverage for the generic --use-image / --build-on-compute paths is a worthwhile follow-up but out of scope for this minimal-additive PR.

Test results

  • pytest tests/unit/test_slurm_multi.py -v -- 12 passed (was 9; +3 new contracts)
  • pytest tests/unit -q -- 436 passed / 2 failed; the 2 failures are the same PermissionError environmental baseline that exists on develop (test_upload_file_to_mongodb_file_not_found, test_validate_additional_context_file_not_found; both raise PermissionError on '/nonexistent/file.json' instead of FileNotFoundError on this filesystem). Zero new failures.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants