feat(slurm): add slurm_multi multi-node SLURM launcher (minimal-additive port from PR #86)#124
feat(slurm): add slurm_multi multi-node SLURM launcher (minimal-additive port from PR #86)#124raviguptaamd wants to merge 8 commits into
Conversation
Introduce the `slurm_multi` self-managed multi-node SLURM launcher, fully additive vs develop. Verbatim minimal-delta port from PR ROCm#86 limited to the slurm_multi surface; no develop functionality removed. - common.py: register `slurm_multi` in VALID_LAUNCHERS, add hyphen alias normalization (`slurm-multi` -> `slurm_multi`). All other develop content (functools.lru_cache, _ROCPROF_FAMILY_TOOL_NAMES, tools_include_rocprof_family, full configure_multi_node_profiling rocprof-family handling, sglang-disagg in VALID_LAUNCHERS) preserved. - slurm.py: SLURM_MULTI_ALIASES; reservation field + allocation detection in __init__; helper methods (_get_allocation_node_count, _validate_allocation_nodes); slurm_multi early dispatch in prepare(); _prepare_slurm_multi_script (sets `_is_slurm_multi = True` for deploy() gate); _generate_slurm_multi_command; new elif in _generate_launcher_command (existing primus/sglang-disagg dispatch preserved); gated bash-in-salloc branch in deploy() (slurm_multi only -- non-slurm_multi launchers continue to use sbatch); _run_inside_existing_allocation; _collect_slurm_multi_results + early dispatch in collect_results(); reservation= kwarg threaded to existing SlurmNodeSelector(...) call. - slurm_node_selector.py: optional `reservation` parameter so the reservation propagates to srun health/cleanup commands. Source: ROCm/madengine PR ROCm#86 (slurm_multi surface only). 0 deletions from develop. Co-authored-by: Cursor <cursoragent@cursor.com>
…or slurm_multi Wire the build/run orchestration paths needed by the slurm_multi launcher. All existing flows for non-slurm_multi launchers continue to work unchanged; new behavior is gated on `distributed.launcher == "slurm_multi"` (or `slurm-multi`). - execution/container_runner.py: SLURM_MULTI_ALIASES, SELF_MANAGED_LAUNCHERS constants; `_run_self_managed` method (runs the model script directly on baremetal so srun/scontrol work inside it, propagates env from model_info.env_vars and additional_context.env_vars); self-managed launcher early dispatch in run_container() that fires only when launcher matches SELF_MANAGED_LAUNCHERS; `.slurm` extension recognized alongside `.sh` in script-extension detection (T-C2). Develop helpers preserved verbatim (_print_run_env_table, _docker_image_exists_locally, _bash_quote_path, _cp_model_dir_file_to_cwd_cmd, _resolve_docker_image, _ENV_KEY_RE env-var validation, MAD_OUTPUT_CSV passthrough, PRIMUS_CONFIG_PATH/PRIMUS_CLI_EXTRA in env-var allowlist, PERFORMANCE_LOG_PATTERN regex, CSV fieldname stripping). - execution/docker_builder.py: after build, write resolved registry image into built_models[*].env_vars.DOCKER_IMAGE_NAME so the run-time `srun docker pull "\$DOCKER_IMAGE_NAME"` on each compute node finds the registry image. Pure addition (one block before manifest assembly). - orchestration/build_orchestrator.py: import shutil; capture user-supplied slurm keys (_original_user_slurm_keys); add `use_image: Optional[str]` and `build_on_compute: bool` to execute(); mutex validation between use_image and build_on_compute; use_image dispatch (incl. "auto" sentinel that resolves from model card via _resolve_image_from_model_card); build_on_compute dispatch (registry required, validated in _execute_build_on_compute); slurm_multi registry gate that runs DiscoverModels early to detect slurm_multi tags and either raises a structured ConfigurationError with helpful suggestions or auto-uses DOCKER_IMAGE_NAME from the model card env_vars (implicit --use-image fallback); _execute_with_prebuilt_image; _resolve_image_from_model_card; _execute_build_on_compute (sbatch wrapper that builds + pushes from one compute node, polls completion, marks `built_on_compute: true` in the manifest). Develop's load_credentials top-level usage, detect_local_gpu_arch parameter, and resolved_arch print all preserved unchanged. Note: PR ROCm#86's universal merge of model_info.env_vars into docker_env_vars in run_container() was DELIBERATELY DROPPED -- it is not slurm_multi-essential (slurm_multi gets env_vars via _prepare_slurm_multi_script's wrapper export block and via _run_self_managed's own merge). Applying it universally would silently change behavior for non-slurm_multi launchers and has an incorrect precedence vs additional_context.env_vars; it can be revisited as a separate PR if needed. Source: ROCm/madengine PR ROCm#86 (slurm_multi surface only). 0 deletions from develop except the agreed `.sh` -> `.sh|.slurm` line modification (T-C2: harmless extension to script-extension recognition). Co-authored-by: Cursor <cursoragent@cursor.com>
Surface the orchestrator paths from the previous commit through the `madengine build` CLI: - cli/commands/build.py: add `--use-image [IMAGE | auto]` (skip the local Docker build and use the named image; `auto` resolves from the model card's DOCKER_IMAGE_NAME); add `--build-on-compute` (build on a SLURM compute node and push to the configured registry; manifest records `built_on_compute: true`); 3 mutex validation blocks (--use-image vs --registry, --use-image vs --build-on-compute, --build-on-compute requires --registry); pass both new kwargs through to validate_additional_context, create_args_namespace, and BuildOrchestrator.execute(). - cli/validators.py: add `use_image: Optional[str] = None` parameter (signature + docstring only -- body unchanged) so build.py's call site does not TypeError. PR ROCm#86's other validators.py rewrites (drops of ROCM_PATH / MAD_ROCM_PATH validation, dockerfile_matched fallback removal) are NOT inherited; develop validation is preserved. Required for the slurm_multi launcher's documented build options in MAD-private #186. Co-authored-by: Cursor <cursoragent@cursor.com>
- tests/unit/test_slurm_multi.py: nine tests across three classes
- TestSlurmMultiRegistration: `slurm_multi` is in VALID_LAUNCHERS;
SLURM_MULTI_ALIASES contains both canonical and hyphen forms.
- TestNormalizeSlurmMultiAliases: canonical and hyphen alias both
normalize to `slurm_multi`; unknown values still fall through to
the existing `slurm -> docker` default.
- TestSlurmMultiPrepareScript: end-to-end fixture that mirrors
MAD-private PR #186's `pyt_sglang_disagg_qwen3-32b_short` entry
verbatim. Asserts that prepare() takes the slurm_multi early-
dispatch path, sets `_is_slurm_multi = True` for the deploy()
gate, emits an SBATCH header that reflects the model card slurm
block, and emits every model_info.env_vars key as
`export KEY="value"` in the wrapper script. This is the explicit
safety net for the Q2 decision to drop PR ROCm#86's universal
docker_env_vars merge -- if slurm_multi ever stops propagating
those env_vars to the workload, this test fails loudly.
Existing tests in tests/unit/test_deployment.py and
tests/unit/test_container_runner.py are NOT modified.
- examples/slurm-configs/minimal/slurm-multi-minimal.json:
small reference config alongside develop's existing
*-minimal.json examples (`-f` add because *.json is gitignored,
matching how existing files in this dir were originally added).
Co-authored-by: Cursor <cursoragent@cursor.com>
…lti bash branch The earlier feat(deployment) commit ported `_run_inside_existing_allocation` verbatim from PR ROCm#86 (used by the slurm_multi bash-in-salloc branch). That method constructs `DeploymentResult(..., skip_monitoring=True)` to signal that the script ran synchronously and the monitor phase should be skipped. Develop's `DeploymentResult` dataclass did not have that field, so the construction raised `TypeError` at runtime, the failure was swallowed by the orchestrator, and `madengine run` exited 0 even when the wrapper script failed. Two minimal additive edits to deployment/base.py: - DeploymentResult: add `skip_monitoring: bool = False` field. - BaseDeployment.execute(): change the monitor guard from `if self.config.monitor:` to `if self.config.monitor and not result.skip_monitoring:` so the bash branch can correctly bypass the SLURM job poll (there is no SLURM job to poll — the script ran inline). Behavior for non-slurm_multi launchers is unchanged: they never construct DeploymentResult with skip_monitoring=True, the field defaults to False, and the monitor() call still fires as before. Discovered during R5.1b cluster smoke test of `pyt_sglang_disagg_qwen3-32b_short` from MAD-private PR #186. Co-authored-by: Cursor <cursoragent@cursor.com>
…eporter
The slurm_multi model script (e.g., MAD-private's run_xPyD_models.slurm)
writes its perf CSV to /shared_inference/$USER/$JOBID/perf.csv.
`_collect_slurm_multi_results` correctly finds and ingests that CSV via
`_collect_results_parse_perf_csv` (results['successful_runs'] is
populated, manifest exit codes are right). However the CLI's
post-run reporter (`display_performance_table` in cli/utils.py) reads
from cwd 'perf.csv' by default, so users saw a cosmetic
"Performance CSV not found: perf.csv" warning even on a successful
slurm_multi run. Local and classic-SLURM flows already leave a
cumulative perf.csv in cwd via update_perf_csv(); slurm_multi did not.
Fix: inside `_collect_slurm_multi_results`, after the per-job CSV is
located and parsed, also write its rows into the conventional cwd
perf.csv path:
* if cwd 'perf.csv' is absent: shutil.copy() the per-job file
* if it exists (from previous runs): append data rows (skip the
per-job header so the cwd file stays single-headed)
Original per-job CSV at /shared_inference/$USER/$JOBID/perf.csv is
not modified or deleted. Wrapped in try/except so any I/O failure
degrades to a yellow warning rather than failing the run.
Affects only `_collect_slurm_multi_results`, which only runs for the
slurm_multi launcher dispatch added earlier on this branch. Zero
effect on non-slurm_multi launchers (they still take develop's
classic collect_results path which writes cwd perf.csv via
update_perf_csv).
Validated end-to-end: Llama-3.1-8B smoke on 3 reservation nodes
(Distributed_Inference_CI), bash branch in salloc, parallel docker
pull on all nodes, proxy/prefill/decode came up, benchmark
concurrency=8 / 32-32 ISL/OSL ran to completion, run exit 0, cwd
perf.csv now contains the per-job row (THROUGHPUT/TTFT/ITL columns
match per-job source).
Co-authored-by: Cursor <cursoragent@cursor.com>
- CHANGELOG.md: new [Unreleased] section documenting the slurm_multi launcher, --use-image / --build-on-compute CLI flags, the slurm_multi build registry gate, the bash-in-salloc execution path, the DeploymentResult.skip_monitoring field, the SlurmNodeSelector reservation parameter, the new contract tests + minimal example, and the cwd perf.csv aggregation fix. - cli/commands/build.py: comment near the --use-image Typer option flagging that `is_flag=False, flag_value="auto"` is being deprecated by Typer, with the planned migration path. Behavior unchanged (matches MAD-private PR #186's documented UX). Co-authored-by: Cursor <cursoragent@cursor.com>
There was a problem hiding this comment.
Pull request overview
Adds a new additive slurm_multi SLURM launcher path intended for self-managed multi-node workloads (wrapper SBATCH + baremetal model script execution), plus supporting build-time options to make registry/pull flows workable.
Changes:
- Introduces
slurm_multilauncher registration/normalization and SLURM deployment support (including salloc “bash” execution + perf.csv aggregation). - Adds
madengine build --use-image [IMAGE|auto]and--build-on-computebuild modes, plus registry/image propagation for run-phase parallel pulls. - Adds contract tests and a minimal example config for
slurm_multi.
Reviewed changes
Copilot reviewed 12 out of 12 changed files in this pull request and generated 9 comments.
Show a summary per file
| File | Description |
|---|---|
src/madengine/deployment/common.py |
Registers slurm_multi and normalizes slurm-multi → slurm_multi. |
src/madengine/deployment/slurm.py |
Implements slurm_multi wrapper generation, allocation-aware deploy path, and results collection. |
src/madengine/deployment/slurm_node_selector.py |
Threads reservation through srun health/cleanup commands. |
src/madengine/deployment/base.py |
Adds DeploymentResult.skip_monitoring and honors it in execute(). |
src/madengine/execution/container_runner.py |
Adds self-managed launcher path to run scripts on host instead of inside madengine Docker. |
src/madengine/execution/docker_builder.py |
Propagates registry_image to built_models[*].env_vars.DOCKER_IMAGE_NAME. |
src/madengine/orchestration/build_orchestrator.py |
Adds --use-image / --build-on-compute flows and slurm_multi registry gating logic. |
src/madengine/cli/commands/build.py |
Adds CLI flags and mutex validation for the new build modes. |
src/madengine/cli/validators.py |
Extends validate_additional_context signature to accept use_image. |
tests/unit/test_slurm_multi.py |
Adds contract tests for launcher registration/aliasing and wrapper env-var exports. |
examples/slurm-configs/minimal/slurm-multi-minimal.json |
Provides a minimal reference config for slurm_multi. |
CHANGELOG.md |
Documents new launcher and build/run behaviors. |
Comments suppressed due to low confidence (1)
src/madengine/cli/validators.py:321
- The
validate_additional_contextdocstring saysuse_image"skips required field validation", but the parameter is currently unused and validation is still always performed viafinalize_additional_context_dict(context). Please either implement the documented behavior or update the docstring to avoid misleading callers.
def validate_additional_context(
additional_context: str,
additional_context_file: Optional[str] = None,
use_image: Optional[str] = None,
) -> Dict[str, Any]:
"""
Validate and parse additional context.
Args:
additional_context: JSON string containing additional context
additional_context_file: Optional file containing additional context
use_image: Optional pre-built image to use (skips required field validation)
Returns:
Dict containing parsed additional context
Raises:
typer.Exit: If validation fails
"""
context, _ = merge_additional_context_from_sources(
additional_context, additional_context_file
)
finalize_additional_context_dict(context)
return context
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
|
||
| import json | ||
| import os | ||
| import shutil |
| # Key by model_name (not use_image) so multiple models sharing | ||
| # the same pre-built image are all preserved in the manifest. | ||
| for model in models: | ||
| model_name = model.get("name", "unknown") | ||
| model_distributed = model.get("distributed", {}) | ||
|
|
||
| # Merge DOCKER_IMAGE_NAME into env_vars for parallel pull in run phase | ||
| model_env_vars = model.get("env_vars", {}).copy() | ||
| model_env_vars["DOCKER_IMAGE_NAME"] = use_image | ||
|
|
||
| manifest["built_models"][model_name] = { | ||
| "name": model_name, | ||
| "image": use_image, | ||
| "docker_image": use_image, |
| # Merge model's distributed config from the first model. | ||
| # If multiple models have differing distributed configs, warn — only the first wins here. | ||
| if len(models) > 1: | ||
| distinct_distributed = {tuple(sorted((m.get("distributed") or {}).items())) for m in models} | ||
| if len(distinct_distributed) > 1: | ||
| self.rich_console.print( |
| import shutil | ||
|
|
| # Prepare environment | ||
| env = os.environ.copy() | ||
| env.update(run_env) | ||
|
|
||
| # Add model-specific env vars from model_info | ||
| if "env_vars" in model_info and model_info["env_vars"]: | ||
| for key, value in model_info["env_vars"].items(): | ||
| env[key] = str(value) | ||
| print(f" ENV: {key}={value}") | ||
|
|
||
| # Add env vars from additional_context | ||
| if self.additional_context and "env_vars" in self.additional_context: | ||
| for key, value in self.additional_context["env_vars"].items(): | ||
| env[key] = str(value) | ||
|
|
| "", | ||
| f"# slurm_multi launcher script for {model_info['name']}", | ||
| f"# Generated by madengine for slurm_multi", | ||
| "", | ||
| "set -e", | ||
| "", | ||
| "# Environment variables", | ||
| ]) | ||
|
|
||
| for key, value in env_vars.items(): | ||
| script_lines.append(f"export {key}=\"{value}\"") | ||
|
|
||
| script_lines.append("") | ||
| script_lines.extend([ | ||
| "echo '=========================================='", | ||
| "echo 'slurm_multi Launcher'", | ||
| "echo '=========================================='", | ||
| f"echo 'Model: {model_info['name']}'", | ||
| f"echo 'Script: {model_script_path}'", | ||
| "echo 'SLURM_JOB_ID:' $SLURM_JOB_ID", | ||
| "echo 'SLURM_NNODES:' $SLURM_NNODES", | ||
| "echo 'SLURM_NODELIST:' $SLURM_NODELIST", | ||
| "echo ''", | ||
| ]) | ||
|
|
||
| # Check if image needs parallel pull on all nodes | ||
| # Pull if: image is from registry (contains / or .) and not a local ci-* build | ||
| docker_image = env_vars.get("DOCKER_IMAGE_NAME", "") | ||
| is_registry_image = docker_image and not docker_image.startswith("ci-") and ("/" in docker_image or "." in docker_image) | ||
|
|
||
| if is_registry_image: | ||
| # Add parallel docker pull on all nodes | ||
| # This ensures all nodes have the image before running | ||
| script_lines.extend([ | ||
| "", | ||
| "# Pull Docker image in parallel on all nodes", | ||
| "echo '=========================================='", | ||
| "echo 'Pulling Docker image on all nodes in parallel'", | ||
| "echo '=========================================='", | ||
| f"echo 'Image: {docker_image}'", | ||
| "echo ''", | ||
| "", | ||
| f"srun --nodes=$SLURM_NNODES --ntasks=$SLURM_NNODES bash -c \"", | ||
| f" echo \\\"[\\$(hostname)] Pulling {docker_image}...\\\"", | ||
| f" docker pull {docker_image}", | ||
| " PULL_RC=\\$?", | ||
| " if [ \\$PULL_RC -eq 0 ]; then", | ||
| " echo \\\"[\\$(hostname)] Pull SUCCESS\\\"", | ||
| " else", | ||
| " echo \\\"[\\$(hostname)] Pull FAILED with exit code \\$PULL_RC\\\"", | ||
| " fi", | ||
| " exit \\$PULL_RC", | ||
| "\"", | ||
| "PULL_EXIT=$?", | ||
| "", | ||
| "if [ $PULL_EXIT -ne 0 ]; then", | ||
| " echo 'Docker pull failed on one or more nodes'", | ||
| " exit $PULL_EXIT", | ||
| "fi", | ||
| "", | ||
| "echo ''", | ||
| "echo 'Docker image pulled on all nodes'", | ||
| "echo ''", | ||
| ]) | ||
|
|
||
| # Create completion marker path for robust completion detection. | ||
| # Namespace by SLURM_JOB_ID so concurrent / repeat runs of the same model | ||
| # tag don't collide on each other's marker files. monitor() reconstructs | ||
| # the same path using the deployment_id returned by sbatch. | ||
| completion_marker_dir = self.output_dir.resolve() | ||
| completion_marker_template = ( | ||
| completion_marker_dir | ||
| / f"madengine_{model_info['name']}_${{SLURM_JOB_ID:-local}}.complete" | ||
| ) | ||
|
|
||
| script_lines.extend([ | ||
| "", | ||
| "# Change to script directory", | ||
| f"cd {model_script_path.parent}", | ||
| "", | ||
| "# Run the model script directly on the host", | ||
| f"echo 'Executing: bash {model_script_path.name} {model_args}'", | ||
| f"bash {model_script_path.name} {model_args}", | ||
| "SCRIPT_EXIT_CODE=$?", | ||
| "", | ||
| "echo ''", | ||
| "echo 'Script completed.'", | ||
| "", | ||
| "# Write completion marker for madengine to detect (job-id namespaced)", | ||
| f"echo \"exit_code=$SCRIPT_EXIT_CODE\" > {completion_marker_template}", | ||
| f"echo \"timestamp=$(date -Iseconds)\" >> {completion_marker_template}", | ||
| f"echo \"Completion marker written: {completion_marker_template}\"", | ||
| "", | ||
| "exit $SCRIPT_EXIT_CODE", | ||
| ]) |
| script_name = os.path.basename(scripts_arg) | ||
| elif scripts_arg.endswith(".py"): | ||
| script_path = scripts_arg | ||
| script_name = os.path.basename(scripts_arg) | ||
| else: | ||
| # Directory specified - look for run.sh | ||
| script_path = os.path.join(scripts_arg, "run.sh") | ||
| script_name = "run.sh" |
| else: | ||
| # Private/internal registry - may not need auth | ||
| self.rich_console.print(f" Auth: Private registry (auth may not be required)") | ||
| requires_auth = dockerhub_user and dockerhub_password |
| use_image=implicit_image, | ||
| manifest_output=manifest_output, | ||
| ) | ||
| # No card image (or divergent images across models): fall through to error. |
Manifest-shape correctness (Copilot C2):
- _execute_with_prebuilt_image now keys built_images by model_name (one
entry per model, all pointing at the same use_image) so it matches
built_models and the rest of the codebase's invariant that
ContainerRunner.run_models_from_manifest() relies on at
built_models.get(image_name, {}). Without this, --use-image (and the
implicit slurm_multi DOCKER_IMAGE_NAME gate) silently skipped every
model in the run phase.
Multi-model dedup TypeError fix (Copilot C3):
- Replace tuple(sorted((m.get('distributed') or {}).items())) and the
same pattern for slurm with json.dumps(..., sort_keys=True, default=str).
The old pattern raised TypeError: unhashable type: 'dict' as soon as a
model's distributed block contained nested dicts (sglang_disagg /
vllm_disagg).
slurm_multi wrapper completion-marker on failure (Copilot C6):
- The wrapper script enables `set -e` at the top, so a non-zero exit
from the model script terminated the wrapper before SCRIPT_EXIT_CODE
was captured and the completion marker was written -- failed runs
looked like hangs to monitor(). Wrap the bash invocation in
`set +e` / `set -e` so the marker is always written.
Self-managed launcher hygiene (Copilot C4, C5, C7):
- Drop unused `import shutil` inside _run_self_managed.
- Drop unused script_name local in _run_self_managed (only script_path
is consumed by the function).
- Redact env_vars values in the run log (`ENV: KEY=<set>`) so model-card
credentials (HF_TOKEN, MAD_DOCKERHUB_PASSWORD, etc.) don't leak.
Build orchestrator hygiene (Copilot C1, C8):
- Drop top-level `import shutil` (unused).
- Drop the two `requires_auth = ...` assignments in
_execute_build_on_compute (assigned but never read).
Validators docstring honesty (Copilot C10):
- validate_additional_context's `use_image` parameter is currently
informational only (parameter retained for build.py call-site
compatibility); the docstring previously claimed it skipped required-
field validation, which the body never implemented. Tightened the
docstring to match reality.
New contract tests (tests/unit/test_slurm_multi.py):
- test_wrapper_disables_set_e_around_model_script: locks in C6's
set +e/SCRIPT_EXIT_CODE/set -e ordering around the bash invocation.
- test_built_images_and_models_share_keys: locks in C2's invariant
using a fake 2-model fixture and a stubbed BuildOrchestrator.
- test_multi_model_nested_dict_distributed_does_not_raise: regression
for C3 using two sglang_disagg-style cards with differing nested
dicts in `distributed`; would TypeError under the old code.
Skipped: Copilot C9 (broader BuildOrchestrator branch tests for
--use-image / --build-on-compute / registry-gating). The slurm_multi
contract suite + multi-node cluster smoke already cover the slurm_multi
surface end-to-end; broader orchestrator-branch coverage is a worthwhile
follow-up but out of scope for this minimal-additive PR.
Tests:
- pytest tests/unit/test_slurm_multi.py -v -> 12 passed (was 9).
- pytest tests/unit -q -> 436 passed / 2 failed; the 2 failures are
the same PermissionError environmental baseline that exists on
develop (test_upload_file_to_mongodb_file_not_found and
test_validate_additional_context_file_not_found, both raise
PermissionError on '/nonexistent/file.json' instead of FileNotFound
on this filesystem). Zero new failures.
Co-authored-by: Cursor <cursoragent@cursor.com>
|
Addressed in commit
New contract tests in
Skipped: C9 (broader BuildOrchestrator branch tests for Test results
|
Summary
This PR adds a new SLURM launcher,
slurm_multi, for self-managed multi-node workloads (proxy + prefill + decode topologies, e.g., SGLang Disaggregated).The discipline for this branch is strictly additive: no develop functionality is removed unless absolutely required for slurm_multi to work, and every line vs develop is justified by one of the contract points below.
Diff size vs develop
The handful of deletions are line-modifications at agreed touchpoints (e.g., adding a new
elifbranch to dispatch, extending an existingif scripts_arg.endswith(".sh")to also accept.slurm).What landed
src/madengine/deployment/common.pyslurm_multiinVALID_LAUNCHERS; add hyphen alias normalization. All other develop content (functools.lru_cache,tools_include_rocprof_family,_ROCPROF_FAMILY_TOOL_NAMES,configure_multi_node_profiling,sglang-disaggmembership) preserved.src/madengine/deployment/slurm.pySLURM_MULTI_ALIASES; allocation detection in__init__; allocation helper methods;slurm_multiearly dispatch inprepare();_prepare_slurm_multi_script(sets_is_slurm_multi=Truefor the deploy gate);_generate_slurm_multi_command; newelifin_generate_launcher_command(existing primus/sglang-disagg dispatch preserved); gated bash-in-salloc branch indeploy()(slurm_multi only);_run_inside_existing_allocation;_collect_slurm_multi_results+ early dispatch incollect_results();reservation=kwarg threaded to existingSlurmNodeSelector(...); cwdperf.csvaggregation so the dashboard reporter finds slurm_multi runs.src/madengine/deployment/slurm_node_selector.pyreservationparameter so the reservation propagates to srun health/cleanup commands.src/madengine/deployment/base.pyskip_monitoring: bool = FalsetoDeploymentResultdataclass; honor it inBaseDeployment.execute()so the slurm_multi bash branch can correctly bypass the SLURM monitor poll.src/madengine/execution/container_runner.pySLURM_MULTI_ALIASES/SELF_MANAGED_LAUNCHERS;_run_self_managed(runs the model script directly on baremetal so srun/scontrol work; propagates env frommodel_info.env_varsandadditional_context.env_vars); self-managed launcher early dispatch inrun_container();.slurmextension recognized alongside.sh. Develop helpers preserved verbatim (_print_run_env_table,_resolve_docker_image,_ENV_KEY_RE,_bash_quote_path,_cp_model_dir_file_to_cwd_cmd,MAD_OUTPUT_CSVpassthrough,PRIMUS_*env-var allowlist,PERFORMANCE_LOG_PATTERNregex, CSV fieldname stripping).src/madengine/execution/docker_builder.pybuilt_models[*].env_vars.DOCKER_IMAGE_NAMEso the run-timesrun docker pull "$DOCKER_IMAGE_NAME"on each compute node finds the registry image.src/madengine/orchestration/build_orchestrator.pyuse_imageandbuild_on_computeparameters onexecute(); mutex validation; use_image dispatch (incl.autosentinel); build_on_compute dispatch; slurm_multi registry gate;_execute_with_prebuilt_image;_resolve_image_from_model_card;_execute_build_on_compute(sbatch wrapper that builds + pushes from one compute node, polls completion, marksbuilt_on_compute: truein the manifest). Develop'sload_credentialstop-level usage,detect_local_gpu_archparameter, andMAD_SYSTEM_GPU_ARCHITECTURE resolvedprint all preserved.src/madengine/cli/commands/build.py--use-image [IMAGE | auto]and--build-on-computeflags + 3 mutex validation blocks; pass through tovalidate_additional_context,create_args_namespace,BuildOrchestrator.execute(). Forward-compat TODO comment for the Typeris_flag=False, flag_value="auto"deprecation.src/madengine/cli/validators.pyuse_image: Optional[str] = Noneparameter (signature + docstring only — body unchanged) sobuild.py's call site does not TypeError.tests/unit/test_slurm_multi.pyslurm_multiregistered inVALID_LAUNCHERS; (2) hyphen alias normalizes; (3) wrapper SBATCH script exports everymodel_info.env_varskey. Fixture mirrors MAD-private PR #186'spyt_sglang_disagg_qwen3-32b_shortentry verbatim.examples/slurm-configs/minimal/slurm-multi-minimal.json*-minimal.jsonexamples.CHANGELOG.md[Unreleased]section documenting all of the above.