Updates to slurm launcher by raviguptaamd · Pull Request #86 · ROCm/madengine

raviguptaamd · 2026-03-12T20:19:59Z

Summary

Adds a slurm_multi deployment launcher for self-managed multi-node disaggregated workloads (vLLM-disagg, SGLang-disagg, large-EP, kvcache-transfer) and the orchestration / build / run / CLI hooks that drive it from madengine's manifest pipeline. Companion model cards and submit scripts that exercise this launcher live in ROCm/MAD-private PR #186.

Scope (75 files vs `develop`)

This PR is now focused strictly on slurm_multi and its supporting refactor. The Kubernetes deployment work originally bundled here has been moved to #118 (independent PR against develop) and will rebase after this PR lands.

Layer	Highlights
Deployment	`deployment/{slurm.py, slurm_node_selector.py, base.py, common.py, config_loader.py, factory.py}`, `templates/slurm/job.sh.j2`
Orchestration / Build / Run	`orchestration/{build_orchestrator, run_orchestrator, image_filtering}.py` — manifest-driven `--build-on-compute` / `--use-image`, deep-merged `deployment_config`, registry validation
Execution	`execution/{container_runner, container_runner_helpers, docker_builder}.py`
CLI	`cli/{app, commands/run, commands/build, constants, utils, validators}.py`
Core / Utils / Reporting / DB / Scripts	shared infra used by the new path: `core/`, `utils/`, `reporting/`, `database/mongodb.py`, `scripts/common/`
Examples / Docs / Tests	`examples/slurm-configs/`, `docs/launchers.md`, `docs/deployment.md`, `docs/configuration.md`, `tests/{unit,integration,e2e,fixtures}/`

Net diff: +2,877 / -2,179 across 75 files (down from +3,014 / -2,624 across 87 files at the squash-rebase head, after Round-2 / Round-3 / Round-4 review fixes restored 12 files to origin/develop byte-for-byte and tightened 4 more in-scope code paths).

Commit shape

Logical commits on top of develop:

deployment: add slurm_multi launcher, node selector, SBATCH template (8815e24) — SLURM deployment core; absorbs the slurm-side Copilot Round-1 review fixes.
orchestration/execution/cli: wire slurm_multi end-to-end (6f2e11b) — manifest deep-merge, registry guard, per-registry auth hints, prebuilt-image multi-model fix, subprocess shell trust comment, and the rest of the orchestration/CLI plumbing.
examples/docs/tests: slurm_multi configs, README, and test coverage (50e8895) — operator-facing docs, examples renamed sglang-disagg-* → slurm-multi-* (one launcher drives multiple flows), and test additions.
PR #86 follow-up: ErrorContext kwargs, model-card image fallback, perf.csv collection (5ec55ff) — fixes surfaced during the live SGLang-disagg smoke run (job 22607 / 22615 on Distributed_Inference_CI).
slurm: namespace completion marker by SLURM_JOB_ID (710b51a) — second-run safety: per-deploy completion marker now job-id namespaced.
Revert out-of-scope changes flagged by Copilot Round-2 review (9109f52) — 11 files reverted to origin/develop byte-for-byte; surgical reverts in cli/utils.py and execution/docker_builder.py keep the slurm_multi-relevant bits (registry_image → DOCKER_IMAGE_NAME propagation for parallel pull, build_gpu_arch_display helper, Workload column rename).
Address Copilot Round-2 slurm_multi-related findings (63a2eae) — 6 real fixes: module-level recursive _deep_merge used on the manifest path; manifest.context.docker_env_vars restored into runtime context; RunOrchestrator._load_credentials delegated to madengine.core.auth.load_credentials (object-validates credential.json); configure_multi_node_profiling only filters rocprof when rocprofv3 is missing (keeps rccl_trace/rocblas_trace); perf-line regex widened (scientific notation + arbitrary metric tokens); "megatron" and "primus" added to VALID_LAUNCHERS.
docs: fix broken Next-Steps links and example paths (0156d33) — distributed-launchers.md → launchers.md; user-guide.md → usage.md; how-to-quick-start.md → cli-reference.md; dropped K8s slurm-multi-*.json paths that don't exist.
Address Copilot Round-3 small findings (2200cbd) — examples/profiling-configs/README.md multinode → multi_node; drop unused import pytest from tests/unit/test_reporting.py; revert tests/unit/test_rocm_path.py to origin/develop (matches the Context revert in 9109f52).
build_orchestrator: lowercase --build-on-compute local image name (ae1996c) — surfaced live during --build-on-compute --registry docker.io/rocm/pytorch-private --tags pyt_kvcache_transfer_bench: Path(dockerfile_path).name for scripts/.../Dockerfile (no .ubuntu.amd.Dockerfile suffix to strip) yielded Dockerfile, producing ci-..._Dockerfile and docker build -t rejection. .lower() the assembled local image name; verified end-to-end (Slurm 22936, image pushed to docker.io/rocm/pytorch-private:pyt_kvcache_transfer_bench).
build_orchestrator: normalize and validate --registry shape (auto-prepend docker.io/) (98f1ac5) — surfaced live with --registry rocm/pytorch-private: registry.split("/")[0] was treated as the docker login host (docker login rocm -> DNS NXDOMAIN). _execute_build_on_compute now normalizes shorthand <ns>/<repo> to docker.io/<ns>/<repo> with a transparent [dim]auto-prefixing 'docker.io/'[/dim] notice, and rejects malformed input (whitespace, @, empty after trim) up front with usage hints for Dockerhub / GHCR / Quay / NGC / self-hosted / localhost. Verified end-to-end across all four input shapes (sbatch 22954/22956 succeeded, @bad/foo rejected without sbatch).
Address Copilot Round-4 findings: 5 P0 + 3 hardening (3fc7587) — 8 fixes covering all in-scope Round-4 items (the other 14 R4 findings are pre-existing flake8/style nits in origin/develop-identical files unchanged by this PR):
- run_orchestrator.py: drop now-unused Context(rocm_path=...) kwarg from _init_runtime_context (line 138) and _create_manifest_from_local_image (line 456) call sites; drop from madengine.core.constants import get_rocm_path and unused handle_error imports. Latent TypeError on madengine run against local target after the Cat B core/context.py revert.
- container_runner.py: widen perf_pattern to mirror the deployment/base.py:_parse_node_log_for_perf regex (R2-Distributed solution - new madengine CLI #11) so tokens/sec, tok/s, samples_per_second, 1.23E+4 no longer get silently dropped.
- container_runner.py: login_to_registry delegates to madengine.core.auth.login_to_registry instead of f"echo '{password}' | docker login ..." inline (password no longer in ps//proc-visible argv even with --password-stdin); same delegation DockerBuilder.login_to_registry already uses.
- container_runner.py: get_env_arg no longer prints assembled --env KEY='value' to stdout (HF tokens, registry passwords, MAD_SECRETS_* etc. could leak); print only the count.
- container_runner.py: self-managed launcher (_run_self_managed) now shlex.quotes script_path and shlex.split + per-arg shlex.quotes model_args, so embedded shell metacharacters ($(), backticks, ;, ...) become literal arguments to the script rather than host-shell directives.
- docs/deployment.md: second occurrence of [Distributed Launchers Guide](distributed-launchers.md) (line 145) → (launchers.md) (the 0156d33 commit fixed only line 543).

Copilot review status

All four rounds resolved or replied-to inline:

Round 1 (9 comments) — addressed in d32b06b/5ec55ff; replies posted under each thread.
Round 2 (24 comments) — fully addressed across 9109f52 / 63a2eae / 0156d33 and replied-to inline:
- slurm.py: drop duplicate Optional import; honour slurm_config.exclusive.
- slurm_node_selector.py: cleanup_node merged srun; preserves --job-name and --reservation.
- build_orchestrator.py: built_models keyed by model_name; warn on divergent configs; early registry None guard; per-registry auth hints.
- run_orchestrator.py: real recursive _deep_merge for deployment_config ↔ additional_context; manifest's context.docker_env_vars restored; _load_credentials delegates to central auth helper.
- container_runner.py: documented shell=True trust boundary on inner subprocess.run.
- deployment/common.py: rocprofv3-fallback only filters rocprof; VALID_LAUNCHERS covers "megatron"/"primus".
- deployment/base.py: perf-line regex accepts scientific notation + arbitrary metric tokens.
- 15 outside-scope changes (core/{context,docker,dataprovider,timeout}.py, cli/app.py, utils/{rocm_tool_manager,discover_models,config_parser}.py, orchestration/image_filtering.py, reporting/update_perf_csv.py, README.md) reverted to origin/develop.
- 3 broken doc links / example paths fixed (docs/{installation,deployment,launchers}.md).
Round 3 (13 comments) — 4 fixed in 2200cbd, 7 stale-snapshot replies (already addressed in earlier rounds), 2 deferred to follow-up issues (pre-existing concerns in origin/develop: container_runner.get_env_arg env-arg shell-escape hardening; Console.sh/subprocess.run(timeout=…) translation of 0 → None).
Round 4 (24 comments) — 8 fixed in 3fc7587 (5 P0 + 3 hardening, see commit 12 above), 14 are pre-existing flake8/style nits in files byte-identical to origin/develop (utils/{ops,log_formatting}.py, scripts/common/{tools,pre_scripts}/**, tests/fixtures/utils.py, tests/integration/test_gpu_management.py, tests/unit/test_reporting_superset.py, core/errors.py, database/mongodb.py, deployment/base.py:645) — not introduced by this PR; deferred to a separate cleanup PR.

Sequencing

Updates to slurm launcher #86 (this PR) → develop.
Kubernetes deployment (split from PR #86) #118 (Kubernetes) rebases onto develop and merges.
After both land, follow-up cherry-picks the combined work into main per repo policy.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Updates to slurm launcher#86

Updates to slurm launcher#86
raviguptaamd wants to merge 15 commits into
ROCm:developfrom
raviguptaamd:raviguptaamd/update_slurm_launcher

raviguptaamd commented Mar 12, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

raviguptaamd commented Mar 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Scope (75 files vs develop)

Commit shape

Copilot review status

Sequencing

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

raviguptaamd commented Mar 12, 2026 •

edited

Loading

Scope (75 files vs `develop`)