Skip to content

Updates to slurm launcher#86

Open
raviguptaamd wants to merge 15 commits into
ROCm:developfrom
raviguptaamd:raviguptaamd/update_slurm_launcher
Open

Updates to slurm launcher#86
raviguptaamd wants to merge 15 commits into
ROCm:developfrom
raviguptaamd:raviguptaamd/update_slurm_launcher

Conversation

@raviguptaamd
Copy link
Copy Markdown

@raviguptaamd raviguptaamd commented Mar 12, 2026

Summary

Adds a slurm_multi deployment launcher for self-managed multi-node disaggregated workloads (vLLM-disagg, SGLang-disagg, large-EP, kvcache-transfer) and the orchestration / build / run / CLI hooks that drive it from madengine's manifest pipeline. Companion model cards and submit scripts that exercise this launcher live in ROCm/MAD-private PR #186.

Scope (75 files vs develop)

This PR is now focused strictly on slurm_multi and its supporting refactor. The Kubernetes deployment work originally bundled here has been moved to #118 (independent PR against develop) and will rebase after this PR lands.

Layer Highlights
Deployment deployment/{slurm.py, slurm_node_selector.py, base.py, common.py, config_loader.py, factory.py}, templates/slurm/job.sh.j2
Orchestration / Build / Run orchestration/{build_orchestrator, run_orchestrator, image_filtering}.py — manifest-driven --build-on-compute / --use-image, deep-merged deployment_config, registry validation
Execution execution/{container_runner, container_runner_helpers, docker_builder}.py
CLI cli/{app, commands/run, commands/build, constants, utils, validators}.py
Core / Utils / Reporting / DB / Scripts shared infra used by the new path: core/, utils/, reporting/, database/mongodb.py, scripts/common/
Examples / Docs / Tests examples/slurm-configs/**, docs/launchers.md, docs/deployment.md, docs/configuration.md, tests/{unit,integration,e2e,fixtures}/**

Net diff: +2,877 / -2,179 across 75 files (down from +3,014 / -2,624 across 87 files at the squash-rebase head, after Round-2 / Round-3 / Round-4 review fixes restored 12 files to origin/develop byte-for-byte and tightened 4 more in-scope code paths).

Commit shape

Logical commits on top of develop:

  1. deployment: add slurm_multi launcher, node selector, SBATCH template (8815e24) — SLURM deployment core; absorbs the slurm-side Copilot Round-1 review fixes.
  2. orchestration/execution/cli: wire slurm_multi end-to-end (6f2e11b) — manifest deep-merge, registry guard, per-registry auth hints, prebuilt-image multi-model fix, subprocess shell trust comment, and the rest of the orchestration/CLI plumbing.
  3. examples/docs/tests: slurm_multi configs, README, and test coverage (50e8895) — operator-facing docs, examples renamed sglang-disagg-*slurm-multi-* (one launcher drives multiple flows), and test additions.
  4. PR #86 follow-up: ErrorContext kwargs, model-card image fallback, perf.csv collection (5ec55ff) — fixes surfaced during the live SGLang-disagg smoke run (job 22607 / 22615 on Distributed_Inference_CI).
  5. slurm: namespace completion marker by SLURM_JOB_ID (710b51a) — second-run safety: per-deploy completion marker now job-id namespaced.
  6. Revert out-of-scope changes flagged by Copilot Round-2 review (9109f52) — 11 files reverted to origin/develop byte-for-byte; surgical reverts in cli/utils.py and execution/docker_builder.py keep the slurm_multi-relevant bits (registry_image → DOCKER_IMAGE_NAME propagation for parallel pull, build_gpu_arch_display helper, Workload column rename).
  7. Address Copilot Round-2 slurm_multi-related findings (63a2eae) — 6 real fixes: module-level recursive _deep_merge used on the manifest path; manifest.context.docker_env_vars restored into runtime context; RunOrchestrator._load_credentials delegated to madengine.core.auth.load_credentials (object-validates credential.json); configure_multi_node_profiling only filters rocprof when rocprofv3 is missing (keeps rccl_trace/rocblas_trace); perf-line regex widened (scientific notation + arbitrary metric tokens); "megatron" and "primus" added to VALID_LAUNCHERS.
  8. docs: fix broken Next-Steps links and example paths (0156d33) — distributed-launchers.mdlaunchers.md; user-guide.mdusage.md; how-to-quick-start.mdcli-reference.md; dropped K8s slurm-multi-*.json paths that don't exist.
  9. Address Copilot Round-3 small findings (2200cbd) — examples/profiling-configs/README.md multinodemulti_node; drop unused import pytest from tests/unit/test_reporting.py; revert tests/unit/test_rocm_path.py to origin/develop (matches the Context revert in 9109f52).
  10. build_orchestrator: lowercase --build-on-compute local image name (ae1996c) — surfaced live during --build-on-compute --registry docker.io/rocm/pytorch-private --tags pyt_kvcache_transfer_bench: Path(dockerfile_path).name for scripts/.../Dockerfile (no .ubuntu.amd.Dockerfile suffix to strip) yielded Dockerfile, producing ci-..._Dockerfile and docker build -t rejection. .lower() the assembled local image name; verified end-to-end (Slurm 22936, image pushed to docker.io/rocm/pytorch-private:pyt_kvcache_transfer_bench).
  11. build_orchestrator: normalize and validate --registry shape (auto-prepend docker.io/) (98f1ac5) — surfaced live with --registry rocm/pytorch-private: registry.split("/")[0] was treated as the docker login host (docker login rocm -> DNS NXDOMAIN). _execute_build_on_compute now normalizes shorthand <ns>/<repo> to docker.io/<ns>/<repo> with a transparent [dim]auto-prefixing 'docker.io/'[/dim] notice, and rejects malformed input (whitespace, @, empty after trim) up front with usage hints for Dockerhub / GHCR / Quay / NGC / self-hosted / localhost. Verified end-to-end across all four input shapes (sbatch 22954/22956 succeeded, @bad/foo rejected without sbatch).
  12. Address Copilot Round-4 findings: 5 P0 + 3 hardening (3fc7587) — 8 fixes covering all in-scope Round-4 items (the other 14 R4 findings are pre-existing flake8/style nits in origin/develop-identical files unchanged by this PR):
    • run_orchestrator.py: drop now-unused Context(rocm_path=...) kwarg from _init_runtime_context (line 138) and _create_manifest_from_local_image (line 456) call sites; drop from madengine.core.constants import get_rocm_path and unused handle_error imports. Latent TypeError on madengine run against local target after the Cat B core/context.py revert.
    • container_runner.py: widen perf_pattern to mirror the deployment/base.py:_parse_node_log_for_perf regex (R2-Distributed solution - new madengine CLI #11) so tokens/sec, tok/s, samples_per_second, 1.23E+4 no longer get silently dropped.
    • container_runner.py: login_to_registry delegates to madengine.core.auth.login_to_registry instead of f"echo '{password}' | docker login ..." inline (password no longer in ps//proc-visible argv even with --password-stdin); same delegation DockerBuilder.login_to_registry already uses.
    • container_runner.py: get_env_arg no longer prints assembled --env KEY='value' to stdout (HF tokens, registry passwords, MAD_SECRETS_* etc. could leak); print only the count.
    • container_runner.py: self-managed launcher (_run_self_managed) now shlex.quotes script_path and shlex.split + per-arg shlex.quotes model_args, so embedded shell metacharacters ($(), backticks, ;, ...) become literal arguments to the script rather than host-shell directives.
    • docs/deployment.md: second occurrence of [Distributed Launchers Guide](distributed-launchers.md) (line 145) → (launchers.md) (the 0156d33 commit fixed only line 543).

Copilot review status

All four rounds resolved or replied-to inline:

  • Round 1 (9 comments) — addressed in d32b06b/5ec55ff; replies posted under each thread.
  • Round 2 (24 comments) — fully addressed across 9109f52 / 63a2eae / 0156d33 and replied-to inline:
    • slurm.py: drop duplicate Optional import; honour slurm_config.exclusive.
    • slurm_node_selector.py: cleanup_node merged srun; preserves --job-name and --reservation.
    • build_orchestrator.py: built_models keyed by model_name; warn on divergent configs; early registry None guard; per-registry auth hints.
    • run_orchestrator.py: real recursive _deep_merge for deployment_configadditional_context; manifest's context.docker_env_vars restored; _load_credentials delegates to central auth helper.
    • container_runner.py: documented shell=True trust boundary on inner subprocess.run.
    • deployment/common.py: rocprofv3-fallback only filters rocprof; VALID_LAUNCHERS covers &#34;megatron&#34;/&#34;primus&#34;.
    • deployment/base.py: perf-line regex accepts scientific notation + arbitrary metric tokens.
    • 15 outside-scope changes (core/{context,docker,dataprovider,timeout}.py, cli/app.py, utils/{rocm_tool_manager,discover_models,config_parser}.py, orchestration/image_filtering.py, reporting/update_perf_csv.py, README.md) reverted to origin/develop.
    • 3 broken doc links / example paths fixed (docs/{installation,deployment,launchers}.md).
  • Round 3 (13 comments) — 4 fixed in 2200cbd, 7 stale-snapshot replies (already addressed in earlier rounds), 2 deferred to follow-up issues (pre-existing concerns in origin/develop: container_runner.get_env_arg env-arg shell-escape hardening; Console.sh/subprocess.run(timeout=…) translation of 0None).
  • Round 4 (24 comments) — 8 fixed in 3fc7587 (5 P0 + 3 hardening, see commit 12 above), 14 are pre-existing flake8/style nits in files byte-identical to origin/develop (utils/{ops,log_formatting}.py, scripts/common/{tools,pre_scripts}/**, tests/fixtures/utils.py, tests/integration/test_gpu_management.py, tests/unit/test_reporting_superset.py, core/errors.py, database/mongodb.py, deployment/base.py:645) — not introduced by this PR; deferred to a separate cleanup PR.

Sequencing

  1. Updates to slurm launcher #86 (this PR) → develop.
  2. Kubernetes deployment (split from PR #86) #118 (Kubernetes) rebases onto develop and merges.
  3. After both land, follow-up cherry-picks the combined work into main per repo policy.

Test plan

  • Import-only smoke check: DeploymentFactory.available_deployments() returns slurm (and k8s/kubernetes from develop's unchanged code, which this PR does not touch).
  • normalize_launcher round-trip for slurm_multi/slurm-multi/megatron/primus/vllm/sglang.
  • _deep_merge nested-leaf preservation (verified: an override touching slurm.node_selector.reservation keeps a sibling slurm.node_selector.partition from the manifest).
  • Context() rejects rocm_path= kwarg post-Cat-B-revert (verified via inspect.signature).
  • --registry normalization across all input shapes (rocm/pytorch-private, rocmshared, docker.io/rocm/pytorch-private, ghcr.io/myorg/myrepo, localhost:5000/myrepo, 192.168.1.10:5000/myrepo, registry.example.com:5000/myrepo, @bad/foo rejected, empty rejected).
  • End-to-end SGLang-disagg DSV3 1Px8/1Dx8 (24 GPUs / 3 nodes) on Distributed_Inference_CI (jobs 22607/22615): 5,977 tok/s (Llama-3.1-8B-Instruct). perf.csv written to /shared_inference/<user>/<jobid>/perf.csv and discovered by _collect_slurm_multi_results.
  • End-to-end SGLang-disagg Llama-3.3-70b-fp8 (24 GPUs / 3 nodes) via --build-on-compute --registry docker.io/rocm/pytorch-private: 1,878 tok/s @ CON=8, 3,644 tok/s @ CON=16 (ISL/OSL=1024/1024). Image pushed to docker.io/rocm/pytorch-private:pyt_sglang_disagg_llama-3.3-70b-fp8.
  • End-to-end pyt_large_ep_benchmark_2n (16 GPUs / 2 nodes) via --build-on-compute --registry: 28 perf rows (DeepEP / MoRI EP dispatch + combine across FP8/BF16 RDMA/NVL).
  • End-to-end pyt_kvcache_transfer_bench (16 GPUs / 2 nodes): 57 perf rows (mooncake + MORI + RIXL backends, 4KB → 1GB sweep).
  • Re-smoke after Round-4 fixes: pyt_large_ep_benchmark_2n on [008,024] via slurm_multi, 8 min, 28 ✅ Success rows. Confirms the Round-4 changes didn't regress the run path.
  • CI on the rebased branch.
  • Optional vLLM-MoRI end-to-end (pyt_vllm_disagg_llama3.1-8b / pyt_vllm_disagg_deepseek-v3_short) once the image-side NIXL_ERR_BACKEND (orthogonal to this PR; observed on rocm/pytorch-private:20260407_itej89_vllm_mori_docker) is resolved.

Loading
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants