Updates to slurm launcher#86
Open
raviguptaamd wants to merge 15 commits into
Open
Conversation
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds a
slurm_multideployment launcher for self-managed multi-node disaggregated workloads (vLLM-disagg, SGLang-disagg, large-EP, kvcache-transfer) and the orchestration / build / run / CLI hooks that drive it frommadengine's manifest pipeline. Companion model cards and submit scripts that exercise this launcher live in ROCm/MAD-private PR #186.Scope (75 files vs
develop)This PR is now focused strictly on
slurm_multiand its supporting refactor. The Kubernetes deployment work originally bundled here has been moved to #118 (independent PR againstdevelop) and will rebase after this PR lands.deployment/{slurm.py, slurm_node_selector.py, base.py, common.py, config_loader.py, factory.py},templates/slurm/job.sh.j2orchestration/{build_orchestrator, run_orchestrator, image_filtering}.py— manifest-driven--build-on-compute/--use-image, deep-mergeddeployment_config, registry validationexecution/{container_runner, container_runner_helpers, docker_builder}.pycli/{app, commands/run, commands/build, constants, utils, validators}.pycore/,utils/,reporting/,database/mongodb.py,scripts/common/examples/slurm-configs/**,docs/launchers.md,docs/deployment.md,docs/configuration.md,tests/{unit,integration,e2e,fixtures}/**Net diff: +2,877 / -2,179 across 75 files (down from +3,014 / -2,624 across 87 files at the squash-rebase head, after Round-2 / Round-3 / Round-4 review fixes restored 12 files to
origin/developbyte-for-byte and tightened 4 more in-scope code paths).Commit shape
Logical commits on top of
develop:deployment: add slurm_multi launcher, node selector, SBATCH template(8815e24) — SLURM deployment core; absorbs the slurm-side Copilot Round-1 review fixes.orchestration/execution/cli: wire slurm_multi end-to-end(6f2e11b) — manifest deep-merge, registry guard, per-registry auth hints, prebuilt-image multi-model fix, subprocess shell trust comment, and the rest of the orchestration/CLI plumbing.examples/docs/tests: slurm_multi configs, README, and test coverage(50e8895) — operator-facing docs, examples renamedsglang-disagg-*→slurm-multi-*(one launcher drives multiple flows), and test additions.PR #86 follow-up: ErrorContext kwargs, model-card image fallback, perf.csv collection(5ec55ff) — fixes surfaced during the live SGLang-disagg smoke run (job 22607 / 22615 onDistributed_Inference_CI).slurm: namespace completion marker by SLURM_JOB_ID(710b51a) — second-run safety: per-deploy completion marker now job-id namespaced.Revert out-of-scope changes flagged by Copilot Round-2 review(9109f52) — 11 files reverted toorigin/developbyte-for-byte; surgical reverts incli/utils.pyandexecution/docker_builder.pykeep the slurm_multi-relevant bits (registry_image → DOCKER_IMAGE_NAMEpropagation for parallel pull,build_gpu_arch_displayhelper,Workloadcolumn rename).Address Copilot Round-2 slurm_multi-related findings(63a2eae) — 6 real fixes: module-level recursive_deep_mergeused on the manifest path;manifest.context.docker_env_varsrestored into runtime context;RunOrchestrator._load_credentialsdelegated tomadengine.core.auth.load_credentials(object-validatescredential.json);configure_multi_node_profilingonly filtersrocprofwhenrocprofv3is missing (keepsrccl_trace/rocblas_trace); perf-line regex widened (scientific notation + arbitrary metric tokens);"megatron"and"primus"added toVALID_LAUNCHERS.docs: fix broken Next-Steps links and example paths(0156d33) —distributed-launchers.md→launchers.md;user-guide.md→usage.md;how-to-quick-start.md→cli-reference.md; dropped K8sslurm-multi-*.jsonpaths that don't exist.Address Copilot Round-3 small findings(2200cbd) —examples/profiling-configs/README.mdmultinode→multi_node; drop unusedimport pytestfromtests/unit/test_reporting.py; reverttests/unit/test_rocm_path.pytoorigin/develop(matches theContextrevert in 9109f52).build_orchestrator: lowercase --build-on-compute local image name(ae1996c) — surfaced live during--build-on-compute --registry docker.io/rocm/pytorch-private --tags pyt_kvcache_transfer_bench:Path(dockerfile_path).nameforscripts/.../Dockerfile(no.ubuntu.amd.Dockerfilesuffix to strip) yieldedDockerfile, producingci-..._Dockerfileanddocker build -trejection..lower()the assembled local image name; verified end-to-end (Slurm 22936, image pushed todocker.io/rocm/pytorch-private:pyt_kvcache_transfer_bench).build_orchestrator: normalize and validate --registry shape (auto-prepend docker.io/)(98f1ac5) — surfaced live with--registry rocm/pytorch-private:registry.split("/")[0]was treated as the docker login host (docker login rocm-> DNS NXDOMAIN)._execute_build_on_computenow normalizes shorthand<ns>/<repo>todocker.io/<ns>/<repo>with a transparent[dim]auto-prefixing 'docker.io/'[/dim]notice, and rejects malformed input (whitespace,@, empty after trim) up front with usage hints for Dockerhub / GHCR / Quay / NGC / self-hosted / localhost. Verified end-to-end across all four input shapes (sbatch 22954/22956 succeeded,@bad/foorejected without sbatch).Address Copilot Round-4 findings: 5 P0 + 3 hardening(3fc7587) — 8 fixes covering all in-scope Round-4 items (the other 14 R4 findings are pre-existing flake8/style nits inorigin/develop-identical files unchanged by this PR):run_orchestrator.py: drop now-unusedContext(rocm_path=...)kwarg from_init_runtime_context(line 138) and_create_manifest_from_local_image(line 456) call sites; dropfrom madengine.core.constants import get_rocm_pathand unusedhandle_errorimports. LatentTypeErroronmadengine runagainst local target after the Cat Bcore/context.pyrevert.container_runner.py: widenperf_patternto mirror thedeployment/base.py:_parse_node_log_for_perfregex (R2-Distributed solution - new madengine CLI #11) sotokens/sec,tok/s,samples_per_second,1.23E+4no longer get silently dropped.container_runner.py:login_to_registrydelegates tomadengine.core.auth.login_to_registryinstead off"echo '{password}' | docker login ..."inline (password no longer inps//proc-visible argv even with--password-stdin); same delegationDockerBuilder.login_to_registryalready uses.container_runner.py:get_env_argno longer prints assembled--env KEY='value'to stdout (HF tokens, registry passwords,MAD_SECRETS_*etc. could leak); print only the count.container_runner.py: self-managed launcher (_run_self_managed) nowshlex.quotesscript_pathandshlex.split+ per-argshlex.quotesmodel_args, so embedded shell metacharacters ($(), backticks,;, ...) become literal arguments to the script rather than host-shell directives.docs/deployment.md: second occurrence of[Distributed Launchers Guide](distributed-launchers.md)(line 145) →(launchers.md)(the 0156d33 commit fixed only line 543).Copilot review status
All four rounds resolved or replied-to inline:
d32b06b/5ec55ff; replies posted under each thread.9109f52/63a2eae/0156d33and replied-to inline:slurm.py: drop duplicateOptionalimport; honourslurm_config.exclusive.slurm_node_selector.py: cleanup_node merged srun; preserves--job-nameand--reservation.build_orchestrator.py: built_models keyed bymodel_name; warn on divergent configs; early registry None guard; per-registry auth hints.run_orchestrator.py: real recursive_deep_mergefordeployment_config↔additional_context; manifest'scontext.docker_env_varsrestored;_load_credentialsdelegates to central auth helper.container_runner.py: documentedshell=Truetrust boundary on innersubprocess.run.deployment/common.py:rocprofv3-fallback only filtersrocprof;VALID_LAUNCHERScovers"megatron"/"primus".deployment/base.py: perf-line regex accepts scientific notation + arbitrary metric tokens.core/{context,docker,dataprovider,timeout}.py,cli/app.py,utils/{rocm_tool_manager,discover_models,config_parser}.py,orchestration/image_filtering.py,reporting/update_perf_csv.py,README.md) reverted toorigin/develop.docs/{installation,deployment,launchers}.md).2200cbd, 7 stale-snapshot replies (already addressed in earlier rounds), 2 deferred to follow-up issues (pre-existing concerns inorigin/develop:container_runner.get_env_argenv-arg shell-escape hardening;Console.sh/subprocess.run(timeout=…)translation of0→None).3fc7587(5 P0 + 3 hardening, see commit 12 above), 14 are pre-existing flake8/style nits in files byte-identical toorigin/develop(utils/{ops,log_formatting}.py,scripts/common/{tools,pre_scripts}/**,tests/fixtures/utils.py,tests/integration/test_gpu_management.py,tests/unit/test_reporting_superset.py,core/errors.py,database/mongodb.py,deployment/base.py:645) — not introduced by this PR; deferred to a separate cleanup PR.Sequencing
develop.developand merges.mainper repo policy.Test plan
DeploymentFactory.available_deployments()returnsslurm(andk8s/kubernetesfromdevelop's unchanged code, which this PR does not touch).normalize_launcherround-trip forslurm_multi/slurm-multi/megatron/primus/vllm/sglang._deep_mergenested-leaf preservation (verified: an override touchingslurm.node_selector.reservationkeeps a siblingslurm.node_selector.partitionfrom the manifest).Context()rejectsrocm_path=kwarg post-Cat-B-revert (verified viainspect.signature).--registrynormalization across all input shapes (rocm/pytorch-private,rocmshared,docker.io/rocm/pytorch-private,ghcr.io/myorg/myrepo,localhost:5000/myrepo,192.168.1.10:5000/myrepo,registry.example.com:5000/myrepo,@bad/foorejected, empty rejected).Distributed_Inference_CI(jobs 22607/22615): 5,977 tok/s (Llama-3.1-8B-Instruct).perf.csvwritten to/shared_inference/<user>/<jobid>/perf.csvand discovered by_collect_slurm_multi_results.--build-on-compute --registry docker.io/rocm/pytorch-private: 1,878 tok/s @ CON=8, 3,644 tok/s @ CON=16 (ISL/OSL=1024/1024). Image pushed todocker.io/rocm/pytorch-private:pyt_sglang_disagg_llama-3.3-70b-fp8.--build-on-compute --registry: 28 perf rows (DeepEP / MoRI EP dispatch + combine across FP8/BF16 RDMA/NVL).pyt_large_ep_benchmark_2non [008,024] via slurm_multi, 8 min, 28 ✅ Success rows. Confirms the Round-4 changes didn't regress the run path.pyt_vllm_disagg_llama3.1-8b/pyt_vllm_disagg_deepseek-v3_short) once the image-sideNIXL_ERR_BACKEND(orthogonal to this PR; observed onrocm/pytorch-private:20260407_itej89_vllm_mori_docker) is resolved.