From 537beb92dab467a7f7a0a057674fbfde10dc0c8a Mon Sep 17 00:00:00 2001 From: William Chen <57119977+OCWC22@users.noreply.github.com> Date: Sat, 2 May 2026 12:34:13 -0700 Subject: [PATCH 1/5] fix(dsv4): gate h200 reasoning parser flag --- benchmarks/single_node/dsv4_fp8_h200.sh | 8 +++++++- 1 file changed, 7 insertions(+), 1 deletion(-) diff --git a/benchmarks/single_node/dsv4_fp8_h200.sh b/benchmarks/single_node/dsv4_fp8_h200.sh index 167a50a57..9b381fed2 100644 --- a/benchmarks/single_node/dsv4_fp8_h200.sh +++ b/benchmarks/single_node/dsv4_fp8_h200.sh @@ -25,6 +25,7 @@ hf download "$MODEL" SERVER_LOG=/workspace/server.log PORT=${PORT:-8888} +ENABLE_DSV4_REASONING_PARSER=${ENABLE_DSV4_REASONING_PARSER:-false} # DeepSeek-V4-Pro weights are large; engine startup can exceed the default # 600s. Give it an hour to load. @@ -37,6 +38,11 @@ else MAX_MODEL_LEN_ARG="--max-model-len 800000" fi +REASONING_PARSER_ARGS=() +if [[ "${ENABLE_DSV4_REASONING_PARSER}" == "true" ]]; then + REASONING_PARSER_ARGS+=(--reasoning-parser deepseek_v4) +fi + # Start GPU monitoring (power, temperature, clocks every second) start_gpu_monitor @@ -60,7 +66,7 @@ $MAX_MODEL_LEN_ARG \ --tokenizer-mode deepseek_v4 \ --tool-call-parser deepseek_v4 \ --enable-auto-tool-choice \ ---reasoning-parser deepseek_v4 > $SERVER_LOG 2>&1 & +"${REASONING_PARSER_ARGS[@]}" > $SERVER_LOG 2>&1 & SERVER_PID=$! From df9aa0cd88b3591fa89a4b127685c25862acbb02 Mon Sep 17 00:00:00 2001 From: William Chen <57119977+OCWC22@users.noreply.github.com> Date: Sun, 3 May 2026 03:25:13 -0700 Subject: [PATCH 2/5] docs(agentic): add GMI truth matrix [skip-sweep] --- AGENTIC_TRUTH_MATRIX.md | 247 ++++++++++++++++++++++++++++++++++++++++ 1 file changed, 247 insertions(+) create mode 100644 AGENTIC_TRUTH_MATRIX.md diff --git a/AGENTIC_TRUTH_MATRIX.md b/AGENTIC_TRUTH_MATRIX.md new file mode 100644 index 000000000..fdd91f139 --- /dev/null +++ b/AGENTIC_TRUTH_MATRIX.md @@ -0,0 +1,247 @@ +# SemiAnalysis InferenceX Agentic/WEKA Truth Matrix + +Date: 2026-05-03 + +Scope: local `InferenceX` checkout, focused on the PR path that adds the `agentic-coding` scenario and WEKA trace replay. This is a truth matrix for deciding what still needs to be built before a GMI Cloud or other neocloud platform engineer can use the harness to evaluate real long-context chat and coding inference workloads. + +## Bottom Line + +The current InferenceX implementation is a real but experimental **agentic trace replay harness**. It replays recorded WEKA coding/chat traces against an OpenAI-compatible serving endpoint and emits latency, throughput, cache, workload-distribution, and artifact outputs. + +It is **not yet** a complete GMI/neocloud evaluation harness for DeepSeek-V4 on B200/B300/GB200. The biggest gap is that `agentic-coding` is wired for some DeepSeek-R1 and GPT-OSS/Kimi paths, while the DeepSeek-V4 GB200/B300/B200 surface is still mostly fixed-sequence or srt-slurm recipe driven. The harness also does not yet produce the full cluster, network, reliability, cost, and operator-readiness evidence that a cloud platform engineer would need. + +## Actual Code Path Today + +| Step | What happens | Actual code | Truth status | +|---|---|---|---| +| 1 | Config declares an optional `agentic-coding` scenario. | `.github/configs/CONFIGS.md` | Exists | +| 2 | NVIDIA/AMD master configs include a small number of `agentic-coding` entries. | `.github/configs/nvidia-master.yaml`, `.github/configs/amd-master.yaml` | Exists, narrow | +| 3 | Matrix generator expands agentic entries across concurrency, TP, EP, DP attention, offload, runner, image, model, and duration. | `utils/matrix_logic/generate_sweep_configs.py` | Exists | +| 4 | GitHub workflow sets agentic routing env vars. | `.github/workflows/benchmark-tmpl.yml` | Exists | +| 5 | Runner selects `benchmarks/single_node/agentic/...` instead of normal fixed-seq scripts. | `runners/launch_*.sh` via `SCENARIO_SUBDIR=agentic/` | Exists | +| 6 | Shared library resolves WEKA trace source and builds the replay command. | `benchmarks/benchmark_lib.sh` | Exists | +| 7 | Agentic script starts the serving backend and runs trace replay. | `benchmarks/single_node/agentic/dsr1_fp4_b200.sh`, peers | Exists | +| 8 | Multi-node agentic path runs client-only replay against an already-started srt-slurm frontend. | `benchmarks/multi_node/agentic_srt.sh` | Exists, experimental | +| 9 | Aggregator turns replay CSVs into InferenceX-like JSON. | `utils/process_agentic_result.py` | Exists | +| 10 | Workflow uploads raw and aggregated artifacts. | `.github/workflows/benchmark-tmpl.yml`, `.github/workflows/e2e-tests.yml` | Exists | + +## Actual Code Snippets + +The trace source is hardcoded to a Hugging Face dataset: + +```bash +local dataset="semianalysisai/cc-traces-weka-042026" +TRACE_SOURCE_FLAG="--hf-dataset $dataset" +``` + +Source: `benchmarks/benchmark_lib.sh` + +Agentic replay is built as a client workload against the local serving endpoint: + +```bash +REPLAY_CMD="python3 $TRACE_REPLAY_DIR/trace_replay_tester.py" +REPLAY_CMD+=" --api-endpoint http://localhost:$PORT" +REPLAY_CMD+=" $TRACE_SOURCE_FLAG" +REPLAY_CMD+=" --output-dir $result_dir/trace_replay" +REPLAY_CMD+=" --start-users $CONC" +REPLAY_CMD+=" --max-users $CONC" +REPLAY_CMD+=" --test-duration $duration" +REPLAY_CMD+=" --recycle" +REPLAY_CMD+=" --warmup-enabled" +REPLAY_CMD+=" --seed 42" +``` + +Source: `benchmarks/benchmark_lib.sh` + +The workflow routes agentic jobs by setting: + +```yaml +SCENARIO_SUBDIR: ${{ inputs.scenario-type == 'agentic-coding' && 'agentic/' || '' }} +IS_AGENTIC: ${{ inputs.scenario-type == 'agentic-coding' && '1' || '0' }} +RESULT_DIR: /workspace/results +``` + +Source: `.github/workflows/benchmark-tmpl.yml` + +The B200 DeepSeek-R1 agentic script starts SGLang, waits for readiness, runs replay, then aggregates: + +```bash +resolve_trace_source +install_agentic_deps +python3 -m sglang.launch_server ... --enable-metrics > "$SERVER_LOG" 2>&1 & +wait_for_server_ready --port "$PORT" --server-log "$SERVER_LOG" --server-pid "$SERVER_PID" +build_replay_cmd "$RESULT_DIR" +$REPLAY_CMD 2>&1 | tee "$RESULT_DIR/benchmark.log" || true +write_agentic_result_json "$RESULT_DIR" +``` + +Source: `benchmarks/single_node/agentic/dsr1_fp4_b200.sh` + +Aggregated JSON includes scenario identity, topology, success counts, latency, throughput, token distributions, cache stats, and per-GPU throughput: + +```python +agg = { + "hw": os.environ.get('RUNNER_TYPE', ''), + "conc": conc, + "model": os.environ.get('MODEL', ''), + "framework": os.environ.get('FRAMEWORK', ''), + "scenario_type": "agentic-coding", + "is_multinode": is_multinode, + "tp": tp, + "ep": ep, + "offloading": os.environ.get('OFFLOADING', 'none'), + "num_requests_total": len(rows), + "num_requests_successful": len(successful), +} +``` + +Source: `utils/process_agentic_result.py` + +## What The Use Case Actually Is + +The current use case is: + +| Use case | Current behavior | +|---|---| +| Replay realistic coding/chat request traces | Yes, via `semianalysisai/cc-traces-weka-042026`. | +| Drive a serving endpoint with concurrent users | Yes, with `--start-users $CONC` and `--max-users $CONC`. | +| Measure request-level TTFT / E2E / ITL / TPOT | Yes, from `trace_replay/detailed_results.csv`. | +| Measure throughput and throughput per GPU | Yes, from completed request timestamps and configured GPU counts. | +| Measure input/output token distribution | Yes, from replay rows. | +| Estimate cache reuse | Partially. It reports theoretical replay cache hit rate and server prefix-cache counters when metrics exist. | +| Evaluate real autonomous coding agent behavior | No. It replays traces; it does not run an agent loop with tools, repo edits, tests, retries, or feedback. | +| Evaluate GMI customer traffic | No, unless GMI traffic is converted into the same trace-replay format. | + +## Current Coverage Matrix + +| Surface | Current status | Notes | +|---|---|---| +| DeepSeek-R1 FP4 B200 SGLang single-node agentic | Exists | `benchmarks/single_node/agentic/dsr1_fp4_b200.sh`. | +| DeepSeek-R1 FP4 B200 Dynamo/TRT multi-node agentic | Exists, experimental | Uses a special `cquil11/srt-slurm-nv` branch and a `128k_agentic` recipe. | +| DeepSeek-R1 FP4 MI355X SGLang single-node agentic | Exists | AMD entry in `.github/configs/amd-master.yaml`. | +| GPT-OSS FP4 H100/H200/MI300X/MI325X agentic scripts | Exists as scripts | Need config coverage and live validation per target. | +| Kimi K2.5 FP4 B200 agentic script | Exists as script | Need config coverage and live validation. | +| DeepSeek-V4 B200/B300 SGLang fixed-seq | Exists | Fixed 1k/8k surfaces, not agentic trace replay. | +| DeepSeek-V4 B200/B300 vLLM fixed-seq/MTP | Exists | Fixed-seq path with DSV4 chat encoding. | +| DeepSeek-V4 GB200 vLLM srt-slurm recipes | Exists | Recipe set for 8k1k, not agentic trace replay. | +| DeepSeek-V4 GB200 agentic trace replay | Missing | No `agentic-coding` config or DSV4-specific agentic launcher found. | +| B300 agentic trace replay | Mostly missing | B300 has fixed-seq DSR1/DSV4 surfaces, not a clear agentic path. | +| LMCache/TensorMesh agentic comparison | Missing | No direct LMCache/TensorMesh metrics integration in InferenceX agentic path. | + +## What A GMI/Neocloud Platform Engineer Actually Cares About + +| Category | What they need to decide | Current harness answer | Gap | +|---|---|---|---| +| Capacity planning | How many concurrent coding/chat sessions per node or rack before SLO violation? | Partial: concurrency sweep and request latencies. | Needs SLO pass/fail curves, saturation point, and capacity recommendations. | +| Latency SLO | P50/P90/P99 TTFT, TPOT, E2E for long-context chat/coding. | Partial: computes latency stats. | Needs explicit SLO config, pass/fail, and stable run windows. | +| Long context | How 32k/64k/128k/256k+ context behaves under realistic reuse. | Partial: WEKA traces may include realistic shapes, but context buckets are not first-class in matrix. | Needs explicit context-length stratification and reporting. | +| Coding workload realism | Does traffic resemble coding assistants, repo Q&A, edits, tests, tool calls? | Partial: recorded traces, but no task taxonomy shown in core benchmark output. | Needs workload classes: code chat, repo QA, patch generation, test/debug loop, long-doc coding. | +| Cache value | Does prefix/KV reuse improve latency, cost, and throughput? | Partial: theoretical and server prefix-cache hit metrics. | Needs engine-specific cache event metrics, eviction, residency, fragmentation, reuse distance, cache salt/isolation. | +| Multi-tenant isolation | Does one tenant poison or evict another tenant's cache? | Missing. | Needs tenant IDs, cache salts, fairness and isolation reports. | +| Memory pressure | When do KV cache, CPU offload, swap, or SSD tiers collapse? | Partial: offloading field and a few counters. | Needs GPU memory, HBM pressure, CPU memory, SSD bandwidth, eviction storms, OOM attribution. | +| Slurm operator flow | Can an operator dry-run, submit, monitor, cancel, and collect artifacts? | Partial in InferenceX CI and srt-slurm paths. | Needs portable Slurm matrix runner, sbatch rendering, env-only cluster config, and artifact contract. | +| Network health | Are NCCL/RDMA/NVLink/IB topology problems caught before benchmark? | Missing in InferenceX agentic path. | Needs preflight topology and network smoke checks. | +| Reproducibility | Can results be traced to image digest, repo SHA, GPU inventory, driver, topology, and versions? | Partial: CI has image/model/framework fields. | Needs full provenance captured per job. | +| Reliability | Do runs survive cold start, warmup, long duration, failed requests, server restarts? | Partial: success counts and raw logs. | Needs failure taxonomy, retry policy, health timeline, and soak tests. | +| Cost model | Which hardware/runtime gives best $/successful-session or $/M tokens at SLO? | Missing. | Needs GPU-hour pricing input and cost-per-SLO report. | +| Hardware comparison | B200 vs B300 vs GB200 for the same DSV4 workload. | Missing for agentic. | Need same workload across same engines and configs. | +| Runtime comparison | vLLM vs SGLang vs TRT/Dynamo under identical trace replay. | Partial for some models. | Need normalized DSV4 matrix and identical trace/scheduler settings. | +| Production readiness | What config should GMI actually offer customers? | Missing. | Needs recommended SKUs, caveats, and no-go thresholds. | + +## Truth Matrix: Current vs Required + +Legend: + +- Yes: implemented in the local InferenceX path. +- Partial: implemented but too narrow, experimental, or missing key evidence. +- No: not implemented. +- Unknown: cannot be proven from this repo without live cluster results or external data. + +| Requirement | Current truth | Evidence | Needed build | +|---|---:|---|---| +| Agentic scenario flag and config schema | Yes | `agentic-coding` in config docs and validation. | Keep. | +| WEKA trace replay source | Yes | `semianalysisai/cc-traces-weka-042026` in `resolve_trace_source`. | Make dataset configurable; keep WEKA as default/example. | +| Single-node trace replay execution | Yes | `benchmarks/single_node/agentic/*.sh`. | Add DSV4 B200/B300 launchers. | +| Multi-node trace replay execution | Partial | `benchmarks/multi_node/agentic_srt.sh`; special srt-slurm branch. | First-class srt-slurm support, no special private branch dependency. | +| DeepSeek-V4 B200 agentic | No | DSV4 B200 configs are fixed-seq, not `agentic-coding`. | Add config + launcher + validated run. | +| DeepSeek-V4 B300 agentic | No | B300 has DSV4 fixed-seq scripts/recipes, not agentic. | Add config + launcher + validated run. | +| DeepSeek-V4 GB200 agentic | No | GB200 DSV4 recipes exist, but no agentic scenario. | Add srt-slurm agentic recipe and config. | +| B200/B300/GB200 apples-to-apples matrix | No | Current surfaces differ by model/runtime/scenario. | Build normalized matrix over hardware, engine, context, concurrency. | +| vLLM/SGLang/TRT/Dynamo comparison for same workload | Partial | Some engines covered for some models. | Normalize exact model, precision, prompt encoding, trace, and duration. | +| Long-context buckets | Partial | Fixed-seq has 1k/8k; trace replay may have varied token lengths. | Add explicit 8k/32k/64k/128k/256k+ bins in reports and optional filters. | +| Coding workload taxonomy | Partial | Trace replay exists; distribution plot exists. | Add task labels and per-class metrics. | +| TTFT/TPOT/E2E latency metrics | Yes | `compute_latency_stats`. | Add SLO pass/fail summary. | +| Throughput per GPU | Yes | `tput_per_gpu` in processor. | Add SLO-qualified throughput, not just raw throughput. | +| Failed request taxonomy | Partial | Success count exists. | Add HTTP error class, timeout, OOM, scheduler reject, engine crash. | +| Prefix/KV cache hit metrics | Partial | Theoretical + server prefix counters when present. | Add LMCache/TensorMesh/vLLM/SGLang metric adapters with measured-vs-inferred flags. | +| Eviction/fragmentation proof | No | No live cache event schema in agentic path. | Add engine metric scraping and artifact schema. | +| Multi-tenant cache isolation | No | No tenant IDs or cache salt model. | Add multi-tenant trace mode and isolation metrics. | +| CPU/SSD offload analysis | Partial | `offloading` field and some counters in processor. | Add tier residency, bandwidth, latency, and failure attribution. | +| Slurm dry-run matrix generation | Partial | InferenceX CI/srt-slurm flow exists; not a portable GMI operator runner. | Add portable Slurm matrix runner and artifact contract. | +| NCCL/RDMA/topology preflight | No | Not in agentic path. | Add pre-benchmark smoke checks. | +| Full provenance capture | Partial | JSON includes image/model/framework; raw logs upload. | Add digest, repo SHA, driver, CUDA, GPU inventory, topology, package versions. | +| Cost and capacity report | No | No pricing or recommendation layer. | Add cost inputs and capacity planning report. | +| Customer-ready operator report | No | Raw/aggregated artifacts only. | Add one-page operator brief with recommendations and caveats. | + +## Recommended Build Matrix For GMI/Neocloud Evaluation + +This is the minimum useful matrix for a GMI cloud engineer evaluating long-context chat and coding workloads. It is intentionally smaller than a full combinatorial sweep. + +| Axis | Required values | Why it matters | +|---|---|---| +| Hardware | B200, B300, GB200 | These are the procurement/deployment choices. | +| Model | DeepSeek-V4-Pro first; DeepSeek-R1 as control | DSV4 is the target; DSR1 provides existing harness continuity. | +| Runtime | vLLM, SGLang, Dynamo/TRT where supported | GMI needs runtime/SKU decision data. | +| Topology | single-node, multi-node disagg | Long context and MoE behavior differ sharply by topology. | +| Context bucket | 8k, 32k, 64k, 128k, 256k+ | Cloud operators need max supported context and degradation curve. | +| Workload type | long chat, repo QA, code generation, test/debug loop, multi-turn agent | Coding traffic is not one workload. | +| Concurrency | 1, 2, 4, 8, 16, 32, 64, 128, then saturation search | Finds knee of curve and failure region. | +| Arrival mode | closed-loop and burst/open-loop | Closed-loop measures users; open-loop exposes queue collapse. | +| Cache mode | cache off, engine prefix cache, LMCache/TensorMesh if available | Proves whether cache stack actually helps. | +| Tenant mode | single tenant, multi-tenant with cache salt | Proves isolation and fairness. | +| Duration | 10 min smoke, 30 min curve, 2-4 hr soak | Separates launch success from operational stability. | + +## What To Build Next + +| Priority | Build item | Acceptance criteria | +|---:|---|---| +| P0 | DSV4 `agentic-coding` configs for B200/B300/GB200 | Matrix generator emits DSV4 agentic jobs for each target hardware without touching fixed-seq paths. | +| P0 | DSV4 agentic launchers | Single-node launchers exist for B200/B300; GB200 multi-node agentic recipe exists or maps cleanly to srt-slurm custom benchmark. | +| P0 | Portable Slurm matrix runner | GMI operator can dry-run and submit without GitHub Actions; no hardcoded cluster IDs; all cluster settings via env/YAML. | +| P0 | Artifact contract | Every run emits a normalized JSON, raw CSV/JSONL, server log, config, command, provenance, and expected-path manifest. | +| P1 | Workload taxonomy and context buckets | Report breaks down metrics by workload class and context-length bucket. | +| P1 | SLO/capacity report | For each cell, report max concurrency at TTFT/TPOT/E2E SLO and failure reason beyond it. | +| P1 | Provenance capture | Per-job artifact records image digest, repo SHA, CUDA/driver, GPU inventory, topology, runtime versions, Slurm job ID, nodelist. | +| P1 | NCCL/RDMA/topology preflight | Preflight emits pass/fail/skipped before benchmark execution. | +| P1 | Cache metrics adapters | vLLM/SGLang/LMCache/TensorMesh metrics are normalized with measured vs inferred labels. | +| P2 | Multi-tenant replay mode | Tenant IDs, cache salt/isolation, fairness, noisy-neighbor metrics. | +| P2 | Cost model | Add GPU-hour price input and output $/successful-session, $/M input tokens, $/M output tokens at SLO. | +| P2 | Operator brief | Generate a human-readable recommendation with caveats: best config, no-go configs, saturation point, and missing proof. | + +## Non-Claims To Preserve + +Do not claim any of the following until live artifacts prove them: + +- DeepSeek-V4 agentic performance on GB200. +- B200/B300/GB200 parity under the same long-context trace replay. +- LMCache/TensorMesh benefit. +- Cache eviction or fragmentation behavior. +- Multi-tenant isolation. +- Production readiness for GMI customer workloads. +- Autonomous agent performance; this is trace replay, not a tool-using agent loop. + +## Proposed File/Code Changes For The Next PR + +| Area | Candidate files | +|---|---| +| DSV4 agentic configs | `.github/configs/nvidia-master.yaml`, possibly a separate GMI/GPU pilot config. | +| DSV4 single-node launchers | `benchmarks/single_node/agentic/dsv4_fp4_b200_sglang.sh`, `benchmarks/single_node/agentic/dsv4_fp4_b300_sglang.sh`, vLLM variants if supported. | +| GB200 multi-node agentic | `benchmarks/multi_node/agentic_srt.sh`, `benchmarks/multi_node/srt-slurm-recipes/.../deepseek-v4/...`, `runners/launch_gb200-nv.sh`. | +| Slurm operator harness | `scripts/slurm/`, `scripts/run_agentic_slurm_matrix.py`, `configs/agentic_slurm_matrix.yaml`. | +| Metrics schema | `utils/process_agentic_result.py` plus a new normalized metrics schema module. | +| Artifact contract tests | `utils/matrix_logic/test_*.py` or new repo-level tests for dry-run contract. | +| Operator report | new `utils/summarize_agentic.py` or integration into `utils/summarize.py`. | + +## Decision + +For GMI/neocloud evaluation, the current InferenceX PR is a **good starting mechanism**, not a finished benchmark product. Build the missing DSV4+B200/B300/GB200 agentic Slurm surface, add provenance/preflight/cache/SLO reporting, and keep every unmeasured claim explicitly labeled as unproven. From bfea80549194a4649124f9df73c46f5682a33e40 Mon Sep 17 00:00:00 2001 From: William Chen <57119977+OCWC22@users.noreply.github.com> Date: Sun, 3 May 2026 03:35:05 -0700 Subject: [PATCH 3/5] feat(agentic): add GMI DSV4 Slurm harness [skip-sweep] --- .github/configs/nvidia-master.yaml | 8 + AGENTIC_TRUTH_MATRIX.md | 20 +- benchmarks/benchmark_lib.sh | 2 +- .../agentic/dsv4_fp4_b200_sglang.sh | 17 + .../agentic/dsv4_fp4_b300_sglang.sh | 17 + benchmarks/single_node/dsv4_fp4_b200.sh | 19 + .../single_node/dsv4_fp4_b300_sglang.sh | 19 + configs/agentic_slurm_matrix.json | 73 ++++ configs/agentic_slurm_matrix.yaml | 67 +++ scripts/run_agentic_slurm_matrix.py | 381 ++++++++++++++++++ scripts/slurm/agentic_job.sbatch.tmpl | 89 ++++ utils/test_agentic_slurm_matrix.py | 84 ++++ 12 files changed, 785 insertions(+), 11 deletions(-) create mode 100755 benchmarks/single_node/agentic/dsv4_fp4_b200_sglang.sh create mode 100755 benchmarks/single_node/agentic/dsv4_fp4_b300_sglang.sh create mode 100644 configs/agentic_slurm_matrix.json create mode 100644 configs/agentic_slurm_matrix.yaml create mode 100755 scripts/run_agentic_slurm_matrix.py create mode 100644 scripts/slurm/agentic_job.sbatch.tmpl create mode 100644 utils/test_agentic_slurm_matrix.py diff --git a/.github/configs/nvidia-master.yaml b/.github/configs/nvidia-master.yaml index 38d1101f3..3b31a65e8 100644 --- a/.github/configs/nvidia-master.yaml +++ b/.github/configs/nvidia-master.yaml @@ -1732,6 +1732,10 @@ dsv4-fp4-b200-sglang: - { tp: 8, ep: 8, dp-attn: true, conc-start: 64, conc-end: 128 } # DP-attention (DP_ATTENTION=true) — max-throughput CONC range - { tp: 8, ep: 8, dp-attn: true, conc-start: 256, conc-end: 512 } + agentic-coding: + - duration: 1800 + search-space: + - { tp: 8, ep: 8, dp-attn: true, offloading: none, conc-list: [1, 2, 4, 8, 16, 32, 64] } dsv4-fp4-b200-vllm: image: vllm/vllm-openai:v0.20.0-cu130 @@ -1951,6 +1955,10 @@ dsv4-fp4-b300-sglang: - { tp: 4, ep: 4, dp-attn: true, conc-start: 512, conc-end: 512 } - { tp: 8, ep: 8, dp-attn: true, conc-start: 2048, conc-end: 2048 } - { tp: 8, ep: 8, dp-attn: true, conc-start: 4096, conc-end: 4096 } + agentic-coding: + - duration: 1800 + search-space: + - { tp: 8, ep: 8, dp-attn: true, offloading: none, conc-list: [1, 2, 4, 8, 16, 32, 64] } # DeepSeek-V4-Pro on B300 with EAGLE/MTP speculative decoding. Recipe is # selected inside benchmarks/single_node/dsv4_fp4_b300_sglang_mtp.sh by diff --git a/AGENTIC_TRUTH_MATRIX.md b/AGENTIC_TRUTH_MATRIX.md index fdd91f139..b23f598b1 100644 --- a/AGENTIC_TRUTH_MATRIX.md +++ b/AGENTIC_TRUTH_MATRIX.md @@ -124,8 +124,8 @@ The current use case is: | DeepSeek-V4 B200/B300 SGLang fixed-seq | Exists | Fixed 1k/8k surfaces, not agentic trace replay. | | DeepSeek-V4 B200/B300 vLLM fixed-seq/MTP | Exists | Fixed-seq path with DSV4 chat encoding. | | DeepSeek-V4 GB200 vLLM srt-slurm recipes | Exists | Recipe set for 8k1k, not agentic trace replay. | -| DeepSeek-V4 GB200 agentic trace replay | Missing | No `agentic-coding` config or DSV4-specific agentic launcher found. | -| B300 agentic trace replay | Mostly missing | B300 has fixed-seq DSR1/DSV4 surfaces, not a clear agentic path. | +| DeepSeek-V4 B200/B300 SGLang agentic trace replay | Wired, needs live validation | Added DSV4 single-node agentic wrappers and conservative `agentic-coding` config rows. | +| DeepSeek-V4 GB200 agentic trace replay | Dry-run harness only | Portable Slurm matrix can render GB200 agentic jobs, but live srt-slurm recipe behavior is still unproven. | | LMCache/TensorMesh agentic comparison | Missing | No direct LMCache/TensorMesh metrics integration in InferenceX agentic path. | ## What A GMI/Neocloud Platform Engineer Actually Cares About @@ -161,11 +161,11 @@ Legend: |---|---:|---|---| | Agentic scenario flag and config schema | Yes | `agentic-coding` in config docs and validation. | Keep. | | WEKA trace replay source | Yes | `semianalysisai/cc-traces-weka-042026` in `resolve_trace_source`. | Make dataset configurable; keep WEKA as default/example. | -| Single-node trace replay execution | Yes | `benchmarks/single_node/agentic/*.sh`. | Add DSV4 B200/B300 launchers. | +| Single-node trace replay execution | Yes | `benchmarks/single_node/agentic/*.sh`. | Run live DSV4 B200/B300 validation. | | Multi-node trace replay execution | Partial | `benchmarks/multi_node/agentic_srt.sh`; special srt-slurm branch. | First-class srt-slurm support, no special private branch dependency. | -| DeepSeek-V4 B200 agentic | No | DSV4 B200 configs are fixed-seq, not `agentic-coding`. | Add config + launcher + validated run. | -| DeepSeek-V4 B300 agentic | No | B300 has DSV4 fixed-seq scripts/recipes, not agentic. | Add config + launcher + validated run. | -| DeepSeek-V4 GB200 agentic | No | GB200 DSV4 recipes exist, but no agentic scenario. | Add srt-slurm agentic recipe and config. | +| DeepSeek-V4 B200 agentic | Partial | Config + launcher are wired; no live GPU artifact yet. | Run on B200 Slurm and attach artifacts. | +| DeepSeek-V4 B300 agentic | Partial | Config + launcher are wired; no live GPU artifact yet. | Run on B300 Slurm and attach artifacts. | +| DeepSeek-V4 GB200 agentic | Partial | Portable matrix renders GB200 jobs; live srt-slurm behavior remains unproven. | Add/validate GB200 agentic srt-slurm recipe on real hardware. | | B200/B300/GB200 apples-to-apples matrix | No | Current surfaces differ by model/runtime/scenario. | Build normalized matrix over hardware, engine, context, concurrency. | | vLLM/SGLang/TRT/Dynamo comparison for same workload | Partial | Some engines covered for some models. | Normalize exact model, precision, prompt encoding, trace, and duration. | | Long-context buckets | Partial | Fixed-seq has 1k/8k; trace replay may have varied token lengths. | Add explicit 8k/32k/64k/128k/256k+ bins in reports and optional filters. | @@ -205,10 +205,10 @@ This is the minimum useful matrix for a GMI cloud engineer evaluating long-conte | Priority | Build item | Acceptance criteria | |---:|---|---| -| P0 | DSV4 `agentic-coding` configs for B200/B300/GB200 | Matrix generator emits DSV4 agentic jobs for each target hardware without touching fixed-seq paths. | -| P0 | DSV4 agentic launchers | Single-node launchers exist for B200/B300; GB200 multi-node agentic recipe exists or maps cleanly to srt-slurm custom benchmark. | -| P0 | Portable Slurm matrix runner | GMI operator can dry-run and submit without GitHub Actions; no hardcoded cluster IDs; all cluster settings via env/YAML. | -| P0 | Artifact contract | Every run emits a normalized JSON, raw CSV/JSONL, server log, config, command, provenance, and expected-path manifest. | +| P0 | DSV4 `agentic-coding` configs for B200/B300 | Implemented; requires live run artifacts. | +| P0 | DSV4 agentic launchers for B200/B300 | Implemented; reuses existing SGLang server recipes and switches the client to WEKA trace replay. | +| P0 | Portable Slurm matrix runner | Implemented for dry-run/submit; no hardcoded cluster IDs; all cluster settings via env/JSON/YAML. | +| P0 | Artifact contract | Implemented expected-path manifest; live runs must still produce the artifacts before claims. | | P1 | Workload taxonomy and context buckets | Report breaks down metrics by workload class and context-length bucket. | | P1 | SLO/capacity report | For each cell, report max concurrency at TTFT/TPOT/E2E SLO and failure reason beyond it. | | P1 | Provenance capture | Per-job artifact records image digest, repo SHA, CUDA/driver, GPU inventory, topology, runtime versions, Slurm job ID, nodelist. | diff --git a/benchmarks/benchmark_lib.sh b/benchmarks/benchmark_lib.sh index 4c0c8642e..994111bad 100644 --- a/benchmarks/benchmark_lib.sh +++ b/benchmarks/benchmark_lib.sh @@ -892,7 +892,7 @@ ensure_hf_cli() { } resolve_trace_source() { - local dataset="semianalysisai/cc-traces-weka-042026" + local dataset="${TRACE_SOURCE:-semianalysisai/cc-traces-weka-042026}" TRACE_SOURCE_FLAG="--hf-dataset $dataset" echo "Loading traces from Hugging Face dataset: $dataset" # Pre-download the dataset into the shared HF_HUB_CACHE (same mount used diff --git a/benchmarks/single_node/agentic/dsv4_fp4_b200_sglang.sh b/benchmarks/single_node/agentic/dsv4_fp4_b200_sglang.sh new file mode 100755 index 000000000..dcac3bdb3 --- /dev/null +++ b/benchmarks/single_node/agentic/dsv4_fp4_b200_sglang.sh @@ -0,0 +1,17 @@ +#!/usr/bin/env bash +set -euo pipefail + +# Agentic trace replay wrapper for DeepSeek-V4-Pro FP4 on B200 with SGLang. +# The server recipe lives in ../dsv4_fp4_b200.sh; AGENTIC_MODE switches the +# post-ready client from fixed random prompts to WEKA trace replay. + +export AGENTIC_MODE=1 +export ISL="${ISL:-8192}" +export OSL="${OSL:-1024}" +export RANDOM_RANGE_RATIO="${RANDOM_RANGE_RATIO:-1}" +export RESULT_FILENAME="${RESULT_FILENAME:-agentic_dsv4_fp4_b200_sglang}" + +REPO_ROOT="$(cd "$(dirname "$0")/../../.." && pwd)" +export INFMAX_CONTAINER_WORKSPACE="${INFMAX_CONTAINER_WORKSPACE:-$REPO_ROOT}" + +exec "$REPO_ROOT/benchmarks/single_node/dsv4_fp4_b200.sh" diff --git a/benchmarks/single_node/agentic/dsv4_fp4_b300_sglang.sh b/benchmarks/single_node/agentic/dsv4_fp4_b300_sglang.sh new file mode 100755 index 000000000..a5dc2387c --- /dev/null +++ b/benchmarks/single_node/agentic/dsv4_fp4_b300_sglang.sh @@ -0,0 +1,17 @@ +#!/usr/bin/env bash +set -euo pipefail + +# Agentic trace replay wrapper for DeepSeek-V4-Pro FP4 on B300 with SGLang. +# The server recipe lives in ../dsv4_fp4_b300_sglang.sh; AGENTIC_MODE switches +# the post-ready client from fixed random prompts to WEKA trace replay. + +export AGENTIC_MODE=1 +export ISL="${ISL:-8192}" +export OSL="${OSL:-1024}" +export RANDOM_RANGE_RATIO="${RANDOM_RANGE_RATIO:-1}" +export RESULT_FILENAME="${RESULT_FILENAME:-agentic_dsv4_fp4_b300_sglang}" + +REPO_ROOT="$(cd "$(dirname "$0")/../../.." && pwd)" +export INFMAX_CONTAINER_WORKSPACE="${INFMAX_CONTAINER_WORKSPACE:-$REPO_ROOT}" + +exec "$REPO_ROOT/benchmarks/single_node/dsv4_fp4_b300_sglang.sh" diff --git a/benchmarks/single_node/dsv4_fp4_b200.sh b/benchmarks/single_node/dsv4_fp4_b200.sh index df1259deb..6577b7791 100755 --- a/benchmarks/single_node/dsv4_fp4_b200.sh +++ b/benchmarks/single_node/dsv4_fp4_b200.sh @@ -100,6 +100,25 @@ SERVER_PID=$! wait_for_server_ready --port "$PORT" --server-log "$SERVER_LOG" --server-pid "$SERVER_PID" +if [ "${AGENTIC_MODE:-0}" = "1" ]; then + RESULT_DIR="${RESULT_DIR:-$PWD/results}" + mkdir -p "$RESULT_DIR" + cp "$SERVER_LOG" "$RESULT_DIR/server.log" 2>/dev/null || true + resolve_trace_source + install_agentic_deps + build_replay_cmd "$RESULT_DIR" + echo "$REPLAY_CMD" > "$RESULT_DIR/benchmark_command.txt" + set +e + $REPLAY_CMD 2>&1 | tee "$RESULT_DIR/benchmark.log" + REPLAY_RC=${PIPESTATUS[0]} + set -e + write_agentic_result_json "$RESULT_DIR" + python3 "$AGENTIC_DIR/scripts/analyze_benchmark_distributions.py" \ + "$RESULT_DIR/trace_replay" -o "$RESULT_DIR" 2>&1 || true + stop_gpu_monitor + exit "$REPLAY_RC" +fi + pip install -q datasets pandas run_benchmark_serving \ diff --git a/benchmarks/single_node/dsv4_fp4_b300_sglang.sh b/benchmarks/single_node/dsv4_fp4_b300_sglang.sh index 8f43ea8a3..2a053ae8f 100755 --- a/benchmarks/single_node/dsv4_fp4_b300_sglang.sh +++ b/benchmarks/single_node/dsv4_fp4_b300_sglang.sh @@ -186,6 +186,25 @@ SERVER_PID=$! wait_for_server_ready --port "$PORT" --server-log "$SERVER_LOG" --server-pid "$SERVER_PID" +if [ "${AGENTIC_MODE:-0}" = "1" ]; then + RESULT_DIR="${RESULT_DIR:-$PWD/results}" + mkdir -p "$RESULT_DIR" + cp "$SERVER_LOG" "$RESULT_DIR/server.log" 2>/dev/null || true + resolve_trace_source + install_agentic_deps + build_replay_cmd "$RESULT_DIR" + echo "$REPLAY_CMD" > "$RESULT_DIR/benchmark_command.txt" + set +e + $REPLAY_CMD 2>&1 | tee "$RESULT_DIR/benchmark.log" + REPLAY_RC=${PIPESTATUS[0]} + set -e + write_agentic_result_json "$RESULT_DIR" + python3 "$AGENTIC_DIR/scripts/analyze_benchmark_distributions.py" \ + "$RESULT_DIR/trace_replay" -o "$RESULT_DIR" 2>&1 || true + stop_gpu_monitor + exit "$REPLAY_RC" +fi + pip install -q datasets pandas run_benchmark_serving \ diff --git a/configs/agentic_slurm_matrix.json b/configs/agentic_slurm_matrix.json new file mode 100644 index 000000000..792e16f38 --- /dev/null +++ b/configs/agentic_slurm_matrix.json @@ -0,0 +1,73 @@ +{ + "defaults": { + "partition_env": "GMI_SLURM_PARTITION", + "account_env": "GMI_SLURM_ACCOUNT", + "results_root_env": "GMI_RESULTS_ROOT", + "container_image_env": "GMI_CONTAINER_IMAGE", + "model_path_env": "GMI_MODEL_PATH", + "time_limit": "04:00:00", + "cpus_per_task": 64, + "gpus_per_node": 8, + "trace_source": "semianalysisai/cc-traces-weka-042026", + "duration_seconds": 1800, + "arrival_modes": ["closed_loop"], + "cache_modes": ["engine_prefix_cache"], + "tenant_modes": ["single_tenant"], + "context_buckets": ["8k", "32k", "64k", "128k"], + "concurrency": [1, 2, 4, 8, 16, 32, 64] + }, + "hardware": { + "b200": { + "enabled": true, + "runner_script": "runners/launch_b200-nb.sh", + "runner_name": "b200-gmi-agentic", + "slurm_nodes": 1, + "model_prefix": "dsv4", + "precision": "fp4", + "framework": "sglang", + "topology": "single_node", + "tp": 8, + "ep": 8, + "dp_attention": true, + "script_expected": "benchmarks/single_node/agentic/dsv4_fp4_b200_sglang.sh" + }, + "b300": { + "enabled": true, + "runner_script": "runners/launch_b300-nv.sh", + "runner_name": "b300-gmi-agentic", + "slurm_nodes": 1, + "model_prefix": "dsv4", + "precision": "fp4", + "framework": "sglang", + "topology": "single_node", + "tp": 8, + "ep": 8, + "dp_attention": true, + "script_expected": "benchmarks/single_node/agentic/dsv4_fp4_b300_sglang.sh" + }, + "gb200": { + "enabled": true, + "runner_script": "runners/launch_gb200-nv.sh", + "runner_name": "gb200-gmi-agentic", + "slurm_nodes": 5, + "model_prefix": "dsv4", + "precision": "fp4", + "framework": "dynamo-vllm", + "topology": "multi_node_disagg", + "prefill": { + "num_worker": 1, + "tp": 8, + "ep": 8, + "dp_attention": true + }, + "decode": { + "num_worker": 1, + "tp": 8, + "ep": 1, + "dp_attention": false + }, + "config_file": "recipes/vllm/deepseek-v4/8k1k/disagg-gb200-low-latency.yaml" + } + } +} + diff --git a/configs/agentic_slurm_matrix.yaml b/configs/agentic_slurm_matrix.yaml new file mode 100644 index 000000000..7cfeb2be5 --- /dev/null +++ b/configs/agentic_slurm_matrix.yaml @@ -0,0 +1,67 @@ +defaults: + partition_env: GMI_SLURM_PARTITION + account_env: GMI_SLURM_ACCOUNT + results_root_env: GMI_RESULTS_ROOT + container_image_env: GMI_CONTAINER_IMAGE + model_path_env: GMI_MODEL_PATH + time_limit: "04:00:00" + cpus_per_task: 64 + gpus_per_node: 8 + trace_source: "semianalysisai/cc-traces-weka-042026" + duration_seconds: 1800 + arrival_modes: ["closed_loop"] + cache_modes: ["engine_prefix_cache"] + tenant_modes: ["single_tenant"] + context_buckets: ["8k", "32k", "64k", "128k"] + concurrency: [1, 2, 4, 8, 16, 32, 64] + +hardware: + b200: + enabled: true + runner_script: "runners/launch_b200-nb.sh" + runner_name: "b200-gmi-agentic" + slurm_nodes: 1 + model_prefix: "dsv4" + precision: "fp4" + framework: "sglang" + topology: "single_node" + tp: 8 + ep: 8 + dp_attention: true + script_expected: "benchmarks/single_node/agentic/dsv4_fp4_b200_sglang.sh" + + b300: + enabled: true + runner_script: "runners/launch_b300-nv.sh" + runner_name: "b300-gmi-agentic" + slurm_nodes: 1 + model_prefix: "dsv4" + precision: "fp4" + framework: "sglang" + topology: "single_node" + tp: 8 + ep: 8 + dp_attention: true + script_expected: "benchmarks/single_node/agentic/dsv4_fp4_b300_sglang.sh" + + gb200: + enabled: true + runner_script: "runners/launch_gb200-nv.sh" + runner_name: "gb200-gmi-agentic" + slurm_nodes: 5 + model_prefix: "dsv4" + precision: "fp4" + framework: "dynamo-vllm" + topology: "multi_node_disagg" + prefill: + num_worker: 1 + tp: 8 + ep: 8 + dp_attention: true + decode: + num_worker: 1 + tp: 8 + ep: 1 + dp_attention: false + config_file: "recipes/vllm/deepseek-v4/8k1k/disagg-gb200-low-latency.yaml" + diff --git a/scripts/run_agentic_slurm_matrix.py b/scripts/run_agentic_slurm_matrix.py new file mode 100755 index 000000000..de8351d5a --- /dev/null +++ b/scripts/run_agentic_slurm_matrix.py @@ -0,0 +1,381 @@ +#!/usr/bin/env python3 +"""Generate and optionally submit a GMI-facing agentic Slurm benchmark matrix. + +The runner is intentionally dry-run-first: it renders sbatch files, a matrix +plan, and an expected artifact contract without claiming GPU behavior. +""" + +from __future__ import annotations + +import argparse +import hashlib +import json +import os +import re +import subprocess +import sys +from dataclasses import dataclass +from pathlib import Path +from typing import Any + +REPO_ROOT = Path(__file__).resolve().parents[1] +DEFAULT_CONFIG = REPO_ROOT / "configs" / "agentic_slurm_matrix.json" +SBATCH_TEMPLATE = REPO_ROOT / "scripts" / "slurm" / "agentic_job.sbatch.tmpl" + + +def _as_list(value: Any) -> list[Any]: + if value is None: + return [] + if isinstance(value, list): + return value + return [value] + + +def _parse_csv_ints(value: str | None) -> list[int] | None: + if not value: + return None + return [int(item.strip()) for item in value.split(",") if item.strip()] + + +def _parse_csv_strings(value: str | None) -> list[str] | None: + if not value: + return None + return [item.strip() for item in value.split(",") if item.strip()] + + +def _slug(value: str) -> str: + value = value.lower() + value = re.sub(r"[^a-z0-9._-]+", "-", value) + return value.strip("-") + + +def _load_config(path: Path) -> dict[str, Any]: + with path.open() as handle: + if path.suffix == ".json": + data = json.load(handle) + else: + try: + import yaml # type: ignore + except ModuleNotFoundError as exc: + raise RuntimeError( + f"{path} requires PyYAML. Use the default JSON config or install pyyaml." + ) from exc + data = yaml.safe_load(handle) + if not isinstance(data, dict): + raise ValueError(f"{path} must contain a YAML mapping") + return data + + +@dataclass(frozen=True) +class MatrixJob: + job_id: str + hardware: str + framework: str + topology: str + context_bucket: str + concurrency: int + arrival_mode: str + cache_mode: str + tenant_mode: str + duration_seconds: int + runner_script: str + runner_name: str + slurm_nodes: int + gpus_per_node: int + cpus_per_task: int + time_limit: str + model_prefix: str + precision: str + tp: int + ep: int + dp_attention: bool + disagg: bool + config_file: str + is_multinode: bool + prefill_num_workers: int + prefill_tp: int + prefill_ep: int + prefill_dp_attention: bool + decode_num_workers: int + decode_tp: int + decode_ep: int + decode_dp_attention: bool + trace_source: str + + @property + def exp_name(self) -> str: + return ( + f"{self.model_prefix}_{self.hardware}_{self.framework}_" + f"{self.context_bucket}_conc{self.concurrency}" + ) + + @property + def result_filename(self) -> str: + return _slug( + f"agentic_{self.exp_name}_{self.arrival_mode}_" + f"{self.cache_mode}_{self.tenant_mode}" + ) + + def to_dict(self) -> dict[str, Any]: + return { + "job_id": self.job_id, + "hardware": self.hardware, + "framework": self.framework, + "topology": self.topology, + "context_bucket": self.context_bucket, + "concurrency": self.concurrency, + "arrival_mode": self.arrival_mode, + "cache_mode": self.cache_mode, + "tenant_mode": self.tenant_mode, + "duration_seconds": self.duration_seconds, + "runner_script": self.runner_script, + "runner_name": self.runner_name, + "slurm_nodes": self.slurm_nodes, + "gpus_per_node": self.gpus_per_node, + "cpus_per_task": self.cpus_per_task, + "time_limit": self.time_limit, + "model_prefix": self.model_prefix, + "precision": self.precision, + "tp": self.tp, + "ep": self.ep, + "dp_attention": self.dp_attention, + "disagg": self.disagg, + "config_file": self.config_file, + "is_multinode": self.is_multinode, + "trace_source": self.trace_source, + "result_filename": self.result_filename, + "exp_name": self.exp_name, + "prefill_num_workers": self.prefill_num_workers, + "prefill_tp": self.prefill_tp, + "prefill_ep": self.prefill_ep, + "prefill_dp_attention": self.prefill_dp_attention, + "decode_num_workers": self.decode_num_workers, + "decode_tp": self.decode_tp, + "decode_ep": self.decode_ep, + "decode_dp_attention": self.decode_dp_attention, + } + + +def expand_jobs(config: dict[str, Any], args: argparse.Namespace) -> list[MatrixJob]: + defaults = config.get("defaults", {}) + hardware_cfg = config.get("hardware", {}) + if not isinstance(hardware_cfg, dict): + raise ValueError("hardware must be a mapping") + + selected_hw = set(_parse_csv_strings(args.hardware) or hardware_cfg.keys()) + contexts = _parse_csv_strings(args.context_buckets) or _as_list(defaults.get("context_buckets")) + concurrencies = _parse_csv_ints(args.concurrency) or _as_list(defaults.get("concurrency")) + arrival_modes = _parse_csv_strings(args.arrival_modes) or _as_list(defaults.get("arrival_modes")) + cache_modes = _parse_csv_strings(args.cache_modes) or _as_list(defaults.get("cache_modes")) + tenant_modes = _parse_csv_strings(args.tenant_modes) or _as_list(defaults.get("tenant_modes")) + + jobs: list[MatrixJob] = [] + for hardware, hw in hardware_cfg.items(): + if hardware not in selected_hw or not hw.get("enabled", True): + continue + runner_script = REPO_ROOT / str(hw["runner_script"]) + if not runner_script.exists(): + raise FileNotFoundError(f"runner_script not found for {hardware}: {runner_script}") + script_expected = hw.get("script_expected") + if script_expected and not (REPO_ROOT / str(script_expected)).exists(): + raise FileNotFoundError(f"script_expected not found for {hardware}: {script_expected}") + + is_multinode = hw.get("topology") == "multi_node_disagg" + prefill = hw.get("prefill", {}) + decode = hw.get("decode", {}) + tp = int(hw.get("tp", prefill.get("tp", 1))) + ep = int(hw.get("ep", prefill.get("ep", 1))) + dp_attention = bool(hw.get("dp_attention", prefill.get("dp_attention", False))) + + for context_bucket in contexts: + for concurrency in concurrencies: + for arrival_mode in arrival_modes: + for cache_mode in cache_modes: + for tenant_mode in tenant_modes: + key = "|".join( + [ + hardware, + str(hw["framework"]), + str(context_bucket), + str(concurrency), + str(arrival_mode), + str(cache_mode), + str(tenant_mode), + ] + ) + job_id = hashlib.sha1(key.encode()).hexdigest()[:10] + jobs.append( + MatrixJob( + job_id=job_id, + hardware=hardware, + framework=str(hw["framework"]), + topology=str(hw["topology"]), + context_bucket=str(context_bucket), + concurrency=int(concurrency), + arrival_mode=str(arrival_mode), + cache_mode=str(cache_mode), + tenant_mode=str(tenant_mode), + duration_seconds=int(args.duration or defaults.get("duration_seconds", 1800)), + runner_script=str(hw["runner_script"]), + runner_name=str(hw["runner_name"]), + slurm_nodes=int(hw.get("slurm_nodes", 1)), + gpus_per_node=int(defaults.get("gpus_per_node", 8)), + cpus_per_task=int(defaults.get("cpus_per_task", 64)), + time_limit=str(defaults.get("time_limit", "04:00:00")), + model_prefix=str(hw["model_prefix"]), + precision=str(hw["precision"]), + tp=tp, + ep=ep, + dp_attention=dp_attention, + disagg=is_multinode, + config_file=str(hw.get("config_file", "")), + is_multinode=is_multinode, + prefill_num_workers=int(prefill.get("num_worker", 0)), + prefill_tp=int(prefill.get("tp", 0)), + prefill_ep=int(prefill.get("ep", 0)), + prefill_dp_attention=bool(prefill.get("dp_attention", False)), + decode_num_workers=int(decode.get("num_worker", 0)), + decode_tp=int(decode.get("tp", 0)), + decode_ep=int(decode.get("ep", 0)), + decode_dp_attention=bool(decode.get("dp_attention", False)), + trace_source=str(defaults.get("trace_source", "")), + ) + ) + + if args.max_jobs is not None: + jobs = jobs[: args.max_jobs] + return jobs + + +def expected_paths(job: MatrixJob) -> list[str]: + return [ + f"{job.job_id}/{job.result_filename}.json", + f"{job.job_id}/results/benchmark.log", + f"{job.job_id}/results/benchmark_command.txt", + f"{job.job_id}/results/server.log", + f"{job.job_id}/results/trace_replay/detailed_results.csv", + f"{job.job_id}/results/trace_replay/debug_trace.jsonl", + f"{job.job_id}/preflight.log", + f"{job.job_id}/provenance_preflight.jsonl", + ] + + +def render_sbatch(job: MatrixJob, config: dict[str, Any], results_root: Path, dry_run_guard: bool) -> str: + defaults = config.get("defaults", {}) + template = SBATCH_TEMPLATE.read_text() + job_dir = results_root / job.job_id + values = { + **job.to_dict(), + "job_name": f"agentic-{job.hardware}-{job.job_id}", + "job_dir": str(job_dir), + "partition_env": defaults.get("partition_env", "GMI_SLURM_PARTITION"), + "account_env": defaults.get("account_env", "GMI_SLURM_ACCOUNT"), + "results_root_env": defaults.get("results_root_env", "GMI_RESULTS_ROOT"), + "container_image_env": defaults.get("container_image_env", "GMI_CONTAINER_IMAGE"), + "model_path_env": defaults.get("model_path_env", "GMI_MODEL_PATH"), + "dp_attention": str(job.dp_attention).lower(), + "disagg": str(job.disagg).lower(), + "is_multinode": "1" if job.is_multinode else "0", + "prefill_dp_attention": str(job.prefill_dp_attention).lower(), + "decode_dp_attention": str(job.decode_dp_attention).lower(), + "dry_run_guard": "1" if dry_run_guard else "0", + } + rendered = template + for key, value in values.items(): + rendered = rendered.replace("{" + key + "}", str(value)) + return rendered + + +def write_outputs(config: dict[str, Any], jobs: list[MatrixJob], results_root: Path, dry_run_guard: bool) -> None: + sbatch_dir = results_root / "sbatch" + sbatch_dir.mkdir(parents=True, exist_ok=True) + for job in jobs: + job_dir = results_root / job.job_id + job_dir.mkdir(parents=True, exist_ok=True) + sbatch_text = render_sbatch(job, config, results_root, dry_run_guard) + (sbatch_dir / f"{job.job_id}.sbatch").write_text(sbatch_text) + + plan = { + "scenario": "agentic-coding", + "total_jobs": len(jobs), + "jobs": [job.to_dict() for job in jobs], + } + (results_root / "matrix_plan.json").write_text(json.dumps(plan, indent=2) + "\n") + + contract = { + "scenario": "agentic-coding", + "total_jobs": len(jobs), + "per_job": [ + { + "job_id": job.job_id, + "result_filename": job.result_filename, + "expected_paths": expected_paths(job), + "required_before_claiming_success": [ + f"{job.job_id}/{job.result_filename}.json", + f"{job.job_id}/results/trace_replay/detailed_results.csv", + f"{job.job_id}/preflight.log", + f"{job.job_id}/provenance_preflight.jsonl", + ], + } + for job in jobs + ], + } + (results_root / "expected_artifact_contract.json").write_text(json.dumps(contract, indent=2) + "\n") + + +def submit_jobs(config: dict[str, Any], results_root: Path, jobs: list[MatrixJob]) -> None: + defaults = config.get("defaults", {}) + partition_env = defaults.get("partition_env", "GMI_SLURM_PARTITION") + account_env = defaults.get("account_env", "GMI_SLURM_ACCOUNT") + partition = os.environ.get(partition_env) + account = os.environ.get(account_env) + if not partition: + raise RuntimeError(f"{partition_env} must be set when --submit is used") + for job in jobs: + sbatch_path = results_root / "sbatch" / f"{job.job_id}.sbatch" + cmd = ["sbatch", "--partition", partition] + if account: + cmd.extend(["--account", account]) + cmd.append(str(sbatch_path)) + subprocess.run(cmd, check=True) + + +def build_parser() -> argparse.ArgumentParser: + parser = argparse.ArgumentParser(description=__doc__) + parser.add_argument("--config", type=Path, default=DEFAULT_CONFIG) + parser.add_argument("--results-root", type=Path, default=Path(os.environ.get("GMI_RESULTS_ROOT", "agentic-slurm-results"))) + parser.add_argument("--hardware", help="Comma-separated hardware filter, e.g. b200,b300") + parser.add_argument("--context-buckets", help="Comma-separated context buckets") + parser.add_argument("--concurrency", help="Comma-separated concurrency values") + parser.add_argument("--arrival-modes", help="Comma-separated arrival modes") + parser.add_argument("--cache-modes", help="Comma-separated cache modes") + parser.add_argument("--tenant-modes", help="Comma-separated tenant modes") + parser.add_argument("--duration", type=int) + parser.add_argument("--max-jobs", type=int) + parser.add_argument("--dry-run", action="store_true", help="Render files only") + parser.add_argument("--submit", action="store_true", help="Submit rendered sbatch jobs") + return parser + + +def main(argv: list[str] | None = None) -> int: + parser = build_parser() + args = parser.parse_args(argv) + if args.submit and args.dry_run: + parser.error("--submit and --dry-run are mutually exclusive") + + config = _load_config(args.config) + jobs = expand_jobs(config, args) + args.results_root.mkdir(parents=True, exist_ok=True) + write_outputs(config, jobs, args.results_root, dry_run_guard=args.dry_run or not args.submit) + + if args.submit: + submit_jobs(config, args.results_root, jobs) + + print(f"Wrote {len(jobs)} agentic Slurm jobs to {args.results_root}") + print(f"Matrix plan: {args.results_root / 'matrix_plan.json'}") + print(f"Artifact contract: {args.results_root / 'expected_artifact_contract.json'}") + return 0 + + +if __name__ == "__main__": + sys.exit(main()) diff --git a/scripts/slurm/agentic_job.sbatch.tmpl b/scripts/slurm/agentic_job.sbatch.tmpl new file mode 100644 index 000000000..b3b052c38 --- /dev/null +++ b/scripts/slurm/agentic_job.sbatch.tmpl @@ -0,0 +1,89 @@ +#!/usr/bin/env bash +#SBATCH --job-name={job_name} +#SBATCH --nodes={slurm_nodes} +#SBATCH --gpus-per-node={gpus_per_node} +#SBATCH --cpus-per-task={cpus_per_task} +#SBATCH --time={time_limit} +#SBATCH --output={job_dir}/slurm-%j.out +#SBATCH --error={job_dir}/slurm-%j.err + +set -euo pipefail + +mkdir -p "{job_dir}" + +required_env=( + "{partition_env}" + "{results_root_env}" + "{container_image_env}" + "{model_path_env}" +) +for env_name in "${required_env[@]}"; do + if [[ -z "${!env_name:-}" ]]; then + echo "FATAL: required environment variable ${env_name} is not set" >&2 + exit 2 + fi +done + +export IMAGE="${{container_image_env}}" +export MODEL="${{model_path_env}}" +export GITHUB_WORKSPACE="${GITHUB_WORKSPACE:-$(pwd)}" +export MODEL_PREFIX="{model_prefix}" +export PRECISION="{precision}" +export FRAMEWORK="{framework}" +export RUNNER_NAME="{runner_name}" +export RUNNER_TYPE="{hardware}" +export EXP_NAME="{exp_name}" +export RESULT_FILENAME="{result_filename}" +export RESULT_DIR="{job_dir}/results" +export AGENTIC_OUTPUT_DIR="{job_dir}" +export SCENARIO_TYPE="agentic-coding" +export SCENARIO_SUBDIR="agentic/" +export IS_AGENTIC="1" +export CONC="{concurrency}" +export DURATION="{duration_seconds}" +export TRACE_SOURCE="{trace_source}" +export AGENTIC_CONTEXT_BUCKET="{context_bucket}" +export AGENTIC_ARRIVAL_MODE="{arrival_mode}" +export AGENTIC_CACHE_MODE="{cache_mode}" +export AGENTIC_TENANT_MODE="{tenant_mode}" +export TP="{tp}" +export EP_SIZE="{ep}" +export DP_ATTENTION="{dp_attention}" +export SPEC_DECODING="none" +export DISAGG="{disagg}" +export CONFIG_FILE="{config_file}" +export IS_MULTINODE="{is_multinode}" +export PREFILL_NUM_WORKERS="{prefill_num_workers}" +export PREFILL_TP="{prefill_tp}" +export PREFILL_EP="{prefill_ep}" +export PREFILL_DP_ATTN="{prefill_dp_attention}" +export DECODE_NUM_WORKERS="{decode_num_workers}" +export DECODE_TP="{decode_tp}" +export DECODE_EP="{decode_ep}" +export DECODE_DP_ATTN="{decode_dp_attention}" + +cat > "{job_dir}/provenance_preflight.jsonl" < "{job_dir}/preflight.log" 2>&1 + +if [[ "{dry_run_guard}" == "1" ]]; then + echo "Dry-run sbatch rendered successfully; not executing runner." + exit 0 +fi + +mkdir -p "$RESULT_DIR" +bash "{runner_script}" diff --git a/utils/test_agentic_slurm_matrix.py b/utils/test_agentic_slurm_matrix.py new file mode 100644 index 000000000..40c332a63 --- /dev/null +++ b/utils/test_agentic_slurm_matrix.py @@ -0,0 +1,84 @@ +import importlib.util +import json +import sys +from pathlib import Path + + +REPO_ROOT = Path(__file__).resolve().parents[1] +SCRIPT = REPO_ROOT / "scripts" / "run_agentic_slurm_matrix.py" + + +def load_runner(): + spec = importlib.util.spec_from_file_location("run_agentic_slurm_matrix", SCRIPT) + module = importlib.util.module_from_spec(spec) + assert spec.loader is not None + sys.modules[spec.name] = module + spec.loader.exec_module(module) + return module + + +def test_agentic_slurm_dry_run_writes_plan_contract_and_sbatch(tmp_path): + runner = load_runner() + rc = runner.main( + [ + "--dry-run", + "--results-root", + str(tmp_path), + "--hardware", + "b200", + "--context-buckets", + "8k", + "--concurrency", + "1,2", + ] + ) + assert rc == 0 + + plan = json.loads((tmp_path / "matrix_plan.json").read_text()) + assert plan["scenario"] == "agentic-coding" + assert plan["total_jobs"] == 2 + assert {job["concurrency"] for job in plan["jobs"]} == {1, 2} + assert all(job["model_prefix"] == "dsv4" for job in plan["jobs"]) + + contract = json.loads((tmp_path / "expected_artifact_contract.json").read_text()) + assert contract["total_jobs"] == 2 + required = contract["per_job"][0]["required_before_claiming_success"] + assert any(path.endswith("/trace_replay/detailed_results.csv") for path in required) + assert any(path.endswith("/provenance_preflight.jsonl") for path in required) + + sbatch_files = sorted((tmp_path / "sbatch").glob("*.sbatch")) + assert len(sbatch_files) == 2 + rendered = sbatch_files[0].read_text() + assert 'SCENARIO_TYPE="agentic-coding"' in rendered + assert 'MODEL_PREFIX="dsv4"' in rendered + assert 'TRACE_SOURCE="semianalysisai/cc-traces-weka-042026"' in rendered + assert "nvidia-smi topo -m" in rendered + assert "all_reduce_perf" in rendered + assert "Dry-run sbatch rendered successfully" in rendered + + +def test_agentic_slurm_matrix_can_filter_gb200_multinode(tmp_path): + runner = load_runner() + rc = runner.main( + [ + "--dry-run", + "--results-root", + str(tmp_path), + "--hardware", + "gb200", + "--context-buckets", + "8k", + "--concurrency", + "1", + "--max-jobs", + "1", + ] + ) + assert rc == 0 + + plan = json.loads((tmp_path / "matrix_plan.json").read_text()) + assert plan["total_jobs"] == 1 + job = plan["jobs"][0] + assert job["hardware"] == "gb200" + assert job["is_multinode"] is True + assert job["config_file"].endswith("disagg-gb200-low-latency.yaml") From 7d7d2624a61c118737557aa66df97af07be095d2 Mon Sep 17 00:00:00 2001 From: William Chen <57119977+OCWC22@users.noreply.github.com> Date: Sun, 3 May 2026 03:36:36 -0700 Subject: [PATCH 4/5] Revert "feat(agentic): add GMI DSV4 Slurm harness [skip-sweep]" This reverts commit bfea80549194a4649124f9df73c46f5682a33e40. --- .github/configs/nvidia-master.yaml | 8 - AGENTIC_TRUTH_MATRIX.md | 20 +- benchmarks/benchmark_lib.sh | 2 +- .../agentic/dsv4_fp4_b200_sglang.sh | 17 - .../agentic/dsv4_fp4_b300_sglang.sh | 17 - benchmarks/single_node/dsv4_fp4_b200.sh | 19 - .../single_node/dsv4_fp4_b300_sglang.sh | 19 - configs/agentic_slurm_matrix.json | 73 ---- configs/agentic_slurm_matrix.yaml | 67 --- scripts/run_agentic_slurm_matrix.py | 381 ------------------ scripts/slurm/agentic_job.sbatch.tmpl | 89 ---- utils/test_agentic_slurm_matrix.py | 84 ---- 12 files changed, 11 insertions(+), 785 deletions(-) delete mode 100755 benchmarks/single_node/agentic/dsv4_fp4_b200_sglang.sh delete mode 100755 benchmarks/single_node/agentic/dsv4_fp4_b300_sglang.sh delete mode 100644 configs/agentic_slurm_matrix.json delete mode 100644 configs/agentic_slurm_matrix.yaml delete mode 100755 scripts/run_agentic_slurm_matrix.py delete mode 100644 scripts/slurm/agentic_job.sbatch.tmpl delete mode 100644 utils/test_agentic_slurm_matrix.py diff --git a/.github/configs/nvidia-master.yaml b/.github/configs/nvidia-master.yaml index 3b31a65e8..38d1101f3 100644 --- a/.github/configs/nvidia-master.yaml +++ b/.github/configs/nvidia-master.yaml @@ -1732,10 +1732,6 @@ dsv4-fp4-b200-sglang: - { tp: 8, ep: 8, dp-attn: true, conc-start: 64, conc-end: 128 } # DP-attention (DP_ATTENTION=true) — max-throughput CONC range - { tp: 8, ep: 8, dp-attn: true, conc-start: 256, conc-end: 512 } - agentic-coding: - - duration: 1800 - search-space: - - { tp: 8, ep: 8, dp-attn: true, offloading: none, conc-list: [1, 2, 4, 8, 16, 32, 64] } dsv4-fp4-b200-vllm: image: vllm/vllm-openai:v0.20.0-cu130 @@ -1955,10 +1951,6 @@ dsv4-fp4-b300-sglang: - { tp: 4, ep: 4, dp-attn: true, conc-start: 512, conc-end: 512 } - { tp: 8, ep: 8, dp-attn: true, conc-start: 2048, conc-end: 2048 } - { tp: 8, ep: 8, dp-attn: true, conc-start: 4096, conc-end: 4096 } - agentic-coding: - - duration: 1800 - search-space: - - { tp: 8, ep: 8, dp-attn: true, offloading: none, conc-list: [1, 2, 4, 8, 16, 32, 64] } # DeepSeek-V4-Pro on B300 with EAGLE/MTP speculative decoding. Recipe is # selected inside benchmarks/single_node/dsv4_fp4_b300_sglang_mtp.sh by diff --git a/AGENTIC_TRUTH_MATRIX.md b/AGENTIC_TRUTH_MATRIX.md index b23f598b1..fdd91f139 100644 --- a/AGENTIC_TRUTH_MATRIX.md +++ b/AGENTIC_TRUTH_MATRIX.md @@ -124,8 +124,8 @@ The current use case is: | DeepSeek-V4 B200/B300 SGLang fixed-seq | Exists | Fixed 1k/8k surfaces, not agentic trace replay. | | DeepSeek-V4 B200/B300 vLLM fixed-seq/MTP | Exists | Fixed-seq path with DSV4 chat encoding. | | DeepSeek-V4 GB200 vLLM srt-slurm recipes | Exists | Recipe set for 8k1k, not agentic trace replay. | -| DeepSeek-V4 B200/B300 SGLang agentic trace replay | Wired, needs live validation | Added DSV4 single-node agentic wrappers and conservative `agentic-coding` config rows. | -| DeepSeek-V4 GB200 agentic trace replay | Dry-run harness only | Portable Slurm matrix can render GB200 agentic jobs, but live srt-slurm recipe behavior is still unproven. | +| DeepSeek-V4 GB200 agentic trace replay | Missing | No `agentic-coding` config or DSV4-specific agentic launcher found. | +| B300 agentic trace replay | Mostly missing | B300 has fixed-seq DSR1/DSV4 surfaces, not a clear agentic path. | | LMCache/TensorMesh agentic comparison | Missing | No direct LMCache/TensorMesh metrics integration in InferenceX agentic path. | ## What A GMI/Neocloud Platform Engineer Actually Cares About @@ -161,11 +161,11 @@ Legend: |---|---:|---|---| | Agentic scenario flag and config schema | Yes | `agentic-coding` in config docs and validation. | Keep. | | WEKA trace replay source | Yes | `semianalysisai/cc-traces-weka-042026` in `resolve_trace_source`. | Make dataset configurable; keep WEKA as default/example. | -| Single-node trace replay execution | Yes | `benchmarks/single_node/agentic/*.sh`. | Run live DSV4 B200/B300 validation. | +| Single-node trace replay execution | Yes | `benchmarks/single_node/agentic/*.sh`. | Add DSV4 B200/B300 launchers. | | Multi-node trace replay execution | Partial | `benchmarks/multi_node/agentic_srt.sh`; special srt-slurm branch. | First-class srt-slurm support, no special private branch dependency. | -| DeepSeek-V4 B200 agentic | Partial | Config + launcher are wired; no live GPU artifact yet. | Run on B200 Slurm and attach artifacts. | -| DeepSeek-V4 B300 agentic | Partial | Config + launcher are wired; no live GPU artifact yet. | Run on B300 Slurm and attach artifacts. | -| DeepSeek-V4 GB200 agentic | Partial | Portable matrix renders GB200 jobs; live srt-slurm behavior remains unproven. | Add/validate GB200 agentic srt-slurm recipe on real hardware. | +| DeepSeek-V4 B200 agentic | No | DSV4 B200 configs are fixed-seq, not `agentic-coding`. | Add config + launcher + validated run. | +| DeepSeek-V4 B300 agentic | No | B300 has DSV4 fixed-seq scripts/recipes, not agentic. | Add config + launcher + validated run. | +| DeepSeek-V4 GB200 agentic | No | GB200 DSV4 recipes exist, but no agentic scenario. | Add srt-slurm agentic recipe and config. | | B200/B300/GB200 apples-to-apples matrix | No | Current surfaces differ by model/runtime/scenario. | Build normalized matrix over hardware, engine, context, concurrency. | | vLLM/SGLang/TRT/Dynamo comparison for same workload | Partial | Some engines covered for some models. | Normalize exact model, precision, prompt encoding, trace, and duration. | | Long-context buckets | Partial | Fixed-seq has 1k/8k; trace replay may have varied token lengths. | Add explicit 8k/32k/64k/128k/256k+ bins in reports and optional filters. | @@ -205,10 +205,10 @@ This is the minimum useful matrix for a GMI cloud engineer evaluating long-conte | Priority | Build item | Acceptance criteria | |---:|---|---| -| P0 | DSV4 `agentic-coding` configs for B200/B300 | Implemented; requires live run artifacts. | -| P0 | DSV4 agentic launchers for B200/B300 | Implemented; reuses existing SGLang server recipes and switches the client to WEKA trace replay. | -| P0 | Portable Slurm matrix runner | Implemented for dry-run/submit; no hardcoded cluster IDs; all cluster settings via env/JSON/YAML. | -| P0 | Artifact contract | Implemented expected-path manifest; live runs must still produce the artifacts before claims. | +| P0 | DSV4 `agentic-coding` configs for B200/B300/GB200 | Matrix generator emits DSV4 agentic jobs for each target hardware without touching fixed-seq paths. | +| P0 | DSV4 agentic launchers | Single-node launchers exist for B200/B300; GB200 multi-node agentic recipe exists or maps cleanly to srt-slurm custom benchmark. | +| P0 | Portable Slurm matrix runner | GMI operator can dry-run and submit without GitHub Actions; no hardcoded cluster IDs; all cluster settings via env/YAML. | +| P0 | Artifact contract | Every run emits a normalized JSON, raw CSV/JSONL, server log, config, command, provenance, and expected-path manifest. | | P1 | Workload taxonomy and context buckets | Report breaks down metrics by workload class and context-length bucket. | | P1 | SLO/capacity report | For each cell, report max concurrency at TTFT/TPOT/E2E SLO and failure reason beyond it. | | P1 | Provenance capture | Per-job artifact records image digest, repo SHA, CUDA/driver, GPU inventory, topology, runtime versions, Slurm job ID, nodelist. | diff --git a/benchmarks/benchmark_lib.sh b/benchmarks/benchmark_lib.sh index 994111bad..4c0c8642e 100644 --- a/benchmarks/benchmark_lib.sh +++ b/benchmarks/benchmark_lib.sh @@ -892,7 +892,7 @@ ensure_hf_cli() { } resolve_trace_source() { - local dataset="${TRACE_SOURCE:-semianalysisai/cc-traces-weka-042026}" + local dataset="semianalysisai/cc-traces-weka-042026" TRACE_SOURCE_FLAG="--hf-dataset $dataset" echo "Loading traces from Hugging Face dataset: $dataset" # Pre-download the dataset into the shared HF_HUB_CACHE (same mount used diff --git a/benchmarks/single_node/agentic/dsv4_fp4_b200_sglang.sh b/benchmarks/single_node/agentic/dsv4_fp4_b200_sglang.sh deleted file mode 100755 index dcac3bdb3..000000000 --- a/benchmarks/single_node/agentic/dsv4_fp4_b200_sglang.sh +++ /dev/null @@ -1,17 +0,0 @@ -#!/usr/bin/env bash -set -euo pipefail - -# Agentic trace replay wrapper for DeepSeek-V4-Pro FP4 on B200 with SGLang. -# The server recipe lives in ../dsv4_fp4_b200.sh; AGENTIC_MODE switches the -# post-ready client from fixed random prompts to WEKA trace replay. - -export AGENTIC_MODE=1 -export ISL="${ISL:-8192}" -export OSL="${OSL:-1024}" -export RANDOM_RANGE_RATIO="${RANDOM_RANGE_RATIO:-1}" -export RESULT_FILENAME="${RESULT_FILENAME:-agentic_dsv4_fp4_b200_sglang}" - -REPO_ROOT="$(cd "$(dirname "$0")/../../.." && pwd)" -export INFMAX_CONTAINER_WORKSPACE="${INFMAX_CONTAINER_WORKSPACE:-$REPO_ROOT}" - -exec "$REPO_ROOT/benchmarks/single_node/dsv4_fp4_b200.sh" diff --git a/benchmarks/single_node/agentic/dsv4_fp4_b300_sglang.sh b/benchmarks/single_node/agentic/dsv4_fp4_b300_sglang.sh deleted file mode 100755 index a5dc2387c..000000000 --- a/benchmarks/single_node/agentic/dsv4_fp4_b300_sglang.sh +++ /dev/null @@ -1,17 +0,0 @@ -#!/usr/bin/env bash -set -euo pipefail - -# Agentic trace replay wrapper for DeepSeek-V4-Pro FP4 on B300 with SGLang. -# The server recipe lives in ../dsv4_fp4_b300_sglang.sh; AGENTIC_MODE switches -# the post-ready client from fixed random prompts to WEKA trace replay. - -export AGENTIC_MODE=1 -export ISL="${ISL:-8192}" -export OSL="${OSL:-1024}" -export RANDOM_RANGE_RATIO="${RANDOM_RANGE_RATIO:-1}" -export RESULT_FILENAME="${RESULT_FILENAME:-agentic_dsv4_fp4_b300_sglang}" - -REPO_ROOT="$(cd "$(dirname "$0")/../../.." && pwd)" -export INFMAX_CONTAINER_WORKSPACE="${INFMAX_CONTAINER_WORKSPACE:-$REPO_ROOT}" - -exec "$REPO_ROOT/benchmarks/single_node/dsv4_fp4_b300_sglang.sh" diff --git a/benchmarks/single_node/dsv4_fp4_b200.sh b/benchmarks/single_node/dsv4_fp4_b200.sh index 6577b7791..df1259deb 100755 --- a/benchmarks/single_node/dsv4_fp4_b200.sh +++ b/benchmarks/single_node/dsv4_fp4_b200.sh @@ -100,25 +100,6 @@ SERVER_PID=$! wait_for_server_ready --port "$PORT" --server-log "$SERVER_LOG" --server-pid "$SERVER_PID" -if [ "${AGENTIC_MODE:-0}" = "1" ]; then - RESULT_DIR="${RESULT_DIR:-$PWD/results}" - mkdir -p "$RESULT_DIR" - cp "$SERVER_LOG" "$RESULT_DIR/server.log" 2>/dev/null || true - resolve_trace_source - install_agentic_deps - build_replay_cmd "$RESULT_DIR" - echo "$REPLAY_CMD" > "$RESULT_DIR/benchmark_command.txt" - set +e - $REPLAY_CMD 2>&1 | tee "$RESULT_DIR/benchmark.log" - REPLAY_RC=${PIPESTATUS[0]} - set -e - write_agentic_result_json "$RESULT_DIR" - python3 "$AGENTIC_DIR/scripts/analyze_benchmark_distributions.py" \ - "$RESULT_DIR/trace_replay" -o "$RESULT_DIR" 2>&1 || true - stop_gpu_monitor - exit "$REPLAY_RC" -fi - pip install -q datasets pandas run_benchmark_serving \ diff --git a/benchmarks/single_node/dsv4_fp4_b300_sglang.sh b/benchmarks/single_node/dsv4_fp4_b300_sglang.sh index 2a053ae8f..8f43ea8a3 100755 --- a/benchmarks/single_node/dsv4_fp4_b300_sglang.sh +++ b/benchmarks/single_node/dsv4_fp4_b300_sglang.sh @@ -186,25 +186,6 @@ SERVER_PID=$! wait_for_server_ready --port "$PORT" --server-log "$SERVER_LOG" --server-pid "$SERVER_PID" -if [ "${AGENTIC_MODE:-0}" = "1" ]; then - RESULT_DIR="${RESULT_DIR:-$PWD/results}" - mkdir -p "$RESULT_DIR" - cp "$SERVER_LOG" "$RESULT_DIR/server.log" 2>/dev/null || true - resolve_trace_source - install_agentic_deps - build_replay_cmd "$RESULT_DIR" - echo "$REPLAY_CMD" > "$RESULT_DIR/benchmark_command.txt" - set +e - $REPLAY_CMD 2>&1 | tee "$RESULT_DIR/benchmark.log" - REPLAY_RC=${PIPESTATUS[0]} - set -e - write_agentic_result_json "$RESULT_DIR" - python3 "$AGENTIC_DIR/scripts/analyze_benchmark_distributions.py" \ - "$RESULT_DIR/trace_replay" -o "$RESULT_DIR" 2>&1 || true - stop_gpu_monitor - exit "$REPLAY_RC" -fi - pip install -q datasets pandas run_benchmark_serving \ diff --git a/configs/agentic_slurm_matrix.json b/configs/agentic_slurm_matrix.json deleted file mode 100644 index 792e16f38..000000000 --- a/configs/agentic_slurm_matrix.json +++ /dev/null @@ -1,73 +0,0 @@ -{ - "defaults": { - "partition_env": "GMI_SLURM_PARTITION", - "account_env": "GMI_SLURM_ACCOUNT", - "results_root_env": "GMI_RESULTS_ROOT", - "container_image_env": "GMI_CONTAINER_IMAGE", - "model_path_env": "GMI_MODEL_PATH", - "time_limit": "04:00:00", - "cpus_per_task": 64, - "gpus_per_node": 8, - "trace_source": "semianalysisai/cc-traces-weka-042026", - "duration_seconds": 1800, - "arrival_modes": ["closed_loop"], - "cache_modes": ["engine_prefix_cache"], - "tenant_modes": ["single_tenant"], - "context_buckets": ["8k", "32k", "64k", "128k"], - "concurrency": [1, 2, 4, 8, 16, 32, 64] - }, - "hardware": { - "b200": { - "enabled": true, - "runner_script": "runners/launch_b200-nb.sh", - "runner_name": "b200-gmi-agentic", - "slurm_nodes": 1, - "model_prefix": "dsv4", - "precision": "fp4", - "framework": "sglang", - "topology": "single_node", - "tp": 8, - "ep": 8, - "dp_attention": true, - "script_expected": "benchmarks/single_node/agentic/dsv4_fp4_b200_sglang.sh" - }, - "b300": { - "enabled": true, - "runner_script": "runners/launch_b300-nv.sh", - "runner_name": "b300-gmi-agentic", - "slurm_nodes": 1, - "model_prefix": "dsv4", - "precision": "fp4", - "framework": "sglang", - "topology": "single_node", - "tp": 8, - "ep": 8, - "dp_attention": true, - "script_expected": "benchmarks/single_node/agentic/dsv4_fp4_b300_sglang.sh" - }, - "gb200": { - "enabled": true, - "runner_script": "runners/launch_gb200-nv.sh", - "runner_name": "gb200-gmi-agentic", - "slurm_nodes": 5, - "model_prefix": "dsv4", - "precision": "fp4", - "framework": "dynamo-vllm", - "topology": "multi_node_disagg", - "prefill": { - "num_worker": 1, - "tp": 8, - "ep": 8, - "dp_attention": true - }, - "decode": { - "num_worker": 1, - "tp": 8, - "ep": 1, - "dp_attention": false - }, - "config_file": "recipes/vllm/deepseek-v4/8k1k/disagg-gb200-low-latency.yaml" - } - } -} - diff --git a/configs/agentic_slurm_matrix.yaml b/configs/agentic_slurm_matrix.yaml deleted file mode 100644 index 7cfeb2be5..000000000 --- a/configs/agentic_slurm_matrix.yaml +++ /dev/null @@ -1,67 +0,0 @@ -defaults: - partition_env: GMI_SLURM_PARTITION - account_env: GMI_SLURM_ACCOUNT - results_root_env: GMI_RESULTS_ROOT - container_image_env: GMI_CONTAINER_IMAGE - model_path_env: GMI_MODEL_PATH - time_limit: "04:00:00" - cpus_per_task: 64 - gpus_per_node: 8 - trace_source: "semianalysisai/cc-traces-weka-042026" - duration_seconds: 1800 - arrival_modes: ["closed_loop"] - cache_modes: ["engine_prefix_cache"] - tenant_modes: ["single_tenant"] - context_buckets: ["8k", "32k", "64k", "128k"] - concurrency: [1, 2, 4, 8, 16, 32, 64] - -hardware: - b200: - enabled: true - runner_script: "runners/launch_b200-nb.sh" - runner_name: "b200-gmi-agentic" - slurm_nodes: 1 - model_prefix: "dsv4" - precision: "fp4" - framework: "sglang" - topology: "single_node" - tp: 8 - ep: 8 - dp_attention: true - script_expected: "benchmarks/single_node/agentic/dsv4_fp4_b200_sglang.sh" - - b300: - enabled: true - runner_script: "runners/launch_b300-nv.sh" - runner_name: "b300-gmi-agentic" - slurm_nodes: 1 - model_prefix: "dsv4" - precision: "fp4" - framework: "sglang" - topology: "single_node" - tp: 8 - ep: 8 - dp_attention: true - script_expected: "benchmarks/single_node/agentic/dsv4_fp4_b300_sglang.sh" - - gb200: - enabled: true - runner_script: "runners/launch_gb200-nv.sh" - runner_name: "gb200-gmi-agentic" - slurm_nodes: 5 - model_prefix: "dsv4" - precision: "fp4" - framework: "dynamo-vllm" - topology: "multi_node_disagg" - prefill: - num_worker: 1 - tp: 8 - ep: 8 - dp_attention: true - decode: - num_worker: 1 - tp: 8 - ep: 1 - dp_attention: false - config_file: "recipes/vllm/deepseek-v4/8k1k/disagg-gb200-low-latency.yaml" - diff --git a/scripts/run_agentic_slurm_matrix.py b/scripts/run_agentic_slurm_matrix.py deleted file mode 100755 index de8351d5a..000000000 --- a/scripts/run_agentic_slurm_matrix.py +++ /dev/null @@ -1,381 +0,0 @@ -#!/usr/bin/env python3 -"""Generate and optionally submit a GMI-facing agentic Slurm benchmark matrix. - -The runner is intentionally dry-run-first: it renders sbatch files, a matrix -plan, and an expected artifact contract without claiming GPU behavior. -""" - -from __future__ import annotations - -import argparse -import hashlib -import json -import os -import re -import subprocess -import sys -from dataclasses import dataclass -from pathlib import Path -from typing import Any - -REPO_ROOT = Path(__file__).resolve().parents[1] -DEFAULT_CONFIG = REPO_ROOT / "configs" / "agentic_slurm_matrix.json" -SBATCH_TEMPLATE = REPO_ROOT / "scripts" / "slurm" / "agentic_job.sbatch.tmpl" - - -def _as_list(value: Any) -> list[Any]: - if value is None: - return [] - if isinstance(value, list): - return value - return [value] - - -def _parse_csv_ints(value: str | None) -> list[int] | None: - if not value: - return None - return [int(item.strip()) for item in value.split(",") if item.strip()] - - -def _parse_csv_strings(value: str | None) -> list[str] | None: - if not value: - return None - return [item.strip() for item in value.split(",") if item.strip()] - - -def _slug(value: str) -> str: - value = value.lower() - value = re.sub(r"[^a-z0-9._-]+", "-", value) - return value.strip("-") - - -def _load_config(path: Path) -> dict[str, Any]: - with path.open() as handle: - if path.suffix == ".json": - data = json.load(handle) - else: - try: - import yaml # type: ignore - except ModuleNotFoundError as exc: - raise RuntimeError( - f"{path} requires PyYAML. Use the default JSON config or install pyyaml." - ) from exc - data = yaml.safe_load(handle) - if not isinstance(data, dict): - raise ValueError(f"{path} must contain a YAML mapping") - return data - - -@dataclass(frozen=True) -class MatrixJob: - job_id: str - hardware: str - framework: str - topology: str - context_bucket: str - concurrency: int - arrival_mode: str - cache_mode: str - tenant_mode: str - duration_seconds: int - runner_script: str - runner_name: str - slurm_nodes: int - gpus_per_node: int - cpus_per_task: int - time_limit: str - model_prefix: str - precision: str - tp: int - ep: int - dp_attention: bool - disagg: bool - config_file: str - is_multinode: bool - prefill_num_workers: int - prefill_tp: int - prefill_ep: int - prefill_dp_attention: bool - decode_num_workers: int - decode_tp: int - decode_ep: int - decode_dp_attention: bool - trace_source: str - - @property - def exp_name(self) -> str: - return ( - f"{self.model_prefix}_{self.hardware}_{self.framework}_" - f"{self.context_bucket}_conc{self.concurrency}" - ) - - @property - def result_filename(self) -> str: - return _slug( - f"agentic_{self.exp_name}_{self.arrival_mode}_" - f"{self.cache_mode}_{self.tenant_mode}" - ) - - def to_dict(self) -> dict[str, Any]: - return { - "job_id": self.job_id, - "hardware": self.hardware, - "framework": self.framework, - "topology": self.topology, - "context_bucket": self.context_bucket, - "concurrency": self.concurrency, - "arrival_mode": self.arrival_mode, - "cache_mode": self.cache_mode, - "tenant_mode": self.tenant_mode, - "duration_seconds": self.duration_seconds, - "runner_script": self.runner_script, - "runner_name": self.runner_name, - "slurm_nodes": self.slurm_nodes, - "gpus_per_node": self.gpus_per_node, - "cpus_per_task": self.cpus_per_task, - "time_limit": self.time_limit, - "model_prefix": self.model_prefix, - "precision": self.precision, - "tp": self.tp, - "ep": self.ep, - "dp_attention": self.dp_attention, - "disagg": self.disagg, - "config_file": self.config_file, - "is_multinode": self.is_multinode, - "trace_source": self.trace_source, - "result_filename": self.result_filename, - "exp_name": self.exp_name, - "prefill_num_workers": self.prefill_num_workers, - "prefill_tp": self.prefill_tp, - "prefill_ep": self.prefill_ep, - "prefill_dp_attention": self.prefill_dp_attention, - "decode_num_workers": self.decode_num_workers, - "decode_tp": self.decode_tp, - "decode_ep": self.decode_ep, - "decode_dp_attention": self.decode_dp_attention, - } - - -def expand_jobs(config: dict[str, Any], args: argparse.Namespace) -> list[MatrixJob]: - defaults = config.get("defaults", {}) - hardware_cfg = config.get("hardware", {}) - if not isinstance(hardware_cfg, dict): - raise ValueError("hardware must be a mapping") - - selected_hw = set(_parse_csv_strings(args.hardware) or hardware_cfg.keys()) - contexts = _parse_csv_strings(args.context_buckets) or _as_list(defaults.get("context_buckets")) - concurrencies = _parse_csv_ints(args.concurrency) or _as_list(defaults.get("concurrency")) - arrival_modes = _parse_csv_strings(args.arrival_modes) or _as_list(defaults.get("arrival_modes")) - cache_modes = _parse_csv_strings(args.cache_modes) or _as_list(defaults.get("cache_modes")) - tenant_modes = _parse_csv_strings(args.tenant_modes) or _as_list(defaults.get("tenant_modes")) - - jobs: list[MatrixJob] = [] - for hardware, hw in hardware_cfg.items(): - if hardware not in selected_hw or not hw.get("enabled", True): - continue - runner_script = REPO_ROOT / str(hw["runner_script"]) - if not runner_script.exists(): - raise FileNotFoundError(f"runner_script not found for {hardware}: {runner_script}") - script_expected = hw.get("script_expected") - if script_expected and not (REPO_ROOT / str(script_expected)).exists(): - raise FileNotFoundError(f"script_expected not found for {hardware}: {script_expected}") - - is_multinode = hw.get("topology") == "multi_node_disagg" - prefill = hw.get("prefill", {}) - decode = hw.get("decode", {}) - tp = int(hw.get("tp", prefill.get("tp", 1))) - ep = int(hw.get("ep", prefill.get("ep", 1))) - dp_attention = bool(hw.get("dp_attention", prefill.get("dp_attention", False))) - - for context_bucket in contexts: - for concurrency in concurrencies: - for arrival_mode in arrival_modes: - for cache_mode in cache_modes: - for tenant_mode in tenant_modes: - key = "|".join( - [ - hardware, - str(hw["framework"]), - str(context_bucket), - str(concurrency), - str(arrival_mode), - str(cache_mode), - str(tenant_mode), - ] - ) - job_id = hashlib.sha1(key.encode()).hexdigest()[:10] - jobs.append( - MatrixJob( - job_id=job_id, - hardware=hardware, - framework=str(hw["framework"]), - topology=str(hw["topology"]), - context_bucket=str(context_bucket), - concurrency=int(concurrency), - arrival_mode=str(arrival_mode), - cache_mode=str(cache_mode), - tenant_mode=str(tenant_mode), - duration_seconds=int(args.duration or defaults.get("duration_seconds", 1800)), - runner_script=str(hw["runner_script"]), - runner_name=str(hw["runner_name"]), - slurm_nodes=int(hw.get("slurm_nodes", 1)), - gpus_per_node=int(defaults.get("gpus_per_node", 8)), - cpus_per_task=int(defaults.get("cpus_per_task", 64)), - time_limit=str(defaults.get("time_limit", "04:00:00")), - model_prefix=str(hw["model_prefix"]), - precision=str(hw["precision"]), - tp=tp, - ep=ep, - dp_attention=dp_attention, - disagg=is_multinode, - config_file=str(hw.get("config_file", "")), - is_multinode=is_multinode, - prefill_num_workers=int(prefill.get("num_worker", 0)), - prefill_tp=int(prefill.get("tp", 0)), - prefill_ep=int(prefill.get("ep", 0)), - prefill_dp_attention=bool(prefill.get("dp_attention", False)), - decode_num_workers=int(decode.get("num_worker", 0)), - decode_tp=int(decode.get("tp", 0)), - decode_ep=int(decode.get("ep", 0)), - decode_dp_attention=bool(decode.get("dp_attention", False)), - trace_source=str(defaults.get("trace_source", "")), - ) - ) - - if args.max_jobs is not None: - jobs = jobs[: args.max_jobs] - return jobs - - -def expected_paths(job: MatrixJob) -> list[str]: - return [ - f"{job.job_id}/{job.result_filename}.json", - f"{job.job_id}/results/benchmark.log", - f"{job.job_id}/results/benchmark_command.txt", - f"{job.job_id}/results/server.log", - f"{job.job_id}/results/trace_replay/detailed_results.csv", - f"{job.job_id}/results/trace_replay/debug_trace.jsonl", - f"{job.job_id}/preflight.log", - f"{job.job_id}/provenance_preflight.jsonl", - ] - - -def render_sbatch(job: MatrixJob, config: dict[str, Any], results_root: Path, dry_run_guard: bool) -> str: - defaults = config.get("defaults", {}) - template = SBATCH_TEMPLATE.read_text() - job_dir = results_root / job.job_id - values = { - **job.to_dict(), - "job_name": f"agentic-{job.hardware}-{job.job_id}", - "job_dir": str(job_dir), - "partition_env": defaults.get("partition_env", "GMI_SLURM_PARTITION"), - "account_env": defaults.get("account_env", "GMI_SLURM_ACCOUNT"), - "results_root_env": defaults.get("results_root_env", "GMI_RESULTS_ROOT"), - "container_image_env": defaults.get("container_image_env", "GMI_CONTAINER_IMAGE"), - "model_path_env": defaults.get("model_path_env", "GMI_MODEL_PATH"), - "dp_attention": str(job.dp_attention).lower(), - "disagg": str(job.disagg).lower(), - "is_multinode": "1" if job.is_multinode else "0", - "prefill_dp_attention": str(job.prefill_dp_attention).lower(), - "decode_dp_attention": str(job.decode_dp_attention).lower(), - "dry_run_guard": "1" if dry_run_guard else "0", - } - rendered = template - for key, value in values.items(): - rendered = rendered.replace("{" + key + "}", str(value)) - return rendered - - -def write_outputs(config: dict[str, Any], jobs: list[MatrixJob], results_root: Path, dry_run_guard: bool) -> None: - sbatch_dir = results_root / "sbatch" - sbatch_dir.mkdir(parents=True, exist_ok=True) - for job in jobs: - job_dir = results_root / job.job_id - job_dir.mkdir(parents=True, exist_ok=True) - sbatch_text = render_sbatch(job, config, results_root, dry_run_guard) - (sbatch_dir / f"{job.job_id}.sbatch").write_text(sbatch_text) - - plan = { - "scenario": "agentic-coding", - "total_jobs": len(jobs), - "jobs": [job.to_dict() for job in jobs], - } - (results_root / "matrix_plan.json").write_text(json.dumps(plan, indent=2) + "\n") - - contract = { - "scenario": "agentic-coding", - "total_jobs": len(jobs), - "per_job": [ - { - "job_id": job.job_id, - "result_filename": job.result_filename, - "expected_paths": expected_paths(job), - "required_before_claiming_success": [ - f"{job.job_id}/{job.result_filename}.json", - f"{job.job_id}/results/trace_replay/detailed_results.csv", - f"{job.job_id}/preflight.log", - f"{job.job_id}/provenance_preflight.jsonl", - ], - } - for job in jobs - ], - } - (results_root / "expected_artifact_contract.json").write_text(json.dumps(contract, indent=2) + "\n") - - -def submit_jobs(config: dict[str, Any], results_root: Path, jobs: list[MatrixJob]) -> None: - defaults = config.get("defaults", {}) - partition_env = defaults.get("partition_env", "GMI_SLURM_PARTITION") - account_env = defaults.get("account_env", "GMI_SLURM_ACCOUNT") - partition = os.environ.get(partition_env) - account = os.environ.get(account_env) - if not partition: - raise RuntimeError(f"{partition_env} must be set when --submit is used") - for job in jobs: - sbatch_path = results_root / "sbatch" / f"{job.job_id}.sbatch" - cmd = ["sbatch", "--partition", partition] - if account: - cmd.extend(["--account", account]) - cmd.append(str(sbatch_path)) - subprocess.run(cmd, check=True) - - -def build_parser() -> argparse.ArgumentParser: - parser = argparse.ArgumentParser(description=__doc__) - parser.add_argument("--config", type=Path, default=DEFAULT_CONFIG) - parser.add_argument("--results-root", type=Path, default=Path(os.environ.get("GMI_RESULTS_ROOT", "agentic-slurm-results"))) - parser.add_argument("--hardware", help="Comma-separated hardware filter, e.g. b200,b300") - parser.add_argument("--context-buckets", help="Comma-separated context buckets") - parser.add_argument("--concurrency", help="Comma-separated concurrency values") - parser.add_argument("--arrival-modes", help="Comma-separated arrival modes") - parser.add_argument("--cache-modes", help="Comma-separated cache modes") - parser.add_argument("--tenant-modes", help="Comma-separated tenant modes") - parser.add_argument("--duration", type=int) - parser.add_argument("--max-jobs", type=int) - parser.add_argument("--dry-run", action="store_true", help="Render files only") - parser.add_argument("--submit", action="store_true", help="Submit rendered sbatch jobs") - return parser - - -def main(argv: list[str] | None = None) -> int: - parser = build_parser() - args = parser.parse_args(argv) - if args.submit and args.dry_run: - parser.error("--submit and --dry-run are mutually exclusive") - - config = _load_config(args.config) - jobs = expand_jobs(config, args) - args.results_root.mkdir(parents=True, exist_ok=True) - write_outputs(config, jobs, args.results_root, dry_run_guard=args.dry_run or not args.submit) - - if args.submit: - submit_jobs(config, args.results_root, jobs) - - print(f"Wrote {len(jobs)} agentic Slurm jobs to {args.results_root}") - print(f"Matrix plan: {args.results_root / 'matrix_plan.json'}") - print(f"Artifact contract: {args.results_root / 'expected_artifact_contract.json'}") - return 0 - - -if __name__ == "__main__": - sys.exit(main()) diff --git a/scripts/slurm/agentic_job.sbatch.tmpl b/scripts/slurm/agentic_job.sbatch.tmpl deleted file mode 100644 index b3b052c38..000000000 --- a/scripts/slurm/agentic_job.sbatch.tmpl +++ /dev/null @@ -1,89 +0,0 @@ -#!/usr/bin/env bash -#SBATCH --job-name={job_name} -#SBATCH --nodes={slurm_nodes} -#SBATCH --gpus-per-node={gpus_per_node} -#SBATCH --cpus-per-task={cpus_per_task} -#SBATCH --time={time_limit} -#SBATCH --output={job_dir}/slurm-%j.out -#SBATCH --error={job_dir}/slurm-%j.err - -set -euo pipefail - -mkdir -p "{job_dir}" - -required_env=( - "{partition_env}" - "{results_root_env}" - "{container_image_env}" - "{model_path_env}" -) -for env_name in "${required_env[@]}"; do - if [[ -z "${!env_name:-}" ]]; then - echo "FATAL: required environment variable ${env_name} is not set" >&2 - exit 2 - fi -done - -export IMAGE="${{container_image_env}}" -export MODEL="${{model_path_env}}" -export GITHUB_WORKSPACE="${GITHUB_WORKSPACE:-$(pwd)}" -export MODEL_PREFIX="{model_prefix}" -export PRECISION="{precision}" -export FRAMEWORK="{framework}" -export RUNNER_NAME="{runner_name}" -export RUNNER_TYPE="{hardware}" -export EXP_NAME="{exp_name}" -export RESULT_FILENAME="{result_filename}" -export RESULT_DIR="{job_dir}/results" -export AGENTIC_OUTPUT_DIR="{job_dir}" -export SCENARIO_TYPE="agentic-coding" -export SCENARIO_SUBDIR="agentic/" -export IS_AGENTIC="1" -export CONC="{concurrency}" -export DURATION="{duration_seconds}" -export TRACE_SOURCE="{trace_source}" -export AGENTIC_CONTEXT_BUCKET="{context_bucket}" -export AGENTIC_ARRIVAL_MODE="{arrival_mode}" -export AGENTIC_CACHE_MODE="{cache_mode}" -export AGENTIC_TENANT_MODE="{tenant_mode}" -export TP="{tp}" -export EP_SIZE="{ep}" -export DP_ATTENTION="{dp_attention}" -export SPEC_DECODING="none" -export DISAGG="{disagg}" -export CONFIG_FILE="{config_file}" -export IS_MULTINODE="{is_multinode}" -export PREFILL_NUM_WORKERS="{prefill_num_workers}" -export PREFILL_TP="{prefill_tp}" -export PREFILL_EP="{prefill_ep}" -export PREFILL_DP_ATTN="{prefill_dp_attention}" -export DECODE_NUM_WORKERS="{decode_num_workers}" -export DECODE_TP="{decode_tp}" -export DECODE_EP="{decode_ep}" -export DECODE_DP_ATTN="{decode_dp_attention}" - -cat > "{job_dir}/provenance_preflight.jsonl" < "{job_dir}/preflight.log" 2>&1 - -if [[ "{dry_run_guard}" == "1" ]]; then - echo "Dry-run sbatch rendered successfully; not executing runner." - exit 0 -fi - -mkdir -p "$RESULT_DIR" -bash "{runner_script}" diff --git a/utils/test_agentic_slurm_matrix.py b/utils/test_agentic_slurm_matrix.py deleted file mode 100644 index 40c332a63..000000000 --- a/utils/test_agentic_slurm_matrix.py +++ /dev/null @@ -1,84 +0,0 @@ -import importlib.util -import json -import sys -from pathlib import Path - - -REPO_ROOT = Path(__file__).resolve().parents[1] -SCRIPT = REPO_ROOT / "scripts" / "run_agentic_slurm_matrix.py" - - -def load_runner(): - spec = importlib.util.spec_from_file_location("run_agentic_slurm_matrix", SCRIPT) - module = importlib.util.module_from_spec(spec) - assert spec.loader is not None - sys.modules[spec.name] = module - spec.loader.exec_module(module) - return module - - -def test_agentic_slurm_dry_run_writes_plan_contract_and_sbatch(tmp_path): - runner = load_runner() - rc = runner.main( - [ - "--dry-run", - "--results-root", - str(tmp_path), - "--hardware", - "b200", - "--context-buckets", - "8k", - "--concurrency", - "1,2", - ] - ) - assert rc == 0 - - plan = json.loads((tmp_path / "matrix_plan.json").read_text()) - assert plan["scenario"] == "agentic-coding" - assert plan["total_jobs"] == 2 - assert {job["concurrency"] for job in plan["jobs"]} == {1, 2} - assert all(job["model_prefix"] == "dsv4" for job in plan["jobs"]) - - contract = json.loads((tmp_path / "expected_artifact_contract.json").read_text()) - assert contract["total_jobs"] == 2 - required = contract["per_job"][0]["required_before_claiming_success"] - assert any(path.endswith("/trace_replay/detailed_results.csv") for path in required) - assert any(path.endswith("/provenance_preflight.jsonl") for path in required) - - sbatch_files = sorted((tmp_path / "sbatch").glob("*.sbatch")) - assert len(sbatch_files) == 2 - rendered = sbatch_files[0].read_text() - assert 'SCENARIO_TYPE="agentic-coding"' in rendered - assert 'MODEL_PREFIX="dsv4"' in rendered - assert 'TRACE_SOURCE="semianalysisai/cc-traces-weka-042026"' in rendered - assert "nvidia-smi topo -m" in rendered - assert "all_reduce_perf" in rendered - assert "Dry-run sbatch rendered successfully" in rendered - - -def test_agentic_slurm_matrix_can_filter_gb200_multinode(tmp_path): - runner = load_runner() - rc = runner.main( - [ - "--dry-run", - "--results-root", - str(tmp_path), - "--hardware", - "gb200", - "--context-buckets", - "8k", - "--concurrency", - "1", - "--max-jobs", - "1", - ] - ) - assert rc == 0 - - plan = json.loads((tmp_path / "matrix_plan.json").read_text()) - assert plan["total_jobs"] == 1 - job = plan["jobs"][0] - assert job["hardware"] == "gb200" - assert job["is_multinode"] is True - assert job["config_file"].endswith("disagg-gb200-low-latency.yaml") From 0156eb17bf361b86d7c4c32b2104067bdb93dde1 Mon Sep 17 00:00:00 2001 From: William Chen <57119977+OCWC22@users.noreply.github.com> Date: Sun, 3 May 2026 03:36:37 -0700 Subject: [PATCH 5/5] Revert "docs(agentic): add GMI truth matrix [skip-sweep]" This reverts commit df9aa0cd88b3591fa89a4b127685c25862acbb02. --- AGENTIC_TRUTH_MATRIX.md | 247 ---------------------------------------- 1 file changed, 247 deletions(-) delete mode 100644 AGENTIC_TRUTH_MATRIX.md diff --git a/AGENTIC_TRUTH_MATRIX.md b/AGENTIC_TRUTH_MATRIX.md deleted file mode 100644 index fdd91f139..000000000 --- a/AGENTIC_TRUTH_MATRIX.md +++ /dev/null @@ -1,247 +0,0 @@ -# SemiAnalysis InferenceX Agentic/WEKA Truth Matrix - -Date: 2026-05-03 - -Scope: local `InferenceX` checkout, focused on the PR path that adds the `agentic-coding` scenario and WEKA trace replay. This is a truth matrix for deciding what still needs to be built before a GMI Cloud or other neocloud platform engineer can use the harness to evaluate real long-context chat and coding inference workloads. - -## Bottom Line - -The current InferenceX implementation is a real but experimental **agentic trace replay harness**. It replays recorded WEKA coding/chat traces against an OpenAI-compatible serving endpoint and emits latency, throughput, cache, workload-distribution, and artifact outputs. - -It is **not yet** a complete GMI/neocloud evaluation harness for DeepSeek-V4 on B200/B300/GB200. The biggest gap is that `agentic-coding` is wired for some DeepSeek-R1 and GPT-OSS/Kimi paths, while the DeepSeek-V4 GB200/B300/B200 surface is still mostly fixed-sequence or srt-slurm recipe driven. The harness also does not yet produce the full cluster, network, reliability, cost, and operator-readiness evidence that a cloud platform engineer would need. - -## Actual Code Path Today - -| Step | What happens | Actual code | Truth status | -|---|---|---|---| -| 1 | Config declares an optional `agentic-coding` scenario. | `.github/configs/CONFIGS.md` | Exists | -| 2 | NVIDIA/AMD master configs include a small number of `agentic-coding` entries. | `.github/configs/nvidia-master.yaml`, `.github/configs/amd-master.yaml` | Exists, narrow | -| 3 | Matrix generator expands agentic entries across concurrency, TP, EP, DP attention, offload, runner, image, model, and duration. | `utils/matrix_logic/generate_sweep_configs.py` | Exists | -| 4 | GitHub workflow sets agentic routing env vars. | `.github/workflows/benchmark-tmpl.yml` | Exists | -| 5 | Runner selects `benchmarks/single_node/agentic/...` instead of normal fixed-seq scripts. | `runners/launch_*.sh` via `SCENARIO_SUBDIR=agentic/` | Exists | -| 6 | Shared library resolves WEKA trace source and builds the replay command. | `benchmarks/benchmark_lib.sh` | Exists | -| 7 | Agentic script starts the serving backend and runs trace replay. | `benchmarks/single_node/agentic/dsr1_fp4_b200.sh`, peers | Exists | -| 8 | Multi-node agentic path runs client-only replay against an already-started srt-slurm frontend. | `benchmarks/multi_node/agentic_srt.sh` | Exists, experimental | -| 9 | Aggregator turns replay CSVs into InferenceX-like JSON. | `utils/process_agentic_result.py` | Exists | -| 10 | Workflow uploads raw and aggregated artifacts. | `.github/workflows/benchmark-tmpl.yml`, `.github/workflows/e2e-tests.yml` | Exists | - -## Actual Code Snippets - -The trace source is hardcoded to a Hugging Face dataset: - -```bash -local dataset="semianalysisai/cc-traces-weka-042026" -TRACE_SOURCE_FLAG="--hf-dataset $dataset" -``` - -Source: `benchmarks/benchmark_lib.sh` - -Agentic replay is built as a client workload against the local serving endpoint: - -```bash -REPLAY_CMD="python3 $TRACE_REPLAY_DIR/trace_replay_tester.py" -REPLAY_CMD+=" --api-endpoint http://localhost:$PORT" -REPLAY_CMD+=" $TRACE_SOURCE_FLAG" -REPLAY_CMD+=" --output-dir $result_dir/trace_replay" -REPLAY_CMD+=" --start-users $CONC" -REPLAY_CMD+=" --max-users $CONC" -REPLAY_CMD+=" --test-duration $duration" -REPLAY_CMD+=" --recycle" -REPLAY_CMD+=" --warmup-enabled" -REPLAY_CMD+=" --seed 42" -``` - -Source: `benchmarks/benchmark_lib.sh` - -The workflow routes agentic jobs by setting: - -```yaml -SCENARIO_SUBDIR: ${{ inputs.scenario-type == 'agentic-coding' && 'agentic/' || '' }} -IS_AGENTIC: ${{ inputs.scenario-type == 'agentic-coding' && '1' || '0' }} -RESULT_DIR: /workspace/results -``` - -Source: `.github/workflows/benchmark-tmpl.yml` - -The B200 DeepSeek-R1 agentic script starts SGLang, waits for readiness, runs replay, then aggregates: - -```bash -resolve_trace_source -install_agentic_deps -python3 -m sglang.launch_server ... --enable-metrics > "$SERVER_LOG" 2>&1 & -wait_for_server_ready --port "$PORT" --server-log "$SERVER_LOG" --server-pid "$SERVER_PID" -build_replay_cmd "$RESULT_DIR" -$REPLAY_CMD 2>&1 | tee "$RESULT_DIR/benchmark.log" || true -write_agentic_result_json "$RESULT_DIR" -``` - -Source: `benchmarks/single_node/agentic/dsr1_fp4_b200.sh` - -Aggregated JSON includes scenario identity, topology, success counts, latency, throughput, token distributions, cache stats, and per-GPU throughput: - -```python -agg = { - "hw": os.environ.get('RUNNER_TYPE', ''), - "conc": conc, - "model": os.environ.get('MODEL', ''), - "framework": os.environ.get('FRAMEWORK', ''), - "scenario_type": "agentic-coding", - "is_multinode": is_multinode, - "tp": tp, - "ep": ep, - "offloading": os.environ.get('OFFLOADING', 'none'), - "num_requests_total": len(rows), - "num_requests_successful": len(successful), -} -``` - -Source: `utils/process_agentic_result.py` - -## What The Use Case Actually Is - -The current use case is: - -| Use case | Current behavior | -|---|---| -| Replay realistic coding/chat request traces | Yes, via `semianalysisai/cc-traces-weka-042026`. | -| Drive a serving endpoint with concurrent users | Yes, with `--start-users $CONC` and `--max-users $CONC`. | -| Measure request-level TTFT / E2E / ITL / TPOT | Yes, from `trace_replay/detailed_results.csv`. | -| Measure throughput and throughput per GPU | Yes, from completed request timestamps and configured GPU counts. | -| Measure input/output token distribution | Yes, from replay rows. | -| Estimate cache reuse | Partially. It reports theoretical replay cache hit rate and server prefix-cache counters when metrics exist. | -| Evaluate real autonomous coding agent behavior | No. It replays traces; it does not run an agent loop with tools, repo edits, tests, retries, or feedback. | -| Evaluate GMI customer traffic | No, unless GMI traffic is converted into the same trace-replay format. | - -## Current Coverage Matrix - -| Surface | Current status | Notes | -|---|---|---| -| DeepSeek-R1 FP4 B200 SGLang single-node agentic | Exists | `benchmarks/single_node/agentic/dsr1_fp4_b200.sh`. | -| DeepSeek-R1 FP4 B200 Dynamo/TRT multi-node agentic | Exists, experimental | Uses a special `cquil11/srt-slurm-nv` branch and a `128k_agentic` recipe. | -| DeepSeek-R1 FP4 MI355X SGLang single-node agentic | Exists | AMD entry in `.github/configs/amd-master.yaml`. | -| GPT-OSS FP4 H100/H200/MI300X/MI325X agentic scripts | Exists as scripts | Need config coverage and live validation per target. | -| Kimi K2.5 FP4 B200 agentic script | Exists as script | Need config coverage and live validation. | -| DeepSeek-V4 B200/B300 SGLang fixed-seq | Exists | Fixed 1k/8k surfaces, not agentic trace replay. | -| DeepSeek-V4 B200/B300 vLLM fixed-seq/MTP | Exists | Fixed-seq path with DSV4 chat encoding. | -| DeepSeek-V4 GB200 vLLM srt-slurm recipes | Exists | Recipe set for 8k1k, not agentic trace replay. | -| DeepSeek-V4 GB200 agentic trace replay | Missing | No `agentic-coding` config or DSV4-specific agentic launcher found. | -| B300 agentic trace replay | Mostly missing | B300 has fixed-seq DSR1/DSV4 surfaces, not a clear agentic path. | -| LMCache/TensorMesh agentic comparison | Missing | No direct LMCache/TensorMesh metrics integration in InferenceX agentic path. | - -## What A GMI/Neocloud Platform Engineer Actually Cares About - -| Category | What they need to decide | Current harness answer | Gap | -|---|---|---|---| -| Capacity planning | How many concurrent coding/chat sessions per node or rack before SLO violation? | Partial: concurrency sweep and request latencies. | Needs SLO pass/fail curves, saturation point, and capacity recommendations. | -| Latency SLO | P50/P90/P99 TTFT, TPOT, E2E for long-context chat/coding. | Partial: computes latency stats. | Needs explicit SLO config, pass/fail, and stable run windows. | -| Long context | How 32k/64k/128k/256k+ context behaves under realistic reuse. | Partial: WEKA traces may include realistic shapes, but context buckets are not first-class in matrix. | Needs explicit context-length stratification and reporting. | -| Coding workload realism | Does traffic resemble coding assistants, repo Q&A, edits, tests, tool calls? | Partial: recorded traces, but no task taxonomy shown in core benchmark output. | Needs workload classes: code chat, repo QA, patch generation, test/debug loop, long-doc coding. | -| Cache value | Does prefix/KV reuse improve latency, cost, and throughput? | Partial: theoretical and server prefix-cache hit metrics. | Needs engine-specific cache event metrics, eviction, residency, fragmentation, reuse distance, cache salt/isolation. | -| Multi-tenant isolation | Does one tenant poison or evict another tenant's cache? | Missing. | Needs tenant IDs, cache salts, fairness and isolation reports. | -| Memory pressure | When do KV cache, CPU offload, swap, or SSD tiers collapse? | Partial: offloading field and a few counters. | Needs GPU memory, HBM pressure, CPU memory, SSD bandwidth, eviction storms, OOM attribution. | -| Slurm operator flow | Can an operator dry-run, submit, monitor, cancel, and collect artifacts? | Partial in InferenceX CI and srt-slurm paths. | Needs portable Slurm matrix runner, sbatch rendering, env-only cluster config, and artifact contract. | -| Network health | Are NCCL/RDMA/NVLink/IB topology problems caught before benchmark? | Missing in InferenceX agentic path. | Needs preflight topology and network smoke checks. | -| Reproducibility | Can results be traced to image digest, repo SHA, GPU inventory, driver, topology, and versions? | Partial: CI has image/model/framework fields. | Needs full provenance captured per job. | -| Reliability | Do runs survive cold start, warmup, long duration, failed requests, server restarts? | Partial: success counts and raw logs. | Needs failure taxonomy, retry policy, health timeline, and soak tests. | -| Cost model | Which hardware/runtime gives best $/successful-session or $/M tokens at SLO? | Missing. | Needs GPU-hour pricing input and cost-per-SLO report. | -| Hardware comparison | B200 vs B300 vs GB200 for the same DSV4 workload. | Missing for agentic. | Need same workload across same engines and configs. | -| Runtime comparison | vLLM vs SGLang vs TRT/Dynamo under identical trace replay. | Partial for some models. | Need normalized DSV4 matrix and identical trace/scheduler settings. | -| Production readiness | What config should GMI actually offer customers? | Missing. | Needs recommended SKUs, caveats, and no-go thresholds. | - -## Truth Matrix: Current vs Required - -Legend: - -- Yes: implemented in the local InferenceX path. -- Partial: implemented but too narrow, experimental, or missing key evidence. -- No: not implemented. -- Unknown: cannot be proven from this repo without live cluster results or external data. - -| Requirement | Current truth | Evidence | Needed build | -|---|---:|---|---| -| Agentic scenario flag and config schema | Yes | `agentic-coding` in config docs and validation. | Keep. | -| WEKA trace replay source | Yes | `semianalysisai/cc-traces-weka-042026` in `resolve_trace_source`. | Make dataset configurable; keep WEKA as default/example. | -| Single-node trace replay execution | Yes | `benchmarks/single_node/agentic/*.sh`. | Add DSV4 B200/B300 launchers. | -| Multi-node trace replay execution | Partial | `benchmarks/multi_node/agentic_srt.sh`; special srt-slurm branch. | First-class srt-slurm support, no special private branch dependency. | -| DeepSeek-V4 B200 agentic | No | DSV4 B200 configs are fixed-seq, not `agentic-coding`. | Add config + launcher + validated run. | -| DeepSeek-V4 B300 agentic | No | B300 has DSV4 fixed-seq scripts/recipes, not agentic. | Add config + launcher + validated run. | -| DeepSeek-V4 GB200 agentic | No | GB200 DSV4 recipes exist, but no agentic scenario. | Add srt-slurm agentic recipe and config. | -| B200/B300/GB200 apples-to-apples matrix | No | Current surfaces differ by model/runtime/scenario. | Build normalized matrix over hardware, engine, context, concurrency. | -| vLLM/SGLang/TRT/Dynamo comparison for same workload | Partial | Some engines covered for some models. | Normalize exact model, precision, prompt encoding, trace, and duration. | -| Long-context buckets | Partial | Fixed-seq has 1k/8k; trace replay may have varied token lengths. | Add explicit 8k/32k/64k/128k/256k+ bins in reports and optional filters. | -| Coding workload taxonomy | Partial | Trace replay exists; distribution plot exists. | Add task labels and per-class metrics. | -| TTFT/TPOT/E2E latency metrics | Yes | `compute_latency_stats`. | Add SLO pass/fail summary. | -| Throughput per GPU | Yes | `tput_per_gpu` in processor. | Add SLO-qualified throughput, not just raw throughput. | -| Failed request taxonomy | Partial | Success count exists. | Add HTTP error class, timeout, OOM, scheduler reject, engine crash. | -| Prefix/KV cache hit metrics | Partial | Theoretical + server prefix counters when present. | Add LMCache/TensorMesh/vLLM/SGLang metric adapters with measured-vs-inferred flags. | -| Eviction/fragmentation proof | No | No live cache event schema in agentic path. | Add engine metric scraping and artifact schema. | -| Multi-tenant cache isolation | No | No tenant IDs or cache salt model. | Add multi-tenant trace mode and isolation metrics. | -| CPU/SSD offload analysis | Partial | `offloading` field and some counters in processor. | Add tier residency, bandwidth, latency, and failure attribution. | -| Slurm dry-run matrix generation | Partial | InferenceX CI/srt-slurm flow exists; not a portable GMI operator runner. | Add portable Slurm matrix runner and artifact contract. | -| NCCL/RDMA/topology preflight | No | Not in agentic path. | Add pre-benchmark smoke checks. | -| Full provenance capture | Partial | JSON includes image/model/framework; raw logs upload. | Add digest, repo SHA, driver, CUDA, GPU inventory, topology, package versions. | -| Cost and capacity report | No | No pricing or recommendation layer. | Add cost inputs and capacity planning report. | -| Customer-ready operator report | No | Raw/aggregated artifacts only. | Add one-page operator brief with recommendations and caveats. | - -## Recommended Build Matrix For GMI/Neocloud Evaluation - -This is the minimum useful matrix for a GMI cloud engineer evaluating long-context chat and coding workloads. It is intentionally smaller than a full combinatorial sweep. - -| Axis | Required values | Why it matters | -|---|---|---| -| Hardware | B200, B300, GB200 | These are the procurement/deployment choices. | -| Model | DeepSeek-V4-Pro first; DeepSeek-R1 as control | DSV4 is the target; DSR1 provides existing harness continuity. | -| Runtime | vLLM, SGLang, Dynamo/TRT where supported | GMI needs runtime/SKU decision data. | -| Topology | single-node, multi-node disagg | Long context and MoE behavior differ sharply by topology. | -| Context bucket | 8k, 32k, 64k, 128k, 256k+ | Cloud operators need max supported context and degradation curve. | -| Workload type | long chat, repo QA, code generation, test/debug loop, multi-turn agent | Coding traffic is not one workload. | -| Concurrency | 1, 2, 4, 8, 16, 32, 64, 128, then saturation search | Finds knee of curve and failure region. | -| Arrival mode | closed-loop and burst/open-loop | Closed-loop measures users; open-loop exposes queue collapse. | -| Cache mode | cache off, engine prefix cache, LMCache/TensorMesh if available | Proves whether cache stack actually helps. | -| Tenant mode | single tenant, multi-tenant with cache salt | Proves isolation and fairness. | -| Duration | 10 min smoke, 30 min curve, 2-4 hr soak | Separates launch success from operational stability. | - -## What To Build Next - -| Priority | Build item | Acceptance criteria | -|---:|---|---| -| P0 | DSV4 `agentic-coding` configs for B200/B300/GB200 | Matrix generator emits DSV4 agentic jobs for each target hardware without touching fixed-seq paths. | -| P0 | DSV4 agentic launchers | Single-node launchers exist for B200/B300; GB200 multi-node agentic recipe exists or maps cleanly to srt-slurm custom benchmark. | -| P0 | Portable Slurm matrix runner | GMI operator can dry-run and submit without GitHub Actions; no hardcoded cluster IDs; all cluster settings via env/YAML. | -| P0 | Artifact contract | Every run emits a normalized JSON, raw CSV/JSONL, server log, config, command, provenance, and expected-path manifest. | -| P1 | Workload taxonomy and context buckets | Report breaks down metrics by workload class and context-length bucket. | -| P1 | SLO/capacity report | For each cell, report max concurrency at TTFT/TPOT/E2E SLO and failure reason beyond it. | -| P1 | Provenance capture | Per-job artifact records image digest, repo SHA, CUDA/driver, GPU inventory, topology, runtime versions, Slurm job ID, nodelist. | -| P1 | NCCL/RDMA/topology preflight | Preflight emits pass/fail/skipped before benchmark execution. | -| P1 | Cache metrics adapters | vLLM/SGLang/LMCache/TensorMesh metrics are normalized with measured vs inferred labels. | -| P2 | Multi-tenant replay mode | Tenant IDs, cache salt/isolation, fairness, noisy-neighbor metrics. | -| P2 | Cost model | Add GPU-hour price input and output $/successful-session, $/M input tokens, $/M output tokens at SLO. | -| P2 | Operator brief | Generate a human-readable recommendation with caveats: best config, no-go configs, saturation point, and missing proof. | - -## Non-Claims To Preserve - -Do not claim any of the following until live artifacts prove them: - -- DeepSeek-V4 agentic performance on GB200. -- B200/B300/GB200 parity under the same long-context trace replay. -- LMCache/TensorMesh benefit. -- Cache eviction or fragmentation behavior. -- Multi-tenant isolation. -- Production readiness for GMI customer workloads. -- Autonomous agent performance; this is trace replay, not a tool-using agent loop. - -## Proposed File/Code Changes For The Next PR - -| Area | Candidate files | -|---|---| -| DSV4 agentic configs | `.github/configs/nvidia-master.yaml`, possibly a separate GMI/GPU pilot config. | -| DSV4 single-node launchers | `benchmarks/single_node/agentic/dsv4_fp4_b200_sglang.sh`, `benchmarks/single_node/agentic/dsv4_fp4_b300_sglang.sh`, vLLM variants if supported. | -| GB200 multi-node agentic | `benchmarks/multi_node/agentic_srt.sh`, `benchmarks/multi_node/srt-slurm-recipes/.../deepseek-v4/...`, `runners/launch_gb200-nv.sh`. | -| Slurm operator harness | `scripts/slurm/`, `scripts/run_agentic_slurm_matrix.py`, `configs/agentic_slurm_matrix.yaml`. | -| Metrics schema | `utils/process_agentic_result.py` plus a new normalized metrics schema module. | -| Artifact contract tests | `utils/matrix_logic/test_*.py` or new repo-level tests for dry-run contract. | -| Operator report | new `utils/summarize_agentic.py` or integration into `utils/summarize.py`. | - -## Decision - -For GMI/neocloud evaluation, the current InferenceX PR is a **good starting mechanism**, not a finished benchmark product. Build the missing DSV4+B200/B300/GB200 agentic Slurm surface, add provenance/preflight/cache/SLO reporting, and keep every unmeasured claim explicitly labeled as unproven.