From 537beb92dab467a7f7a0a057674fbfde10dc0c8a Mon Sep 17 00:00:00 2001
From: William Chen <57119977+OCWC22@users.noreply.github.com>
Date: Sat, 2 May 2026 12:34:13 -0700
Subject: [PATCH 1/5] fix(dsv4): gate h200 reasoning parser flag

---
 benchmarks/single_node/dsv4_fp8_h200.sh | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/benchmarks/single_node/dsv4_fp8_h200.sh b/benchmarks/single_node/dsv4_fp8_h200.sh
index 167a50a57..9b381fed2 100644
--- a/benchmarks/single_node/dsv4_fp8_h200.sh
+++ b/benchmarks/single_node/dsv4_fp8_h200.sh
@@ -25,6 +25,7 @@ hf download "$MODEL"
 
 SERVER_LOG=/workspace/server.log
 PORT=${PORT:-8888}
+ENABLE_DSV4_REASONING_PARSER=${ENABLE_DSV4_REASONING_PARSER:-false}
 
 # DeepSeek-V4-Pro weights are large; engine startup can exceed the default
 # 600s. Give it an hour to load.
@@ -37,6 +38,11 @@ else
     MAX_MODEL_LEN_ARG="--max-model-len 800000"
 fi
 
+REASONING_PARSER_ARGS=()
+if [[ "${ENABLE_DSV4_REASONING_PARSER}" == "true" ]]; then
+    REASONING_PARSER_ARGS+=(--reasoning-parser deepseek_v4)
+fi
+
 # Start GPU monitoring (power, temperature, clocks every second)
 start_gpu_monitor
 
@@ -60,7 +66,7 @@ $MAX_MODEL_LEN_ARG \
 --tokenizer-mode deepseek_v4 \
 --tool-call-parser deepseek_v4 \
 --enable-auto-tool-choice \
---reasoning-parser deepseek_v4 > $SERVER_LOG 2>&1 &
+"${REASONING_PARSER_ARGS[@]}" > $SERVER_LOG 2>&1 &
 
 SERVER_PID=$!
 

From df9aa0cd88b3591fa89a4b127685c25862acbb02 Mon Sep 17 00:00:00 2001
From: William Chen <57119977+OCWC22@users.noreply.github.com>
Date: Sun, 3 May 2026 03:25:13 -0700
Subject: [PATCH 2/5] docs(agentic): add GMI truth matrix [skip-sweep]

---
 AGENTIC_TRUTH_MATRIX.md | 247 ++++++++++++++++++++++++++++++++++++++++
 1 file changed, 247 insertions(+)
 create mode 100644 AGENTIC_TRUTH_MATRIX.md

diff --git a/AGENTIC_TRUTH_MATRIX.md b/AGENTIC_TRUTH_MATRIX.md
new file mode 100644
index 000000000..fdd91f139
--- /dev/null
+++ b/AGENTIC_TRUTH_MATRIX.md
@@ -0,0 +1,247 @@
+# SemiAnalysis InferenceX Agentic/WEKA Truth Matrix
+
+Date: 2026-05-03
+
+Scope: local `InferenceX` checkout, focused on the PR path that adds the `agentic-coding` scenario and WEKA trace replay. This is a truth matrix for deciding what still needs to be built before a GMI Cloud or other neocloud platform engineer can use the harness to evaluate real long-context chat and coding inference workloads.
+
+## Bottom Line
+
+The current InferenceX implementation is a real but experimental **agentic trace replay harness**. It replays recorded WEKA coding/chat traces against an OpenAI-compatible serving endpoint and emits latency, throughput, cache, workload-distribution, and artifact outputs.
+
+It is **not yet** a complete GMI/neocloud evaluation harness for DeepSeek-V4 on B200/B300/GB200. The biggest gap is that `agentic-coding` is wired for some DeepSeek-R1 and GPT-OSS/Kimi paths, while the DeepSeek-V4 GB200/B300/B200 surface is still mostly fixed-sequence or srt-slurm recipe driven. The harness also does not yet produce the full cluster, network, reliability, cost, and operator-readiness evidence that a cloud platform engineer would need.
+
+## Actual Code Path Today
+
+| Step | What happens | Actual code | Truth status |
+|---|---|---|---|
+| 1 | Config declares an optional `agentic-coding` scenario. | `.github/configs/CONFIGS.md` | Exists |
+| 2 | NVIDIA/AMD master configs include a small number of `agentic-coding` entries. | `.github/configs/nvidia-master.yaml`, `.github/configs/amd-master.yaml` | Exists, narrow |
+| 3 | Matrix generator expands agentic entries across concurrency, TP, EP, DP attention, offload, runner, image, model, and duration. | `utils/matrix_logic/generate_sweep_configs.py` | Exists |
+| 4 | GitHub workflow sets agentic routing env vars. | `.github/workflows/benchmark-tmpl.yml` | Exists |
+| 5 | Runner selects `benchmarks/single_node/agentic/...` instead of normal fixed-seq scripts. | `runners/launch_*.sh` via `SCENARIO_SUBDIR=agentic/` | Exists |
+| 6 | Shared library resolves WEKA trace source and builds the replay command. | `benchmarks/benchmark_lib.sh` | Exists |
+| 7 | Agentic script starts the serving backend and runs trace replay. | `benchmarks/single_node/agentic/dsr1_fp4_b200.sh`, peers | Exists |
+| 8 | Multi-node agentic path runs client-only replay against an already-started srt-slurm frontend. | `benchmarks/multi_node/agentic_srt.sh` | Exists, experimental |
+| 9 | Aggregator turns replay CSVs into InferenceX-like JSON. | `utils/process_agentic_result.py` | Exists |
+| 10 | Workflow uploads raw and aggregated artifacts. | `.github/workflows/benchmark-tmpl.yml`, `.github/workflows/e2e-tests.yml` | Exists |
+
+## Actual Code Snippets
+
+The trace source is hardcoded to a Hugging Face dataset:
+
+```bash
+local dataset="semianalysisai/cc-traces-weka-042026"
+TRACE_SOURCE_FLAG="--hf-dataset $dataset"
+```
+
+Source: `benchmarks/benchmark_lib.sh`
+
+Agentic replay is built as a client workload against the local serving endpoint:
+
+```bash
+REPLAY_CMD="python3 $TRACE_REPLAY_DIR/trace_replay_tester.py"
+REPLAY_CMD+=" --api-endpoint http://localhost:$PORT"
+REPLAY_CMD+=" $TRACE_SOURCE_FLAG"
+REPLAY_CMD+=" --output-dir $result_dir/trace_replay"
+REPLAY_CMD+=" --start-users $CONC"
+REPLAY_CMD+=" --max-users $CONC"
+REPLAY_CMD+=" --test-duration $duration"
+REPLAY_CMD+=" --recycle"
+REPLAY_CMD+=" --warmup-enabled"
+REPLAY_CMD+=" --seed 42"
+```
+
+Source: `benchmarks/benchmark_lib.sh`
+
+The workflow routes agentic jobs by setting:
+
+```yaml
+SCENARIO_SUBDIR: ${{ inputs.scenario-type == 'agentic-coding' && 'agentic/' || '' }}
+IS_AGENTIC: ${{ inputs.scenario-type == 'agentic-coding' && '1' || '0' }}
+RESULT_DIR: /workspace/results
+```
+
+Source: `.github/workflows/benchmark-tmpl.yml`
+
+The B200 DeepSeek-R1 agentic script starts SGLang, waits for readiness, runs replay, then aggregates:
+
+```bash
+resolve_trace_source
+install_agentic_deps
+python3 -m sglang.launch_server ... --enable-metrics > "$SERVER_LOG" 2>&1 &
+wait_for_server_ready --port "$PORT" --server-log "$SERVER_LOG" --server-pid "$SERVER_PID"
+build_replay_cmd "$RESULT_DIR"
+$REPLAY_CMD 2>&1 | tee "$RESULT_DIR/benchmark.log" || true
+write_agentic_result_json "$RESULT_DIR"
+```
+
+Source: `benchmarks/single_node/agentic/dsr1_fp4_b200.sh`
+
+Aggregated JSON includes scenario identity, topology, success counts, latency, throughput, token distributions, cache stats, and per-GPU throughput:
+
+```python
+agg = {
+    "hw": os.environ.get('RUNNER_TYPE', ''),
+    "conc": conc,
+    "model": os.environ.get('MODEL', ''),
+    "framework": os.environ.get('FRAMEWORK', ''),
+    "scenario_type": "agentic-coding",
+    "is_multinode": is_multinode,
+    "tp": tp,
+    "ep": ep,
+    "offloading": os.environ.get('OFFLOADING', 'none'),
+    "num_requests_total": len(rows),
+    "num_requests_successful": len(successful),
+}
+```
+
+Source: `utils/process_agentic_result.py`
+
+## What The Use Case Actually Is
+
+The current use case is:
+
+| Use case | Current behavior |
+|---|---|
+| Replay realistic coding/chat request traces | Yes, via `semianalysisai/cc-traces-weka-042026`. |
+| Drive a serving endpoint with concurrent users | Yes, with `--start-users $CONC` and `--max-users $CONC`. |
+| Measure request-level TTFT / E2E / ITL / TPOT | Yes, from `trace_replay/detailed_results.csv`. |
+| Measure throughput and throughput per GPU | Yes, from completed request timestamps and configured GPU counts. |
+| Measure input/output token distribution | Yes, from replay rows. |
+| Estimate cache reuse | Partially. It reports theoretical replay cache hit rate and server prefix-cache counters when metrics exist. |
+| Evaluate real autonomous coding agent behavior | No. It replays traces; it does not run an agent loop with tools, repo edits, tests, retries, or feedback. |
+| Evaluate GMI customer traffic | No, unless GMI traffic is converted into the same trace-replay format. |
+
+## Current Coverage Matrix
+
+| Surface | Current status | Notes |
+|---|---|---|
+| DeepSeek-R1 FP4 B200 SGLang single-node agentic | Exists | `benchmarks/single_node/agentic/dsr1_fp4_b200.sh`. |
+| DeepSeek-R1 FP4 B200 Dynamo/TRT multi-node agentic | Exists, experimental | Uses a special `cquil11/srt-slurm-nv` branch and a `128k_agentic` recipe. |
+| DeepSeek-R1 FP4 MI355X SGLang single-node agentic | Exists | AMD entry in `.github/configs/amd-master.yaml`. |
+| GPT-OSS FP4 H100/H200/MI300X/MI325X agentic scripts | Exists as scripts | Need config coverage and live validation per target. |
+| Kimi K2.5 FP4 B200 agentic script | Exists as script | Need config coverage and live validation. |
+| DeepSeek-V4 B200/B300 SGLang fixed-seq | Exists | Fixed 1k/8k surfaces, not agentic trace replay. |
+| DeepSeek-V4 B200/B300 vLLM fixed-seq/MTP | Exists | Fixed-seq path with DSV4 chat encoding. |
+| DeepSeek-V4 GB200 vLLM srt-slurm recipes | Exists | Recipe set for 8k1k, not agentic trace replay. |
+| DeepSeek-V4 GB200 agentic trace replay | Missing | No `agentic-coding` config or DSV4-specific agentic launcher found. |
+| B300 agentic trace replay | Mostly missing | B300 has fixed-seq DSR1/DSV4 surfaces, not a clear agentic path. |
+| LMCache/TensorMesh agentic comparison | Missing | No direct LMCache/TensorMesh metrics integration in InferenceX agentic path. |
+
+## What A GMI/Neocloud Platform Engineer Actually Cares About
+
+| Category | What they need to decide | Current harness answer | Gap |
+|---|---|---|---|
+| Capacity planning | How many concurrent coding/chat sessions per node or rack before SLO violation? | Partial: concurrency sweep and request latencies. | Needs SLO pass/fail curves, saturation point, and capacity recommendations. |
+| Latency SLO | P50/P90/P99 TTFT, TPOT, E2E for long-context chat/coding. | Partial: computes latency stats. | Needs explicit SLO config, pass/fail, and stable run windows. |
+| Long context | How 32k/64k/128k/256k+ context behaves under realistic reuse. | Partial: WEKA traces may include realistic shapes, but context buckets are not first-class in matrix. | Needs explicit context-length stratification and reporting. |
+| Coding workload realism | Does traffic resemble coding assistants, repo Q&A, edits, tests, tool calls? | Partial: recorded traces, but no task taxonomy shown in core benchmark output. | Needs workload classes: code chat, repo QA, patch generation, test/debug loop, long-doc coding. |
+| Cache value | Does prefix/KV reuse improve latency, cost, and throughput? | Partial: theoretical and server prefix-cache hit metrics. | Needs engine-specific cache event metrics, eviction, residency, fragmentation, reuse distance, cache salt/isolation. |
+| Multi-tenant isolation | Does one tenant poison or evict another tenant's cache? | Missing. | Needs tenant IDs, cache salts, fairness and isolation reports. |
+| Memory pressure | When do KV cache, CPU offload, swap, or SSD tiers collapse? | Partial: offloading field and a few counters. | Needs GPU memory, HBM pressure, CPU memory, SSD bandwidth, eviction storms, OOM attribution. |
+| Slurm operator flow | Can an operator dry-run, submit, monitor, cancel, and collect artifacts? | Partial in InferenceX CI and srt-slurm paths. | Needs portable Slurm matrix runner, sbatch rendering, env-only cluster config, and artifact contract. |
+| Network health | Are NCCL/RDMA/NVLink/IB topology problems caught before benchmark? | Missing in InferenceX agentic path. | Needs preflight topology and network smoke checks. |
+| Reproducibility | Can results be traced to image digest, repo SHA, GPU inventory, driver, topology, and versions? | Partial: CI has image/model/framework fields. | Needs full provenance captured per job. |
+| Reliability | Do runs survive cold start, warmup, long duration, failed requests, server restarts? | Partial: success counts and raw logs. | Needs failure taxonomy, retry policy, health timeline, and soak tests. |
+| Cost model | Which hardware/runtime gives best $/successful-session or $/M tokens at SLO? | Missing. | Needs GPU-hour pricing input and cost-per-SLO report. |
+| Hardware comparison | B200 vs B300 vs GB200 for the same DSV4 workload. | Missing for agentic. | Need same workload across same engines and configs. |
+| Runtime comparison | vLLM vs SGLang vs TRT/Dynamo under identical trace replay. | Partial for some models. | Need normalized DSV4 matrix and identical trace/scheduler settings. |
+| Production readiness | What config should GMI actually offer customers? | Missing. | Needs recommended SKUs, caveats, and no-go thresholds. |
+
+## Truth Matrix: Current vs Required
+
+Legend:
+
+- Yes: implemented in the local InferenceX path.
+- Partial: implemented but too narrow, experimental, or missing key evidence.
+- No: not implemented.
+- Unknown: cannot be proven from this repo without live cluster results or external data.
+
+| Requirement | Current truth | Evidence | Needed build |
+|---|---:|---|---|
+| Agentic scenario flag and config schema | Yes | `agentic-coding` in config docs and validation. | Keep. |
+| WEKA trace replay source | Yes | `semianalysisai/cc-traces-weka-042026` in `resolve_trace_source`. | Make dataset configurable; keep WEKA as default/example. |
+| Single-node trace replay execution | Yes | `benchmarks/single_node/agentic/*.sh`. | Add DSV4 B200/B300 launchers. |
+| Multi-node trace replay execution | Partial | `benchmarks/multi_node/agentic_srt.sh`; special srt-slurm branch. | First-class srt-slurm support, no special private branch dependency. |
+| DeepSeek-V4 B200 agentic | No | DSV4 B200 configs are fixed-seq, not `agentic-coding`. | Add config + launcher + validated run. |
+| DeepSeek-V4 B300 agentic | No | B300 has DSV4 fixed-seq scripts/recipes, not agentic. | Add config + launcher + validated run. |
+| DeepSeek-V4 GB200 agentic | No | GB200 DSV4 recipes exist, but no agentic scenario. | Add srt-slurm agentic recipe and config. |
+| B200/B300/GB200 apples-to-apples matrix | No | Current surfaces differ by model/runtime/scenario. | Build normalized matrix over hardware, engine, context, concurrency. |
+| vLLM/SGLang/TRT/Dynamo comparison for same workload | Partial | Some engines covered for some models. | Normalize exact model, precision, prompt encoding, trace, and duration. |
+| Long-context buckets | Partial | Fixed-seq has 1k/8k; trace replay may have varied token lengths. | Add explicit 8k/32k/64k/128k/256k+ bins in reports and optional filters. |
+| Coding workload taxonomy | Partial | Trace replay exists; distribution plot exists. | Add task labels and per-class metrics. |
+| TTFT/TPOT/E2E latency metrics | Yes | `compute_latency_stats`. | Add SLO pass/fail summary. |
+| Throughput per GPU | Yes | `tput_per_gpu` in processor. | Add SLO-qualified throughput, not just raw throughput. |
+| Failed request taxonomy | Partial | Success count exists. | Add HTTP error class, timeout, OOM, scheduler reject, engine crash. |
+| Prefix/KV cache hit metrics | Partial | Theoretical + server prefix counters when present. | Add LMCache/TensorMesh/vLLM/SGLang metric adapters with measured-vs-inferred flags. |
+| Eviction/fragmentation proof | No | No live cache event schema in agentic path. | Add engine metric scraping and artifact schema. |
+| Multi-tenant cache isolation | No | No tenant IDs or cache salt model. | Add multi-tenant trace mode and isolation metrics. |
+| CPU/SSD offload analysis | Partial | `offloading` field and some counters in processor. | Add tier residency, bandwidth, latency, and failure attribution. |
+| Slurm dry-run matrix generation | Partial | InferenceX CI/srt-slurm flow exists; not a portable GMI operator runner. | Add portable Slurm matrix runner and artifact contract. |
+| NCCL/RDMA/topology preflight | No | Not in agentic path. | Add pre-benchmark smoke checks. |
+| Full provenance capture | Partial | JSON includes image/model/framework; raw logs upload. | Add digest, repo SHA, driver, CUDA, GPU inventory, topology, package versions. |
+| Cost and capacity report | No | No pricing or recommendation layer. | Add cost inputs and capacity planning report. |
+| Customer-ready operator report | No | Raw/aggregated artifacts only. | Add one-page operator brief with recommendations and caveats. |
+
+## Recommended Build Matrix For GMI/Neocloud Evaluation
+
+This is the minimum useful matrix for a GMI cloud engineer evaluating long-context chat and coding workloads. It is intentionally smaller than a full combinatorial sweep.
+
+| Axis | Required values | Why it matters |
+|---|---|---|
+| Hardware | B200, B300, GB200 | These are the procurement/deployment choices. |
+| Model | DeepSeek-V4-Pro first; DeepSeek-R1 as control | DSV4 is the target; DSR1 provides existing harness continuity. |
+| Runtime | vLLM, SGLang, Dynamo/TRT where supported | GMI needs runtime/SKU decision data. |
+| Topology | single-node, multi-node disagg | Long context and MoE behavior differ sharply by topology. |
+| Context bucket | 8k, 32k, 64k, 128k, 256k+ | Cloud operators need max supported context and degradation curve. |
+| Workload type | long chat, repo QA, code generation, test/debug loop, multi-turn agent | Coding traffic is not one workload. |
+| Concurrency | 1, 2, 4, 8, 16, 32, 64, 128, then saturation search | Finds knee of curve and failure region. |
+| Arrival mode | closed-loop and burst/open-loop | Closed-loop measures users; open-loop exposes queue collapse. |
+| Cache mode | cache off, engine prefix cache, LMCache/TensorMesh if available | Proves whether cache stack actually helps. |
+| Tenant mode | single tenant, multi-tenant with cache salt | Proves isolation and fairness. |
+| Duration | 10 min smoke, 30 min curve, 2-4 hr soak | Separates launch success from operational stability. |
+
+## What To Build Next
+
+| Priority | Build item | Acceptance criteria |
+|---:|---|---|
+| P0 | DSV4 `agentic-coding` configs for B200/B300/GB200 | Matrix generator emits DSV4 agentic jobs for each target hardware without touching fixed-seq paths. |
+| P0 | DSV4 agentic launchers | Single-node launchers exist for B200/B300; GB200 multi-node agentic recipe exists or maps cleanly to srt-slurm custom benchmark. |
+| P0 | Portable Slurm matrix runner | GMI operator can dry-run and submit without GitHub Actions; no hardcoded cluster IDs; all cluster settings via env/YAML. |
+| P0 | Artifact contract | Every run emits a normalized JSON, raw CSV/JSONL, server log, config, command, provenance, and expected-path manifest. |
+| P1 | Workload taxonomy and context buckets | Report breaks down metrics by workload class and context-length bucket. |
+| P1 | SLO/capacity report | For each cell, report max concurrency at TTFT/TPOT/E2E SLO and failure reason beyond it. |
+| P1 | Provenance capture | Per-job artifact records image digest, repo SHA, CUDA/driver, GPU inventory, topology, runtime versions, Slurm job ID, nodelist. |
+| P1 | NCCL/RDMA/topology preflight | Preflight emits pass/fail/skipped before benchmark execution. |
+| P1 | Cache metrics adapters | vLLM/SGLang/LMCache/TensorMesh metrics are normalized with measured vs inferred labels. |
+| P2 | Multi-tenant replay mode | Tenant IDs, cache salt/isolation, fairness, noisy-neighbor metrics. |
+| P2 | Cost model | Add GPU-hour price input and output $/successful-session, $/M input tokens, $/M output tokens at SLO. |
+| P2 | Operator brief | Generate a human-readable recommendation with caveats: best config, no-go configs, saturation point, and missing proof. |
+
+## Non-Claims To Preserve
+
+Do not claim any of the following until live artifacts prove them:
+
+- DeepSeek-V4 agentic performance on GB200.
+- B200/B300/GB200 parity under the same long-context trace replay.
+- LMCache/TensorMesh benefit.
+- Cache eviction or fragmentation behavior.
+- Multi-tenant isolation.
+- Production readiness for GMI customer workloads.
+- Autonomous agent performance; this is trace replay, not a tool-using agent loop.
+
+## Proposed File/Code Changes For The Next PR
+
+| Area | Candidate files |
+|---|---|
+| DSV4 agentic configs | `.github/configs/nvidia-master.yaml`, possibly a separate GMI/GPU pilot config. |
+| DSV4 single-node launchers | `benchmarks/single_node/agentic/dsv4_fp4_b200_sglang.sh`, `benchmarks/single_node/agentic/dsv4_fp4_b300_sglang.sh`, vLLM variants if supported. |
+| GB200 multi-node agentic | `benchmarks/multi_node/agentic_srt.sh`, `benchmarks/multi_node/srt-slurm-recipes/.../deepseek-v4/...`, `runners/launch_gb200-nv.sh`. |
+| Slurm operator harness | `scripts/slurm/`, `scripts/run_agentic_slurm_matrix.py`, `configs/agentic_slurm_matrix.yaml`. |
+| Metrics schema | `utils/process_agentic_result.py` plus a new normalized metrics schema module. |
+| Artifact contract tests | `utils/matrix_logic/test_*.py` or new repo-level tests for dry-run contract. |
+| Operator report | new `utils/summarize_agentic.py` or integration into `utils/summarize.py`. |
+
+## Decision
+
+For GMI/neocloud evaluation, the current InferenceX PR is a **good starting mechanism**, not a finished benchmark product. Build the missing DSV4+B200/B300/GB200 agentic Slurm surface, add provenance/preflight/cache/SLO reporting, and keep every unmeasured claim explicitly labeled as unproven.

From bfea80549194a4649124f9df73c46f5682a33e40 Mon Sep 17 00:00:00 2001
From: William Chen <57119977+OCWC22@users.noreply.github.com>
Date: Sun, 3 May 2026 03:35:05 -0700
Subject: [PATCH 3/5] feat(agentic): add GMI DSV4 Slurm harness [skip-sweep]

---
 .github/configs/nvidia-master.yaml            |   8 +
 AGENTIC_TRUTH_MATRIX.md                       |  20 +-
 benchmarks/benchmark_lib.sh                   |   2 +-
 .../agentic/dsv4_fp4_b200_sglang.sh           |  17 +
 .../agentic/dsv4_fp4_b300_sglang.sh           |  17 +
 benchmarks/single_node/dsv4_fp4_b200.sh       |  19 +
 .../single_node/dsv4_fp4_b300_sglang.sh       |  19 +
 configs/agentic_slurm_matrix.json             |  73 ++++
 configs/agentic_slurm_matrix.yaml             |  67 +++
 scripts/run_agentic_slurm_matrix.py           | 381 ++++++++++++++++++
 scripts/slurm/agentic_job.sbatch.tmpl         |  89 ++++
 utils/test_agentic_slurm_matrix.py            |  84 ++++
 12 files changed, 785 insertions(+), 11 deletions(-)
 create mode 100755 benchmarks/single_node/agentic/dsv4_fp4_b200_sglang.sh
 create mode 100755 benchmarks/single_node/agentic/dsv4_fp4_b300_sglang.sh
 create mode 100644 configs/agentic_slurm_matrix.json
 create mode 100644 configs/agentic_slurm_matrix.yaml
 create mode 100755 scripts/run_agentic_slurm_matrix.py
 create mode 100644 scripts/slurm/agentic_job.sbatch.tmpl
 create mode 100644 utils/test_agentic_slurm_matrix.py

diff --git a/.github/configs/nvidia-master.yaml b/.github/configs/nvidia-master.yaml
index 38d1101f3..3b31a65e8 100644
--- a/.github/configs/nvidia-master.yaml
+++ b/.github/configs/nvidia-master.yaml
@@ -1732,6 +1732,10 @@ dsv4-fp4-b200-sglang:
       - { tp: 8, ep: 8, dp-attn: true, conc-start: 64, conc-end: 128 }
       # DP-attention (DP_ATTENTION=true) — max-throughput CONC range
       - { tp: 8, ep: 8, dp-attn: true, conc-start: 256, conc-end: 512 }
+    agentic-coding:
+    - duration: 1800
+      search-space:
+      - { tp: 8, ep: 8, dp-attn: true, offloading: none, conc-list: [1, 2, 4, 8, 16, 32, 64] }
 
 dsv4-fp4-b200-vllm:
   image: vllm/vllm-openai:v0.20.0-cu130
@@ -1951,6 +1955,10 @@ dsv4-fp4-b300-sglang:
       - { tp: 4, ep: 4, dp-attn: true, conc-start: 512, conc-end: 512 }
       - { tp: 8, ep: 8, dp-attn: true, conc-start: 2048, conc-end: 2048 }
       - { tp: 8, ep: 8, dp-attn: true, conc-start: 4096, conc-end: 4096 }
+    agentic-coding:
+    - duration: 1800
+      search-space:
+      - { tp: 8, ep: 8, dp-attn: true, offloading: none, conc-list: [1, 2, 4, 8, 16, 32, 64] }
 
   # DeepSeek-V4-Pro on B300 with EAGLE/MTP speculative decoding. Recipe is
   # selected inside benchmarks/single_node/dsv4_fp4_b300_sglang_mtp.sh by
diff --git a/AGENTIC_TRUTH_MATRIX.md b/AGENTIC_TRUTH_MATRIX.md
index fdd91f139..b23f598b1 100644
--- a/AGENTIC_TRUTH_MATRIX.md
+++ b/AGENTIC_TRUTH_MATRIX.md
@@ -124,8 +124,8 @@ The current use case is:
 | DeepSeek-V4 B200/B300 SGLang fixed-seq | Exists | Fixed 1k/8k surfaces, not agentic trace replay. |
 | DeepSeek-V4 B200/B300 vLLM fixed-seq/MTP | Exists | Fixed-seq path with DSV4 chat encoding. |
 | DeepSeek-V4 GB200 vLLM srt-slurm recipes | Exists | Recipe set for 8k1k, not agentic trace replay. |
-| DeepSeek-V4 GB200 agentic trace replay | Missing | No `agentic-coding` config or DSV4-specific agentic launcher found. |
-| B300 agentic trace replay | Mostly missing | B300 has fixed-seq DSR1/DSV4 surfaces, not a clear agentic path. |
+| DeepSeek-V4 B200/B300 SGLang agentic trace replay | Wired, needs live validation | Added DSV4 single-node agentic wrappers and conservative `agentic-coding` config rows. |
+| DeepSeek-V4 GB200 agentic trace replay | Dry-run harness only | Portable Slurm matrix can render GB200 agentic jobs, but live srt-slurm recipe behavior is still unproven. |
 | LMCache/TensorMesh agentic comparison | Missing | No direct LMCache/TensorMesh metrics integration in InferenceX agentic path. |
 
 ## What A GMI/Neocloud Platform Engineer Actually Cares About
@@ -161,11 +161,11 @@ Legend:
 |---|---:|---|---|
 | Agentic scenario flag and config schema | Yes | `agentic-coding` in config docs and validation. | Keep. |
 | WEKA trace replay source | Yes | `semianalysisai/cc-traces-weka-042026` in `resolve_trace_source`. | Make dataset configurable; keep WEKA as default/example. |
-| Single-node trace replay execution | Yes | `benchmarks/single_node/agentic/*.sh`. | Add DSV4 B200/B300 launchers. |
+| Single-node trace replay execution | Yes | `benchmarks/single_node/agentic/*.sh`. | Run live DSV4 B200/B300 validation. |
 | Multi-node trace replay execution | Partial | `benchmarks/multi_node/agentic_srt.sh`; special srt-slurm branch. | First-class srt-slurm support, no special private branch dependency. |
-| DeepSeek-V4 B200 agentic | No | DSV4 B200 configs are fixed-seq, not `agentic-coding`. | Add config + launcher + validated run. |
-| DeepSeek-V4 B300 agentic | No | B300 has DSV4 fixed-seq scripts/recipes, not agentic. | Add config + launcher + validated run. |
-| DeepSeek-V4 GB200 agentic | No | GB200 DSV4 recipes exist, but no agentic scenario. | Add srt-slurm agentic recipe and config. |
+| DeepSeek-V4 B200 agentic | Partial | Config + launcher are wired; no live GPU artifact yet. | Run on B200 Slurm and attach artifacts. |
+| DeepSeek-V4 B300 agentic | Partial | Config + launcher are wired; no live GPU artifact yet. | Run on B300 Slurm and attach artifacts. |
+| DeepSeek-V4 GB200 agentic | Partial | Portable matrix renders GB200 jobs; live srt-slurm behavior remains unproven. | Add/validate GB200 agentic srt-slurm recipe on real hardware. |
 | B200/B300/GB200 apples-to-apples matrix | No | Current surfaces differ by model/runtime/scenario. | Build normalized matrix over hardware, engine, context, concurrency. |
 | vLLM/SGLang/TRT/Dynamo comparison for same workload | Partial | Some engines covered for some models. | Normalize exact model, precision, prompt encoding, trace, and duration. |
 | Long-context buckets | Partial | Fixed-seq has 1k/8k; trace replay may have varied token lengths. | Add explicit 8k/32k/64k/128k/256k+ bins in reports and optional filters. |
@@ -205,10 +205,10 @@ This is the minimum useful matrix for a GMI cloud engineer evaluating long-conte
 
 | Priority | Build item | Acceptance criteria |
 |---:|---|---|
-| P0 | DSV4 `agentic-coding` configs for B200/B300/GB200 | Matrix generator emits DSV4 agentic jobs for each target hardware without touching fixed-seq paths. |
-| P0 | DSV4 agentic launchers | Single-node launchers exist for B200/B300; GB200 multi-node agentic recipe exists or maps cleanly to srt-slurm custom benchmark. |
-| P0 | Portable Slurm matrix runner | GMI operator can dry-run and submit without GitHub Actions; no hardcoded cluster IDs; all cluster settings via env/YAML. |
-| P0 | Artifact contract | Every run emits a normalized JSON, raw CSV/JSONL, server log, config, command, provenance, and expected-path manifest. |
+| P0 | DSV4 `agentic-coding` configs for B200/B300 | Implemented; requires live run artifacts. |
+| P0 | DSV4 agentic launchers for B200/B300 | Implemented; reuses existing SGLang server recipes and switches the client to WEKA trace replay. |
+| P0 | Portable Slurm matrix runner | Implemented for dry-run/submit; no hardcoded cluster IDs; all cluster settings via env/JSON/YAML. |
+| P0 | Artifact contract | Implemented expected-path manifest; live runs must still produce the artifacts before claims. |
 | P1 | Workload taxonomy and context buckets | Report breaks down metrics by workload class and context-length bucket. |
 | P1 | SLO/capacity report | For each cell, report max concurrency at TTFT/TPOT/E2E SLO and failure reason beyond it. |
 | P1 | Provenance capture | Per-job artifact records image digest, repo SHA, CUDA/driver, GPU inventory, topology, runtime versions, Slurm job ID, nodelist. |
diff --git a/benchmarks/benchmark_lib.sh b/benchmarks/benchmark_lib.sh
index 4c0c8642e..994111bad 100644
--- a/benchmarks/benchmark_lib.sh
+++ b/benchmarks/benchmark_lib.sh
@@ -892,7 +892,7 @@ ensure_hf_cli() {
 }
 
 resolve_trace_source() {
-    local dataset="semianalysisai/cc-traces-weka-042026"
+    local dataset="${TRACE_SOURCE:-semianalysisai/cc-traces-weka-042026}"
     TRACE_SOURCE_FLAG="--hf-dataset $dataset"
     echo "Loading traces from Hugging Face dataset: $dataset"
     # Pre-download the dataset into the shared HF_HUB_CACHE (same mount used
diff --git a/benchmarks/single_node/agentic/dsv4_fp4_b200_sglang.sh b/benchmarks/single_node/agentic/dsv4_fp4_b200_sglang.sh
new file mode 100755
index 000000000..dcac3bdb3
--- /dev/null
+++ b/benchmarks/single_node/agentic/dsv4_fp4_b200_sglang.sh
@@ -0,0 +1,17 @@
+#!/usr/bin/env bash
+set -euo pipefail
+
+# Agentic trace replay wrapper for DeepSeek-V4-Pro FP4 on B200 with SGLang.
+# The server recipe lives in ../dsv4_fp4_b200.sh; AGENTIC_MODE switches the
+# post-ready client from fixed random prompts to WEKA trace replay.
+
+export AGENTIC_MODE=1
+export ISL="${ISL:-8192}"
+export OSL="${OSL:-1024}"
+export RANDOM_RANGE_RATIO="${RANDOM_RANGE_RATIO:-1}"
+export RESULT_FILENAME="${RESULT_FILENAME:-agentic_dsv4_fp4_b200_sglang}"
+
+REPO_ROOT="$(cd "$(dirname "$0")/../../.." && pwd)"
+export INFMAX_CONTAINER_WORKSPACE="${INFMAX_CONTAINER_WORKSPACE:-$REPO_ROOT}"
+
+exec "$REPO_ROOT/benchmarks/single_node/dsv4_fp4_b200.sh"
diff --git a/benchmarks/single_node/agentic/dsv4_fp4_b300_sglang.sh b/benchmarks/single_node/agentic/dsv4_fp4_b300_sglang.sh
new file mode 100755
index 000000000..a5dc2387c
--- /dev/null
+++ b/benchmarks/single_node/agentic/dsv4_fp4_b300_sglang.sh
@@ -0,0 +1,17 @@
+#!/usr/bin/env bash
+set -euo pipefail
+
+# Agentic trace replay wrapper for DeepSeek-V4-Pro FP4 on B300 with SGLang.
+# The server recipe lives in ../dsv4_fp4_b300_sglang.sh; AGENTIC_MODE switches
+# the post-ready client from fixed random prompts to WEKA trace replay.
+
+export AGENTIC_MODE=1
+export ISL="${ISL:-8192}"
+export OSL="${OSL:-1024}"
+export RANDOM_RANGE_RATIO="${RANDOM_RANGE_RATIO:-1}"
+export RESULT_FILENAME="${RESULT_FILENAME:-agentic_dsv4_fp4_b300_sglang}"
+
+REPO_ROOT="$(cd "$(dirname "$0")/../../.." && pwd)"
+export INFMAX_CONTAINER_WORKSPACE="${INFMAX_CONTAINER_WORKSPACE:-$REPO_ROOT}"
+
+exec "$REPO_ROOT/benchmarks/single_node/dsv4_fp4_b300_sglang.sh"
diff --git a/benchmarks/single_node/dsv4_fp4_b200.sh b/benchmarks/single_node/dsv4_fp4_b200.sh
index df1259deb..6577b7791 100755
--- a/benchmarks/single_node/dsv4_fp4_b200.sh
+++ b/benchmarks/single_node/dsv4_fp4_b200.sh
@@ -100,6 +100,25 @@ SERVER_PID=$!
 
 wait_for_server_ready --port "$PORT" --server-log "$SERVER_LOG" --server-pid "$SERVER_PID"
 
+if [ "${AGENTIC_MODE:-0}" = "1" ]; then
+    RESULT_DIR="${RESULT_DIR:-$PWD/results}"
+    mkdir -p "$RESULT_DIR"
+    cp "$SERVER_LOG" "$RESULT_DIR/server.log" 2>/dev/null || true
+    resolve_trace_source
+    install_agentic_deps
+    build_replay_cmd "$RESULT_DIR"
+    echo "$REPLAY_CMD" > "$RESULT_DIR/benchmark_command.txt"
+    set +e
+    $REPLAY_CMD 2>&1 | tee "$RESULT_DIR/benchmark.log"
+    REPLAY_RC=${PIPESTATUS[0]}
+    set -e
+    write_agentic_result_json "$RESULT_DIR"
+    python3 "$AGENTIC_DIR/scripts/analyze_benchmark_distributions.py" \
+        "$RESULT_DIR/trace_replay" -o "$RESULT_DIR" 2>&1 || true
+    stop_gpu_monitor
+    exit "$REPLAY_RC"
+fi
+
 pip install -q datasets pandas
 
 run_benchmark_serving \
diff --git a/benchmarks/single_node/dsv4_fp4_b300_sglang.sh b/benchmarks/single_node/dsv4_fp4_b300_sglang.sh
index 8f43ea8a3..2a053ae8f 100755
--- a/benchmarks/single_node/dsv4_fp4_b300_sglang.sh
+++ b/benchmarks/single_node/dsv4_fp4_b300_sglang.sh
@@ -186,6 +186,25 @@ SERVER_PID=$!
 
 wait_for_server_ready --port "$PORT" --server-log "$SERVER_LOG" --server-pid "$SERVER_PID"
 
+if [ "${AGENTIC_MODE:-0}" = "1" ]; then
+    RESULT_DIR="${RESULT_DIR:-$PWD/results}"
+    mkdir -p "$RESULT_DIR"
+    cp "$SERVER_LOG" "$RESULT_DIR/server.log" 2>/dev/null || true
+    resolve_trace_source
+    install_agentic_deps
+    build_replay_cmd "$RESULT_DIR"
+    echo "$REPLAY_CMD" > "$RESULT_DIR/benchmark_command.txt"
+    set +e
+    $REPLAY_CMD 2>&1 | tee "$RESULT_DIR/benchmark.log"
+    REPLAY_RC=${PIPESTATUS[0]}
+    set -e
+    write_agentic_result_json "$RESULT_DIR"
+    python3 "$AGENTIC_DIR/scripts/analyze_benchmark_distributions.py" \
+        "$RESULT_DIR/trace_replay" -o "$RESULT_DIR" 2>&1 || true
+    stop_gpu_monitor
+    exit "$REPLAY_RC"
+fi
+
 pip install -q datasets pandas
 
 run_benchmark_serving \
diff --git a/configs/agentic_slurm_matrix.json b/configs/agentic_slurm_matrix.json
new file mode 100644
index 000000000..792e16f38
--- /dev/null
+++ b/configs/agentic_slurm_matrix.json
@@ -0,0 +1,73 @@
+{
+  "defaults": {
+    "partition_env": "GMI_SLURM_PARTITION",
+    "account_env": "GMI_SLURM_ACCOUNT",
+    "results_root_env": "GMI_RESULTS_ROOT",
+    "container_image_env": "GMI_CONTAINER_IMAGE",
+    "model_path_env": "GMI_MODEL_PATH",
+    "time_limit": "04:00:00",
+    "cpus_per_task": 64,
+    "gpus_per_node": 8,
+    "trace_source": "semianalysisai/cc-traces-weka-042026",
+    "duration_seconds": 1800,
+    "arrival_modes": ["closed_loop"],
+    "cache_modes": ["engine_prefix_cache"],
+    "tenant_modes": ["single_tenant"],
+    "context_buckets": ["8k", "32k", "64k", "128k"],
+    "concurrency": [1, 2, 4, 8, 16, 32, 64]
+  },
+  "hardware": {
+    "b200": {
+      "enabled": true,
+      "runner_script": "runners/launch_b200-nb.sh",
+      "runner_name": "b200-gmi-agentic",
+      "slurm_nodes": 1,
+      "model_prefix": "dsv4",
+      "precision": "fp4",
+      "framework": "sglang",
+      "topology": "single_node",
+      "tp": 8,
+      "ep": 8,
+      "dp_attention": true,
+      "script_expected": "benchmarks/single_node/agentic/dsv4_fp4_b200_sglang.sh"
+    },
+    "b300": {
+      "enabled": true,
+      "runner_script": "runners/launch_b300-nv.sh",
+      "runner_name": "b300-gmi-agentic",
+      "slurm_nodes": 1,
+      "model_prefix": "dsv4",
+      "precision": "fp4",
+      "framework": "sglang",
+      "topology": "single_node",
+      "tp": 8,
+      "ep": 8,
+      "dp_attention": true,
+      "script_expected": "benchmarks/single_node/agentic/dsv4_fp4_b300_sglang.sh"
+    },
+    "gb200": {
+      "enabled": true,
+      "runner_script": "runners/launch_gb200-nv.sh",
+      "runner_name": "gb200-gmi-agentic",
+      "slurm_nodes": 5,
+      "model_prefix": "dsv4",
+      "precision": "fp4",
+      "framework": "dynamo-vllm",
+      "topology": "multi_node_disagg",
+      "prefill": {
+        "num_worker": 1,
+        "tp": 8,
+        "ep": 8,
+        "dp_attention": true
+      },
+      "decode": {
+        "num_worker": 1,
+        "tp": 8,
+        "ep": 1,
+        "dp_attention": false
+      },
+      "config_file": "recipes/vllm/deepseek-v4/8k1k/disagg-gb200-low-latency.yaml"
+    }
+  }
+}
+
diff --git a/configs/agentic_slurm_matrix.yaml b/configs/agentic_slurm_matrix.yaml
new file mode 100644
index 000000000..7cfeb2be5
--- /dev/null
+++ b/configs/agentic_slurm_matrix.yaml
@@ -0,0 +1,67 @@
+defaults:
+  partition_env: GMI_SLURM_PARTITION
+  account_env: GMI_SLURM_ACCOUNT
+  results_root_env: GMI_RESULTS_ROOT
+  container_image_env: GMI_CONTAINER_IMAGE
+  model_path_env: GMI_MODEL_PATH
+  time_limit: "04:00:00"
+  cpus_per_task: 64
+  gpus_per_node: 8
+  trace_source: "semianalysisai/cc-traces-weka-042026"
+  duration_seconds: 1800
+  arrival_modes: ["closed_loop"]
+  cache_modes: ["engine_prefix_cache"]
+  tenant_modes: ["single_tenant"]
+  context_buckets: ["8k", "32k", "64k", "128k"]
+  concurrency: [1, 2, 4, 8, 16, 32, 64]
+
+hardware:
+  b200:
+    enabled: true
+    runner_script: "runners/launch_b200-nb.sh"
+    runner_name: "b200-gmi-agentic"
+    slurm_nodes: 1
+    model_prefix: "dsv4"
+    precision: "fp4"
+    framework: "sglang"
+    topology: "single_node"
+    tp: 8
+    ep: 8
+    dp_attention: true
+    script_expected: "benchmarks/single_node/agentic/dsv4_fp4_b200_sglang.sh"
+
+  b300:
+    enabled: true
+    runner_script: "runners/launch_b300-nv.sh"
+    runner_name: "b300-gmi-agentic"
+    slurm_nodes: 1
+    model_prefix: "dsv4"
+    precision: "fp4"
+    framework: "sglang"
+    topology: "single_node"
+    tp: 8
+    ep: 8
+    dp_attention: true
+    script_expected: "benchmarks/single_node/agentic/dsv4_fp4_b300_sglang.sh"
+
+  gb200:
+    enabled: true
+    runner_script: "runners/launch_gb200-nv.sh"
+    runner_name: "gb200-gmi-agentic"
+    slurm_nodes: 5
+    model_prefix: "dsv4"
+    precision: "fp4"
+    framework: "dynamo-vllm"
+    topology: "multi_node_disagg"
+    prefill:
+      num_worker: 1
+      tp: 8
+      ep: 8
+      dp_attention: true
+    decode:
+      num_worker: 1
+      tp: 8
+      ep: 1
+      dp_attention: false
+    config_file: "recipes/vllm/deepseek-v4/8k1k/disagg-gb200-low-latency.yaml"
+
diff --git a/scripts/run_agentic_slurm_matrix.py b/scripts/run_agentic_slurm_matrix.py
new file mode 100755
index 000000000..de8351d5a
--- /dev/null
+++ b/scripts/run_agentic_slurm_matrix.py
@@ -0,0 +1,381 @@
+#!/usr/bin/env python3
+"""Generate and optionally submit a GMI-facing agentic Slurm benchmark matrix.
+
+The runner is intentionally dry-run-first: it renders sbatch files, a matrix
+plan, and an expected artifact contract without claiming GPU behavior.
+"""
+
+from __future__ import annotations
+
+import argparse
+import hashlib
+import json
+import os
+import re
+import subprocess
+import sys
+from dataclasses import dataclass
+from pathlib import Path
+from typing import Any
+
+REPO_ROOT = Path(__file__).resolve().parents[1]
+DEFAULT_CONFIG = REPO_ROOT / "configs" / "agentic_slurm_matrix.json"
+SBATCH_TEMPLATE = REPO_ROOT / "scripts" / "slurm" / "agentic_job.sbatch.tmpl"
+
+
+def _as_list(value: Any) -> list[Any]:
+    if value is None:
+        return []
+    if isinstance(value, list):
+        return value
+    return [value]
+
+
+def _parse_csv_ints(value: str | None) -> list[int] | None:
+    if not value:
+        return None
+    return [int(item.strip()) for item in value.split(",") if item.strip()]
+
+
+def _parse_csv_strings(value: str | None) -> list[str] | None:
+    if not value:
+        return None
+    return [item.strip() for item in value.split(",") if item.strip()]
+
+
+def _slug(value: str) -> str:
+    value = value.lower()
+    value = re.sub(r"[^a-z0-9._-]+", "-", value)
+    return value.strip("-")
+
+
+def _load_config(path: Path) -> dict[str, Any]:
+    with path.open() as handle:
+        if path.suffix == ".json":
+            data = json.load(handle)
+        else:
+            try:
+                import yaml  # type: ignore
+            except ModuleNotFoundError as exc:
+                raise RuntimeError(
+                    f"{path} requires PyYAML. Use the default JSON config or install pyyaml."
+                ) from exc
+            data = yaml.safe_load(handle)
+    if not isinstance(data, dict):
+        raise ValueError(f"{path} must contain a YAML mapping")
+    return data
+
+
+@dataclass(frozen=True)
+class MatrixJob:
+    job_id: str
+    hardware: str
+    framework: str
+    topology: str
+    context_bucket: str
+    concurrency: int
+    arrival_mode: str
+    cache_mode: str
+    tenant_mode: str
+    duration_seconds: int
+    runner_script: str
+    runner_name: str
+    slurm_nodes: int
+    gpus_per_node: int
+    cpus_per_task: int
+    time_limit: str
+    model_prefix: str
+    precision: str
+    tp: int
+    ep: int
+    dp_attention: bool
+    disagg: bool
+    config_file: str
+    is_multinode: bool
+    prefill_num_workers: int
+    prefill_tp: int
+    prefill_ep: int
+    prefill_dp_attention: bool
+    decode_num_workers: int
+    decode_tp: int
+    decode_ep: int
+    decode_dp_attention: bool
+    trace_source: str
+
+    @property
+    def exp_name(self) -> str:
+        return (
+            f"{self.model_prefix}_{self.hardware}_{self.framework}_"
+            f"{self.context_bucket}_conc{self.concurrency}"
+        )
+
+    @property
+    def result_filename(self) -> str:
+        return _slug(
+            f"agentic_{self.exp_name}_{self.arrival_mode}_"
+            f"{self.cache_mode}_{self.tenant_mode}"
+        )
+
+    def to_dict(self) -> dict[str, Any]:
+        return {
+            "job_id": self.job_id,
+            "hardware": self.hardware,
+            "framework": self.framework,
+            "topology": self.topology,
+            "context_bucket": self.context_bucket,
+            "concurrency": self.concurrency,
+            "arrival_mode": self.arrival_mode,
+            "cache_mode": self.cache_mode,
+            "tenant_mode": self.tenant_mode,
+            "duration_seconds": self.duration_seconds,
+            "runner_script": self.runner_script,
+            "runner_name": self.runner_name,
+            "slurm_nodes": self.slurm_nodes,
+            "gpus_per_node": self.gpus_per_node,
+            "cpus_per_task": self.cpus_per_task,
+            "time_limit": self.time_limit,
+            "model_prefix": self.model_prefix,
+            "precision": self.precision,
+            "tp": self.tp,
+            "ep": self.ep,
+            "dp_attention": self.dp_attention,
+            "disagg": self.disagg,
+            "config_file": self.config_file,
+            "is_multinode": self.is_multinode,
+            "trace_source": self.trace_source,
+            "result_filename": self.result_filename,
+            "exp_name": self.exp_name,
+            "prefill_num_workers": self.prefill_num_workers,
+            "prefill_tp": self.prefill_tp,
+            "prefill_ep": self.prefill_ep,
+            "prefill_dp_attention": self.prefill_dp_attention,
+            "decode_num_workers": self.decode_num_workers,
+            "decode_tp": self.decode_tp,
+            "decode_ep": self.decode_ep,
+            "decode_dp_attention": self.decode_dp_attention,
+        }
+
+
+def expand_jobs(config: dict[str, Any], args: argparse.Namespace) -> list[MatrixJob]:
+    defaults = config.get("defaults", {})
+    hardware_cfg = config.get("hardware", {})
+    if not isinstance(hardware_cfg, dict):
+        raise ValueError("hardware must be a mapping")
+
+    selected_hw = set(_parse_csv_strings(args.hardware) or hardware_cfg.keys())
+    contexts = _parse_csv_strings(args.context_buckets) or _as_list(defaults.get("context_buckets"))
+    concurrencies = _parse_csv_ints(args.concurrency) or _as_list(defaults.get("concurrency"))
+    arrival_modes = _parse_csv_strings(args.arrival_modes) or _as_list(defaults.get("arrival_modes"))
+    cache_modes = _parse_csv_strings(args.cache_modes) or _as_list(defaults.get("cache_modes"))
+    tenant_modes = _parse_csv_strings(args.tenant_modes) or _as_list(defaults.get("tenant_modes"))
+
+    jobs: list[MatrixJob] = []
+    for hardware, hw in hardware_cfg.items():
+        if hardware not in selected_hw or not hw.get("enabled", True):
+            continue
+        runner_script = REPO_ROOT / str(hw["runner_script"])
+        if not runner_script.exists():
+            raise FileNotFoundError(f"runner_script not found for {hardware}: {runner_script}")
+        script_expected = hw.get("script_expected")
+        if script_expected and not (REPO_ROOT / str(script_expected)).exists():
+            raise FileNotFoundError(f"script_expected not found for {hardware}: {script_expected}")
+
+        is_multinode = hw.get("topology") == "multi_node_disagg"
+        prefill = hw.get("prefill", {})
+        decode = hw.get("decode", {})
+        tp = int(hw.get("tp", prefill.get("tp", 1)))
+        ep = int(hw.get("ep", prefill.get("ep", 1)))
+        dp_attention = bool(hw.get("dp_attention", prefill.get("dp_attention", False)))
+
+        for context_bucket in contexts:
+            for concurrency in concurrencies:
+                for arrival_mode in arrival_modes:
+                    for cache_mode in cache_modes:
+                        for tenant_mode in tenant_modes:
+                            key = "|".join(
+                                [
+                                    hardware,
+                                    str(hw["framework"]),
+                                    str(context_bucket),
+                                    str(concurrency),
+                                    str(arrival_mode),
+                                    str(cache_mode),
+                                    str(tenant_mode),
+                                ]
+                            )
+                            job_id = hashlib.sha1(key.encode()).hexdigest()[:10]
+                            jobs.append(
+                                MatrixJob(
+                                    job_id=job_id,
+                                    hardware=hardware,
+                                    framework=str(hw["framework"]),
+                                    topology=str(hw["topology"]),
+                                    context_bucket=str(context_bucket),
+                                    concurrency=int(concurrency),
+                                    arrival_mode=str(arrival_mode),
+                                    cache_mode=str(cache_mode),
+                                    tenant_mode=str(tenant_mode),
+                                    duration_seconds=int(args.duration or defaults.get("duration_seconds", 1800)),
+                                    runner_script=str(hw["runner_script"]),
+                                    runner_name=str(hw["runner_name"]),
+                                    slurm_nodes=int(hw.get("slurm_nodes", 1)),
+                                    gpus_per_node=int(defaults.get("gpus_per_node", 8)),
+                                    cpus_per_task=int(defaults.get("cpus_per_task", 64)),
+                                    time_limit=str(defaults.get("time_limit", "04:00:00")),
+                                    model_prefix=str(hw["model_prefix"]),
+                                    precision=str(hw["precision"]),
+                                    tp=tp,
+                                    ep=ep,
+                                    dp_attention=dp_attention,
+                                    disagg=is_multinode,
+                                    config_file=str(hw.get("config_file", "")),
+                                    is_multinode=is_multinode,
+                                    prefill_num_workers=int(prefill.get("num_worker", 0)),
+                                    prefill_tp=int(prefill.get("tp", 0)),
+                                    prefill_ep=int(prefill.get("ep", 0)),
+                                    prefill_dp_attention=bool(prefill.get("dp_attention", False)),
+                                    decode_num_workers=int(decode.get("num_worker", 0)),
+                                    decode_tp=int(decode.get("tp", 0)),
+                                    decode_ep=int(decode.get("ep", 0)),
+                                    decode_dp_attention=bool(decode.get("dp_attention", False)),
+                                    trace_source=str(defaults.get("trace_source", "")),
+                                )
+                            )
+
+    if args.max_jobs is not None:
+        jobs = jobs[: args.max_jobs]
+    return jobs
+
+
+def expected_paths(job: MatrixJob) -> list[str]:
+    return [
+        f"{job.job_id}/{job.result_filename}.json",
+        f"{job.job_id}/results/benchmark.log",
+        f"{job.job_id}/results/benchmark_command.txt",
+        f"{job.job_id}/results/server.log",
+        f"{job.job_id}/results/trace_replay/detailed_results.csv",
+        f"{job.job_id}/results/trace_replay/debug_trace.jsonl",
+        f"{job.job_id}/preflight.log",
+        f"{job.job_id}/provenance_preflight.jsonl",
+    ]
+
+
+def render_sbatch(job: MatrixJob, config: dict[str, Any], results_root: Path, dry_run_guard: bool) -> str:
+    defaults = config.get("defaults", {})
+    template = SBATCH_TEMPLATE.read_text()
+    job_dir = results_root / job.job_id
+    values = {
+        **job.to_dict(),
+        "job_name": f"agentic-{job.hardware}-{job.job_id}",
+        "job_dir": str(job_dir),
+        "partition_env": defaults.get("partition_env", "GMI_SLURM_PARTITION"),
+        "account_env": defaults.get("account_env", "GMI_SLURM_ACCOUNT"),
+        "results_root_env": defaults.get("results_root_env", "GMI_RESULTS_ROOT"),
+        "container_image_env": defaults.get("container_image_env", "GMI_CONTAINER_IMAGE"),
+        "model_path_env": defaults.get("model_path_env", "GMI_MODEL_PATH"),
+        "dp_attention": str(job.dp_attention).lower(),
+        "disagg": str(job.disagg).lower(),
+        "is_multinode": "1" if job.is_multinode else "0",
+        "prefill_dp_attention": str(job.prefill_dp_attention).lower(),
+        "decode_dp_attention": str(job.decode_dp_attention).lower(),
+        "dry_run_guard": "1" if dry_run_guard else "0",
+    }
+    rendered = template
+    for key, value in values.items():
+        rendered = rendered.replace("{" + key + "}", str(value))
+    return rendered
+
+
+def write_outputs(config: dict[str, Any], jobs: list[MatrixJob], results_root: Path, dry_run_guard: bool) -> None:
+    sbatch_dir = results_root / "sbatch"
+    sbatch_dir.mkdir(parents=True, exist_ok=True)
+    for job in jobs:
+        job_dir = results_root / job.job_id
+        job_dir.mkdir(parents=True, exist_ok=True)
+        sbatch_text = render_sbatch(job, config, results_root, dry_run_guard)
+        (sbatch_dir / f"{job.job_id}.sbatch").write_text(sbatch_text)
+
+    plan = {
+        "scenario": "agentic-coding",
+        "total_jobs": len(jobs),
+        "jobs": [job.to_dict() for job in jobs],
+    }
+    (results_root / "matrix_plan.json").write_text(json.dumps(plan, indent=2) + "\n")
+
+    contract = {
+        "scenario": "agentic-coding",
+        "total_jobs": len(jobs),
+        "per_job": [
+            {
+                "job_id": job.job_id,
+                "result_filename": job.result_filename,
+                "expected_paths": expected_paths(job),
+                "required_before_claiming_success": [
+                    f"{job.job_id}/{job.result_filename}.json",
+                    f"{job.job_id}/results/trace_replay/detailed_results.csv",
+                    f"{job.job_id}/preflight.log",
+                    f"{job.job_id}/provenance_preflight.jsonl",
+                ],
+            }
+            for job in jobs
+        ],
+    }
+    (results_root / "expected_artifact_contract.json").write_text(json.dumps(contract, indent=2) + "\n")
+
+
+def submit_jobs(config: dict[str, Any], results_root: Path, jobs: list[MatrixJob]) -> None:
+    defaults = config.get("defaults", {})
+    partition_env = defaults.get("partition_env", "GMI_SLURM_PARTITION")
+    account_env = defaults.get("account_env", "GMI_SLURM_ACCOUNT")
+    partition = os.environ.get(partition_env)
+    account = os.environ.get(account_env)
+    if not partition:
+        raise RuntimeError(f"{partition_env} must be set when --submit is used")
+    for job in jobs:
+        sbatch_path = results_root / "sbatch" / f"{job.job_id}.sbatch"
+        cmd = ["sbatch", "--partition", partition]
+        if account:
+            cmd.extend(["--account", account])
+        cmd.append(str(sbatch_path))
+        subprocess.run(cmd, check=True)
+
+
+def build_parser() -> argparse.ArgumentParser:
+    parser = argparse.ArgumentParser(description=__doc__)
+    parser.add_argument("--config", type=Path, default=DEFAULT_CONFIG)
+    parser.add_argument("--results-root", type=Path, default=Path(os.environ.get("GMI_RESULTS_ROOT", "agentic-slurm-results")))
+    parser.add_argument("--hardware", help="Comma-separated hardware filter, e.g. b200,b300")
+    parser.add_argument("--context-buckets", help="Comma-separated context buckets")
+    parser.add_argument("--concurrency", help="Comma-separated concurrency values")
+    parser.add_argument("--arrival-modes", help="Comma-separated arrival modes")
+    parser.add_argument("--cache-modes", help="Comma-separated cache modes")
+    parser.add_argument("--tenant-modes", help="Comma-separated tenant modes")
+    parser.add_argument("--duration", type=int)
+    parser.add_argument("--max-jobs", type=int)
+    parser.add_argument("--dry-run", action="store_true", help="Render files only")
+    parser.add_argument("--submit", action="store_true", help="Submit rendered sbatch jobs")
+    return parser
+
+
+def main(argv: list[str] | None = None) -> int:
+    parser = build_parser()
+    args = parser.parse_args(argv)
+    if args.submit and args.dry_run:
+        parser.error("--submit and --dry-run are mutually exclusive")
+
+    config = _load_config(args.config)
+    jobs = expand_jobs(config, args)
+    args.results_root.mkdir(parents=True, exist_ok=True)
+    write_outputs(config, jobs, args.results_root, dry_run_guard=args.dry_run or not args.submit)
+
+    if args.submit:
+        submit_jobs(config, args.results_root, jobs)
+
+    print(f"Wrote {len(jobs)} agentic Slurm jobs to {args.results_root}")
+    print(f"Matrix plan: {args.results_root / 'matrix_plan.json'}")
+    print(f"Artifact contract: {args.results_root / 'expected_artifact_contract.json'}")
+    return 0
+
+
+if __name__ == "__main__":
+    sys.exit(main())
diff --git a/scripts/slurm/agentic_job.sbatch.tmpl b/scripts/slurm/agentic_job.sbatch.tmpl
new file mode 100644
index 000000000..b3b052c38
--- /dev/null
+++ b/scripts/slurm/agentic_job.sbatch.tmpl
@@ -0,0 +1,89 @@
+#!/usr/bin/env bash
+#SBATCH --job-name={job_name}
+#SBATCH --nodes={slurm_nodes}
+#SBATCH --gpus-per-node={gpus_per_node}
+#SBATCH --cpus-per-task={cpus_per_task}
+#SBATCH --time={time_limit}
+#SBATCH --output={job_dir}/slurm-%j.out
+#SBATCH --error={job_dir}/slurm-%j.err
+
+set -euo pipefail
+
+mkdir -p "{job_dir}"
+
+required_env=(
+  "{partition_env}"
+  "{results_root_env}"
+  "{container_image_env}"
+  "{model_path_env}"
+)
+for env_name in "${required_env[@]}"; do
+  if [[ -z "${!env_name:-}" ]]; then
+    echo "FATAL: required environment variable ${env_name} is not set" >&2
+    exit 2
+  fi
+done
+
+export IMAGE="${{container_image_env}}"
+export MODEL="${{model_path_env}}"
+export GITHUB_WORKSPACE="${GITHUB_WORKSPACE:-$(pwd)}"
+export MODEL_PREFIX="{model_prefix}"
+export PRECISION="{precision}"
+export FRAMEWORK="{framework}"
+export RUNNER_NAME="{runner_name}"
+export RUNNER_TYPE="{hardware}"
+export EXP_NAME="{exp_name}"
+export RESULT_FILENAME="{result_filename}"
+export RESULT_DIR="{job_dir}/results"
+export AGENTIC_OUTPUT_DIR="{job_dir}"
+export SCENARIO_TYPE="agentic-coding"
+export SCENARIO_SUBDIR="agentic/"
+export IS_AGENTIC="1"
+export CONC="{concurrency}"
+export DURATION="{duration_seconds}"
+export TRACE_SOURCE="{trace_source}"
+export AGENTIC_CONTEXT_BUCKET="{context_bucket}"
+export AGENTIC_ARRIVAL_MODE="{arrival_mode}"
+export AGENTIC_CACHE_MODE="{cache_mode}"
+export AGENTIC_TENANT_MODE="{tenant_mode}"
+export TP="{tp}"
+export EP_SIZE="{ep}"
+export DP_ATTENTION="{dp_attention}"
+export SPEC_DECODING="none"
+export DISAGG="{disagg}"
+export CONFIG_FILE="{config_file}"
+export IS_MULTINODE="{is_multinode}"
+export PREFILL_NUM_WORKERS="{prefill_num_workers}"
+export PREFILL_TP="{prefill_tp}"
+export PREFILL_EP="{prefill_ep}"
+export PREFILL_DP_ATTN="{prefill_dp_attention}"
+export DECODE_NUM_WORKERS="{decode_num_workers}"
+export DECODE_TP="{decode_tp}"
+export DECODE_EP="{decode_ep}"
+export DECODE_DP_ATTN="{decode_dp_attention}"
+
+cat > "{job_dir}/provenance_preflight.jsonl" <<EOF
+{"event":"job_start","slurm_job_id":"${SLURM_JOB_ID:-}","nodelist":"${SLURM_JOB_NODELIST:-}","hardware":"{hardware}","framework":"{framework}","model_prefix":"{model_prefix}","context_bucket":"{context_bucket}","concurrency":{concurrency}}
+EOF
+
+{
+  echo "=== uname ==="
+  uname -a || true
+  echo "=== git ==="
+  git rev-parse HEAD || true
+  echo "=== gpu ==="
+  nvidia-smi -L || true
+  nvidia-smi topo -m || true
+  echo "=== rdma ==="
+  ibv_devinfo || true
+  echo "=== nccl ==="
+  all_reduce_perf -b 8 -e 128M -f 2 -g "{gpus_per_node}" || true
+} > "{job_dir}/preflight.log" 2>&1
+
+if [[ "{dry_run_guard}" == "1" ]]; then
+  echo "Dry-run sbatch rendered successfully; not executing runner."
+  exit 0
+fi
+
+mkdir -p "$RESULT_DIR"
+bash "{runner_script}"
diff --git a/utils/test_agentic_slurm_matrix.py b/utils/test_agentic_slurm_matrix.py
new file mode 100644
index 000000000..40c332a63
--- /dev/null
+++ b/utils/test_agentic_slurm_matrix.py
@@ -0,0 +1,84 @@
+import importlib.util
+import json
+import sys
+from pathlib import Path
+
+
+REPO_ROOT = Path(__file__).resolve().parents[1]
+SCRIPT = REPO_ROOT / "scripts" / "run_agentic_slurm_matrix.py"
+
+
+def load_runner():
+    spec = importlib.util.spec_from_file_location("run_agentic_slurm_matrix", SCRIPT)
+    module = importlib.util.module_from_spec(spec)
+    assert spec.loader is not None
+    sys.modules[spec.name] = module
+    spec.loader.exec_module(module)
+    return module
+
+
+def test_agentic_slurm_dry_run_writes_plan_contract_and_sbatch(tmp_path):
+    runner = load_runner()
+    rc = runner.main(
+        [
+            "--dry-run",
+            "--results-root",
+            str(tmp_path),
+            "--hardware",
+            "b200",
+            "--context-buckets",
+            "8k",
+            "--concurrency",
+            "1,2",
+        ]
+    )
+    assert rc == 0
+
+    plan = json.loads((tmp_path / "matrix_plan.json").read_text())
+    assert plan["scenario"] == "agentic-coding"
+    assert plan["total_jobs"] == 2
+    assert {job["concurrency"] for job in plan["jobs"]} == {1, 2}
+    assert all(job["model_prefix"] == "dsv4" for job in plan["jobs"])
+
+    contract = json.loads((tmp_path / "expected_artifact_contract.json").read_text())
+    assert contract["total_jobs"] == 2
+    required = contract["per_job"][0]["required_before_claiming_success"]
+    assert any(path.endswith("/trace_replay/detailed_results.csv") for path in required)
+    assert any(path.endswith("/provenance_preflight.jsonl") for path in required)
+
+    sbatch_files = sorted((tmp_path / "sbatch").glob("*.sbatch"))
+    assert len(sbatch_files) == 2
+    rendered = sbatch_files[0].read_text()
+    assert 'SCENARIO_TYPE="agentic-coding"' in rendered
+    assert 'MODEL_PREFIX="dsv4"' in rendered
+    assert 'TRACE_SOURCE="semianalysisai/cc-traces-weka-042026"' in rendered
+    assert "nvidia-smi topo -m" in rendered
+    assert "all_reduce_perf" in rendered
+    assert "Dry-run sbatch rendered successfully" in rendered
+
+
+def test_agentic_slurm_matrix_can_filter_gb200_multinode(tmp_path):
+    runner = load_runner()
+    rc = runner.main(
+        [
+            "--dry-run",
+            "--results-root",
+            str(tmp_path),
+            "--hardware",
+            "gb200",
+            "--context-buckets",
+            "8k",
+            "--concurrency",
+            "1",
+            "--max-jobs",
+            "1",
+        ]
+    )
+    assert rc == 0
+
+    plan = json.loads((tmp_path / "matrix_plan.json").read_text())
+    assert plan["total_jobs"] == 1
+    job = plan["jobs"][0]
+    assert job["hardware"] == "gb200"
+    assert job["is_multinode"] is True
+    assert job["config_file"].endswith("disagg-gb200-low-latency.yaml")

From 7d7d2624a61c118737557aa66df97af07be095d2 Mon Sep 17 00:00:00 2001
From: William Chen <57119977+OCWC22@users.noreply.github.com>
Date: Sun, 3 May 2026 03:36:36 -0700
Subject: [PATCH 4/5] Revert "feat(agentic): add GMI DSV4 Slurm harness
 [skip-sweep]"

This reverts commit bfea80549194a4649124f9df73c46f5682a33e40.
---
 .github/configs/nvidia-master.yaml            |   8 -
 AGENTIC_TRUTH_MATRIX.md                       |  20 +-
 benchmarks/benchmark_lib.sh                   |   2 +-
 .../agentic/dsv4_fp4_b200_sglang.sh           |  17 -
 .../agentic/dsv4_fp4_b300_sglang.sh           |  17 -
 benchmarks/single_node/dsv4_fp4_b200.sh       |  19 -
 .../single_node/dsv4_fp4_b300_sglang.sh       |  19 -
 configs/agentic_slurm_matrix.json             |  73 ----
 configs/agentic_slurm_matrix.yaml             |  67 ---
 scripts/run_agentic_slurm_matrix.py           | 381 ------------------
 scripts/slurm/agentic_job.sbatch.tmpl         |  89 ----
 utils/test_agentic_slurm_matrix.py            |  84 ----
 12 files changed, 11 insertions(+), 785 deletions(-)
 delete mode 100755 benchmarks/single_node/agentic/dsv4_fp4_b200_sglang.sh
 delete mode 100755 benchmarks/single_node/agentic/dsv4_fp4_b300_sglang.sh
 delete mode 100644 configs/agentic_slurm_matrix.json
 delete mode 100644 configs/agentic_slurm_matrix.yaml
 delete mode 100755 scripts/run_agentic_slurm_matrix.py
 delete mode 100644 scripts/slurm/agentic_job.sbatch.tmpl
 delete mode 100644 utils/test_agentic_slurm_matrix.py

diff --git a/.github/configs/nvidia-master.yaml b/.github/configs/nvidia-master.yaml
index 3b31a65e8..38d1101f3 100644
--- a/.github/configs/nvidia-master.yaml
+++ b/.github/configs/nvidia-master.yaml
@@ -1732,10 +1732,6 @@ dsv4-fp4-b200-sglang:
       - { tp: 8, ep: 8, dp-attn: true, conc-start: 64, conc-end: 128 }
       # DP-attention (DP_ATTENTION=true) — max-throughput CONC range
       - { tp: 8, ep: 8, dp-attn: true, conc-start: 256, conc-end: 512 }
-    agentic-coding:
-    - duration: 1800
-      search-space:
-      - { tp: 8, ep: 8, dp-attn: true, offloading: none, conc-list: [1, 2, 4, 8, 16, 32, 64] }
 
 dsv4-fp4-b200-vllm:
   image: vllm/vllm-openai:v0.20.0-cu130
@@ -1955,10 +1951,6 @@ dsv4-fp4-b300-sglang:
       - { tp: 4, ep: 4, dp-attn: true, conc-start: 512, conc-end: 512 }
       - { tp: 8, ep: 8, dp-attn: true, conc-start: 2048, conc-end: 2048 }
       - { tp: 8, ep: 8, dp-attn: true, conc-start: 4096, conc-end: 4096 }
-    agentic-coding:
-    - duration: 1800
-      search-space:
-      - { tp: 8, ep: 8, dp-attn: true, offloading: none, conc-list: [1, 2, 4, 8, 16, 32, 64] }
 
   # DeepSeek-V4-Pro on B300 with EAGLE/MTP speculative decoding. Recipe is
   # selected inside benchmarks/single_node/dsv4_fp4_b300_sglang_mtp.sh by
diff --git a/AGENTIC_TRUTH_MATRIX.md b/AGENTIC_TRUTH_MATRIX.md
index b23f598b1..fdd91f139 100644
--- a/AGENTIC_TRUTH_MATRIX.md
+++ b/AGENTIC_TRUTH_MATRIX.md
@@ -124,8 +124,8 @@ The current use case is:
 | DeepSeek-V4 B200/B300 SGLang fixed-seq | Exists | Fixed 1k/8k surfaces, not agentic trace replay. |
 | DeepSeek-V4 B200/B300 vLLM fixed-seq/MTP | Exists | Fixed-seq path with DSV4 chat encoding. |
 | DeepSeek-V4 GB200 vLLM srt-slurm recipes | Exists | Recipe set for 8k1k, not agentic trace replay. |
-| DeepSeek-V4 B200/B300 SGLang agentic trace replay | Wired, needs live validation | Added DSV4 single-node agentic wrappers and conservative `agentic-coding` config rows. |
-| DeepSeek-V4 GB200 agentic trace replay | Dry-run harness only | Portable Slurm matrix can render GB200 agentic jobs, but live srt-slurm recipe behavior is still unproven. |
+| DeepSeek-V4 GB200 agentic trace replay | Missing | No `agentic-coding` config or DSV4-specific agentic launcher found. |
+| B300 agentic trace replay | Mostly missing | B300 has fixed-seq DSR1/DSV4 surfaces, not a clear agentic path. |
 | LMCache/TensorMesh agentic comparison | Missing | No direct LMCache/TensorMesh metrics integration in InferenceX agentic path. |
 
 ## What A GMI/Neocloud Platform Engineer Actually Cares About
@@ -161,11 +161,11 @@ Legend:
 |---|---:|---|---|
 | Agentic scenario flag and config schema | Yes | `agentic-coding` in config docs and validation. | Keep. |
 | WEKA trace replay source | Yes | `semianalysisai/cc-traces-weka-042026` in `resolve_trace_source`. | Make dataset configurable; keep WEKA as default/example. |
-| Single-node trace replay execution | Yes | `benchmarks/single_node/agentic/*.sh`. | Run live DSV4 B200/B300 validation. |
+| Single-node trace replay execution | Yes | `benchmarks/single_node/agentic/*.sh`. | Add DSV4 B200/B300 launchers. |
 | Multi-node trace replay execution | Partial | `benchmarks/multi_node/agentic_srt.sh`; special srt-slurm branch. | First-class srt-slurm support, no special private branch dependency. |
-| DeepSeek-V4 B200 agentic | Partial | Config + launcher are wired; no live GPU artifact yet. | Run on B200 Slurm and attach artifacts. |
-| DeepSeek-V4 B300 agentic | Partial | Config + launcher are wired; no live GPU artifact yet. | Run on B300 Slurm and attach artifacts. |
-| DeepSeek-V4 GB200 agentic | Partial | Portable matrix renders GB200 jobs; live srt-slurm behavior remains unproven. | Add/validate GB200 agentic srt-slurm recipe on real hardware. |
+| DeepSeek-V4 B200 agentic | No | DSV4 B200 configs are fixed-seq, not `agentic-coding`. | Add config + launcher + validated run. |
+| DeepSeek-V4 B300 agentic | No | B300 has DSV4 fixed-seq scripts/recipes, not agentic. | Add config + launcher + validated run. |
+| DeepSeek-V4 GB200 agentic | No | GB200 DSV4 recipes exist, but no agentic scenario. | Add srt-slurm agentic recipe and config. |
 | B200/B300/GB200 apples-to-apples matrix | No | Current surfaces differ by model/runtime/scenario. | Build normalized matrix over hardware, engine, context, concurrency. |
 | vLLM/SGLang/TRT/Dynamo comparison for same workload | Partial | Some engines covered for some models. | Normalize exact model, precision, prompt encoding, trace, and duration. |
 | Long-context buckets | Partial | Fixed-seq has 1k/8k; trace replay may have varied token lengths. | Add explicit 8k/32k/64k/128k/256k+ bins in reports and optional filters. |
@@ -205,10 +205,10 @@ This is the minimum useful matrix for a GMI cloud engineer evaluating long-conte
 
 | Priority | Build item | Acceptance criteria |
 |---:|---|---|
-| P0 | DSV4 `agentic-coding` configs for B200/B300 | Implemented; requires live run artifacts. |
-| P0 | DSV4 agentic launchers for B200/B300 | Implemented; reuses existing SGLang server recipes and switches the client to WEKA trace replay. |
-| P0 | Portable Slurm matrix runner | Implemented for dry-run/submit; no hardcoded cluster IDs; all cluster settings via env/JSON/YAML. |
-| P0 | Artifact contract | Implemented expected-path manifest; live runs must still produce the artifacts before claims. |
+| P0 | DSV4 `agentic-coding` configs for B200/B300/GB200 | Matrix generator emits DSV4 agentic jobs for each target hardware without touching fixed-seq paths. |
+| P0 | DSV4 agentic launchers | Single-node launchers exist for B200/B300; GB200 multi-node agentic recipe exists or maps cleanly to srt-slurm custom benchmark. |
+| P0 | Portable Slurm matrix runner | GMI operator can dry-run and submit without GitHub Actions; no hardcoded cluster IDs; all cluster settings via env/YAML. |
+| P0 | Artifact contract | Every run emits a normalized JSON, raw CSV/JSONL, server log, config, command, provenance, and expected-path manifest. |
 | P1 | Workload taxonomy and context buckets | Report breaks down metrics by workload class and context-length bucket. |
 | P1 | SLO/capacity report | For each cell, report max concurrency at TTFT/TPOT/E2E SLO and failure reason beyond it. |
 | P1 | Provenance capture | Per-job artifact records image digest, repo SHA, CUDA/driver, GPU inventory, topology, runtime versions, Slurm job ID, nodelist. |
diff --git a/benchmarks/benchmark_lib.sh b/benchmarks/benchmark_lib.sh
index 994111bad..4c0c8642e 100644
--- a/benchmarks/benchmark_lib.sh
+++ b/benchmarks/benchmark_lib.sh
@@ -892,7 +892,7 @@ ensure_hf_cli() {
 }
 
 resolve_trace_source() {
-    local dataset="${TRACE_SOURCE:-semianalysisai/cc-traces-weka-042026}"
+    local dataset="semianalysisai/cc-traces-weka-042026"
     TRACE_SOURCE_FLAG="--hf-dataset $dataset"
     echo "Loading traces from Hugging Face dataset: $dataset"
     # Pre-download the dataset into the shared HF_HUB_CACHE (same mount used
diff --git a/benchmarks/single_node/agentic/dsv4_fp4_b200_sglang.sh b/benchmarks/single_node/agentic/dsv4_fp4_b200_sglang.sh
deleted file mode 100755
index dcac3bdb3..000000000
--- a/benchmarks/single_node/agentic/dsv4_fp4_b200_sglang.sh
+++ /dev/null
@@ -1,17 +0,0 @@
-#!/usr/bin/env bash
-set -euo pipefail
-
-# Agentic trace replay wrapper for DeepSeek-V4-Pro FP4 on B200 with SGLang.
-# The server recipe lives in ../dsv4_fp4_b200.sh; AGENTIC_MODE switches the
-# post-ready client from fixed random prompts to WEKA trace replay.
-
-export AGENTIC_MODE=1
-export ISL="${ISL:-8192}"
-export OSL="${OSL:-1024}"
-export RANDOM_RANGE_RATIO="${RANDOM_RANGE_RATIO:-1}"
-export RESULT_FILENAME="${RESULT_FILENAME:-agentic_dsv4_fp4_b200_sglang}"
-
-REPO_ROOT="$(cd "$(dirname "$0")/../../.." && pwd)"
-export INFMAX_CONTAINER_WORKSPACE="${INFMAX_CONTAINER_WORKSPACE:-$REPO_ROOT}"
-
-exec "$REPO_ROOT/benchmarks/single_node/dsv4_fp4_b200.sh"
diff --git a/benchmarks/single_node/agentic/dsv4_fp4_b300_sglang.sh b/benchmarks/single_node/agentic/dsv4_fp4_b300_sglang.sh
deleted file mode 100755
index a5dc2387c..000000000
--- a/benchmarks/single_node/agentic/dsv4_fp4_b300_sglang.sh
+++ /dev/null
@@ -1,17 +0,0 @@
-#!/usr/bin/env bash
-set -euo pipefail
-
-# Agentic trace replay wrapper for DeepSeek-V4-Pro FP4 on B300 with SGLang.
-# The server recipe lives in ../dsv4_fp4_b300_sglang.sh; AGENTIC_MODE switches
-# the post-ready client from fixed random prompts to WEKA trace replay.
-
-export AGENTIC_MODE=1
-export ISL="${ISL:-8192}"
-export OSL="${OSL:-1024}"
-export RANDOM_RANGE_RATIO="${RANDOM_RANGE_RATIO:-1}"
-export RESULT_FILENAME="${RESULT_FILENAME:-agentic_dsv4_fp4_b300_sglang}"
-
-REPO_ROOT="$(cd "$(dirname "$0")/../../.." && pwd)"
-export INFMAX_CONTAINER_WORKSPACE="${INFMAX_CONTAINER_WORKSPACE:-$REPO_ROOT}"
-
-exec "$REPO_ROOT/benchmarks/single_node/dsv4_fp4_b300_sglang.sh"
diff --git a/benchmarks/single_node/dsv4_fp4_b200.sh b/benchmarks/single_node/dsv4_fp4_b200.sh
index 6577b7791..df1259deb 100755
--- a/benchmarks/single_node/dsv4_fp4_b200.sh
+++ b/benchmarks/single_node/dsv4_fp4_b200.sh
@@ -100,25 +100,6 @@ SERVER_PID=$!
 
 wait_for_server_ready --port "$PORT" --server-log "$SERVER_LOG" --server-pid "$SERVER_PID"
 
-if [ "${AGENTIC_MODE:-0}" = "1" ]; then
-    RESULT_DIR="${RESULT_DIR:-$PWD/results}"
-    mkdir -p "$RESULT_DIR"
-    cp "$SERVER_LOG" "$RESULT_DIR/server.log" 2>/dev/null || true
-    resolve_trace_source
-    install_agentic_deps
-    build_replay_cmd "$RESULT_DIR"
-    echo "$REPLAY_CMD" > "$RESULT_DIR/benchmark_command.txt"
-    set +e
-    $REPLAY_CMD 2>&1 | tee "$RESULT_DIR/benchmark.log"
-    REPLAY_RC=${PIPESTATUS[0]}
-    set -e
-    write_agentic_result_json "$RESULT_DIR"
-    python3 "$AGENTIC_DIR/scripts/analyze_benchmark_distributions.py" \
-        "$RESULT_DIR/trace_replay" -o "$RESULT_DIR" 2>&1 || true
-    stop_gpu_monitor
-    exit "$REPLAY_RC"
-fi
-
 pip install -q datasets pandas
 
 run_benchmark_serving \
diff --git a/benchmarks/single_node/dsv4_fp4_b300_sglang.sh b/benchmarks/single_node/dsv4_fp4_b300_sglang.sh
index 2a053ae8f..8f43ea8a3 100755
--- a/benchmarks/single_node/dsv4_fp4_b300_sglang.sh
+++ b/benchmarks/single_node/dsv4_fp4_b300_sglang.sh
@@ -186,25 +186,6 @@ SERVER_PID=$!
 
 wait_for_server_ready --port "$PORT" --server-log "$SERVER_LOG" --server-pid "$SERVER_PID"
 
-if [ "${AGENTIC_MODE:-0}" = "1" ]; then
-    RESULT_DIR="${RESULT_DIR:-$PWD/results}"
-    mkdir -p "$RESULT_DIR"
-    cp "$SERVER_LOG" "$RESULT_DIR/server.log" 2>/dev/null || true
-    resolve_trace_source
-    install_agentic_deps
-    build_replay_cmd "$RESULT_DIR"
-    echo "$REPLAY_CMD" > "$RESULT_DIR/benchmark_command.txt"
-    set +e
-    $REPLAY_CMD 2>&1 | tee "$RESULT_DIR/benchmark.log"
-    REPLAY_RC=${PIPESTATUS[0]}
-    set -e
-    write_agentic_result_json "$RESULT_DIR"
-    python3 "$AGENTIC_DIR/scripts/analyze_benchmark_distributions.py" \
-        "$RESULT_DIR/trace_replay" -o "$RESULT_DIR" 2>&1 || true
-    stop_gpu_monitor
-    exit "$REPLAY_RC"
-fi
-
 pip install -q datasets pandas
 
 run_benchmark_serving \
diff --git a/configs/agentic_slurm_matrix.json b/configs/agentic_slurm_matrix.json
deleted file mode 100644
index 792e16f38..000000000
--- a/configs/agentic_slurm_matrix.json
+++ /dev/null
@@ -1,73 +0,0 @@
-{
-  "defaults": {
-    "partition_env": "GMI_SLURM_PARTITION",
-    "account_env": "GMI_SLURM_ACCOUNT",
-    "results_root_env": "GMI_RESULTS_ROOT",
-    "container_image_env": "GMI_CONTAINER_IMAGE",
-    "model_path_env": "GMI_MODEL_PATH",
-    "time_limit": "04:00:00",
-    "cpus_per_task": 64,
-    "gpus_per_node": 8,
-    "trace_source": "semianalysisai/cc-traces-weka-042026",
-    "duration_seconds": 1800,
-    "arrival_modes": ["closed_loop"],
-    "cache_modes": ["engine_prefix_cache"],
-    "tenant_modes": ["single_tenant"],
-    "context_buckets": ["8k", "32k", "64k", "128k"],
-    "concurrency": [1, 2, 4, 8, 16, 32, 64]
-  },
-  "hardware": {
-    "b200": {
-      "enabled": true,
-      "runner_script": "runners/launch_b200-nb.sh",
-      "runner_name": "b200-gmi-agentic",
-      "slurm_nodes": 1,
-      "model_prefix": "dsv4",
-      "precision": "fp4",
-      "framework": "sglang",
-      "topology": "single_node",
-      "tp": 8,
-      "ep": 8,
-      "dp_attention": true,
-      "script_expected": "benchmarks/single_node/agentic/dsv4_fp4_b200_sglang.sh"
-    },
-    "b300": {
-      "enabled": true,
-      "runner_script": "runners/launch_b300-nv.sh",
-      "runner_name": "b300-gmi-agentic",
-      "slurm_nodes": 1,
-      "model_prefix": "dsv4",
-      "precision": "fp4",
-      "framework": "sglang",
-      "topology": "single_node",
-      "tp": 8,
-      "ep": 8,
-      "dp_attention": true,
-      "script_expected": "benchmarks/single_node/agentic/dsv4_fp4_b300_sglang.sh"
-    },
-    "gb200": {
-      "enabled": true,
-      "runner_script": "runners/launch_gb200-nv.sh",
-      "runner_name": "gb200-gmi-agentic",
-      "slurm_nodes": 5,
-      "model_prefix": "dsv4",
-      "precision": "fp4",
-      "framework": "dynamo-vllm",
-      "topology": "multi_node_disagg",
-      "prefill": {
-        "num_worker": 1,
-        "tp": 8,
-        "ep": 8,
-        "dp_attention": true
-      },
-      "decode": {
-        "num_worker": 1,
-        "tp": 8,
-        "ep": 1,
-        "dp_attention": false
-      },
-      "config_file": "recipes/vllm/deepseek-v4/8k1k/disagg-gb200-low-latency.yaml"
-    }
-  }
-}
-
diff --git a/configs/agentic_slurm_matrix.yaml b/configs/agentic_slurm_matrix.yaml
deleted file mode 100644
index 7cfeb2be5..000000000
--- a/configs/agentic_slurm_matrix.yaml
+++ /dev/null
@@ -1,67 +0,0 @@
-defaults:
-  partition_env: GMI_SLURM_PARTITION
-  account_env: GMI_SLURM_ACCOUNT
-  results_root_env: GMI_RESULTS_ROOT
-  container_image_env: GMI_CONTAINER_IMAGE
-  model_path_env: GMI_MODEL_PATH
-  time_limit: "04:00:00"
-  cpus_per_task: 64
-  gpus_per_node: 8
-  trace_source: "semianalysisai/cc-traces-weka-042026"
-  duration_seconds: 1800
-  arrival_modes: ["closed_loop"]
-  cache_modes: ["engine_prefix_cache"]
-  tenant_modes: ["single_tenant"]
-  context_buckets: ["8k", "32k", "64k", "128k"]
-  concurrency: [1, 2, 4, 8, 16, 32, 64]
-
-hardware:
-  b200:
-    enabled: true
-    runner_script: "runners/launch_b200-nb.sh"
-    runner_name: "b200-gmi-agentic"
-    slurm_nodes: 1
-    model_prefix: "dsv4"
-    precision: "fp4"
-    framework: "sglang"
-    topology: "single_node"
-    tp: 8
-    ep: 8
-    dp_attention: true
-    script_expected: "benchmarks/single_node/agentic/dsv4_fp4_b200_sglang.sh"
-
-  b300:
-    enabled: true
-    runner_script: "runners/launch_b300-nv.sh"
-    runner_name: "b300-gmi-agentic"
-    slurm_nodes: 1
-    model_prefix: "dsv4"
-    precision: "fp4"
-    framework: "sglang"
-    topology: "single_node"
-    tp: 8
-    ep: 8
-    dp_attention: true
-    script_expected: "benchmarks/single_node/agentic/dsv4_fp4_b300_sglang.sh"
-
-  gb200:
-    enabled: true
-    runner_script: "runners/launch_gb200-nv.sh"
-    runner_name: "gb200-gmi-agentic"
-    slurm_nodes: 5
-    model_prefix: "dsv4"
-    precision: "fp4"
-    framework: "dynamo-vllm"
-    topology: "multi_node_disagg"
-    prefill:
-      num_worker: 1
-      tp: 8
-      ep: 8
-      dp_attention: true
-    decode:
-      num_worker: 1
-      tp: 8
-      ep: 1
-      dp_attention: false
-    config_file: "recipes/vllm/deepseek-v4/8k1k/disagg-gb200-low-latency.yaml"
-
diff --git a/scripts/run_agentic_slurm_matrix.py b/scripts/run_agentic_slurm_matrix.py
deleted file mode 100755
index de8351d5a..000000000
--- a/scripts/run_agentic_slurm_matrix.py
+++ /dev/null
@@ -1,381 +0,0 @@
-#!/usr/bin/env python3
-"""Generate and optionally submit a GMI-facing agentic Slurm benchmark matrix.
-
-The runner is intentionally dry-run-first: it renders sbatch files, a matrix
-plan, and an expected artifact contract without claiming GPU behavior.
-"""
-
-from __future__ import annotations
-
-import argparse
-import hashlib
-import json
-import os
-import re
-import subprocess
-import sys
-from dataclasses import dataclass
-from pathlib import Path
-from typing import Any
-
-REPO_ROOT = Path(__file__).resolve().parents[1]
-DEFAULT_CONFIG = REPO_ROOT / "configs" / "agentic_slurm_matrix.json"
-SBATCH_TEMPLATE = REPO_ROOT / "scripts" / "slurm" / "agentic_job.sbatch.tmpl"
-
-
-def _as_list(value: Any) -> list[Any]:
-    if value is None:
-        return []
-    if isinstance(value, list):
-        return value
-    return [value]
-
-
-def _parse_csv_ints(value: str | None) -> list[int] | None:
-    if not value:
-        return None
-    return [int(item.strip()) for item in value.split(",") if item.strip()]
-
-
-def _parse_csv_strings(value: str | None) -> list[str] | None:
-    if not value:
-        return None
-    return [item.strip() for item in value.split(",") if item.strip()]
-
-
-def _slug(value: str) -> str:
-    value = value.lower()
-    value = re.sub(r"[^a-z0-9._-]+", "-", value)
-    return value.strip("-")
-
-
-def _load_config(path: Path) -> dict[str, Any]:
-    with path.open() as handle:
-        if path.suffix == ".json":
-            data = json.load(handle)
-        else:
-            try:
-                import yaml  # type: ignore
-            except ModuleNotFoundError as exc:
-                raise RuntimeError(
-                    f"{path} requires PyYAML. Use the default JSON config or install pyyaml."
-                ) from exc
-            data = yaml.safe_load(handle)
-    if not isinstance(data, dict):
-        raise ValueError(f"{path} must contain a YAML mapping")
-    return data
-
-
-@dataclass(frozen=True)
-class MatrixJob:
-    job_id: str
-    hardware: str
-    framework: str
-    topology: str
-    context_bucket: str
-    concurrency: int
-    arrival_mode: str
-    cache_mode: str
-    tenant_mode: str
-    duration_seconds: int
-    runner_script: str
-    runner_name: str
-    slurm_nodes: int
-    gpus_per_node: int
-    cpus_per_task: int
-    time_limit: str
-    model_prefix: str
-    precision: str
-    tp: int
-    ep: int
-    dp_attention: bool
-    disagg: bool
-    config_file: str
-    is_multinode: bool
-    prefill_num_workers: int
-    prefill_tp: int
-    prefill_ep: int
-    prefill_dp_attention: bool
-    decode_num_workers: int
-    decode_tp: int
-    decode_ep: int
-    decode_dp_attention: bool
-    trace_source: str
-
-    @property
-    def exp_name(self) -> str:
-        return (
-            f"{self.model_prefix}_{self.hardware}_{self.framework}_"
-            f"{self.context_bucket}_conc{self.concurrency}"
-        )
-
-    @property
-    def result_filename(self) -> str:
-        return _slug(
-            f"agentic_{self.exp_name}_{self.arrival_mode}_"
-            f"{self.cache_mode}_{self.tenant_mode}"
-        )
-
-    def to_dict(self) -> dict[str, Any]:
-        return {
-            "job_id": self.job_id,
-            "hardware": self.hardware,
-            "framework": self.framework,
-            "topology": self.topology,
-            "context_bucket": self.context_bucket,
-            "concurrency": self.concurrency,
-            "arrival_mode": self.arrival_mode,
-            "cache_mode": self.cache_mode,
-            "tenant_mode": self.tenant_mode,
-            "duration_seconds": self.duration_seconds,
-            "runner_script": self.runner_script,
-            "runner_name": self.runner_name,
-            "slurm_nodes": self.slurm_nodes,
-            "gpus_per_node": self.gpus_per_node,
-            "cpus_per_task": self.cpus_per_task,
-            "time_limit": self.time_limit,
-            "model_prefix": self.model_prefix,
-            "precision": self.precision,
-            "tp": self.tp,
-            "ep": self.ep,
-            "dp_attention": self.dp_attention,
-            "disagg": self.disagg,
-            "config_file": self.config_file,
-            "is_multinode": self.is_multinode,
-            "trace_source": self.trace_source,
-            "result_filename": self.result_filename,
-            "exp_name": self.exp_name,
-            "prefill_num_workers": self.prefill_num_workers,
-            "prefill_tp": self.prefill_tp,
-            "prefill_ep": self.prefill_ep,
-            "prefill_dp_attention": self.prefill_dp_attention,
-            "decode_num_workers": self.decode_num_workers,
-            "decode_tp": self.decode_tp,
-            "decode_ep": self.decode_ep,
-            "decode_dp_attention": self.decode_dp_attention,
-        }
-
-
-def expand_jobs(config: dict[str, Any], args: argparse.Namespace) -> list[MatrixJob]:
-    defaults = config.get("defaults", {})
-    hardware_cfg = config.get("hardware", {})
-    if not isinstance(hardware_cfg, dict):
-        raise ValueError("hardware must be a mapping")
-
-    selected_hw = set(_parse_csv_strings(args.hardware) or hardware_cfg.keys())
-    contexts = _parse_csv_strings(args.context_buckets) or _as_list(defaults.get("context_buckets"))
-    concurrencies = _parse_csv_ints(args.concurrency) or _as_list(defaults.get("concurrency"))
-    arrival_modes = _parse_csv_strings(args.arrival_modes) or _as_list(defaults.get("arrival_modes"))
-    cache_modes = _parse_csv_strings(args.cache_modes) or _as_list(defaults.get("cache_modes"))
-    tenant_modes = _parse_csv_strings(args.tenant_modes) or _as_list(defaults.get("tenant_modes"))
-
-    jobs: list[MatrixJob] = []
-    for hardware, hw in hardware_cfg.items():
-        if hardware not in selected_hw or not hw.get("enabled", True):
-            continue
-        runner_script = REPO_ROOT / str(hw["runner_script"])
-        if not runner_script.exists():
-            raise FileNotFoundError(f"runner_script not found for {hardware}: {runner_script}")
-        script_expected = hw.get("script_expected")
-        if script_expected and not (REPO_ROOT / str(script_expected)).exists():
-            raise FileNotFoundError(f"script_expected not found for {hardware}: {script_expected}")
-
-        is_multinode = hw.get("topology") == "multi_node_disagg"
-        prefill = hw.get("prefill", {})
-        decode = hw.get("decode", {})
-        tp = int(hw.get("tp", prefill.get("tp", 1)))
-        ep = int(hw.get("ep", prefill.get("ep", 1)))
-        dp_attention = bool(hw.get("dp_attention", prefill.get("dp_attention", False)))
-
-        for context_bucket in contexts:
-            for concurrency in concurrencies:
-                for arrival_mode in arrival_modes:
-                    for cache_mode in cache_modes:
-                        for tenant_mode in tenant_modes:
-                            key = "|".join(
-                                [
-                                    hardware,
-                                    str(hw["framework"]),
-                                    str(context_bucket),
-                                    str(concurrency),
-                                    str(arrival_mode),
-                                    str(cache_mode),
-                                    str(tenant_mode),
-                                ]
-                            )
-                            job_id = hashlib.sha1(key.encode()).hexdigest()[:10]
-                            jobs.append(
-                                MatrixJob(
-                                    job_id=job_id,
-                                    hardware=hardware,
-                                    framework=str(hw["framework"]),
-                                    topology=str(hw["topology"]),
-                                    context_bucket=str(context_bucket),
-                                    concurrency=int(concurrency),
-                                    arrival_mode=str(arrival_mode),
-                                    cache_mode=str(cache_mode),
-                                    tenant_mode=str(tenant_mode),
-                                    duration_seconds=int(args.duration or defaults.get("duration_seconds", 1800)),
-                                    runner_script=str(hw["runner_script"]),
-                                    runner_name=str(hw["runner_name"]),
-                                    slurm_nodes=int(hw.get("slurm_nodes", 1)),
-                                    gpus_per_node=int(defaults.get("gpus_per_node", 8)),
-                                    cpus_per_task=int(defaults.get("cpus_per_task", 64)),
-                                    time_limit=str(defaults.get("time_limit", "04:00:00")),
-                                    model_prefix=str(hw["model_prefix"]),
-                                    precision=str(hw["precision"]),
-                                    tp=tp,
-                                    ep=ep,
-                                    dp_attention=dp_attention,
-                                    disagg=is_multinode,
-                                    config_file=str(hw.get("config_file", "")),
-                                    is_multinode=is_multinode,
-                                    prefill_num_workers=int(prefill.get("num_worker", 0)),
-                                    prefill_tp=int(prefill.get("tp", 0)),
-                                    prefill_ep=int(prefill.get("ep", 0)),
-                                    prefill_dp_attention=bool(prefill.get("dp_attention", False)),
-                                    decode_num_workers=int(decode.get("num_worker", 0)),
-                                    decode_tp=int(decode.get("tp", 0)),
-                                    decode_ep=int(decode.get("ep", 0)),
-                                    decode_dp_attention=bool(decode.get("dp_attention", False)),
-                                    trace_source=str(defaults.get("trace_source", "")),
-                                )
-                            )
-
-    if args.max_jobs is not None:
-        jobs = jobs[: args.max_jobs]
-    return jobs
-
-
-def expected_paths(job: MatrixJob) -> list[str]:
-    return [
-        f"{job.job_id}/{job.result_filename}.json",
-        f"{job.job_id}/results/benchmark.log",
-        f"{job.job_id}/results/benchmark_command.txt",
-        f"{job.job_id}/results/server.log",
-        f"{job.job_id}/results/trace_replay/detailed_results.csv",
-        f"{job.job_id}/results/trace_replay/debug_trace.jsonl",
-        f"{job.job_id}/preflight.log",
-        f"{job.job_id}/provenance_preflight.jsonl",
-    ]
-
-
-def render_sbatch(job: MatrixJob, config: dict[str, Any], results_root: Path, dry_run_guard: bool) -> str:
-    defaults = config.get("defaults", {})
-    template = SBATCH_TEMPLATE.read_text()
-    job_dir = results_root / job.job_id
-    values = {
-        **job.to_dict(),
-        "job_name": f"agentic-{job.hardware}-{job.job_id}",
-        "job_dir": str(job_dir),
-        "partition_env": defaults.get("partition_env", "GMI_SLURM_PARTITION"),
-        "account_env": defaults.get("account_env", "GMI_SLURM_ACCOUNT"),
-        "results_root_env": defaults.get("results_root_env", "GMI_RESULTS_ROOT"),
-        "container_image_env": defaults.get("container_image_env", "GMI_CONTAINER_IMAGE"),
-        "model_path_env": defaults.get("model_path_env", "GMI_MODEL_PATH"),
-        "dp_attention": str(job.dp_attention).lower(),
-        "disagg": str(job.disagg).lower(),
-        "is_multinode": "1" if job.is_multinode else "0",
-        "prefill_dp_attention": str(job.prefill_dp_attention).lower(),
-        "decode_dp_attention": str(job.decode_dp_attention).lower(),
-        "dry_run_guard": "1" if dry_run_guard else "0",
-    }
-    rendered = template
-    for key, value in values.items():
-        rendered = rendered.replace("{" + key + "}", str(value))
-    return rendered
-
-
-def write_outputs(config: dict[str, Any], jobs: list[MatrixJob], results_root: Path, dry_run_guard: bool) -> None:
-    sbatch_dir = results_root / "sbatch"
-    sbatch_dir.mkdir(parents=True, exist_ok=True)
-    for job in jobs:
-        job_dir = results_root / job.job_id
-        job_dir.mkdir(parents=True, exist_ok=True)
-        sbatch_text = render_sbatch(job, config, results_root, dry_run_guard)
-        (sbatch_dir / f"{job.job_id}.sbatch").write_text(sbatch_text)
-
-    plan = {
-        "scenario": "agentic-coding",
-        "total_jobs": len(jobs),
-        "jobs": [job.to_dict() for job in jobs],
-    }
-    (results_root / "matrix_plan.json").write_text(json.dumps(plan, indent=2) + "\n")
-
-    contract = {
-        "scenario": "agentic-coding",
-        "total_jobs": len(jobs),
-        "per_job": [
-            {
-                "job_id": job.job_id,
-                "result_filename": job.result_filename,
-                "expected_paths": expected_paths(job),
-                "required_before_claiming_success": [
-                    f"{job.job_id}/{job.result_filename}.json",
-                    f"{job.job_id}/results/trace_replay/detailed_results.csv",
-                    f"{job.job_id}/preflight.log",
-                    f"{job.job_id}/provenance_preflight.jsonl",
-                ],
-            }
-            for job in jobs
-        ],
-    }
-    (results_root / "expected_artifact_contract.json").write_text(json.dumps(contract, indent=2) + "\n")
-
-
-def submit_jobs(config: dict[str, Any], results_root: Path, jobs: list[MatrixJob]) -> None:
-    defaults = config.get("defaults", {})
-    partition_env = defaults.get("partition_env", "GMI_SLURM_PARTITION")
-    account_env = defaults.get("account_env", "GMI_SLURM_ACCOUNT")
-    partition = os.environ.get(partition_env)
-    account = os.environ.get(account_env)
-    if not partition:
-        raise RuntimeError(f"{partition_env} must be set when --submit is used")
-    for job in jobs:
-        sbatch_path = results_root / "sbatch" / f"{job.job_id}.sbatch"
-        cmd = ["sbatch", "--partition", partition]
-        if account:
-            cmd.extend(["--account", account])
-        cmd.append(str(sbatch_path))
-        subprocess.run(cmd, check=True)
-
-
-def build_parser() -> argparse.ArgumentParser:
-    parser = argparse.ArgumentParser(description=__doc__)
-    parser.add_argument("--config", type=Path, default=DEFAULT_CONFIG)
-    parser.add_argument("--results-root", type=Path, default=Path(os.environ.get("GMI_RESULTS_ROOT", "agentic-slurm-results")))
-    parser.add_argument("--hardware", help="Comma-separated hardware filter, e.g. b200,b300")
-    parser.add_argument("--context-buckets", help="Comma-separated context buckets")
-    parser.add_argument("--concurrency", help="Comma-separated concurrency values")
-    parser.add_argument("--arrival-modes", help="Comma-separated arrival modes")
-    parser.add_argument("--cache-modes", help="Comma-separated cache modes")
-    parser.add_argument("--tenant-modes", help="Comma-separated tenant modes")
-    parser.add_argument("--duration", type=int)
-    parser.add_argument("--max-jobs", type=int)
-    parser.add_argument("--dry-run", action="store_true", help="Render files only")
-    parser.add_argument("--submit", action="store_true", help="Submit rendered sbatch jobs")
-    return parser
-
-
-def main(argv: list[str] | None = None) -> int:
-    parser = build_parser()
-    args = parser.parse_args(argv)
-    if args.submit and args.dry_run:
-        parser.error("--submit and --dry-run are mutually exclusive")
-
-    config = _load_config(args.config)
-    jobs = expand_jobs(config, args)
-    args.results_root.mkdir(parents=True, exist_ok=True)
-    write_outputs(config, jobs, args.results_root, dry_run_guard=args.dry_run or not args.submit)
-
-    if args.submit:
-        submit_jobs(config, args.results_root, jobs)
-
-    print(f"Wrote {len(jobs)} agentic Slurm jobs to {args.results_root}")
-    print(f"Matrix plan: {args.results_root / 'matrix_plan.json'}")
-    print(f"Artifact contract: {args.results_root / 'expected_artifact_contract.json'}")
-    return 0
-
-
-if __name__ == "__main__":
-    sys.exit(main())
diff --git a/scripts/slurm/agentic_job.sbatch.tmpl b/scripts/slurm/agentic_job.sbatch.tmpl
deleted file mode 100644
index b3b052c38..000000000
--- a/scripts/slurm/agentic_job.sbatch.tmpl
+++ /dev/null
@@ -1,89 +0,0 @@
-#!/usr/bin/env bash
-#SBATCH --job-name={job_name}
-#SBATCH --nodes={slurm_nodes}
-#SBATCH --gpus-per-node={gpus_per_node}
-#SBATCH --cpus-per-task={cpus_per_task}
-#SBATCH --time={time_limit}
-#SBATCH --output={job_dir}/slurm-%j.out
-#SBATCH --error={job_dir}/slurm-%j.err
-
-set -euo pipefail
-
-mkdir -p "{job_dir}"
-
-required_env=(
-  "{partition_env}"
-  "{results_root_env}"
-  "{container_image_env}"
-  "{model_path_env}"
-)
-for env_name in "${required_env[@]}"; do
-  if [[ -z "${!env_name:-}" ]]; then
-    echo "FATAL: required environment variable ${env_name} is not set" >&2
-    exit 2
-  fi
-done
-
-export IMAGE="${{container_image_env}}"
-export MODEL="${{model_path_env}}"
-export GITHUB_WORKSPACE="${GITHUB_WORKSPACE:-$(pwd)}"
-export MODEL_PREFIX="{model_prefix}"
-export PRECISION="{precision}"
-export FRAMEWORK="{framework}"
-export RUNNER_NAME="{runner_name}"
-export RUNNER_TYPE="{hardware}"
-export EXP_NAME="{exp_name}"
-export RESULT_FILENAME="{result_filename}"
-export RESULT_DIR="{job_dir}/results"
-export AGENTIC_OUTPUT_DIR="{job_dir}"
-export SCENARIO_TYPE="agentic-coding"
-export SCENARIO_SUBDIR="agentic/"
-export IS_AGENTIC="1"
-export CONC="{concurrency}"
-export DURATION="{duration_seconds}"
-export TRACE_SOURCE="{trace_source}"
-export AGENTIC_CONTEXT_BUCKET="{context_bucket}"
-export AGENTIC_ARRIVAL_MODE="{arrival_mode}"
-export AGENTIC_CACHE_MODE="{cache_mode}"
-export AGENTIC_TENANT_MODE="{tenant_mode}"
-export TP="{tp}"
-export EP_SIZE="{ep}"
-export DP_ATTENTION="{dp_attention}"
-export SPEC_DECODING="none"
-export DISAGG="{disagg}"
-export CONFIG_FILE="{config_file}"
-export IS_MULTINODE="{is_multinode}"
-export PREFILL_NUM_WORKERS="{prefill_num_workers}"
-export PREFILL_TP="{prefill_tp}"
-export PREFILL_EP="{prefill_ep}"
-export PREFILL_DP_ATTN="{prefill_dp_attention}"
-export DECODE_NUM_WORKERS="{decode_num_workers}"
-export DECODE_TP="{decode_tp}"
-export DECODE_EP="{decode_ep}"
-export DECODE_DP_ATTN="{decode_dp_attention}"
-
-cat > "{job_dir}/provenance_preflight.jsonl" <<EOF
-{"event":"job_start","slurm_job_id":"${SLURM_JOB_ID:-}","nodelist":"${SLURM_JOB_NODELIST:-}","hardware":"{hardware}","framework":"{framework}","model_prefix":"{model_prefix}","context_bucket":"{context_bucket}","concurrency":{concurrency}}
-EOF
-
-{
-  echo "=== uname ==="
-  uname -a || true
-  echo "=== git ==="
-  git rev-parse HEAD || true
-  echo "=== gpu ==="
-  nvidia-smi -L || true
-  nvidia-smi topo -m || true
-  echo "=== rdma ==="
-  ibv_devinfo || true
-  echo "=== nccl ==="
-  all_reduce_perf -b 8 -e 128M -f 2 -g "{gpus_per_node}" || true
-} > "{job_dir}/preflight.log" 2>&1
-
-if [[ "{dry_run_guard}" == "1" ]]; then
-  echo "Dry-run sbatch rendered successfully; not executing runner."
-  exit 0
-fi
-
-mkdir -p "$RESULT_DIR"
-bash "{runner_script}"
diff --git a/utils/test_agentic_slurm_matrix.py b/utils/test_agentic_slurm_matrix.py
deleted file mode 100644
index 40c332a63..000000000
--- a/utils/test_agentic_slurm_matrix.py
+++ /dev/null
@@ -1,84 +0,0 @@
-import importlib.util
-import json
-import sys
-from pathlib import Path
-
-
-REPO_ROOT = Path(__file__).resolve().parents[1]
-SCRIPT = REPO_ROOT / "scripts" / "run_agentic_slurm_matrix.py"
-
-
-def load_runner():
-    spec = importlib.util.spec_from_file_location("run_agentic_slurm_matrix", SCRIPT)
-    module = importlib.util.module_from_spec(spec)
-    assert spec.loader is not None
-    sys.modules[spec.name] = module
-    spec.loader.exec_module(module)
-    return module
-
-
-def test_agentic_slurm_dry_run_writes_plan_contract_and_sbatch(tmp_path):
-    runner = load_runner()
-    rc = runner.main(
-        [
-            "--dry-run",
-            "--results-root",
-            str(tmp_path),
-            "--hardware",
-            "b200",
-            "--context-buckets",
-            "8k",
-            "--concurrency",
-            "1,2",
-        ]
-    )
-    assert rc == 0
-
-    plan = json.loads((tmp_path / "matrix_plan.json").read_text())
-    assert plan["scenario"] == "agentic-coding"
-    assert plan["total_jobs"] == 2
-    assert {job["concurrency"] for job in plan["jobs"]} == {1, 2}
-    assert all(job["model_prefix"] == "dsv4" for job in plan["jobs"])
-
-    contract = json.loads((tmp_path / "expected_artifact_contract.json").read_text())
-    assert contract["total_jobs"] == 2
-    required = contract["per_job"][0]["required_before_claiming_success"]
-    assert any(path.endswith("/trace_replay/detailed_results.csv") for path in required)
-    assert any(path.endswith("/provenance_preflight.jsonl") for path in required)
-
-    sbatch_files = sorted((tmp_path / "sbatch").glob("*.sbatch"))
-    assert len(sbatch_files) == 2
-    rendered = sbatch_files[0].read_text()
-    assert 'SCENARIO_TYPE="agentic-coding"' in rendered
-    assert 'MODEL_PREFIX="dsv4"' in rendered
-    assert 'TRACE_SOURCE="semianalysisai/cc-traces-weka-042026"' in rendered
-    assert "nvidia-smi topo -m" in rendered
-    assert "all_reduce_perf" in rendered
-    assert "Dry-run sbatch rendered successfully" in rendered
-
-
-def test_agentic_slurm_matrix_can_filter_gb200_multinode(tmp_path):
-    runner = load_runner()
-    rc = runner.main(
-        [
-            "--dry-run",
-            "--results-root",
-            str(tmp_path),
-            "--hardware",
-            "gb200",
-            "--context-buckets",
-            "8k",
-            "--concurrency",
-            "1",
-            "--max-jobs",
-            "1",
-        ]
-    )
-    assert rc == 0
-
-    plan = json.loads((tmp_path / "matrix_plan.json").read_text())
-    assert plan["total_jobs"] == 1
-    job = plan["jobs"][0]
-    assert job["hardware"] == "gb200"
-    assert job["is_multinode"] is True
-    assert job["config_file"].endswith("disagg-gb200-low-latency.yaml")

From 0156eb17bf361b86d7c4c32b2104067bdb93dde1 Mon Sep 17 00:00:00 2001
From: William Chen <57119977+OCWC22@users.noreply.github.com>
Date: Sun, 3 May 2026 03:36:37 -0700
Subject: [PATCH 5/5] Revert "docs(agentic): add GMI truth matrix [skip-sweep]"

This reverts commit df9aa0cd88b3591fa89a4b127685c25862acbb02.
---
 AGENTIC_TRUTH_MATRIX.md | 247 ----------------------------------------
 1 file changed, 247 deletions(-)
 delete mode 100644 AGENTIC_TRUTH_MATRIX.md

diff --git a/AGENTIC_TRUTH_MATRIX.md b/AGENTIC_TRUTH_MATRIX.md
deleted file mode 100644
index fdd91f139..000000000
--- a/AGENTIC_TRUTH_MATRIX.md
+++ /dev/null
@@ -1,247 +0,0 @@
-# SemiAnalysis InferenceX Agentic/WEKA Truth Matrix
-
-Date: 2026-05-03
-
-Scope: local `InferenceX` checkout, focused on the PR path that adds the `agentic-coding` scenario and WEKA trace replay. This is a truth matrix for deciding what still needs to be built before a GMI Cloud or other neocloud platform engineer can use the harness to evaluate real long-context chat and coding inference workloads.
-
-## Bottom Line
-
-The current InferenceX implementation is a real but experimental **agentic trace replay harness**. It replays recorded WEKA coding/chat traces against an OpenAI-compatible serving endpoint and emits latency, throughput, cache, workload-distribution, and artifact outputs.
-
-It is **not yet** a complete GMI/neocloud evaluation harness for DeepSeek-V4 on B200/B300/GB200. The biggest gap is that `agentic-coding` is wired for some DeepSeek-R1 and GPT-OSS/Kimi paths, while the DeepSeek-V4 GB200/B300/B200 surface is still mostly fixed-sequence or srt-slurm recipe driven. The harness also does not yet produce the full cluster, network, reliability, cost, and operator-readiness evidence that a cloud platform engineer would need.
-
-## Actual Code Path Today
-
-| Step | What happens | Actual code | Truth status |
-|---|---|---|---|
-| 1 | Config declares an optional `agentic-coding` scenario. | `.github/configs/CONFIGS.md` | Exists |
-| 2 | NVIDIA/AMD master configs include a small number of `agentic-coding` entries. | `.github/configs/nvidia-master.yaml`, `.github/configs/amd-master.yaml` | Exists, narrow |
-| 3 | Matrix generator expands agentic entries across concurrency, TP, EP, DP attention, offload, runner, image, model, and duration. | `utils/matrix_logic/generate_sweep_configs.py` | Exists |
-| 4 | GitHub workflow sets agentic routing env vars. | `.github/workflows/benchmark-tmpl.yml` | Exists |
-| 5 | Runner selects `benchmarks/single_node/agentic/...` instead of normal fixed-seq scripts. | `runners/launch_*.sh` via `SCENARIO_SUBDIR=agentic/` | Exists |
-| 6 | Shared library resolves WEKA trace source and builds the replay command. | `benchmarks/benchmark_lib.sh` | Exists |
-| 7 | Agentic script starts the serving backend and runs trace replay. | `benchmarks/single_node/agentic/dsr1_fp4_b200.sh`, peers | Exists |
-| 8 | Multi-node agentic path runs client-only replay against an already-started srt-slurm frontend. | `benchmarks/multi_node/agentic_srt.sh` | Exists, experimental |
-| 9 | Aggregator turns replay CSVs into InferenceX-like JSON. | `utils/process_agentic_result.py` | Exists |
-| 10 | Workflow uploads raw and aggregated artifacts. | `.github/workflows/benchmark-tmpl.yml`, `.github/workflows/e2e-tests.yml` | Exists |
-
-## Actual Code Snippets
-
-The trace source is hardcoded to a Hugging Face dataset:
-
-```bash
-local dataset="semianalysisai/cc-traces-weka-042026"
-TRACE_SOURCE_FLAG="--hf-dataset $dataset"
-```
-
-Source: `benchmarks/benchmark_lib.sh`
-
-Agentic replay is built as a client workload against the local serving endpoint:
-
-```bash
-REPLAY_CMD="python3 $TRACE_REPLAY_DIR/trace_replay_tester.py"
-REPLAY_CMD+=" --api-endpoint http://localhost:$PORT"
-REPLAY_CMD+=" $TRACE_SOURCE_FLAG"
-REPLAY_CMD+=" --output-dir $result_dir/trace_replay"
-REPLAY_CMD+=" --start-users $CONC"
-REPLAY_CMD+=" --max-users $CONC"
-REPLAY_CMD+=" --test-duration $duration"
-REPLAY_CMD+=" --recycle"
-REPLAY_CMD+=" --warmup-enabled"
-REPLAY_CMD+=" --seed 42"
-```
-
-Source: `benchmarks/benchmark_lib.sh`
-
-The workflow routes agentic jobs by setting:
-
-```yaml
-SCENARIO_SUBDIR: ${{ inputs.scenario-type == 'agentic-coding' && 'agentic/' || '' }}
-IS_AGENTIC: ${{ inputs.scenario-type == 'agentic-coding' && '1' || '0' }}
-RESULT_DIR: /workspace/results
-```
-
-Source: `.github/workflows/benchmark-tmpl.yml`
-
-The B200 DeepSeek-R1 agentic script starts SGLang, waits for readiness, runs replay, then aggregates:
-
-```bash
-resolve_trace_source
-install_agentic_deps
-python3 -m sglang.launch_server ... --enable-metrics > "$SERVER_LOG" 2>&1 &
-wait_for_server_ready --port "$PORT" --server-log "$SERVER_LOG" --server-pid "$SERVER_PID"
-build_replay_cmd "$RESULT_DIR"
-$REPLAY_CMD 2>&1 | tee "$RESULT_DIR/benchmark.log" || true
-write_agentic_result_json "$RESULT_DIR"
-```
-
-Source: `benchmarks/single_node/agentic/dsr1_fp4_b200.sh`
-
-Aggregated JSON includes scenario identity, topology, success counts, latency, throughput, token distributions, cache stats, and per-GPU throughput:
-
-```python
-agg = {
-    "hw": os.environ.get('RUNNER_TYPE', ''),
-    "conc": conc,
-    "model": os.environ.get('MODEL', ''),
-    "framework": os.environ.get('FRAMEWORK', ''),
-    "scenario_type": "agentic-coding",
-    "is_multinode": is_multinode,
-    "tp": tp,
-    "ep": ep,
-    "offloading": os.environ.get('OFFLOADING', 'none'),
-    "num_requests_total": len(rows),
-    "num_requests_successful": len(successful),
-}
-```
-
-Source: `utils/process_agentic_result.py`
-
-## What The Use Case Actually Is
-
-The current use case is:
-
-| Use case | Current behavior |
-|---|---|
-| Replay realistic coding/chat request traces | Yes, via `semianalysisai/cc-traces-weka-042026`. |
-| Drive a serving endpoint with concurrent users | Yes, with `--start-users $CONC` and `--max-users $CONC`. |
-| Measure request-level TTFT / E2E / ITL / TPOT | Yes, from `trace_replay/detailed_results.csv`. |
-| Measure throughput and throughput per GPU | Yes, from completed request timestamps and configured GPU counts. |
-| Measure input/output token distribution | Yes, from replay rows. |
-| Estimate cache reuse | Partially. It reports theoretical replay cache hit rate and server prefix-cache counters when metrics exist. |
-| Evaluate real autonomous coding agent behavior | No. It replays traces; it does not run an agent loop with tools, repo edits, tests, retries, or feedback. |
-| Evaluate GMI customer traffic | No, unless GMI traffic is converted into the same trace-replay format. |
-
-## Current Coverage Matrix
-
-| Surface | Current status | Notes |
-|---|---|---|
-| DeepSeek-R1 FP4 B200 SGLang single-node agentic | Exists | `benchmarks/single_node/agentic/dsr1_fp4_b200.sh`. |
-| DeepSeek-R1 FP4 B200 Dynamo/TRT multi-node agentic | Exists, experimental | Uses a special `cquil11/srt-slurm-nv` branch and a `128k_agentic` recipe. |
-| DeepSeek-R1 FP4 MI355X SGLang single-node agentic | Exists | AMD entry in `.github/configs/amd-master.yaml`. |
-| GPT-OSS FP4 H100/H200/MI300X/MI325X agentic scripts | Exists as scripts | Need config coverage and live validation per target. |
-| Kimi K2.5 FP4 B200 agentic script | Exists as script | Need config coverage and live validation. |
-| DeepSeek-V4 B200/B300 SGLang fixed-seq | Exists | Fixed 1k/8k surfaces, not agentic trace replay. |
-| DeepSeek-V4 B200/B300 vLLM fixed-seq/MTP | Exists | Fixed-seq path with DSV4 chat encoding. |
-| DeepSeek-V4 GB200 vLLM srt-slurm recipes | Exists | Recipe set for 8k1k, not agentic trace replay. |
-| DeepSeek-V4 GB200 agentic trace replay | Missing | No `agentic-coding` config or DSV4-specific agentic launcher found. |
-| B300 agentic trace replay | Mostly missing | B300 has fixed-seq DSR1/DSV4 surfaces, not a clear agentic path. |
-| LMCache/TensorMesh agentic comparison | Missing | No direct LMCache/TensorMesh metrics integration in InferenceX agentic path. |
-
-## What A GMI/Neocloud Platform Engineer Actually Cares About
-
-| Category | What they need to decide | Current harness answer | Gap |
-|---|---|---|---|
-| Capacity planning | How many concurrent coding/chat sessions per node or rack before SLO violation? | Partial: concurrency sweep and request latencies. | Needs SLO pass/fail curves, saturation point, and capacity recommendations. |
-| Latency SLO | P50/P90/P99 TTFT, TPOT, E2E for long-context chat/coding. | Partial: computes latency stats. | Needs explicit SLO config, pass/fail, and stable run windows. |
-| Long context | How 32k/64k/128k/256k+ context behaves under realistic reuse. | Partial: WEKA traces may include realistic shapes, but context buckets are not first-class in matrix. | Needs explicit context-length stratification and reporting. |
-| Coding workload realism | Does traffic resemble coding assistants, repo Q&A, edits, tests, tool calls? | Partial: recorded traces, but no task taxonomy shown in core benchmark output. | Needs workload classes: code chat, repo QA, patch generation, test/debug loop, long-doc coding. |
-| Cache value | Does prefix/KV reuse improve latency, cost, and throughput? | Partial: theoretical and server prefix-cache hit metrics. | Needs engine-specific cache event metrics, eviction, residency, fragmentation, reuse distance, cache salt/isolation. |
-| Multi-tenant isolation | Does one tenant poison or evict another tenant's cache? | Missing. | Needs tenant IDs, cache salts, fairness and isolation reports. |
-| Memory pressure | When do KV cache, CPU offload, swap, or SSD tiers collapse? | Partial: offloading field and a few counters. | Needs GPU memory, HBM pressure, CPU memory, SSD bandwidth, eviction storms, OOM attribution. |
-| Slurm operator flow | Can an operator dry-run, submit, monitor, cancel, and collect artifacts? | Partial in InferenceX CI and srt-slurm paths. | Needs portable Slurm matrix runner, sbatch rendering, env-only cluster config, and artifact contract. |
-| Network health | Are NCCL/RDMA/NVLink/IB topology problems caught before benchmark? | Missing in InferenceX agentic path. | Needs preflight topology and network smoke checks. |
-| Reproducibility | Can results be traced to image digest, repo SHA, GPU inventory, driver, topology, and versions? | Partial: CI has image/model/framework fields. | Needs full provenance captured per job. |
-| Reliability | Do runs survive cold start, warmup, long duration, failed requests, server restarts? | Partial: success counts and raw logs. | Needs failure taxonomy, retry policy, health timeline, and soak tests. |
-| Cost model | Which hardware/runtime gives best $/successful-session or $/M tokens at SLO? | Missing. | Needs GPU-hour pricing input and cost-per-SLO report. |
-| Hardware comparison | B200 vs B300 vs GB200 for the same DSV4 workload. | Missing for agentic. | Need same workload across same engines and configs. |
-| Runtime comparison | vLLM vs SGLang vs TRT/Dynamo under identical trace replay. | Partial for some models. | Need normalized DSV4 matrix and identical trace/scheduler settings. |
-| Production readiness | What config should GMI actually offer customers? | Missing. | Needs recommended SKUs, caveats, and no-go thresholds. |
-
-## Truth Matrix: Current vs Required
-
-Legend:
-
-- Yes: implemented in the local InferenceX path.
-- Partial: implemented but too narrow, experimental, or missing key evidence.
-- No: not implemented.
-- Unknown: cannot be proven from this repo without live cluster results or external data.
-
-| Requirement | Current truth | Evidence | Needed build |
-|---|---:|---|---|
-| Agentic scenario flag and config schema | Yes | `agentic-coding` in config docs and validation. | Keep. |
-| WEKA trace replay source | Yes | `semianalysisai/cc-traces-weka-042026` in `resolve_trace_source`. | Make dataset configurable; keep WEKA as default/example. |
-| Single-node trace replay execution | Yes | `benchmarks/single_node/agentic/*.sh`. | Add DSV4 B200/B300 launchers. |
-| Multi-node trace replay execution | Partial | `benchmarks/multi_node/agentic_srt.sh`; special srt-slurm branch. | First-class srt-slurm support, no special private branch dependency. |
-| DeepSeek-V4 B200 agentic | No | DSV4 B200 configs are fixed-seq, not `agentic-coding`. | Add config + launcher + validated run. |
-| DeepSeek-V4 B300 agentic | No | B300 has DSV4 fixed-seq scripts/recipes, not agentic. | Add config + launcher + validated run. |
-| DeepSeek-V4 GB200 agentic | No | GB200 DSV4 recipes exist, but no agentic scenario. | Add srt-slurm agentic recipe and config. |
-| B200/B300/GB200 apples-to-apples matrix | No | Current surfaces differ by model/runtime/scenario. | Build normalized matrix over hardware, engine, context, concurrency. |
-| vLLM/SGLang/TRT/Dynamo comparison for same workload | Partial | Some engines covered for some models. | Normalize exact model, precision, prompt encoding, trace, and duration. |
-| Long-context buckets | Partial | Fixed-seq has 1k/8k; trace replay may have varied token lengths. | Add explicit 8k/32k/64k/128k/256k+ bins in reports and optional filters. |
-| Coding workload taxonomy | Partial | Trace replay exists; distribution plot exists. | Add task labels and per-class metrics. |
-| TTFT/TPOT/E2E latency metrics | Yes | `compute_latency_stats`. | Add SLO pass/fail summary. |
-| Throughput per GPU | Yes | `tput_per_gpu` in processor. | Add SLO-qualified throughput, not just raw throughput. |
-| Failed request taxonomy | Partial | Success count exists. | Add HTTP error class, timeout, OOM, scheduler reject, engine crash. |
-| Prefix/KV cache hit metrics | Partial | Theoretical + server prefix counters when present. | Add LMCache/TensorMesh/vLLM/SGLang metric adapters with measured-vs-inferred flags. |
-| Eviction/fragmentation proof | No | No live cache event schema in agentic path. | Add engine metric scraping and artifact schema. |
-| Multi-tenant cache isolation | No | No tenant IDs or cache salt model. | Add multi-tenant trace mode and isolation metrics. |
-| CPU/SSD offload analysis | Partial | `offloading` field and some counters in processor. | Add tier residency, bandwidth, latency, and failure attribution. |
-| Slurm dry-run matrix generation | Partial | InferenceX CI/srt-slurm flow exists; not a portable GMI operator runner. | Add portable Slurm matrix runner and artifact contract. |
-| NCCL/RDMA/topology preflight | No | Not in agentic path. | Add pre-benchmark smoke checks. |
-| Full provenance capture | Partial | JSON includes image/model/framework; raw logs upload. | Add digest, repo SHA, driver, CUDA, GPU inventory, topology, package versions. |
-| Cost and capacity report | No | No pricing or recommendation layer. | Add cost inputs and capacity planning report. |
-| Customer-ready operator report | No | Raw/aggregated artifacts only. | Add one-page operator brief with recommendations and caveats. |
-
-## Recommended Build Matrix For GMI/Neocloud Evaluation
-
-This is the minimum useful matrix for a GMI cloud engineer evaluating long-context chat and coding workloads. It is intentionally smaller than a full combinatorial sweep.
-
-| Axis | Required values | Why it matters |
-|---|---|---|
-| Hardware | B200, B300, GB200 | These are the procurement/deployment choices. |
-| Model | DeepSeek-V4-Pro first; DeepSeek-R1 as control | DSV4 is the target; DSR1 provides existing harness continuity. |
-| Runtime | vLLM, SGLang, Dynamo/TRT where supported | GMI needs runtime/SKU decision data. |
-| Topology | single-node, multi-node disagg | Long context and MoE behavior differ sharply by topology. |
-| Context bucket | 8k, 32k, 64k, 128k, 256k+ | Cloud operators need max supported context and degradation curve. |
-| Workload type | long chat, repo QA, code generation, test/debug loop, multi-turn agent | Coding traffic is not one workload. |
-| Concurrency | 1, 2, 4, 8, 16, 32, 64, 128, then saturation search | Finds knee of curve and failure region. |
-| Arrival mode | closed-loop and burst/open-loop | Closed-loop measures users; open-loop exposes queue collapse. |
-| Cache mode | cache off, engine prefix cache, LMCache/TensorMesh if available | Proves whether cache stack actually helps. |
-| Tenant mode | single tenant, multi-tenant with cache salt | Proves isolation and fairness. |
-| Duration | 10 min smoke, 30 min curve, 2-4 hr soak | Separates launch success from operational stability. |
-
-## What To Build Next
-
-| Priority | Build item | Acceptance criteria |
-|---:|---|---|
-| P0 | DSV4 `agentic-coding` configs for B200/B300/GB200 | Matrix generator emits DSV4 agentic jobs for each target hardware without touching fixed-seq paths. |
-| P0 | DSV4 agentic launchers | Single-node launchers exist for B200/B300; GB200 multi-node agentic recipe exists or maps cleanly to srt-slurm custom benchmark. |
-| P0 | Portable Slurm matrix runner | GMI operator can dry-run and submit without GitHub Actions; no hardcoded cluster IDs; all cluster settings via env/YAML. |
-| P0 | Artifact contract | Every run emits a normalized JSON, raw CSV/JSONL, server log, config, command, provenance, and expected-path manifest. |
-| P1 | Workload taxonomy and context buckets | Report breaks down metrics by workload class and context-length bucket. |
-| P1 | SLO/capacity report | For each cell, report max concurrency at TTFT/TPOT/E2E SLO and failure reason beyond it. |
-| P1 | Provenance capture | Per-job artifact records image digest, repo SHA, CUDA/driver, GPU inventory, topology, runtime versions, Slurm job ID, nodelist. |
-| P1 | NCCL/RDMA/topology preflight | Preflight emits pass/fail/skipped before benchmark execution. |
-| P1 | Cache metrics adapters | vLLM/SGLang/LMCache/TensorMesh metrics are normalized with measured vs inferred labels. |
-| P2 | Multi-tenant replay mode | Tenant IDs, cache salt/isolation, fairness, noisy-neighbor metrics. |
-| P2 | Cost model | Add GPU-hour price input and output $/successful-session, $/M input tokens, $/M output tokens at SLO. |
-| P2 | Operator brief | Generate a human-readable recommendation with caveats: best config, no-go configs, saturation point, and missing proof. |
-
-## Non-Claims To Preserve
-
-Do not claim any of the following until live artifacts prove them:
-
-- DeepSeek-V4 agentic performance on GB200.
-- B200/B300/GB200 parity under the same long-context trace replay.
-- LMCache/TensorMesh benefit.
-- Cache eviction or fragmentation behavior.
-- Multi-tenant isolation.
-- Production readiness for GMI customer workloads.
-- Autonomous agent performance; this is trace replay, not a tool-using agent loop.
-
-## Proposed File/Code Changes For The Next PR
-
-| Area | Candidate files |
-|---|---|
-| DSV4 agentic configs | `.github/configs/nvidia-master.yaml`, possibly a separate GMI/GPU pilot config. |
-| DSV4 single-node launchers | `benchmarks/single_node/agentic/dsv4_fp4_b200_sglang.sh`, `benchmarks/single_node/agentic/dsv4_fp4_b300_sglang.sh`, vLLM variants if supported. |
-| GB200 multi-node agentic | `benchmarks/multi_node/agentic_srt.sh`, `benchmarks/multi_node/srt-slurm-recipes/.../deepseek-v4/...`, `runners/launch_gb200-nv.sh`. |
-| Slurm operator harness | `scripts/slurm/`, `scripts/run_agentic_slurm_matrix.py`, `configs/agentic_slurm_matrix.yaml`. |
-| Metrics schema | `utils/process_agentic_result.py` plus a new normalized metrics schema module. |
-| Artifact contract tests | `utils/matrix_logic/test_*.py` or new repo-level tests for dry-run contract. |
-| Operator report | new `utils/summarize_agentic.py` or integration into `utils/summarize.py`. |
-
-## Decision
-
-For GMI/neocloud evaluation, the current InferenceX PR is a **good starting mechanism**, not a finished benchmark product. Build the missing DSV4+B200/B300/GB200 agentic Slurm surface, add provenance/preflight/cache/SLO reporting, and keep every unmeasured claim explicitly labeled as unproven.