diff --git a/CHANGELOG.md b/CHANGELOG.md index e2a0681..da1b259 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -8,6 +8,7 @@ The format follows [Keep a Changelog](https://keepachangelog.com/en/1.1.0/). The ### Added +- **vLLM production deployment notes.** `docs/model-providers/vllm.md` grows a "Production deployment" section covering the `VLLM_HTTP_TIMEOUT_KEEP_ALIVE` gotcha (vLLM's stock 5s uvicorn keep-alive lapses pooled OA-side httpx connections and surfaces as `ProviderUnavailable`; widen to roughly 300s), a systemd unit skeleton, and the three throughput knobs that interact with OA's shared connection pool (`--max-model-len`, `--max-num-seqs`, `--gpu-memory-utilization`). The existing "Tool calling" section grows a `--tool-call-parser` family table verified against vLLM's docs (Llama 3.x / Llama 4 / Mistral / Hermes / Qwen3 / DeepSeek V3 / GPT-OSS), plus explicit "not supported here" callouts for Anthropic / Gemini (proprietary cloud) and mainstream Gemma (no vLLM parser). - **Three new patterns docs.** `docs/patterns/state-migration-on-resume.md`, `docs/patterns/caller-supplied-trace-identifiers.md`, and `docs/patterns/observer-state-reconciliation.md` graduate the corresponding entries from `docs/agent/non-obvious-shapes.md` into full pattern recipes with code snippets and "when this is right / when it isn't" guidance. The programmatic patterns API (`openarmature.patterns.list()` / `get(name)`) grows from 4 to 7 entries. - **HyperDX OTel integration test path and "Production swap" docs in example 03.** `examples/03-observer-hooks/main.py`'s module docstring grows a "Production swap" section showing how to substitute the demo's `SimpleSpanProcessor` + `ConsoleSpanExporter` for `BatchSpanProcessor` + `OTLPSpanExporter` pointed at HyperDX (or any other OTLP-HTTP collector). A new opt-in integration test (`tests/integration/test_otel_hyperdx_export.py`, gated by `HYPERDX_API_KEY` + `HYPERDX_OTLP_ENDPOINT` env vars and `@pytest.mark.integration`) drives the same production export path end-to-end against a live endpoint. `opentelemetry-exporter-otlp-proto-http` lands as a dev-only dep; not promoted to a public extras group yet. diff --git a/docs/model-providers/vllm.md b/docs/model-providers/vllm.md index d0b5952..3fd2634 100644 --- a/docs/model-providers/vllm.md +++ b/docs/model-providers/vllm.md @@ -197,21 +197,150 @@ post-release task: harden OpenAIProvider readiness probe). vLLM supports OpenAI-style tool calling when launched with `--enable-auto-tool-choice` and a tool-parser flag matching the -model family (e.g., `--tool-call-parser llama3_json` for Llama 3.1 -Instruct). The wire shape is identical to OpenAI's; from +model family. The wire shape is identical to OpenAI's; from `OpenAIProvider`'s perspective, tool calls Just Work. The [fundamentals → tool calling](../concepts/llms.md#tool-calling) page covers the OA-side dispatch pattern; no vLLM-specific changes needed. ```bash -# vLLM server — enable tool calling -python -m vllm.entrypoints.openai.api_server \ - --model meta-llama/Llama-3.1-8B-Instruct \ +# vLLM server with tool calling enabled +vllm serve \ --enable-auto-tool-choice \ - --tool-call-parser llama3_json + --tool-call-parser +``` + +The `--tool-call-parser` flag MUST match the model family's training +format; mismatches produce assistant messages that vLLM tries to +parse as tool calls and silently returns as content (or vice versa). +Common families: + +| Model family | `--tool-call-parser` value | +|-------------------------------|----------------------------| +| Llama 3.x Instruct | `llama3_json` | +| Llama 4 (Maverick / Scout) | `llama4_pythonic` | +| Mistral Instruct families | `mistral` | +| Hermes, Qwen 2.5 tool-use | `hermes` | +| Qwen3 / Qwen3-Coder | `qwen3_xml` | +| DeepSeek V3 | `deepseek_v3` | +| GPT-OSS (20B / 120B) | `openai` | + +Anthropic Claude and Google Gemini models are proprietary cloud APIs, +not open weights; vLLM doesn't serve them, so they don't appear in +this table. Use their first-party endpoints (or an OpenAI-compatible +proxy) and skip the `--tool-call-parser` story entirely. + +**Gemma (Google open weights).** Distinct from Gemini, but vLLM does +not currently ship a tool-call parser for the mainstream Gemma 2, +Gemma 3, or CodeGemma variants; tool calling is effectively +unsupported under vLLM for those. The one exception is Google's +specialized FunctionGemma (270M, edge-focused), which has its own +`functiongemma` parser. For general-purpose tool-calling workloads, +pick a model family from the table above rather than Gemma. + +**Qwen3-VL specifically.** vLLM's docs don't currently document a +dedicated parser for the Qwen3-VL variants (`Qwen3-VL-30B-A3B`, +`Qwen3-VL-72B`). Check vLLM's release notes for the version you're +pinned to before assuming the Qwen3 row above carries over; +multimodal-instruct variants sometimes ship parser support behind +the text-instruct generation. + +See vLLM's +[tool-calling docs](https://docs.vllm.ai/en/latest/features/tool_calling.html) +for the current full list; the set grows release-over-release. + +## Production deployment + +The 30-second snippet at the top of this page is enough for a local +dev box. Production deployments hit three additional gotchas worth +calling out. + +### `VLLM_HTTP_TIMEOUT_KEEP_ALIVE` against `OpenAIProvider` + +`OpenAIProvider` keeps one `httpx.AsyncClient` per provider instance +and reuses connections across concurrent `complete()` calls per the +standard httpx pool idiom. vLLM's stock uvicorn keep-alive timeout +is 5 seconds; an idle pooled connection on the OA side can outlive +that window and the next request lands on a half-closed socket. The +visible symptom is `httpcore.RemoteProtocolError: Server +disconnected without sending a response` or +`httpx.RemoteProtocolError`, surfaced through `OpenAIProvider` as +`ProviderUnavailable`. + +The fix is to widen vLLM's keep-alive window via the +`VLLM_HTTP_TIMEOUT_KEEP_ALIVE` env var (the value feeds uvicorn's +`timeout_keep_alive`). 300 seconds covers most pool idle windows in +practice: + +```bash +VLLM_HTTP_TIMEOUT_KEEP_ALIVE=300 vllm serve --host 0.0.0.0 --port 8001 ``` +Same applies behind a reverse proxy: the proxy's keep-alive window +MUST be at least as wide as vLLM's. Otherwise the proxy closes +connections vLLM still considers alive and the OA-side pool reuses a +dead socket on the next call. + +### systemd unit shape + +For long-running vLLM workloads, a systemd unit is the canonical +launcher. The structural skeleton: + +```ini +# /etc/systemd/system/vllm-.service +[Unit] +Description=vLLM serving +After=network-online.target +Wants=network-online.target + +[Service] +Type=simple +User=vllm +WorkingDirectory=/srv/vllm +EnvironmentFile=/etc/vllm/.env +ExecStart=/srv/vllm/.venv/bin/vllm serve \ + --host 0.0.0.0 --port 8001 \ + --enable-auto-tool-choice \ + --tool-call-parser +Restart=on-failure +RestartSec=5 + +[Install] +WantedBy=multi-user.target +``` + +The `EnvironmentFile` pattern keeps `VLLM_HTTP_TIMEOUT_KEEP_ALIVE`, +`CUDA_VISIBLE_DEVICES`, `HF_HOME`, and other deploy-specific vars +out of the unit file itself, which makes the unit shippable across +hosts without per-machine edits. `journalctl -u vllm-` is +then the canonical log surface for production triage. + +### Throughput knobs and OA concurrency + +Three vLLM flags interact directly with how many concurrent +`complete()` calls an OA graph can land before vLLM starts 429-ing: + +- `--max-model-len`: per-request context ceiling. Lower values fit + more concurrent requests in the same KV-cache budget; higher + values let individual requests carry longer prompts at the cost + of concurrent capacity. +- `--max-num-seqs`: hard cap on concurrent sequences vLLM will + schedule. Past this cap, the scheduler queues and (once queue + fills) returns 429 with `Retry-After`. +- `--gpu-memory-utilization`: fraction of GPU VRAM vLLM may use. + Higher values widen the KV-cache budget, which lets vLLM schedule + closer to its `--max-num-seqs` cap before evicting in-flight + sequences; the cap itself doesn't move. Tune cautiously to avoid + OOM on the resident model weights. + +OA's `OpenAIProvider` shares one connection pool across the whole +graph, so a fan-out with `concurrency=N` lands N simultaneous wire +calls. When `N` exceeds `--max-num-seqs` minus vLLM's other +in-flight traffic, expect `ProviderRateLimit` with +`retry_after` populated; wrap the LLM-calling node in +`RetryMiddleware` (or set `concurrency` explicitly on the fan-out) +to avoid head-of-line stalls. + ## Behaviour to be aware of - **Concurrency**: vLLM batches requests internally. `OpenAIProvider`