diff --git a/issues/embedded-ai-embabel/11-ai-provider-abstraction.md b/issues/embedded-ai-embabel/11-ai-provider-abstraction.md new file mode 100644 index 0000000..9331dc9 --- /dev/null +++ b/issues/embedded-ai-embabel/11-ai-provider-abstraction.md @@ -0,0 +1,96 @@ +# Issue: AI Provider Abstraction Layer + +**Milestone:** 5 — Embedded AI via Embabel & Chatbot Interface +**Priority:** Critical (prerequisite for all other Embabel work) +**Depends on:** None + +## Summary + +Create an `AIProvider` abstraction layer that decouples docgen from direct OpenAI API calls. This enables switching between providers (OpenAI, Embabel via MCP, Ollama for local models) via configuration, and is the prerequisite for all Embabel integration work. + +## Background + +Today, docgen has three hard-coded OpenAI integration points: + +| File | API | Model | Purpose | +|------|-----|-------|---------| +| `wizard.py` | `chat.completions.create` | `gpt-4o` (configurable) | Narration generation | +| `tts.py` | `audio.speech.create` | `gpt-4o-mini-tts` (configurable) | Text-to-speech | +| `timestamps.py` | `audio.transcriptions.create` | `whisper-1` (hard-coded) | Audio timestamps | + +Each creates its own `openai.OpenAI()` client. There is no abstraction layer, no provider switching, and no support for non-OpenAI backends. + +## Acceptance Criteria + +- [ ] Create `AIProvider` protocol with methods: + ```python + class AIProvider(Protocol): + def chat(self, model: str, messages: list[dict], **kwargs) -> str: ... + def tts(self, model: str, voice: str, text: str, instructions: str, output_path: Path) -> Path: ... + def transcribe(self, model: str, audio_path: Path, **kwargs) -> dict: ... + ``` +- [ ] Implement `OpenAIProvider` wrapping all current direct calls +- [ ] Implement `EmbabelProvider` that connects via MCP Python SDK to Embabel server +- [ ] Implement `OllamaProvider` for local model support: + - Chat: Ollama REST API (`/api/chat`) + - TTS: falls back to OpenAI (Ollama doesn't support TTS) + - Transcribe: falls back to OpenAI or local `whisper.cpp` +- [ ] Factory function: `get_ai_provider(config) -> AIProvider` +- [ ] Config in `docgen.yaml`: + ```yaml + ai: + provider: openai # "openai", "embabel", "ollama" + embabel_url: http://localhost:8080/sse + ollama_url: http://localhost:11434 + ollama_model: llama3.2 + ``` +- [ ] Refactor all three call sites to use `AIProvider`: + - `wizard.py:generate_narration_via_llm` → `provider.chat(...)` + - `tts.py:TTSGenerator.generate` → `provider.tts(...)` + - `timestamps.py:TimestampExtractor.extract` → `provider.transcribe(...)` +- [ ] Backward compatible: no config = default to `openai` with existing behavior +- [ ] Make `whisper-1` model configurable (currently hard-coded in `timestamps.py`) +- [ ] Unit tests with mock providers for each implementation + +## Technical Notes + +### Provider resolution order + +1. Explicit `ai.provider` in `docgen.yaml` → use that +2. `DOCGEN_AI_PROVIDER` environment variable → override +3. Default → `openai` + +### EmbabelProvider sketch + +```python +class EmbabelProvider: + def __init__(self, url: str): + self.url = url + self._client = None # lazy MCP client + + async def _connect(self): + from mcp import ClientSession + # Connect to Embabel SSE endpoint + ... + + def chat(self, model, messages, **kwargs): + # Invoke Embabel NarrationAgent tool via MCP + return self._call_tool("generate_narration", {...}) + + def tts(self, model, voice, text, instructions, output_path): + # Invoke Embabel TTSAgent tool via MCP, or fall back to OpenAI + ... +``` + +### Narration lint impact + +`NarrationLinter.lint_audio` in `narration_lint.py` also uses `TimestampExtractor` indirectly — it will automatically benefit from the provider abstraction without code changes. + +## Files to Create/Modify + +- **Create:** `src/docgen/ai_provider.py` +- **Modify:** `src/docgen/wizard.py` (use provider instead of direct openai) +- **Modify:** `src/docgen/tts.py` (use provider instead of direct openai) +- **Modify:** `src/docgen/timestamps.py` (use provider instead of direct openai) +- **Modify:** `src/docgen/config.py` (add `ai` config block) +- **Create:** `tests/test_ai_provider.py` diff --git a/issues/embedded-ai-embabel/12-embabel-agent-definitions.md b/issues/embedded-ai-embabel/12-embabel-agent-definitions.md new file mode 100644 index 0000000..3111085 --- /dev/null +++ b/issues/embedded-ai-embabel/12-embabel-agent-definitions.md @@ -0,0 +1,109 @@ +# Issue: Embabel Agent Definitions (JVM Side) + +**Milestone:** 5 — Embedded AI via Embabel & Chatbot Interface +**Priority:** High +**Depends on:** Issue 11 (provider abstraction) + +## Summary + +Create the Embabel Spring Boot application that hosts the AI agents for docgen. These agents are exposed as MCP tools that the Python docgen client can invoke for narration generation, TTS orchestration, pipeline management, script generation, and error diagnosis. + +## Background + +Embabel is a JVM-based agent framework that uses Goal-Oriented Action Planning (GOAP) to dynamically plan action sequences. By defining docgen-specific agents, we get: + +- **Planning**: the agent figures out the optimal sequence of steps (e.g., "to produce a demo video, I need to generate narration, then TTS, then compose...") +- **Tool use**: agents can invoke docgen CLI commands as tools +- **LLM mixing**: use GPT-4o for narration quality, cheaper models for classification +- **MCP exposure**: all agents automatically available to Python via MCP protocol + +## Acceptance Criteria + +- [ ] Create Embabel Spring Boot project: + - Option A: `docgen-agent/` subdirectory in this repo + - Option B: Companion repo `docgen-agent` (linked from README) +- [ ] Domain model classes (Kotlin data classes): + ```kotlin + data class NarrationRequest(val segment: String, val guidance: String, val sources: List) + data class NarrationResponse(val text: String, val wordCount: Int) + data class TTSRequest(val segment: String, val voice: String, val model: String) + data class PipelineRequest(val steps: List, val options: Map) + data class ScriptRequest(val testFile: String, val segment: String, val description: String) + data class DiagnosisRequest(val errorLog: String, val segment: String, val context: Map) + ``` +- [ ] Agent implementations: + - `NarrationAgent` — generates/revises narration from source docs + guidance + - `TTSAgent` — wraps TTS generation with voice/model selection and preview + - `PipelineAgent` — orchestrates multi-step pipeline via GOAP planning + - `ScriptAgent` — generates Playwright capture scripts or Manim scene code + - `DebugAgent` — analyzes compose/validation errors and suggests fixes +- [ ] All agent goals exported as MCP tools: `@Export(remote = true)` +- [ ] LLM configuration: + - GPT-4o for narration generation (quality-critical) + - Local model via Ollama for simple classification/routing + - Configurable in `application.yml` +- [ ] Docker Compose setup for running Embabel alongside docgen: + ```yaml + services: + docgen-agent: + build: ./docgen-agent + ports: ["8080:8080"] + environment: + - OPENAI_API_KEY=${OPENAI_API_KEY} + - SPRING_AI_OPENAI_API_KEY=${OPENAI_API_KEY} + ``` +- [ ] Integration tests verifying MCP tool discovery and invocation +- [ ] Health endpoint for connection checking + +## Technical Notes + +### Agent architecture + +Each agent is an `@EmbabelComponent` with `@Action` methods: + +```kotlin +@Agent("Narration generation agent for docgen") +class NarrationAgent { + + @Action("Generate narration from source documents") + @Export(remote = true) + fun generateNarration(request: NarrationRequest): NarrationResponse { + // LLM call with docgen-specific system prompt + } + + @Action("Revise existing narration based on feedback") + @Export(remote = true) + fun reviseNarration(segment: String, currentText: String, feedback: String): NarrationResponse { + // LLM call with revision context + } +} +``` + +### MCP server configuration + +```yaml +# application.yml +spring: + ai: + mcp: + server: + type: SYNC + openai: + api-key: ${OPENAI_API_KEY} +``` + +## Files to Create + +- **Create:** `docgen-agent/` (Spring Boot project) + - `pom.xml` or `build.gradle.kts` + - `src/main/kotlin/com/docgen/agent/` + - `DocgenAgentApplication.kt` + - `agents/NarrationAgent.kt` + - `agents/TTSAgent.kt` + - `agents/PipelineAgent.kt` + - `agents/ScriptAgent.kt` + - `agents/DebugAgent.kt` + - `model/` (domain classes) + - `src/main/resources/application.yml` + - `Dockerfile` +- **Create:** `docker-compose.yml` (root level, optional) diff --git a/issues/embedded-ai-embabel/13-python-mcp-client.md b/issues/embedded-ai-embabel/13-python-mcp-client.md new file mode 100644 index 0000000..cc1c0a7 --- /dev/null +++ b/issues/embedded-ai-embabel/13-python-mcp-client.md @@ -0,0 +1,110 @@ +# Issue: Python MCP Client Integration + +**Milestone:** 5 — Embedded AI via Embabel & Chatbot Interface +**Priority:** High +**Depends on:** Issue 11 (provider abstraction), Issue 12 (Embabel agents) + +## Summary + +Implement the Python-side MCP client that connects to the Embabel agent server, discovers available tools, and provides the `EmbabelProvider` implementation for the AI provider abstraction layer. + +## Background + +The official MCP Python SDK (`mcp` on PyPI) provides `ClientSession` for connecting to MCP servers. Embabel exposes its agents as MCP tools over SSE (Server-Sent Events) at `http://localhost:8080/sse`. This issue bridges the two by implementing a robust client that handles connection management, tool discovery, invocation, and streaming. + +## Acceptance Criteria + +- [ ] Add `mcp` Python SDK as optional dependency: + ```toml + [project.optional-dependencies] + embabel = ["mcp>=1.0"] + ``` + Install with: `pip install docgen[embabel]` +- [ ] Implement `EmbabelClient` class: + ```python + class EmbabelClient: + def __init__(self, url: str = "http://localhost:8080/sse"): + ... + + async def connect(self) -> None: + """Connect to Embabel SSE endpoint.""" + + async def discover_tools(self) -> list[Tool]: + """List available MCP tools from Embabel.""" + + async def invoke(self, tool_name: str, args: dict) -> Any: + """Invoke an MCP tool and return the result.""" + + async def stream(self, tool_name: str, args: dict) -> AsyncIterator[str]: + """Invoke a tool with streaming response.""" + + async def close(self) -> None: + """Disconnect from Embabel.""" + ``` +- [ ] Auto-reconnect on connection loss (exponential backoff, max 3 retries) +- [ ] Graceful degradation: if Embabel is unavailable, fall back to direct OpenAI provider +- [ ] Tool invocation wrappers for each agent tool: + ```python + async def generate_narration(self, segment: str, guidance: str, sources: list[str]) -> str: + return await self.invoke("generate_narration", {...}) + ``` +- [ ] Handle streaming responses for chat interactions (SSE event stream) +- [ ] Connection health checking (`is_connected`, `ping`) +- [ ] Config integration: read `ai.embabel_url` from `docgen.yaml` +- [ ] Synchronous wrapper for CLI usage (the MCP SDK is async, but docgen CLI is sync) +- [ ] Unit tests with mocked MCP server + +## Technical Notes + +### MCP Python SDK usage + +```python +from mcp import ClientSession +from mcp.client.sse import sse_client + +async with sse_client(url="http://localhost:8080/sse") as (read, write): + async with ClientSession(read, write) as session: + await session.initialize() + tools = await session.list_tools() + result = await session.call_tool("generate_narration", arguments={...}) +``` + +### Sync wrapper pattern + +Since docgen CLI uses Click (synchronous), we need a sync wrapper: + +```python +import asyncio + +class EmbabelClientSync: + def __init__(self, url: str): + self._async_client = EmbabelClient(url) + self._loop = asyncio.new_event_loop() + + def invoke(self, tool_name: str, args: dict) -> Any: + return self._loop.run_until_complete( + self._async_client.invoke(tool_name, args) + ) +``` + +### Fallback behavior + +```python +def get_ai_provider(config): + if config.ai_provider == "embabel": + try: + client = EmbabelClientSync(config.embabel_url) + client.connect() + return EmbabelProvider(client) + except ConnectionError: + print("[ai] Embabel unavailable, falling back to OpenAI") + return OpenAIProvider() + ... +``` + +## Files to Create/Modify + +- **Create:** `src/docgen/mcp_client.py` +- **Modify:** `src/docgen/ai_provider.py` (implement EmbabelProvider using mcp_client) +- **Modify:** `pyproject.toml` (add `embabel` optional dependency) +- **Create:** `tests/test_mcp_client.py` diff --git a/issues/embedded-ai-embabel/14-cli-chat.md b/issues/embedded-ai-embabel/14-cli-chat.md new file mode 100644 index 0000000..cbb2cce --- /dev/null +++ b/issues/embedded-ai-embabel/14-cli-chat.md @@ -0,0 +1,98 @@ +# Issue: CLI Chat Interface + +**Milestone:** 5 — Embedded AI via Embabel & Chatbot Interface +**Priority:** Medium +**Depends on:** Issue 11 (provider abstraction), Issue 13 (MCP client) + +## Summary + +Add a `docgen chat` CLI command that provides a terminal-based conversational interface for interacting with the AI agent. Users can generate narration, run pipeline steps, diagnose errors, and iterate on their demo videos through natural language. + +## Background + +Currently, docgen requires users to know specific CLI commands and their flags. A chat interface lets users describe what they want in natural language, and the AI agent translates that into the appropriate tool calls. This is especially valuable for: + +- New users exploring docgen capabilities +- Iterating on narration ("make it more conversational", "shorten this section") +- Debugging pipeline failures ("what went wrong?", "why is the video frozen?") +- Generating code ("write a Playwright capture script for the login flow") + +## Acceptance Criteria + +- [ ] New CLI command: `docgen chat [--provider embabel|openai|ollama]` +- [ ] Terminal-based conversational loop with colored prompt: + ``` + docgen> generate narration for segment 03 about the wizard setup flow + [Agent] I'll generate narration for segment 03. Let me check the source documents... + [Tool: generate_narration] Using sources: docs/setup.md, README.md + [Agent] Here's the draft narration: + + "The docgen wizard provides a local web interface for bootstrapping narration scripts..." + + Would you like me to save this to narration/03-wizard-gui.md? + + docgen> yes, and make it shorter + + [Agent] I'll revise the narration to be more concise... + ``` +- [ ] Support natural language commands mapping to tools: + - "generate narration for segment 03" → `generate_narration` tool + - "run TTS" → `run_tts` tool + - "run the full pipeline" → `run_pipeline` tool + - "what went wrong with compose?" → `diagnose_error` tool + - "write a capture script for test_login.py" → `generate_capture_script` tool +- [ ] Stream responses token-by-token for natural feel +- [ ] Maintain conversation history within session +- [ ] Tool call visualization with status indicators +- [ ] Handle multi-turn conversations with context +- [ ] Exit with `/quit`, `/exit`, or Ctrl+C +- [ ] Special commands: + - `/help` — show available commands and examples + - `/status` — show current project state (segments, which have audio, etc.) + - `/clear` — clear conversation history + - `/provider` — show/switch active AI provider +- [ ] `--non-interactive` mode for scripted usage: + ```bash + echo "generate narration for 03" | docgen chat --non-interactive + ``` + +## Technical Notes + +### Chat loop architecture + +```python +@main.command() +@click.option("--provider", default=None) +@click.pass_context +def chat(ctx, provider): + cfg = ctx.obj["config"] + ai = get_ai_provider(cfg, override_provider=provider) + + history = [] + while True: + user_input = click.prompt("docgen", prompt_suffix="> ") + if user_input.strip() in ("/quit", "/exit"): + break + + history.append({"role": "user", "content": user_input}) + response = ai.chat( + model=cfg.chat_model, + messages=[SYSTEM_PROMPT, *history], + tools=get_docgen_tools(cfg), + ) + # Handle tool calls, stream response + ... +``` + +### System prompt + +The chat system prompt should include: +- docgen project context (segments, visual_map, current state) +- Available tools and their descriptions +- Instructions for being helpful with demo video creation + +## Files to Create/Modify + +- **Modify:** `src/docgen/cli.py` (add `chat` command) +- **Create:** `src/docgen/chat.py` (chat loop, history, tool handling) +- **Create:** `tests/test_chat.py` diff --git a/issues/embedded-ai-embabel/15-wizard-chatbot-panel.md b/issues/embedded-ai-embabel/15-wizard-chatbot-panel.md new file mode 100644 index 0000000..08ce49f --- /dev/null +++ b/issues/embedded-ai-embabel/15-wizard-chatbot-panel.md @@ -0,0 +1,83 @@ +# Issue: Wizard Chatbot Panel + +**Milestone:** 5 — Embedded AI via Embabel & Chatbot Interface +**Priority:** Medium +**Depends on:** Issue 11 (provider abstraction), Issue 13 (MCP client) + +## Summary + +Add a chat panel to the existing wizard web GUI, allowing users to interact with the AI agent while editing narration and configuring their demo project. + +## Background + +The wizard (`docgen wizard`) is a Flask-based local web GUI for bootstrapping narration scripts. Adding a chat panel gives users a conversational way to: +- Ask for narration revisions in context ("make this more concise") +- Generate TTS and preview audio from the chat +- Get explanations of pipeline steps +- Debug issues without leaving the GUI + +## Acceptance Criteria + +- [ ] Add `/api/chat` SSE endpoint to Flask wizard for streaming responses +- [ ] Add `/api/chat/history` endpoint for loading chat history +- [ ] Add chat panel UI to `wizard.html`: + - Collapsible sidebar (default collapsed) or dedicated tab + - Message input with send button + - Message history with user/agent message bubbles + - Tool call indicators (spinner, "Running TTS...", etc.) + - Code blocks in agent responses (for generated scripts) +- [ ] Chat is context-aware: + - Knows which segment is currently active + - Can reference current narration text + - Can suggest edits to the active segment +- [ ] Support commands via chat: + - "revise this narration to be more conversational" + - "generate TTS for this segment" + - "what does this segment's visual map look like?" + - "suggest source documents for segment 03" +- [ ] Show agent tool calls in chat with progress +- [ ] Persist chat history per session (in-memory, reset on server restart) +- [ ] Degrade gracefully: if no AI provider configured, show message with setup instructions +- [ ] Keyboard shortcut to toggle chat panel (e.g., `Ctrl+/`) + +## Technical Notes + +### SSE endpoint for streaming + +```python +@app.route("/api/chat", methods=["POST"]) +def api_chat(): + data = request.json + message = data.get("message", "") + segment = data.get("segment") # current segment context + + def stream(): + provider = get_ai_provider(cfg) + for token in provider.chat_stream(model=..., messages=[...]): + yield f"data: {json.dumps({'token': token})}\n\n" + yield "data: [DONE]\n\n" + + return Response(stream(), mimetype="text/event-stream") +``` + +### Frontend chat component + +```javascript +// wizard.js addition +async function sendChatMessage(message) { + const res = await fetch("/api/chat", { + method: "POST", + headers: {"Content-Type": "application/json"}, + body: JSON.stringify({message, segment: activeSegmentId}) + }); + const reader = res.body.getReader(); + // Stream tokens to chat panel... +} +``` + +## Files to Create/Modify + +- **Modify:** `src/docgen/wizard.py` (add chat endpoints) +- **Modify:** `src/docgen/templates/wizard.html` (add chat panel HTML) +- **Modify:** `src/docgen/static/wizard.js` (add chat JS logic) +- **Create/Modify:** `src/docgen/static/wizard.css` (chat panel styles) diff --git a/issues/embedded-ai-embabel/16-script-generation-agent.md b/issues/embedded-ai-embabel/16-script-generation-agent.md new file mode 100644 index 0000000..b5f0b70 --- /dev/null +++ b/issues/embedded-ai-embabel/16-script-generation-agent.md @@ -0,0 +1,80 @@ +# Issue: Script Generation Agent + +**Milestone:** 5 — Embedded AI via Embabel & Chatbot Interface +**Priority:** Low (high value, but requires mature agent infrastructure) +**Depends on:** Issue 12 (Embabel agents), Issue 14 or 15 (chat interface) + +## Summary + +Implement an Embabel agent (and corresponding Python integration) that generates Playwright capture scripts and Manim scene code from natural language descriptions, test file analysis, and segment narration. + +## Background + +Today, writing Playwright capture scripts or Manim scenes requires manual coding. The Script Generation Agent automates this by: +1. Analyzing existing test files to understand the UI flow +2. Reading the narration to know what needs to be demonstrated +3. Generating code that follows the `PlaywrightRunner` contract (writes MP4 to `DOCGEN_PLAYWRIGHT_OUTPUT`) +4. Iterating based on user feedback ("make the animation slower", "add a highlight on the login button") + +## Acceptance Criteria + +- [ ] Embabel `ScriptAgent` that generates Playwright capture scripts: + - Input: test file path, segment narration, desired demo flow description + - Output: Python script compatible with `PlaywrightRunner` contract + - Validates generated code: syntax check, import verification +- [ ] Generate Manim scene code from narration + segment description: + - Input: segment narration, description of desired animation + - Output: Python Manim scene class compatible with `ManimRunner` + - Includes timing from `timing.json` for sync +- [ ] Iterative refinement via chat: + - "make the animation slower" + - "add a highlight on the login button" + - "show the dashboard loading state" +- [ ] Template library for common patterns: + - Form fill + submit + - Navigation + page transition + - Dashboard overview with scroll + - Terminal command execution + - Architecture diagram animation +- [ ] Generated code includes appropriate imports and follows project conventions +- [ ] Python-side wrapper for invoking the agent via MCP + +## Technical Notes + +### Script generation prompt structure + +``` +System: You are a code generation agent for docgen. Generate {Playwright/Manim} scripts that: +1. Follow the contract: {contract details} +2. Demonstrate the flow described in the narration +3. Use timing from timing.json for synchronization +4. Include error handling and cleanup + +User: Generate a Playwright capture script for segment 03 (wizard setup). +Narration: "The wizard provides a local web interface..." +Test reference: tests/e2e/test_setup_view.py +``` + +### Validation loop + +```python +def generate_and_validate(description, narration, test_ref): + code = agent.generate_script(description, narration, test_ref) + # Syntax check + try: + compile(code, "", "exec") + except SyntaxError as e: + code = agent.fix_syntax(code, str(e)) + # Import check + missing = check_imports(code) + if missing: + code = agent.fix_imports(code, missing) + return code +``` + +## Files to Create/Modify + +- **Modify:** `docgen-agent/` (add ScriptAgent if using Embabel) +- **Create:** `src/docgen/script_generator.py` (Python-side integration) +- **Create:** `src/docgen/templates/` (script templates for common patterns) +- **Create:** `tests/test_script_generator.py` diff --git a/issues/embedded-ai-embabel/17-error-diagnosis-agent.md b/issues/embedded-ai-embabel/17-error-diagnosis-agent.md new file mode 100644 index 0000000..d0c1d35 --- /dev/null +++ b/issues/embedded-ai-embabel/17-error-diagnosis-agent.md @@ -0,0 +1,48 @@ +# Issue: Error Diagnosis Agent + +**Milestone:** 5 — Embedded AI via Embabel & Chatbot Interface +**Priority:** Low +**Depends on:** Issue 12 (Embabel agents), Issue 14 or 15 (chat interface) + +## Summary + +Implement an Embabel agent that analyzes docgen pipeline errors and provides actionable diagnosis and fix suggestions. Integrates with the chat interface so users can ask "what went wrong?" and get specific, helpful answers. + +## Background + +Common docgen pipeline errors: +- **FREEZE GUARD**: video is much shorter than audio, causing excessive frozen frames +- **Missing audio**: TTS generation failed or was skipped +- **FFmpeg failures**: codec issues, missing files, timeout +- **Playwright timeouts**: capture script or test runs too long +- **VHS errors**: tape commands fail in the real shell +- **Validation failures**: A/V drift, OCR errors, narration lint issues + +Currently, users must manually read error messages and figure out fixes. The diagnosis agent understands docgen's pipeline and can map error patterns to specific remediation steps. + +## Acceptance Criteria + +- [ ] Embabel `DebugAgent` that analyzes pipeline errors: + - Input: error log text, segment ID, pipeline stage, relevant config + - Output: diagnosis (what went wrong), suggestion (how to fix it), optional auto-fix +- [ ] Handles all common error types: + | Error | Diagnosis | Suggestion | + |-------|-----------|------------| + | FREEZE GUARD | Manim scene is 5s shorter than narration | "Add `self.wait(5)` at end of scene, or run `docgen generate-all --retry-manim`" | + | Missing audio | TTS failed for segment 03 | "Check OPENAI_API_KEY, run `docgen tts --segment 03`" | + | FFmpeg timeout | Compose timed out at 300s | "Increase `compose.ffmpeg_timeout_sec` or check video file integrity" | + | Playwright timeout | Script exceeded 120s | "Increase `playwright.timeout_sec` or optimize the capture script" | + | A/V drift | Video 2.8s longer than audio | "Video needs trimming; check visual_map source duration" | + | OCR error | "command not found" detected in frame | "VHS tape has a failing command; check line 15 of the tape" | +- [ ] Can auto-fix common issues when given permission: + - Rerun TTS for a failed segment + - Retry Manim with cache cleared + - Increase timeout and retry +- [ ] Integrates with `docgen validate` output for proactive suggestions +- [ ] Python-side wrapper for invoking via chat or CLI + +## Files to Create/Modify + +- **Modify:** `docgen-agent/` (add DebugAgent if using Embabel) +- **Create:** `src/docgen/error_diagnosis.py` (Python-side integration and fallback logic) +- **Create:** `tests/test_error_diagnosis.py` diff --git a/issues/embedded-ai-embabel/18-local-model-support.md b/issues/embedded-ai-embabel/18-local-model-support.md new file mode 100644 index 0000000..cbd6302 --- /dev/null +++ b/issues/embedded-ai-embabel/18-local-model-support.md @@ -0,0 +1,107 @@ +# Issue: Local Model Support (Ollama) + +**Milestone:** 5 — Embedded AI via Embabel & Chatbot Interface +**Priority:** Medium +**Depends on:** Issue 11 (provider abstraction) + +## Summary + +Implement an `OllamaProvider` that enables docgen to use locally-running LLMs via Ollama for narration generation and chat, reducing cost and removing the OpenAI API dependency for text generation tasks. + +## Background + +Ollama (`ollama.ai`) runs open-source LLMs locally. For many docgen tasks — narration drafting, narration revision, error diagnosis, script generation — a capable local model (Llama 3.2, Mistral, CodeLlama) may be sufficient, especially for iteration and development. + +TTS and Whisper transcription still require OpenAI (or equivalent cloud service) since local TTS quality is not yet competitive for professional narration, but text generation can run fully local. + +## Acceptance Criteria + +- [ ] `OllamaProvider` implementation using Ollama REST API: + ```python + class OllamaProvider: + def __init__(self, url: str = "http://localhost:11434", model: str = "llama3.2"): + ... + + def chat(self, model, messages, **kwargs) -> str: + # POST /api/chat + ... + + def tts(self, model, voice, text, instructions, output_path): + # Falls back to OpenAI + raise NotImplementedError("Use OpenAI for TTS") + + def transcribe(self, model, audio_path, **kwargs): + # Falls back to OpenAI or local whisper.cpp + raise NotImplementedError("Use OpenAI for transcription") + ``` +- [ ] Config in `docgen.yaml`: + ```yaml + ai: + provider: ollama + ollama_url: http://localhost:11434 + ollama_model: llama3.2 + # TTS/transcription still use OpenAI even with Ollama chat + tts_provider: openai + transcribe_provider: openai + ``` +- [ ] Hybrid provider support: Ollama for chat, OpenAI for TTS + transcription +- [ ] Test with common local models: llama3.2, mistral, codellama +- [ ] Streaming support for chat responses +- [ ] Auto-detect Ollama availability on startup +- [ ] Document setup instructions: + ```bash + # Install Ollama + curl -fsSL https://ollama.ai/install.sh | sh + # Pull a model + ollama pull llama3.2 + # Configure docgen + # docgen.yaml: ai.provider: ollama + ``` +- [ ] Performance note in docs: first invocation pulls model (may take minutes), subsequent calls are fast + +## Technical Notes + +### Ollama REST API + +```python +import requests + +response = requests.post( + "http://localhost:11434/api/chat", + json={ + "model": "llama3.2", + "messages": [ + {"role": "system", "content": "..."}, + {"role": "user", "content": "..."}, + ], + "stream": False, + }, +) +result = response.json() +text = result["message"]["content"] +``` + +### Hybrid provider pattern + +```python +class HybridProvider: + """Uses different providers for different capabilities.""" + + def __init__(self, chat_provider, tts_provider, transcribe_provider): + self.chat = chat_provider + self.tts = tts_provider + self.transcribe = transcribe_provider +``` + +This naturally handles the case where chat goes to Ollama but TTS goes to OpenAI. + +### Local Whisper alternative + +For full offline support, we could optionally integrate `whisper.cpp` or `faster-whisper` for local transcription. This is a stretch goal — OpenAI Whisper is cheap and reliable. + +## Files to Create/Modify + +- **Modify:** `src/docgen/ai_provider.py` (add OllamaProvider, HybridProvider) +- **Modify:** `src/docgen/config.py` (add per-capability provider config) +- **Create:** `tests/test_ollama_provider.py` +- **Modify:** `README.md` (Ollama setup instructions) diff --git a/issues/playwright-test-integration/01-trace-event-extractor.md b/issues/playwright-test-integration/01-trace-event-extractor.md new file mode 100644 index 0000000..8426ffd --- /dev/null +++ b/issues/playwright-test-integration/01-trace-event-extractor.md @@ -0,0 +1,49 @@ +# Issue: Playwright Trace Event Extractor + +**Milestone:** 4 — Playwright Test Video Integration +**Priority:** High (foundational) +**Depends on:** None + +## Summary + +Create `playwright_trace.py` — a module that parses Playwright trace files (`trace.zip`) to extract a timeline of browser actions with precise timestamps. This is the foundational data source for synchronizing narration audio with Playwright test video. + +## Background + +Playwright traces contain a structured JSON log of every user action (clicks, fills, navigations) with precise wall-clock timestamps. By extracting these events, we can map "what happened in the video" to "what the narration says" — the same way `tape_sync.py` maps VHS Sleep blocks to timing.json, and Manim scenes use `wait_until` to sync with Whisper segments. + +## Acceptance Criteria + +- [ ] Parse Playwright `trace.zip` files and extract action events +- [ ] Support action types: `click`, `fill`, `type`, `press`, `goto`/`navigate`, `select_option`, `check`, `uncheck`, `hover`, `dblclick`, `drag_to` +- [ ] Each event includes: timestamp (relative to video start), action type, selector, optional value (typed text, URL) +- [ ] Output `events.json` with normalized timestamps: + ```json + [ + {"t": 0.0, "action": "goto", "url": "http://localhost:8501"}, + {"t": 1.2, "action": "fill", "selector": "#email", "value": "user@example.com"}, + {"t": 3.4, "action": "click", "selector": "button[type=submit]"}, + {"t": 5.1, "action": "goto", "url": "/dashboard"} + ] + ``` +- [ ] Handle multi-page traces and iframes +- [ ] CLI command: `docgen trace-extract [--test test_name] [--output events.json]` +- [ ] Unit tests with fixture trace zip files + +## Technical Notes + +Playwright traces are zip archives containing: +- `trace.trace` — binary trace events (protobuf-like) +- `trace.network` — network events +- Resources (screenshots, etc.) + +The trace format has been stable since Playwright 1.12+. We should parse the action entries from the trace resources JSON, filtering for user-initiated actions vs internal Playwright bookkeeping. + +The `events.json` format should be extensible for future action types and metadata (screenshots at event time, DOM snapshots, etc.). + +## Files to Create/Modify + +- **Create:** `src/docgen/playwright_trace.py` +- **Modify:** `src/docgen/cli.py` (add `trace-extract` command) +- **Create:** `tests/test_playwright_trace.py` +- **Create:** `tests/fixtures/` (sample trace.zip files) diff --git a/issues/playwright-test-integration/02-test-runner-integration.md b/issues/playwright-test-integration/02-test-runner-integration.md new file mode 100644 index 0000000..e4082e0 --- /dev/null +++ b/issues/playwright-test-integration/02-test-runner-integration.md @@ -0,0 +1,69 @@ +# Issue: Playwright Test Runner Integration + +**Milestone:** 4 — Playwright Test Video Integration +**Priority:** High (foundational) +**Depends on:** None (parallel with Issue 1) + +## Summary + +Create `playwright_test_runner.py` — a runner that invokes existing Playwright test suites with video and tracing enabled, then collects the resulting video and trace artifacts for use as docgen visual sources. + +## Background + +The existing `PlaywrightRunner` in `playwright_runner.py` runs custom capture scripts specifically written for docgen. This new runner takes a fundamentally different approach: it runs the project's **existing** Playwright tests as-is, enabling video recording and tracing via Playwright configuration, and harvests the artifacts. + +This supports both Python (`pytest-playwright`) and Node.js (`@playwright/test`) test frameworks. + +## Acceptance Criteria + +- [ ] Invoke `pytest` or `npx playwright test` with `--video on --tracing on` flags +- [ ] Auto-detect test framework from project files (`conftest.py` → pytest, `playwright.config.ts` → Node.js) +- [ ] Discover and collect video + trace artifacts from test output directories +- [ ] Support filtering by test name/path to capture specific tests as segments +- [ ] Auto-detect video output paths from Playwright config +- [ ] Handle test failures gracefully — capture video even if assertions fail (using `--tracing retain-on-failure` or equivalent) +- [ ] Config block in `docgen.yaml`: + ```yaml + playwright_test: + framework: pytest # or "playwright" for Node.js + test_command: "pytest tests/e2e/ --video on --tracing on" + test_dir: tests/e2e/ + video_dir: test-results/videos/ + trace_dir: test-results/traces/ + retain_on_failure: true + ``` +- [ ] CLI command: `docgen playwright-test [--test test_login.py] [--timeout 300]` + +## Technical Notes + +### pytest-playwright video config + +```python +# conftest.py +@pytest.fixture(scope="session") +def browser_context_args(): + return {"record_video_dir": "test-results/videos/"} +``` + +Or via CLI: `pytest --video on --tracing on` + +### @playwright/test video config + +```typescript +// playwright.config.ts +export default defineConfig({ + use: { + video: 'on', + trace: 'on', + }, +}); +``` + +The runner should inject these settings via environment variables or config overrides without requiring users to modify their test configuration permanently. + +## Files to Create/Modify + +- **Create:** `src/docgen/playwright_test_runner.py` +- **Modify:** `src/docgen/config.py` (add `playwright_test` config properties) +- **Modify:** `src/docgen/cli.py` (add `playwright-test` command) +- **Create:** `tests/test_playwright_test_runner.py` diff --git a/issues/playwright-test-integration/03-event-narration-sync.md b/issues/playwright-test-integration/03-event-narration-sync.md new file mode 100644 index 0000000..3836010 --- /dev/null +++ b/issues/playwright-test-integration/03-event-narration-sync.md @@ -0,0 +1,75 @@ +# Issue: Event-to-Narration Synchronizer + +**Milestone:** 4 — Playwright Test Video Integration +**Priority:** High (core sync logic) +**Depends on:** Issue 1 (trace extractor), Issue 2 (test runner) + +## Summary + +Create `playwright_sync.py` — the core synchronization engine that aligns narration audio timing with Playwright browser events extracted from test traces. This is analogous to `tape_sync.py` for VHS and `scenes.py` timing for Manim, but operates on continuous video rather than discrete commands. + +## Background + +The fundamental challenge: a Playwright test runs at its own pace (e.g., clicks at 1.2s, 3.4s, 5.1s), but the narration discusses those actions at different times (e.g., "now we fill in the email" starts at 2.0s, "click submit" at 6.0s). We need to warp the video timeline so the visual actions align with the spoken narration. + +### Sync Algorithm + +``` +Input: + events[] = [{t: 1.2, action: "fill", selector: "#email"}, ...] # from events.json + timing = {segments: [...], words: [...]} # from timing.json + anchors[] = [{narration_anchor: "fill in email", action: "fill"}] # from config + +Algorithm: + 1. For each anchor, find matching event by action + selector + 2. For each anchor, find matching narration timestamp by fuzzy text search in words[] + 3. Build desired_time[] (narration) and actual_time[] (video) pairs + 4. Compute speed factors between consecutive anchor pairs: + speed[i] = (desired[i+1] - desired[i]) / (actual[i+1] - actual[i]) + 5. Clamp speed factors to [0.25, 4.0] + 6. Generate ffmpeg setpts filter for piece-wise speed adjustment + 7. Output sync_map.json + retimed video +``` + +## Acceptance Criteria + +- [ ] Load `events.json` + `timing.json` and match anchors to events +- [ ] Fuzzy keyword matching between narration text and event descriptions +- [ ] Compute per-segment speed adjustment factors +- [ ] Generate `sync_map.json`: + ```json + { + "anchors": [ + {"event_t": 1.2, "narration_t": 2.0, "action": "fill", "text": "fill in the email"}, + {"event_t": 3.4, "narration_t": 6.0, "action": "click", "text": "click submit"} + ], + "speed_segments": [ + {"start": 0.0, "end": 1.2, "factor": 1.67}, + {"start": 1.2, "end": 3.4, "factor": 0.55} + ] + } + ``` +- [ ] Support sync strategies: + - `stretch` — adjust video speed to match narration (default) + - `cut` — trim idle periods from video + - `pad` — freeze key frames to extend short segments +- [ ] CLI: `docgen sync-playwright [--segment 03] [--dry-run] [--strategy stretch]` +- [ ] Validation: warn when event count doesn't match anchor count +- [ ] Fallback: even distribution when no anchors match (same as VHS default) + +## Technical Notes + +The speed factor computation is conceptually identical to how `tape_sync.py` distributes `duration / n_blocks` across VHS Type/Enter/Sleep blocks, but applied to continuous video timecodes rather than discrete sleep values. + +FFmpeg `setpts` filter for variable-speed playback: +``` +setpts='if(lt(PTS,1.2),PTS*1.67,if(lt(PTS,3.4),(PTS-1.2)*0.55+2.0,...))' +``` + +For complex speed profiles, it may be cleaner to split, retime, and concat rather than build one complex setpts expression. + +## Files to Create/Modify + +- **Create:** `src/docgen/playwright_sync.py` +- **Modify:** `src/docgen/cli.py` (add `sync-playwright` command) +- **Create:** `tests/test_playwright_sync.py` diff --git a/issues/playwright-test-integration/04-video-speed-adjustment.md b/issues/playwright-test-integration/04-video-speed-adjustment.md new file mode 100644 index 0000000..f97b4f1 --- /dev/null +++ b/issues/playwright-test-integration/04-video-speed-adjustment.md @@ -0,0 +1,57 @@ +# Issue: Video Speed Adjustment via FFmpeg + +**Milestone:** 4 — Playwright Test Video Integration +**Priority:** High (required for compose) +**Depends on:** Issue 3 (sync engine) + +## Summary + +Implement the FFmpeg video retiming logic that applies the speed adjustment factors computed by the sync engine, and integrate `type: playwright_test` into the existing `Composer`. + +## Background + +The sync engine (Issue 3) computes per-segment speed factors. This issue implements the actual video manipulation: splitting the source video into segments, applying `setpts` filters to each, and concatenating the result. The retimed video is then composed with narration audio via the existing `Composer._compose_simple` path. + +## Acceptance Criteria + +- [ ] Apply `setpts` filter for piece-wise speed adjustment +- [ ] Support variable speeds within a single video (different rates for different event windows) +- [ ] Preserve video quality during retiming (re-encode at source quality/CRF) +- [ ] Handle audio stripping from source video (Playwright videos may contain system audio or silence) +- [ ] Handle WebM input (Playwright default) — transcode to MP4 if needed +- [ ] Frame interpolation option for slowed-down segments (`minterpolate` filter) to improve quality at low FPS +- [ ] Add `type: playwright_test` handler in `compose.py`: + ```python + elif vtype == "playwright_test": + video_path = self._playwright_test_path(vmap) + sync_map = self._load_sync_map(seg_id) + if sync_map: + video_path = self._retime_video(video_path, sync_map) + ok = self._compose_simple(seg_id, video_path, strict=strict) + ``` +- [ ] Configurable speed clamps: `min_speed_factor`, `max_speed_factor` in `docgen.yaml` + +## Technical Notes + +### Piece-wise speed adjustment approach + +Rather than a single complex `setpts` expression, use the split-retime-concat approach: + +1. Split source video at anchor points using `ffmpeg -ss -to` +2. Apply `setpts=PTS/factor` to each segment +3. Concat the retimed segments +4. Feed result to `_compose_simple` for audio muxing + +### WebM handling + +Playwright records in WebM by default. Two options: +- Transcode to MP4 early (simple, adds encoding time) +- Pass WebM through ffmpeg directly (works, but some filters behave differently) + +Recommend: transcode to MP4 at collection time (in the test runner, Issue 2). + +## Files to Create/Modify + +- **Modify:** `src/docgen/compose.py` (add `playwright_test` handler, `_retime_video`) +- **Modify:** `src/docgen/config.py` (speed clamp config) +- **Create:** `tests/test_playwright_compose.py` diff --git a/issues/playwright-test-integration/05-config-visual-map.md b/issues/playwright-test-integration/05-config-visual-map.md new file mode 100644 index 0000000..a624a31 --- /dev/null +++ b/issues/playwright-test-integration/05-config-visual-map.md @@ -0,0 +1,53 @@ +# Issue: Config & Visual Map Extensions for playwright_test + +**Milestone:** 4 — Playwright Test Video Integration +**Priority:** Medium (enables pipeline) +**Depends on:** None (can be done in parallel) + +## Summary + +Extend `Config`, `docgen.yaml` schema, and `docgen init` to support the new `playwright_test` visual source type and its associated configuration. + +## Acceptance Criteria + +- [ ] Add `playwright_test:` configuration block to `Config` dataclass: + ```yaml + playwright_test: + framework: pytest # "pytest" or "playwright" + test_command: "" # custom test command override + test_dir: tests/e2e/ # where tests live + video_dir: test-results/videos/ # where Playwright saves videos + trace_dir: test-results/traces/ # where Playwright saves traces + retain_on_failure: true # capture video even if test fails + transcode_to_mp4: true # convert WebM to MP4 at collection time + default_viewport: + width: 1920 + height: 1080 + ``` +- [ ] New config properties: `playwright_test_framework`, `playwright_test_command`, `playwright_test_dir`, `playwright_test_video_dir`, `playwright_test_trace_dir`, etc. +- [ ] Extend `visual_map` to support `type: playwright_test`: + ```yaml + visual_map: + "03": + type: playwright_test + test: tests/e2e/test_wizard.py::test_setup_flow + source: test-results/videos/test_setup_flow.webm + trace: test-results/traces/test_setup_flow/trace.zip + events: + - narration_anchor: "launch the wizard" + action: goto + url: / + - narration_anchor: "select the setup tab" + action: click + selector: "[data-tab=setup]" + ``` +- [ ] Add `sync_playwright_after_timestamps` pipeline option (analogous to `sync_vhs_after_timestamps`) +- [ ] Update `docgen init` to offer Playwright test integration option when tests are detected +- [ ] Validate config: ensure referenced test files, video files, and trace files exist +- [ ] Unit tests for new config parsing + +## Files to Create/Modify + +- **Modify:** `src/docgen/config.py` +- **Modify:** `src/docgen/init.py` (if scaffolding updated) +- **Modify:** `tests/test_config.py` diff --git a/issues/playwright-test-integration/06-pipeline-integration.md b/issues/playwright-test-integration/06-pipeline-integration.md new file mode 100644 index 0000000..8232799 --- /dev/null +++ b/issues/playwright-test-integration/06-pipeline-integration.md @@ -0,0 +1,55 @@ +# Issue: Pipeline Integration for Playwright Tests + +**Milestone:** 4 — Playwright Test Video Integration +**Priority:** Medium +**Depends on:** Issues 1-5 + +## Summary + +Integrate the Playwright test runner, trace extractor, and sync engine into the docgen pipeline (`generate-all`, `rebuild-after-audio`). + +## Acceptance Criteria + +- [ ] Add Playwright test stages to `Pipeline.run()`: + 1. (existing) TTS → Timestamps + 2. (new) Run Playwright tests → Collect videos + traces + 3. (new) Extract events from traces + 4. (new) Sync events to narration timing + 5. (existing) Manim, VHS, Compose, Validate, Concat, Pages +- [ ] Order of operations: tests run after TTS + timestamps (need timing data for sync), before compose +- [ ] Add `--skip-playwright-tests` flag to `docgen generate-all` +- [ ] Retry logic: if test fails, optionally retry once before skipping segment +- [ ] Support mixed pipelines: some segments from Manim, some from VHS, some from Playwright tests +- [ ] Pipeline only runs Playwright test stages if any `visual_map` entry has `type: playwright_test` + +## Technical Notes + +Pipeline flow with Playwright tests: + +```python +def run(self, ..., skip_playwright_tests: bool = False): + # ... TTS, Timestamps, VHS sync (existing) ... + + if not skip_playwright_tests and self._has_playwright_test_segments(): + print("\n=== Stage: Playwright Tests ===") + from docgen.playwright_test_runner import PlaywrightTestRunner + runner = PlaywrightTestRunner(self.config) + runner.run_tests() + + print("\n=== Stage: Trace Extraction ===") + from docgen.playwright_trace import TraceExtractor + TraceExtractor(self.config).extract_all() + + if self.config.sync_playwright_after_timestamps: + print("\n=== Stage: Sync Playwright ===") + from docgen.playwright_sync import PlaywrightSynchronizer + PlaywrightSynchronizer(self.config).sync() + + # ... Manim, VHS, Compose, Validate, Concat, Pages (existing) ... +``` + +## Files to Create/Modify + +- **Modify:** `src/docgen/pipeline.py` +- **Modify:** `src/docgen/cli.py` (add `--skip-playwright-tests` flag) +- **Create:** `tests/test_pipeline_playwright.py` diff --git a/issues/playwright-test-integration/07-auto-discovery.md b/issues/playwright-test-integration/07-auto-discovery.md new file mode 100644 index 0000000..7805d20 --- /dev/null +++ b/issues/playwright-test-integration/07-auto-discovery.md @@ -0,0 +1,39 @@ +# Issue: Auto-Discovery of Existing Playwright Tests + +**Milestone:** 4 — Playwright Test Video Integration +**Priority:** Low (nice-to-have) +**Depends on:** Issues 2, 5 + +## Summary + +Automatically discover existing Playwright tests in a project and suggest `visual_map` entries, both during `docgen init` and in the wizard GUI. + +## Acceptance Criteria + +- [ ] Scan project for Playwright indicators: + - Python: `conftest.py` with `playwright` imports, `pytest-playwright` in dependencies + - Node.js: `playwright.config.ts` or `playwright.config.js`, `@playwright/test` in `package.json` +- [ ] Discover individual test files and test functions +- [ ] Suggest `visual_map` entries based on discovered tests: + ``` + Found 3 Playwright tests: + tests/e2e/test_setup_view.py::test_setup_tab_navigation + tests/e2e/test_setup_view.py::test_bulk_generate + tests/e2e/test_api_integration.py::test_scan_endpoint + + Suggested visual_map entries: + "03": { type: playwright_test, test: "tests/e2e/test_setup_view.py::test_setup_tab_navigation" } + "04": { type: playwright_test, test: "tests/e2e/test_setup_view.py::test_bulk_generate" } + ``` +- [ ] `docgen wizard` integration: show discovered tests as candidate segments in the setup GUI +- [ ] `docgen init` integration: auto-populate visual_map when Playwright tests are found +- [ ] Handle monorepo layouts where tests live in a different directory than docs +- [ ] CLI: `docgen discover-tests [--dir tests/]` + +## Files to Create/Modify + +- **Create:** `src/docgen/test_discovery.py` +- **Modify:** `src/docgen/wizard.py` (add test discovery API) +- **Modify:** `src/docgen/init.py` (integrate discovery into scaffolding) +- **Modify:** `src/docgen/cli.py` (add `discover-tests` command) +- **Create:** `tests/test_discovery.py` diff --git a/issues/playwright-test-integration/08-anchor-auto-detection.md b/issues/playwright-test-integration/08-anchor-auto-detection.md new file mode 100644 index 0000000..eda590b --- /dev/null +++ b/issues/playwright-test-integration/08-anchor-auto-detection.md @@ -0,0 +1,51 @@ +# Issue: Narration Anchor Auto-Detection + +**Milestone:** 4 — Playwright Test Video Integration +**Priority:** Low (nice-to-have) +**Depends on:** Issues 1, 3 + +## Summary + +Automatically detect narration anchors by cross-referencing Playwright action metadata (selectors, URLs, typed text) with the narration transcript, reducing the manual configuration needed for event-to-narration sync. + +## Acceptance Criteria + +- [ ] Analyze Playwright actions to extract descriptive keywords: + - `click "button[type=submit]"` → "submit", "button" + - `fill "#email"` → "email" + - `goto "/dashboard"` → "dashboard" + - `click "[data-testid=save-btn]"` → "save" +- [ ] Cross-reference extracted keywords with narration text (word-level timestamps from Whisper) +- [ ] Use Whisper word-level timestamps for precise alignment +- [ ] Generate suggested anchor mappings: + ```json + { + "auto_anchors": [ + {"event_idx": 0, "action": "fill", "keyword": "email", "narration_word_idx": 12, "narration_t": 2.1, "confidence": 0.85}, + {"event_idx": 1, "action": "click", "keyword": "submit", "narration_word_idx": 28, "narration_t": 5.8, "confidence": 0.92} + ] + } + ``` +- [ ] Fallback: evenly distribute events across narration duration when no anchors match +- [ ] Confidence scoring for each match (exact word match > substring > semantic similarity) +- [ ] CLI: `docgen suggest-anchors --segment 03` to preview auto-detected mappings +- [ ] Interactive mode: present suggestions and let user confirm/override + +## Technical Notes + +Keyword extraction from selectors uses heuristics: +- `#email` → strip `#`, split camelCase → "email" +- `[data-testid=save-btn]` → extract value, strip `-btn` suffix → "save" +- `button:has-text("Submit")` → extract text content → "Submit" +- URL paths: `/dashboard/settings` → "dashboard", "settings" + +The confidence scoring considers: +- Exact word match in narration transcript (high) +- Partial match or synonym (medium) +- Positional heuristic: events and narration words should be in the same order (boost) + +## Files to Create/Modify + +- **Create:** `src/docgen/anchor_detection.py` +- **Modify:** `src/docgen/cli.py` (add `suggest-anchors` command) +- **Create:** `tests/test_anchor_detection.py` diff --git a/issues/playwright-test-integration/09-validation-extensions.md b/issues/playwright-test-integration/09-validation-extensions.md new file mode 100644 index 0000000..fa44c7a --- /dev/null +++ b/issues/playwright-test-integration/09-validation-extensions.md @@ -0,0 +1,25 @@ +# Issue: Validation Extensions for Playwright Test Segments + +**Milestone:** 4 — Playwright Test Video Integration +**Priority:** Medium +**Depends on:** Issues 4, 5 + +## Summary + +Extend the existing `Validator` to handle `playwright_test` segments, including A/V drift checks, event-narration alignment validation, OCR scanning, and test failure detection. + +## Acceptance Criteria + +- [ ] Extend `Validator` to check `playwright_test` segments for A/V drift (reuse existing `_check_av_drift`) +- [ ] Validate that event count in trace matches expected narration anchor count +- [ ] Warn when speed adjustment factors are extreme (< 0.25 or > 4.0) +- [ ] OCR validation on Playwright test video (reuse existing `_ocr_scan`) +- [ ] Check for test failures in trace data and warn/fail appropriately +- [ ] Verify that retimed video duration matches narration audio duration (within `max_drift_sec`) +- [ ] Add `playwright_test` segment type to `--pre-push` checks +- [ ] Report includes: test name, video source, sync quality metrics + +## Files to Create/Modify + +- **Modify:** `src/docgen/validate.py` +- **Create:** `tests/test_validate_playwright.py` diff --git a/issues/playwright-test-integration/10-documentation-dogfood.md b/issues/playwright-test-integration/10-documentation-dogfood.md new file mode 100644 index 0000000..5446447 --- /dev/null +++ b/issues/playwright-test-integration/10-documentation-dogfood.md @@ -0,0 +1,46 @@ +# Issue: Documentation & Dogfood for Playwright Test Video + +**Milestone:** 4 — Playwright Test Video Integration +**Priority:** Medium +**Depends on:** Issues 1-6 + +## Summary + +Document the Playwright test video integration and dogfood it by converting one of docgen's own e2e tests into a demo video segment. + +## Acceptance Criteria + +- [ ] Add Playwright test integration guide section to README: + - Overview of the approach (reuse existing tests) + - Configuration example (`docgen.yaml` with `type: playwright_test`) + - Step-by-step walkthrough + - Event anchor configuration reference + - Sync strategy documentation +- [ ] Convert one docgen e2e test into a demo segment: + - Candidate: `tests/e2e/test_setup_view.py` (wizard setup flow) + - Add `type: playwright_test` entry to `docs/demos/docgen.yaml` + - Write narration script for the wizard walkthrough synced to test events +- [ ] Update `docs/demos/docgen.yaml` with example `playwright_test` visual_map entry +- [ ] Update milestone spec link in README +- [ ] Add FAQ section: "When to use Playwright test vs custom Playwright script vs Manim vs VHS" + +## Dogfood Plan + +The docgen project already has these e2e tests that exercise the wizard: +- `test_setup_view.py` — navigates tabs, checks headings, verifies file tree +- `test_production_view.py` — switches to production view, tests narration editing +- `test_api_integration.py` — tests API endpoints (scan, generate, etc.) + +The `test_setup_view.py::test_setup_tab_navigation` test is ideal for dogfooding: +1. It opens the wizard +2. Clicks through setup tabs +3. Verifies the file tree renders +4. These are exactly the actions a demo video would show + +Narration would describe: "The wizard provides a local web interface for creating narration scripts. Let's walk through the setup flow — first we see the project overview, then select source documents..." + +## Files to Create/Modify + +- **Modify:** `README.md` +- **Modify:** `docs/demos/docgen.yaml` +- **Create:** `docs/demos/narration/07-wizard-demo.md` (narration for wizard test video) diff --git a/milestones/milestone-4-playwright-test-video.md b/milestones/milestone-4-playwright-test-video.md new file mode 100644 index 0000000..4ca6149 --- /dev/null +++ b/milestones/milestone-4-playwright-test-video.md @@ -0,0 +1,237 @@ +# Milestone 4 — Playwright Test Video Integration + +**Goal:** Allow docgen to piggyback on existing Playwright test suites, using their recorded videos as visual sources instead of (or alongside) Manim animations. Synchronize narration audio to Playwright browser events (clicks, navigations, typed input) the same way Manim scenes sync to `timing.json`. + +## Motivation + +Many projects already have Playwright end-to-end tests that exercise the exact UI flows a demo video would show. Today, docgen requires either: +- **Manim** — programmatic animations (high effort, no real UI) +- **VHS** — terminal recordings (CLI-only) +- **Custom Playwright scripts** — purpose-built capture scripts separate from tests + +This milestone eliminates the duplicate effort by letting teams **reuse existing Playwright tests as-is**, harvesting the video recordings Playwright already produces, and synchronizing narration to the click/navigation events that naturally occur during test execution. + +## Architecture Overview + +``` +┌─────────────────────────────────────────────────────────────────┐ +│ Existing Playwright Tests │ +│ (test_login.py, test_dashboard.py, etc.) │ +│ │ +│ Playwright config: video: { dir: 'test-results/videos' } │ +│ ───────────────────────────────────────────────────────────── │ +│ test_login: │ +│ page.goto("/login") │ +│ page.fill("#email", "user@example.com") ← event @ 1.2s │ +│ page.click("button[type=submit]") ← event @ 3.4s │ +│ expect(page).to_have_url("/dashboard") ← event @ 5.1s │ +└─────────────────────────────────────────────────────────────────┘ + │ + ▼ +┌─────────────────────────────────────────────────────────────────┐ +│ docgen Playwright Harvester │ +│ │ +│ 1. Run tests with tracing + video enabled │ +│ 2. Extract event timeline from trace.zip │ +│ 3. Produce events.json: │ +│ [ │ +│ {"t": 1.2, "action": "fill", "selector": "#email"}, │ +│ {"t": 3.4, "action": "click", "selector": "button"}, │ +│ {"t": 5.1, "action": "navigate", "url": "/dashboard"} │ +│ ] │ +│ 4. Map events.json → narration segment boundaries │ +└─────────────────────────────────────────────────────────────────┘ + │ + ▼ +┌─────────────────────────────────────────────────────────────────┐ +│ docgen Playwright Sync │ +│ (analogous to tape_sync.py for VHS, scenes.py for Manim) │ +│ │ +│ Inputs: │ +│ - events.json (from trace extraction) │ +│ - timing.json (from Whisper timestamps) │ +│ Output: │ +│ - sync_map.json: maps narration words/segments to video │ +│ timestamps where matching UI events occur │ +│ - Speed-adjusted video (ffmpeg setpts) or segment cut-points │ +│ so narration aligns with visual actions │ +└─────────────────────────────────────────────────────────────────┘ + │ + ▼ +┌─────────────────────────────────────────────────────────────────┐ +│ docgen compose (existing) │ +│ │ +│ Muxes speed-adjusted Playwright video + narration audio │ +│ using the same ffmpeg pipeline as other visual types │ +└─────────────────────────────────────────────────────────────────┘ +``` + +## Key Design Decisions + +### 1. Trace-Based Event Extraction (not code instrumentation) + +Rather than requiring users to modify their tests, we parse Playwright's **trace files** (`trace.zip`). Playwright traces contain a structured JSON log of every action with precise timestamps. This is non-invasive — tests run exactly as they would normally. + +### 2. Event-to-Narration Alignment Strategy + +Analogous to how `tape_sync.py` distributes narration duration across VHS Type/Enter/Sleep blocks, the Playwright sync will: +- Parse the event timeline from trace data +- Match events to narration segments (by configured mapping or auto-detection) +- Compute speed adjustment factors per segment so that key UI moments align with the corresponding narration words +- Apply `ffmpeg setpts` filters to speed up idle periods and slow down action-heavy periods + +### 3. New Visual Map Type: `playwright_test` + +Distinct from the existing `type: playwright` (custom capture scripts), the new `type: playwright_test` works with pre-existing test suites: + +```yaml +visual_map: + "03": + type: playwright_test + test: tests/e2e/test_wizard.py::test_setup_flow + source: test-results/videos/test_wizard/test_setup_flow.webm + trace: test-results/traces/test_wizard/test_setup_flow/trace.zip + events: + - narration_anchor: "fill in the email" + action: fill + selector: "#email" + - narration_anchor: "click submit" + action: click + selector: "button[type=submit]" +``` + +### 4. Fallback Behavior + +If no trace is available, the system falls back to simple duration-based sync (like the existing Playwright type), just using the test video as a flat visual source. + +## Items + +### Issue 1: Playwright Trace Event Extractor (`playwright_trace.py`) +- [ ] Parse Playwright `trace.zip` files to extract action events with timestamps +- [ ] Support action types: `click`, `fill`, `type`, `press`, `goto`/`navigate`, `select_option`, `check`, `uncheck`, `hover`, `dblclick`, `drag_to` +- [ ] Output `events.json` with normalized timestamps relative to video start +- [ ] Handle multi-page traces and iframes +- [ ] CLI: `docgen trace-extract [--test test_name] [--output events.json]` +- [ ] Unit tests for trace parsing with fixture trace files + +### Issue 2: Playwright Test Runner Integration (`playwright_test_runner.py`) +- [ ] New runner that invokes `pytest` (or `playwright test`) with `--video on --tracing on` flags +- [ ] Discover and collect video + trace artifacts from test output directories +- [ ] Support filtering by test name/path to capture specific tests as segments +- [ ] Auto-detect Playwright config (`playwright.config.ts`, `conftest.py`) and video output paths +- [ ] Support both Python (`pytest-playwright`) and Node.js (`@playwright/test`) test frameworks +- [ ] Handle test failures gracefully — capture video even if assertions fail +- [ ] Config: `playwright_test:` block in `docgen.yaml` with `test_command`, `test_dir`, `video_dir`, `trace_dir` + +### Issue 3: Event-to-Narration Synchronizer (`playwright_sync.py`) +- [ ] Load `events.json` (from trace extraction) + `timing.json` (from Whisper) +- [ ] Match narration anchors to video events by configured mapping or fuzzy keyword matching +- [ ] Compute per-segment speed adjustment factors (analogous to `tape_sync.py` window distribution) +- [ ] Generate `sync_map.json` mapping narration timestamps → video timestamps +- [ ] Support configurable sync strategies: `stretch` (adjust video speed), `cut` (trim idle), `pad` (freeze on key frames) +- [ ] CLI: `docgen sync-playwright [--segment 03] [--dry-run] [--strategy stretch]` +- [ ] Validation: warn when events and narration are mismatched in count or order + +### Issue 4: Video Speed Adjustment via FFmpeg (`playwright_compose.py`) +- [ ] Apply `setpts` filter to speed up/slow down video segments to match narration timing +- [ ] Support piece-wise speed adjustment (different rates for different event windows) +- [ ] Preserve video quality during retiming (re-encode at source quality) +- [ ] Handle audio stripping from source video (Playwright videos may have no audio, or system audio) +- [ ] Integrate with existing `Composer._compose_simple` for final audio muxing +- [ ] Add `type: playwright_test` handler in `compose.py` + +### Issue 5: Config & Visual Map Extensions +- [ ] Add `playwright_test:` configuration block to `Config` dataclass +- [ ] New config properties: `playwright_test_command`, `playwright_test_dir`, `playwright_test_video_dir`, `playwright_test_trace_dir`, `playwright_test_framework` (`pytest` or `playwright`) +- [ ] Extend `visual_map` to support `type: playwright_test` with fields: `test`, `source`, `trace`, `events` (anchor mappings) +- [ ] Add `sync_playwright_after_timestamps` pipeline option (analogous to `sync_vhs_after_timestamps`) +- [ ] Update `docgen.yaml` schema documentation +- [ ] Update `docgen init` scaffolding to offer Playwright test integration option + +### Issue 6: Pipeline Integration +- [ ] Add `playwright_test` stages to `Pipeline.run()`: test execution → trace extraction → sync → compose +- [ ] Add `--skip-playwright-tests` flag to `docgen generate-all` +- [ ] Retry logic: if test fails, optionally retry once before skipping segment +- [ ] Support mixed pipelines: some segments from Manim, some from VHS, some from Playwright tests +- [ ] Order-of-operations: tests run after TTS + timestamps (need timing data for sync), before compose + +### Issue 7: Auto-Discovery of Existing Playwright Tests +- [ ] Scan project for `conftest.py` with `playwright` imports, or `playwright.config.ts` +- [ ] Suggest visual_map entries based on discovered test files +- [ ] `docgen wizard` integration: show discovered tests as candidate segments in the setup GUI +- [ ] `docgen init` integration: auto-populate visual_map when Playwright tests are found +- [ ] Handle monorepo layouts where tests live in a different directory than docs + +### Issue 8: Narration Anchor Auto-Detection +- [ ] Analyze Playwright actions (selectors, URLs, typed text) to suggest narration keywords +- [ ] Cross-reference with narration text to auto-map events to spoken words +- [ ] Use Whisper word-level timestamps for precise alignment +- [ ] Fallback: evenly distribute events across narration duration when no anchors match +- [ ] CLI: `docgen suggest-anchors --segment 03` to preview auto-detected mappings + +### Issue 9: Validation Extensions +- [ ] Extend `Validator` to check `playwright_test` segments for A/V drift +- [ ] Validate that event count in trace matches expected narration anchor count +- [ ] OCR validation on Playwright test video (reuse existing `_ocr_scan`) +- [ ] Check for test failures in trace data and warn/fail appropriately +- [ ] Add `playwright_test` segment type to `--pre-push` checks + +### Issue 10: Documentation & Dogfood +- [ ] Add Playwright test integration guide to README +- [ ] Convert one docgen e2e test (`test_setup_view.py` or `test_api_integration.py`) into a demo segment +- [ ] Add example `visual_map` entry using `type: playwright_test` in `docs/demos/docgen.yaml` +- [ ] Update milestone spec link in README +- [ ] Write a narration script that describes the wizard workflow, synced to the e2e test video + +## Event-to-Audio Sync: Detailed Algorithm + +The core synchronization algorithm (Issue 3) works as follows: + +``` +Input: + events[] = [{t: 1.2, action: "fill"}, {t: 3.4, action: "click"}, ...] + timing = {segments: [{start: 0, end: 4.5, text: "..."}], words: [...]} + anchors = [{narration_anchor: "fill in email", action: "fill", selector: "#email"}, ...] + +Algorithm: + 1. For each anchor, find the matching event in events[] by action + selector + 2. For each anchor, find the matching word/segment in timing by fuzzy text match + 3. Compute desired_time[i] = timing match timestamp for anchor i + 4. Compute actual_time[i] = event timestamp for anchor i + 5. Build speed segments between consecutive anchor pairs: + speed_factor[i] = (desired_time[i+1] - desired_time[i]) / (actual_time[i+1] - actual_time[i]) + 6. Clamp speed factors to [0.25, 4.0] to avoid extreme distortion + 7. Generate ffmpeg filter: setpts with piece-wise PTS adjustment + 8. Apply filter to produce retimed video + +Output: + sync_map.json with per-anchor timing + retimed video file +``` + +This is conceptually identical to how `tape_sync.py` distributes narration duration across VHS blocks, but operates on continuous video rather than discrete Sleep directives. + +## Integration into Existing Projects + +For projects already using Playwright tests, the integration path is: + +1. `pip install docgen` (or add to dev dependencies) +2. `docgen init` → detects existing Playwright tests, suggests visual_map entries +3. Write narration Markdown for each test-as-segment +4. `docgen generate-all` → runs tests with video+tracing, extracts events, syncs, composes +5. Demo videos are produced using actual app recordings from real tests + +No changes to existing tests are required. The only new artifact is `docgen.yaml` configuration. + +## Dependencies + +- Playwright trace format documentation (stable since Playwright 1.12+) +- `zipfile` stdlib for trace.zip parsing +- `ffmpeg` `setpts` filter for speed adjustment (already a dependency) +- Existing docgen infrastructure: `Config`, `Composer`, `Validator`, `Pipeline` + +## Risks + +- **Trace format stability**: Playwright's internal trace format may change between versions. Mitigation: pin to trace format v3+ and version-detect. +- **Video quality**: Playwright test videos are typically low framerate (may be 5-10 FPS in CI). Mitigation: configurable upscale/interpolation via ffmpeg `minterpolate`. +- **Test flakiness**: Flaky tests produce inconsistent videos. Mitigation: retry logic + deterministic test fixtures. +- **Timing precision**: Browser rendering introduces variable delays. Mitigation: tolerance windows in anchor matching + clamp speed factors. diff --git a/milestones/milestone-5-embedded-ai-embabel.md b/milestones/milestone-5-embedded-ai-embabel.md new file mode 100644 index 0000000..0c280aa --- /dev/null +++ b/milestones/milestone-5-embedded-ai-embabel.md @@ -0,0 +1,226 @@ +# Milestone 5 — Embedded AI via Embabel & Chatbot Interface + +**Goal:** Remove the dependency on Cursor (or any specific IDE/tool) for AI-powered code and audio generation by embedding our own AI agent layer. Use the Embabel agent framework to provide a chatbot interface that can generate narration scripts, Python capture code, and orchestrate the full docgen pipeline conversationally. + +## Motivation + +Today, docgen relies on direct OpenAI API calls hard-coded in three places: + +1. **`wizard.py`** — `openai.ChatCompletion` for narration generation (model: `gpt-4o`) +2. **`tts.py`** — `openai.audio.speech` for TTS (model: `gpt-4o-mini-tts`) +3. **`timestamps.py`** — `openai.audio.transcriptions` for Whisper timestamps (model: `whisper-1`) + +This creates several problems: +- **No provider flexibility** — locked to OpenAI; no Azure, Anthropic, local model support +- **No agent intelligence** — the wizard is a simple request/response; it cannot plan multi-step workflows, use tools, or adapt to user feedback +- **No chatbot interface** — users must use the CLI or the wizard web GUI; there is no conversational interface for iterating on narration, debugging compose failures, or exploring the pipeline +- **IDE dependency** — teams currently rely on Cursor or similar AI-assisted editors to generate the Python capture scripts, Manim scenes, and configuration that docgen needs + +Embabel provides a JVM-based agent framework with planning (GOAP), tool use (MCP), and chatbot support. By running Embabel as an MCP server and connecting docgen as a Python MCP client, we get: +- A conversational AI that understands the docgen domain +- Tool-based orchestration of pipeline steps +- Provider abstraction (Embabel supports OpenAI, local models via Ollama, etc.) +- A chatbot UI for non-technical users + +## Architecture Overview + +``` +┌──────────────────────────────────────────────────────────────────┐ +│ User Interface │ +│ │ +│ ┌─────────────┐ ┌──────────────┐ ┌────────────────────────┐ │ +│ │ CLI Chat │ │ Wizard GUI │ │ Standalone Chatbot UI │ │ +│ │ docgen chat │ │ /chat tab │ │ (Vaadin/React/HTML) │ │ +│ └──────┬──────┘ └──────┬───────┘ └───────────┬────────────┘ │ +│ │ │ │ │ +│ └────────────────┼───────────────────────┘ │ +│ │ │ +│ ▼ │ +│ ┌───────────────────────────────────────────────────────────┐ │ +│ │ docgen Python MCP Client │ │ +│ │ │ │ +│ │ - Connects to Embabel MCP server via SSE │ │ +│ │ - Discovers available tools (narration, tts, compose...) │ │ +│ │ - Manages conversation state │ │ +│ │ - Streams responses to UI │ │ +│ └───────────────────────┬───────────────────────────────────┘ │ +│ │ MCP (SSE / Streamable HTTP) │ +│ ▼ │ +│ ┌───────────────────────────────────────────────────────────┐ │ +│ │ Embabel Agent (JVM / Spring Boot) │ │ +│ │ │ │ +│ │ Agents: │ │ +│ │ NarrationAgent — draft/revise narration from docs │ │ +│ │ TTSAgent — generate speech, manage voices │ │ +│ │ PipelineAgent — orchestrate generate-all steps │ │ +│ │ ScriptAgent — generate Playwright/Manim code │ │ +│ │ DebugAgent — diagnose compose/validation errors │ │ +│ │ │ │ +│ │ Tools (MCP-exposed): │ │ +│ │ @Export generate_narration(segment, guidance, sources) │ │ +│ │ @Export run_tts(segment, voice, model) │ │ +│ │ @Export run_pipeline(steps, options) │ │ +│ │ @Export generate_capture_script(test_file, segment) │ │ +│ │ @Export diagnose_error(error_log, segment) │ │ +│ │ @Export suggest_visual_map(project_dir) │ │ +│ │ │ │ +│ │ Planning: GOAP selects optimal action sequence │ │ +│ │ LLM: Configurable — OpenAI, Anthropic, Ollama, Azure │ │ +│ └───────────────────────────────────────────────────────────┘ │ +│ │ │ +│ ▼ │ +│ ┌───────────────────────────────────────────────────────────┐ │ +│ │ Existing docgen Pipeline │ │ +│ │ tts.py, timestamps.py, compose.py, validate.py, etc. │ │ +│ └───────────────────────────────────────────────────────────┘ │ +└──────────────────────────────────────────────────────────────────┘ +``` + +## Key Design Decisions + +### 1. Embabel as MCP Server, docgen as MCP Client + +Embabel runs as a separate Spring Boot service exposing tools via MCP over SSE. The Python docgen codebase connects as an MCP client using the official `mcp` Python SDK. This keeps the two codebases cleanly separated: + +- **Embabel side (JVM):** Agent definitions, planning, LLM abstraction, domain models +- **docgen side (Python):** Pipeline execution, file I/O, ffmpeg, Playwright, existing CLI + +### 2. Provider Abstraction Layer + +Before Embabel integration, we first need to abstract the current direct OpenAI calls: + +```python +# Current (hard-coded): +client = openai.OpenAI() +response = client.chat.completions.create(model="gpt-4o", ...) + +# Target (abstracted): +provider = get_ai_provider(config) # OpenAI, Anthropic, Ollama, or Embabel-via-MCP +response = provider.chat(model=config.llm_model, messages=[...]) +``` + +This abstraction benefits docgen even without Embabel — it enables local model usage, Azure OpenAI, etc. + +### 3. Chatbot Interface Options + +Three tiers of chatbot integration: + +1. **CLI chat** (`docgen chat`) — terminal-based conversational interface using the MCP client +2. **Wizard chat tab** — add a `/chat` endpoint and chat panel to the existing Flask wizard +3. **Standalone chatbot UI** — Embabel's Vaadin-based chatbot template or a custom React/HTML frontend + +### 4. Progressive Enhancement + +The Embabel integration is additive, not a rewrite: +- All existing CLI commands continue to work +- Direct OpenAI calls remain as the default provider +- Embabel is an optional backend activated via config +- The chatbot is a new interface, not a replacement for the wizard + +## Items + +### Issue 11: AI Provider Abstraction Layer (`ai_provider.py`) +- [ ] Create `AIProvider` protocol/interface with methods: `chat()`, `tts()`, `transcribe()` +- [ ] Implement `OpenAIProvider` wrapping current direct calls +- [ ] Implement `EmbabelProvider` connecting via MCP client to Embabel server +- [ ] Implement `OllamaProvider` for local model support (chat only; TTS falls back to OpenAI) +- [ ] Config: `ai.provider` in `docgen.yaml` (`openai`, `embabel`, `ollama`) +- [ ] Config: `ai.embabel_url` for MCP server endpoint (default `http://localhost:8080/sse`) +- [ ] Config: `ai.ollama_url` for local Ollama endpoint (default `http://localhost:11434`) +- [ ] Refactor `wizard.py`, `tts.py`, `timestamps.py` to use `AIProvider` instead of direct `openai` calls +- [ ] Maintain backward compatibility: if no provider configured, default to `openai` with existing behavior +- [ ] Unit tests with mock providers + +### Issue 12: Embabel Agent Definitions (JVM Side) +- [ ] Create Embabel Spring Boot project (`docgen-agent/`) in repo or as a companion repo +- [ ] Define domain model classes: `NarrationRequest`, `TTSRequest`, `PipelineRequest`, `ScriptRequest` +- [ ] Implement `NarrationAgent` — accepts source docs + guidance, generates narration via LLM +- [ ] Implement `TTSAgent` — wraps TTS generation with voice/model selection +- [ ] Implement `PipelineAgent` — orchestrates multi-step pipeline via GOAP planning +- [ ] Implement `ScriptAgent` — generates Playwright capture scripts or Manim scene code +- [ ] Implement `DebugAgent` — analyzes compose/validation errors and suggests fixes +- [ ] Export all agent goals as MCP tools with `@Export(remote = true)` +- [ ] Configure LLM mixing: use GPT-4o for narration, local model for simple classification +- [ ] Docker Compose setup for running Embabel alongside docgen +- [ ] Integration tests verifying MCP tool discovery and invocation + +### Issue 13: Python MCP Client Integration (`mcp_client.py`) +- [ ] Add `mcp` Python SDK as optional dependency (`pip install docgen[embabel]`) +- [ ] Implement `EmbabelClient` class wrapping MCP `ClientSession` +- [ ] Connect to Embabel SSE endpoint with auto-reconnect +- [ ] Discover available tools on connection +- [ ] Implement tool invocation wrappers for each agent tool +- [ ] Handle streaming responses for chat interactions +- [ ] Connection health checking and graceful degradation (fall back to direct OpenAI if Embabel unavailable) +- [ ] Config integration: read `ai.embabel_url` from `docgen.yaml` +- [ ] Unit tests with mocked MCP server + +### Issue 14: CLI Chat Interface (`docgen chat`) +- [ ] New CLI command: `docgen chat [--provider embabel|openai|ollama]` +- [ ] Terminal-based conversational loop with prompt +- [ ] Support natural language commands: "generate narration for segment 03", "run the pipeline", "what went wrong with compose?" +- [ ] Stream responses token-by-token to terminal +- [ ] Maintain conversation history within session +- [ ] Tool call visualization: show when the agent invokes pipeline tools +- [ ] Handle multi-turn conversations with context +- [ ] Exit with `/quit`, `/exit`, or Ctrl+C +- [ ] `--non-interactive` mode for scripted usage: `echo "generate narration for 03" | docgen chat` + +### Issue 15: Wizard Chatbot Panel +- [ ] Add `/chat` route to Flask wizard +- [ ] Add chat panel UI to `wizard.html` (collapsible sidebar or tab) +- [ ] Server-Sent Events (SSE) endpoint for streaming chat responses +- [ ] Chat can reference current segment context (active tab in wizard) +- [ ] Support commands: "revise this narration to be more conversational", "generate TTS for this segment", "what's wrong with the compose output?" +- [ ] Show agent tool calls in chat (e.g., "Running TTS... done" with progress) +- [ ] Persist chat history per session +- [ ] Degrade gracefully: if no AI provider configured, show helpful setup message + +### Issue 16: Script Generation Agent +- [ ] Embabel agent that generates Playwright capture scripts from test file analysis +- [ ] Input: test file path, segment narration, desired demo flow +- [ ] Output: Python script compatible with `PlaywrightRunner` contract (writes MP4 to `DOCGEN_PLAYWRIGHT_OUTPUT`) +- [ ] Generate Manim scene code from narration + segment description +- [ ] Validate generated code: syntax check, import verification +- [ ] Iterative refinement: "make the animation slower", "add a highlight on the login button" +- [ ] Template library: common patterns (form fill, navigation, dashboard overview) + +### Issue 17: Error Diagnosis Agent +- [ ] Embabel agent that analyzes pipeline errors and suggests fixes +- [ ] Input: error log, segment config, pipeline state +- [ ] Handles: FREEZE GUARD, missing audio, ffmpeg failures, Playwright timeouts, VHS errors +- [ ] Suggests concrete fixes: "your Manim scene is 5s shorter than narration, add a 5s wait at the end" +- [ ] Can auto-fix common issues when given permission +- [ ] Integrates with `docgen validate` output for proactive suggestions + +### Issue 18: Local Model Support (Ollama) +- [ ] `OllamaProvider` implementation using Ollama REST API +- [ ] Support chat/completion for narration generation +- [ ] TTS fallback to OpenAI (Ollama doesn't do TTS natively) +- [ ] Whisper fallback to OpenAI or local whisper.cpp +- [ ] Config: `ai.ollama_model` (default: `llama3.2`) +- [ ] Test with common local models: llama3.2, mistral, codellama +- [ ] Document setup instructions for Ollama + +## Integration with Milestone 4 (Playwright Test Video) + +The Embabel agents enhance the Playwright test video integration from Milestone 4: + +- **ScriptAgent** can analyze existing Playwright tests and generate the `visual_map` configuration +- **ScriptAgent** can generate narration anchor mappings by reading test selectors and suggesting matching phrases +- **DebugAgent** can diagnose sync issues between test video events and narration timing +- **PipelineAgent** orchestrates the full test-to-video flow conversationally + +## Dependencies + +- Embabel Agent Framework 0.3.x (JVM, Spring Boot) +- MCP Python SDK (`mcp` on PyPI) +- Java 21+ runtime (for Embabel) +- Docker (optional, for containerized Embabel deployment) + +## Risks + +- **Operational complexity**: Running a JVM service alongside a Python CLI adds deployment burden. Mitigation: Docker Compose, optional activation, graceful fallback. +- **Embabel maturity**: Framework is relatively new (Rod Johnson / Spring lineage, but early versions). Mitigation: thin integration layer, easy to swap for alternative MCP server. +- **Latency**: MCP round-trips add latency vs direct API calls. Mitigation: async operations, streaming, caching. +- **Model quality for code generation**: LLM-generated Playwright/Manim code may need iteration. Mitigation: validation loops, template library, human review step.