Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
96 changes: 96 additions & 0 deletions issues/embedded-ai-embabel/11-ai-provider-abstraction.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,96 @@
# Issue: AI Provider Abstraction Layer

**Milestone:** 5 — Embedded AI via Embabel & Chatbot Interface
**Priority:** Critical (prerequisite for all other Embabel work)
**Depends on:** None

## Summary

Create an `AIProvider` abstraction layer that decouples docgen from direct OpenAI API calls. This enables switching between providers (OpenAI, Embabel via MCP, Ollama for local models) via configuration, and is the prerequisite for all Embabel integration work.

## Background

Today, docgen has three hard-coded OpenAI integration points:

| File | API | Model | Purpose |
|------|-----|-------|---------|
| `wizard.py` | `chat.completions.create` | `gpt-4o` (configurable) | Narration generation |
| `tts.py` | `audio.speech.create` | `gpt-4o-mini-tts` (configurable) | Text-to-speech |
| `timestamps.py` | `audio.transcriptions.create` | `whisper-1` (hard-coded) | Audio timestamps |

Each creates its own `openai.OpenAI()` client. There is no abstraction layer, no provider switching, and no support for non-OpenAI backends.

## Acceptance Criteria

- [ ] Create `AIProvider` protocol with methods:
```python
class AIProvider(Protocol):
def chat(self, model: str, messages: list[dict], **kwargs) -> str: ...
def tts(self, model: str, voice: str, text: str, instructions: str, output_path: Path) -> Path: ...
def transcribe(self, model: str, audio_path: Path, **kwargs) -> dict: ...
```
- [ ] Implement `OpenAIProvider` wrapping all current direct calls
- [ ] Implement `EmbabelProvider` that connects via MCP Python SDK to Embabel server
- [ ] Implement `OllamaProvider` for local model support:
- Chat: Ollama REST API (`/api/chat`)
- TTS: falls back to OpenAI (Ollama doesn't support TTS)
- Transcribe: falls back to OpenAI or local `whisper.cpp`
- [ ] Factory function: `get_ai_provider(config) -> AIProvider`
- [ ] Config in `docgen.yaml`:
```yaml
ai:
provider: openai # "openai", "embabel", "ollama"
embabel_url: http://localhost:8080/sse
ollama_url: http://localhost:11434
ollama_model: llama3.2
```
- [ ] Refactor all three call sites to use `AIProvider`:
- `wizard.py:generate_narration_via_llm` → `provider.chat(...)`
- `tts.py:TTSGenerator.generate` → `provider.tts(...)`
- `timestamps.py:TimestampExtractor.extract` → `provider.transcribe(...)`
- [ ] Backward compatible: no config = default to `openai` with existing behavior
- [ ] Make `whisper-1` model configurable (currently hard-coded in `timestamps.py`)
- [ ] Unit tests with mock providers for each implementation

## Technical Notes

### Provider resolution order

1. Explicit `ai.provider` in `docgen.yaml` → use that
2. `DOCGEN_AI_PROVIDER` environment variable → override
3. Default → `openai`

### EmbabelProvider sketch

```python
class EmbabelProvider:
def __init__(self, url: str):
self.url = url
self._client = None # lazy MCP client

async def _connect(self):
from mcp import ClientSession
# Connect to Embabel SSE endpoint
...

def chat(self, model, messages, **kwargs):
# Invoke Embabel NarrationAgent tool via MCP
return self._call_tool("generate_narration", {...})

def tts(self, model, voice, text, instructions, output_path):
# Invoke Embabel TTSAgent tool via MCP, or fall back to OpenAI
...
```

### Narration lint impact

`NarrationLinter.lint_audio` in `narration_lint.py` also uses `TimestampExtractor` indirectly — it will automatically benefit from the provider abstraction without code changes.

## Files to Create/Modify

- **Create:** `src/docgen/ai_provider.py`
- **Modify:** `src/docgen/wizard.py` (use provider instead of direct openai)
- **Modify:** `src/docgen/tts.py` (use provider instead of direct openai)
- **Modify:** `src/docgen/timestamps.py` (use provider instead of direct openai)
- **Modify:** `src/docgen/config.py` (add `ai` config block)
- **Create:** `tests/test_ai_provider.py`
109 changes: 109 additions & 0 deletions issues/embedded-ai-embabel/12-embabel-agent-definitions.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,109 @@
# Issue: Embabel Agent Definitions (JVM Side)

**Milestone:** 5 — Embedded AI via Embabel & Chatbot Interface
**Priority:** High
**Depends on:** Issue 11 (provider abstraction)

## Summary

Create the Embabel Spring Boot application that hosts the AI agents for docgen. These agents are exposed as MCP tools that the Python docgen client can invoke for narration generation, TTS orchestration, pipeline management, script generation, and error diagnosis.

## Background

Embabel is a JVM-based agent framework that uses Goal-Oriented Action Planning (GOAP) to dynamically plan action sequences. By defining docgen-specific agents, we get:

- **Planning**: the agent figures out the optimal sequence of steps (e.g., "to produce a demo video, I need to generate narration, then TTS, then compose...")
- **Tool use**: agents can invoke docgen CLI commands as tools
- **LLM mixing**: use GPT-4o for narration quality, cheaper models for classification
- **MCP exposure**: all agents automatically available to Python via MCP protocol

## Acceptance Criteria

- [ ] Create Embabel Spring Boot project:
- Option A: `docgen-agent/` subdirectory in this repo
- Option B: Companion repo `docgen-agent` (linked from README)
- [ ] Domain model classes (Kotlin data classes):
```kotlin
data class NarrationRequest(val segment: String, val guidance: String, val sources: List<String>)
data class NarrationResponse(val text: String, val wordCount: Int)
data class TTSRequest(val segment: String, val voice: String, val model: String)
data class PipelineRequest(val steps: List<String>, val options: Map<String, Any>)
data class ScriptRequest(val testFile: String, val segment: String, val description: String)
data class DiagnosisRequest(val errorLog: String, val segment: String, val context: Map<String, Any>)
```
- [ ] Agent implementations:
- `NarrationAgent` — generates/revises narration from source docs + guidance
- `TTSAgent` — wraps TTS generation with voice/model selection and preview
- `PipelineAgent` — orchestrates multi-step pipeline via GOAP planning
- `ScriptAgent` — generates Playwright capture scripts or Manim scene code
- `DebugAgent` — analyzes compose/validation errors and suggests fixes
- [ ] All agent goals exported as MCP tools: `@Export(remote = true)`
- [ ] LLM configuration:
- GPT-4o for narration generation (quality-critical)
- Local model via Ollama for simple classification/routing
- Configurable in `application.yml`
- [ ] Docker Compose setup for running Embabel alongside docgen:
```yaml
services:
docgen-agent:
build: ./docgen-agent
ports: ["8080:8080"]
environment:
- OPENAI_API_KEY=${OPENAI_API_KEY}
- SPRING_AI_OPENAI_API_KEY=${OPENAI_API_KEY}
```
- [ ] Integration tests verifying MCP tool discovery and invocation
- [ ] Health endpoint for connection checking

## Technical Notes

### Agent architecture

Each agent is an `@EmbabelComponent` with `@Action` methods:

```kotlin
@Agent("Narration generation agent for docgen")
class NarrationAgent {

@Action("Generate narration from source documents")
@Export(remote = true)
fun generateNarration(request: NarrationRequest): NarrationResponse {
// LLM call with docgen-specific system prompt
}

@Action("Revise existing narration based on feedback")
@Export(remote = true)
fun reviseNarration(segment: String, currentText: String, feedback: String): NarrationResponse {
// LLM call with revision context
}
}
```

### MCP server configuration

```yaml
# application.yml
spring:
ai:
mcp:
server:
type: SYNC
openai:
api-key: ${OPENAI_API_KEY}
```

## Files to Create

- **Create:** `docgen-agent/` (Spring Boot project)
- `pom.xml` or `build.gradle.kts`
- `src/main/kotlin/com/docgen/agent/`
- `DocgenAgentApplication.kt`
- `agents/NarrationAgent.kt`
- `agents/TTSAgent.kt`
- `agents/PipelineAgent.kt`
- `agents/ScriptAgent.kt`
- `agents/DebugAgent.kt`
- `model/` (domain classes)
- `src/main/resources/application.yml`
- `Dockerfile`
- **Create:** `docker-compose.yml` (root level, optional)
110 changes: 110 additions & 0 deletions issues/embedded-ai-embabel/13-python-mcp-client.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,110 @@
# Issue: Python MCP Client Integration

**Milestone:** 5 — Embedded AI via Embabel & Chatbot Interface
**Priority:** High
**Depends on:** Issue 11 (provider abstraction), Issue 12 (Embabel agents)

## Summary

Implement the Python-side MCP client that connects to the Embabel agent server, discovers available tools, and provides the `EmbabelProvider` implementation for the AI provider abstraction layer.

## Background

The official MCP Python SDK (`mcp` on PyPI) provides `ClientSession` for connecting to MCP servers. Embabel exposes its agents as MCP tools over SSE (Server-Sent Events) at `http://localhost:8080/sse`. This issue bridges the two by implementing a robust client that handles connection management, tool discovery, invocation, and streaming.

## Acceptance Criteria

- [ ] Add `mcp` Python SDK as optional dependency:
```toml
[project.optional-dependencies]
embabel = ["mcp>=1.0"]
```
Install with: `pip install docgen[embabel]`
- [ ] Implement `EmbabelClient` class:
```python
class EmbabelClient:
def __init__(self, url: str = "http://localhost:8080/sse"):
...

async def connect(self) -> None:
"""Connect to Embabel SSE endpoint."""

async def discover_tools(self) -> list[Tool]:
"""List available MCP tools from Embabel."""

async def invoke(self, tool_name: str, args: dict) -> Any:
"""Invoke an MCP tool and return the result."""

async def stream(self, tool_name: str, args: dict) -> AsyncIterator[str]:
"""Invoke a tool with streaming response."""

async def close(self) -> None:
"""Disconnect from Embabel."""
```
- [ ] Auto-reconnect on connection loss (exponential backoff, max 3 retries)
- [ ] Graceful degradation: if Embabel is unavailable, fall back to direct OpenAI provider
- [ ] Tool invocation wrappers for each agent tool:
```python
async def generate_narration(self, segment: str, guidance: str, sources: list[str]) -> str:
return await self.invoke("generate_narration", {...})
```
- [ ] Handle streaming responses for chat interactions (SSE event stream)
- [ ] Connection health checking (`is_connected`, `ping`)
- [ ] Config integration: read `ai.embabel_url` from `docgen.yaml`
- [ ] Synchronous wrapper for CLI usage (the MCP SDK is async, but docgen CLI is sync)
- [ ] Unit tests with mocked MCP server

## Technical Notes

### MCP Python SDK usage

```python
from mcp import ClientSession
from mcp.client.sse import sse_client

async with sse_client(url="http://localhost:8080/sse") as (read, write):
async with ClientSession(read, write) as session:
await session.initialize()
tools = await session.list_tools()
result = await session.call_tool("generate_narration", arguments={...})
```

### Sync wrapper pattern

Since docgen CLI uses Click (synchronous), we need a sync wrapper:

```python
import asyncio

class EmbabelClientSync:
def __init__(self, url: str):
self._async_client = EmbabelClient(url)
self._loop = asyncio.new_event_loop()

def invoke(self, tool_name: str, args: dict) -> Any:
return self._loop.run_until_complete(
self._async_client.invoke(tool_name, args)
)
```

### Fallback behavior

```python
def get_ai_provider(config):
if config.ai_provider == "embabel":
try:
client = EmbabelClientSync(config.embabel_url)
client.connect()
return EmbabelProvider(client)
except ConnectionError:
print("[ai] Embabel unavailable, falling back to OpenAI")
return OpenAIProvider()
...
```

## Files to Create/Modify

- **Create:** `src/docgen/mcp_client.py`
- **Modify:** `src/docgen/ai_provider.py` (implement EmbabelProvider using mcp_client)
- **Modify:** `pyproject.toml` (add `embabel` optional dependency)
- **Create:** `tests/test_mcp_client.py`
Loading