Testing & Selector Sync

Real ChatGPT Validation

Run a prompt against the real interface:

node dist/cli.js run "Hello"

Note: Chrome always uses the dedicated Oracle profile at ~/.oracle/chrome. If you're not logged in yet, use oracle open.

If needs_user: login, open a visible browser to log in:

node dist/cli.js open

node dist/cli.js resume <run_id>

Mock Sync Process

The mock server (scripts/mock-server.js) mirrors core UI hooks used by automation:

Prompt input: #prompt-textarea
Send button: button[data-testid="send-button"]
Message wrappers: [data-message-author-role="user|assistant"]
Stop button: button text/aria containing "Stop"
Action buttons: [data-testid="good-response-turn-action-button"], [data-testid="bad-response-turn-action-button"]

When ChatGPT changes:

Update selectors in src/browser/chatgpt.ts.
Update mock HTML so selectors stay aligned.
Re-run mock tests and a real ChatGPT validation.

Mock Server Scenarios

The mock server supports query parameters to simulate different ChatGPT behaviors:

Parameter	Description
`scenario=stall`	Stop button stays visible forever → `ResponseStalledError`
`scenario=fail`	Stop disappears without copy button → `ResponseFailedError`
`scenario=no_generation`	Nothing happens after submit → `ResponseTimeoutError`
`scenario=error_text`	ChatGPT error message in response text (with copy button)
`scenario=slow_start`	5s delay before prompt input becomes available
`scenario=no_pro`	Pro model unavailable → `ModelNotAvailableError`
`durationMs=N`	Control total streaming duration (milliseconds)
`delayMs=N`	Control per-character streaming delay (default: 50ms)

Examples:

# Stall — streaming never completes, stop button stays visible
ORACLE_DEV=1 node dist/cli.js run "stall" --base-url "http://127.0.0.1:7777/?scenario=stall" --timeout-ms 120000

# No generation — nothing happens after submit (use short timeout)
ORACLE_DEV=1 node dist/cli.js run "timeout" --base-url "http://127.0.0.1:7777/?scenario=no_generation" --timeout-ms 15000

# Fail — response generation fails mid-stream
ORACLE_DEV=1 node dist/cli.js run "fail" --base-url "http://127.0.0.1:7777/?scenario=fail"

# Long streaming run (2h+)
ORACLE_DEV=1 node dist/cli.js run "long" --base-url http://127.0.0.1:7777/?durationMs=7200000 --timeout-ms 7800000

Structured Output Extraction

Automated extraction tests validate JSON/XML/exact-string outputs using headless Chromium.

First, install the Playwright browser binary (one-time):

npx playwright install chromium

Then run the tests:

npm test

Scenario-Based Agent Evals

The primary eval system runs 38 scenarios (CLI + subagent) through Claude Opus against the mock server:

# Run all scenarios
node scripts/evals/run-scenario-evals.js

# Run specific scenario(s)
node scripts/evals/run-scenario-evals.js --scenario happy-path
node scripts/evals/run-scenario-evals.js --scenarios happy-path,cancel-flow,prune-stale

# Run only CLI or subagent scenarios
node scripts/evals/run-scenario-evals.js --cli-only
node scripts/evals/run-scenario-evals.js --subagent-only

# Resume a partial run (retries failed scenarios, skips passed ones)
node scripts/evals/run-scenario-evals.js --resume scripts/evals/results/scenarios-2026-03-05T18-40-25-372Z.json

# Use a different model
CLAUDE_MODEL=sonnet node scripts/evals/run-scenario-evals.js

Results are written incrementally to scripts/evals/results/scenarios-<timestamp>.json.

The scenarios cover: happy path, error handling, file attachments, concurrent runs, run discoverability, health monitoring, prune --stale, prefix matching, and subagent prompt linting (dangling refs, terminology, quality).

Legacy Agent SDK Evals (Codex + Claude)

Runs oracle via the mock ChatGPT server (no chatgpt.com). Uses real agents:

npm run eval:agents

Individual runs:

npm run eval:codex
npm run eval:claude

Notes:

Uses the mock server on 127.0.0.1:7777.
Codex eval uses @openai/codex-sdk (bundled codex binary). Ensure Codex credentials are configured.
Claude eval shells out to claude CLI. Ensure Claude Code is installed and logged in.
Results are written to scripts/evals/results/.
Eval harness strips any *_API_KEY variables from the agent process environment.
Optional: set ORACLE_EVAL_SKIP_CLAUDE=1 or ORACLE_EVAL_SKIP_CODEX=1 to run one agent.
Optional: set ORACLE_EVAL_AGENT_TIMEOUT_MS to cap agent runtime.

Debug Capture

Set ORACLE_CAPTURE_HTML=1 to save completion.html/completion.png for real ChatGPT runs:

ORACLE_CAPTURE_HTML=1 node dist/cli.js run "Hello"

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Testing & Selector Sync

Real ChatGPT Validation

Mock Sync Process

Mock Server Scenarios

Structured Output Extraction

Scenario-Based Agent Evals

Legacy Agent SDK Evals (Codex + Claude)

Debug Capture

FilesExpand file tree

TESTING.md

Latest commit

History

TESTING.md

File metadata and controls

Testing & Selector Sync

Real ChatGPT Validation

Mock Sync Process

Mock Server Scenarios

Structured Output Extraction

Scenario-Based Agent Evals

Legacy Agent SDK Evals (Codex + Claude)

Debug Capture