- Run a prompt against the real interface:
node dist/cli.js run "Hello"
Note: Chrome always uses the dedicated Oracle profile at ~/.oracle/chrome. If you're not logged in yet, use oracle open.
- If
needs_user: login, open a visible browser to log in:
node dist/cli.js open
Login, then resume the run:
node dist/cli.js resume <run_id>
The mock server (scripts/mock-server.js) mirrors core UI hooks used by automation:
- Prompt input:
#prompt-textarea - Send button:
button[data-testid="send-button"] - Message wrappers:
[data-message-author-role="user|assistant"] - Stop button: button text/aria containing "Stop"
- Action buttons:
[data-testid="good-response-turn-action-button"],[data-testid="bad-response-turn-action-button"]
When ChatGPT changes:
- Update selectors in
src/browser/chatgpt.ts. - Update mock HTML so selectors stay aligned.
- Re-run mock tests and a real ChatGPT validation.
The mock server supports query parameters to simulate different ChatGPT behaviors:
| Parameter | Description |
|---|---|
scenario=stall |
Stop button stays visible forever → ResponseStalledError |
scenario=fail |
Stop disappears without copy button → ResponseFailedError |
scenario=no_generation |
Nothing happens after submit → ResponseTimeoutError |
scenario=error_text |
ChatGPT error message in response text (with copy button) |
scenario=slow_start |
5s delay before prompt input becomes available |
scenario=no_pro |
Pro model unavailable → ModelNotAvailableError |
durationMs=N |
Control total streaming duration (milliseconds) |
delayMs=N |
Control per-character streaming delay (default: 50ms) |
Examples:
# Stall — streaming never completes, stop button stays visible
ORACLE_DEV=1 node dist/cli.js run "stall" --base-url "http://127.0.0.1:7777/?scenario=stall" --timeout-ms 120000
# No generation — nothing happens after submit (use short timeout)
ORACLE_DEV=1 node dist/cli.js run "timeout" --base-url "http://127.0.0.1:7777/?scenario=no_generation" --timeout-ms 15000
# Fail — response generation fails mid-stream
ORACLE_DEV=1 node dist/cli.js run "fail" --base-url "http://127.0.0.1:7777/?scenario=fail"
# Long streaming run (2h+)
ORACLE_DEV=1 node dist/cli.js run "long" --base-url http://127.0.0.1:7777/?durationMs=7200000 --timeout-ms 7800000
Automated extraction tests validate JSON/XML/exact-string outputs using headless Chromium.
First, install the Playwright browser binary (one-time):
npx playwright install chromium
Then run the tests:
npm test
The primary eval system runs 38 scenarios (CLI + subagent) through Claude Opus against the mock server:
# Run all scenarios
node scripts/evals/run-scenario-evals.js
# Run specific scenario(s)
node scripts/evals/run-scenario-evals.js --scenario happy-path
node scripts/evals/run-scenario-evals.js --scenarios happy-path,cancel-flow,prune-stale
# Run only CLI or subagent scenarios
node scripts/evals/run-scenario-evals.js --cli-only
node scripts/evals/run-scenario-evals.js --subagent-only
# Resume a partial run (retries failed scenarios, skips passed ones)
node scripts/evals/run-scenario-evals.js --resume scripts/evals/results/scenarios-2026-03-05T18-40-25-372Z.json
# Use a different model
CLAUDE_MODEL=sonnet node scripts/evals/run-scenario-evals.js
Results are written incrementally to scripts/evals/results/scenarios-<timestamp>.json.
The scenarios cover: happy path, error handling, file attachments, concurrent runs, run discoverability, health monitoring, prune --stale, prefix matching, and subagent prompt linting (dangling refs, terminology, quality).
Runs oracle via the mock ChatGPT server (no chatgpt.com). Uses real agents:
npm run eval:agents
Individual runs:
npm run eval:codex
npm run eval:claude
Notes:
- Uses the mock server on
127.0.0.1:7777. - Codex eval uses
@openai/codex-sdk(bundled codex binary). Ensure Codex credentials are configured. - Claude eval shells out to
claudeCLI. Ensure Claude Code is installed and logged in. - Results are written to
scripts/evals/results/. - Eval harness strips any
*_API_KEYvariables from the agent process environment. - Optional: set
ORACLE_EVAL_SKIP_CLAUDE=1orORACLE_EVAL_SKIP_CODEX=1to run one agent. - Optional: set
ORACLE_EVAL_AGENT_TIMEOUT_MSto cap agent runtime.
Set ORACLE_CAPTURE_HTML=1 to save completion.html/completion.png for real ChatGPT runs:
ORACLE_CAPTURE_HTML=1 node dist/cli.js run "Hello"