Skip to content

Latest commit

 

History

History
145 lines (100 loc) · 4.7 KB

File metadata and controls

145 lines (100 loc) · 4.7 KB

Testing & Selector Sync

Real ChatGPT Validation

  1. Run a prompt against the real interface:
node dist/cli.js run "Hello"

Note: Chrome always uses the dedicated Oracle profile at ~/.oracle/chrome. If you're not logged in yet, use oracle open.

  1. If needs_user: login, open a visible browser to log in:
node dist/cli.js open

Login, then resume the run:

node dist/cli.js resume <run_id>

Mock Sync Process

The mock server (scripts/mock-server.js) mirrors core UI hooks used by automation:

  • Prompt input: #prompt-textarea
  • Send button: button[data-testid="send-button"]
  • Message wrappers: [data-message-author-role="user|assistant"]
  • Stop button: button text/aria containing "Stop"
  • Action buttons: [data-testid="good-response-turn-action-button"], [data-testid="bad-response-turn-action-button"]

When ChatGPT changes:

  1. Update selectors in src/browser/chatgpt.ts.
  2. Update mock HTML so selectors stay aligned.
  3. Re-run mock tests and a real ChatGPT validation.

Mock Server Scenarios

The mock server supports query parameters to simulate different ChatGPT behaviors:

Parameter Description
scenario=stall Stop button stays visible forever → ResponseStalledError
scenario=fail Stop disappears without copy button → ResponseFailedError
scenario=no_generation Nothing happens after submit → ResponseTimeoutError
scenario=error_text ChatGPT error message in response text (with copy button)
scenario=slow_start 5s delay before prompt input becomes available
scenario=no_pro Pro model unavailable → ModelNotAvailableError
durationMs=N Control total streaming duration (milliseconds)
delayMs=N Control per-character streaming delay (default: 50ms)

Examples:

# Stall — streaming never completes, stop button stays visible
ORACLE_DEV=1 node dist/cli.js run "stall" --base-url "http://127.0.0.1:7777/?scenario=stall" --timeout-ms 120000

# No generation — nothing happens after submit (use short timeout)
ORACLE_DEV=1 node dist/cli.js run "timeout" --base-url "http://127.0.0.1:7777/?scenario=no_generation" --timeout-ms 15000

# Fail — response generation fails mid-stream
ORACLE_DEV=1 node dist/cli.js run "fail" --base-url "http://127.0.0.1:7777/?scenario=fail"

# Long streaming run (2h+)
ORACLE_DEV=1 node dist/cli.js run "long" --base-url http://127.0.0.1:7777/?durationMs=7200000 --timeout-ms 7800000

Structured Output Extraction

Automated extraction tests validate JSON/XML/exact-string outputs using headless Chromium.

First, install the Playwright browser binary (one-time):

npx playwright install chromium

Then run the tests:

npm test

Scenario-Based Agent Evals

The primary eval system runs 38 scenarios (CLI + subagent) through Claude Opus against the mock server:

# Run all scenarios
node scripts/evals/run-scenario-evals.js

# Run specific scenario(s)
node scripts/evals/run-scenario-evals.js --scenario happy-path
node scripts/evals/run-scenario-evals.js --scenarios happy-path,cancel-flow,prune-stale

# Run only CLI or subagent scenarios
node scripts/evals/run-scenario-evals.js --cli-only
node scripts/evals/run-scenario-evals.js --subagent-only

# Resume a partial run (retries failed scenarios, skips passed ones)
node scripts/evals/run-scenario-evals.js --resume scripts/evals/results/scenarios-2026-03-05T18-40-25-372Z.json

# Use a different model
CLAUDE_MODEL=sonnet node scripts/evals/run-scenario-evals.js

Results are written incrementally to scripts/evals/results/scenarios-<timestamp>.json.

The scenarios cover: happy path, error handling, file attachments, concurrent runs, run discoverability, health monitoring, prune --stale, prefix matching, and subagent prompt linting (dangling refs, terminology, quality).

Legacy Agent SDK Evals (Codex + Claude)

Runs oracle via the mock ChatGPT server (no chatgpt.com). Uses real agents:

npm run eval:agents

Individual runs:

npm run eval:codex
npm run eval:claude

Notes:

  • Uses the mock server on 127.0.0.1:7777.
  • Codex eval uses @openai/codex-sdk (bundled codex binary). Ensure Codex credentials are configured.
  • Claude eval shells out to claude CLI. Ensure Claude Code is installed and logged in.
  • Results are written to scripts/evals/results/.
  • Eval harness strips any *_API_KEY variables from the agent process environment.
  • Optional: set ORACLE_EVAL_SKIP_CLAUDE=1 or ORACLE_EVAL_SKIP_CODEX=1 to run one agent.
  • Optional: set ORACLE_EVAL_AGENT_TIMEOUT_MS to cap agent runtime.

Debug Capture

Set ORACLE_CAPTURE_HTML=1 to save completion.html/completion.png for real ChatGPT runs:

ORACLE_CAPTURE_HTML=1 node dist/cli.js run "Hello"