AI-powered browser automation agent that understands the web.
Give it a task in plain English. Watch it execute in a real browser. Take control when needed.
Quick Start • Features • Configuration • How It Works • Architecture • Token Compression • Credential Vault
The fastest way to run SurfAgent is via npx. Ensure you have Node.js installed, then run:
# Set your API key (pick one provider)
export ANTHROPIC_API_KEY=sk-... # Anthropic (Claude) — default
# export OPENAI_API_KEY=sk-... && export LLM_PROVIDER=openai
# export DEEPSEEK_API_KEY=sk-... && export LLM_PROVIDER=deepseek # OpenAI-compatible, text-only
# Run directly
npx surfagentSurfAgent will automatically launch its backend, frontend, and open the UI in your browser. All logs and saved credentials are stored locally in ~/.surfagent.
Note: the
npxflow requires the package to be published to npm. For local development or contributing, use Manual Setup below.
SurfAgent runs a Chromium browser headless by default — the live view streams into the UI, so no separate window pops up (set HEADLESS=false to open the real window for debugging). It takes your natural language instructions and executes them step by step. It navigates websites, fills forms, clicks buttons, handles logins, and completes complex multi-step workflows — all while you watch live via screenshot streaming.
When it hits a CAPTCHA, needs login credentials, or encounters something it can't handle — it pauses and lets you take control directly from the web UI. Click on the screenshot to interact, solve the problem, and hand back.
# Clone
git clone https://github.com/vikast908/Surfagent.git
cd Surfagent
# Install dependencies for both parts
npm run install:all
# Configure the backend
cd backend
cp .env.example .env # add your API key + set LLM_PROVIDER
cd ..
# (optional) configure the frontend
cp frontend/.env.example frontend/.env.local
# Run both (backend + frontend, opens the UI)
npm start
# …or run them separately for development:
# npm run dev:backend # WebSocket server + headless browser on :8080
# npm run dev:frontend # Next.js UI on :3000Backend config lives in backend/.env (copy from backend/.env.example); frontend config in frontend/.env.local (copy from frontend/.env.example).
| Variable | Default | Description |
|---|---|---|
LLM_PROVIDER |
anthropic |
anthropic · openai · deepseek |
ANTHROPIC_API_KEY |
— | Required when provider is anthropic |
OPENAI_API_KEY |
— | Required when provider is openai |
DEEPSEEK_API_KEY |
— | Required when provider is deepseek (OpenAI-compatible, text-only) |
LLM_MODEL |
per-provider | Override model (e.g. gpt-4o, deepseek-chat, claude-sonnet-4-…) |
LLM_BASE_URL |
per-provider | Override API base URL (OpenAI-compatible providers) |
LLM_VISION |
on (off for deepseek) |
Feed page screenshots to the model (needs a vision-capable model) |
HEADLESS |
true |
Run the browser headless (streamed to the UI). false opens a real window for debugging |
PORT |
8080 |
WebSocket server port |
WS_AUTH_TOKEN |
surfagent_dev_token_2026 |
Change this — shared token the UI uses to connect |
VAULT_SECRET |
dev default | Change this — AES key for the credential vault |
| Variable | Default | Description |
|---|---|---|
NEXT_PUBLIC_WS_URL |
ws://localhost:8080 |
Backend WebSocket URL |
NEXT_PUBLIC_WS_AUTH_TOKEN |
surfagent_dev_token_2026 |
Must match the backend WS_AUTH_TOKEN |
| Feature | Description |
|---|---|
| Natural Language Tasks | "Go to LinkedIn and search for product manager jobs in Mumbai" |
| Live Browser Stream | JPEG screenshots at 2fps via WebSocket |
| Interactive Manual Control | Click, scroll, type directly on the browser screenshot |
| 17 Action Types | goto, click, type, hover, scroll, type_slowly, select_option, switch_tab, set_viewport, wait_for_network, check_validation, dismiss_overlay, and more |
| Accessibility Tree | Uses Playwright's a11y API for reliable element targeting |
| Visual Grounding | Feeds the page screenshot to the model so it can see canvas, icons, charts, and layout — not just the DOM (toggle with LLM_VISION) |
| Multi-Provider LLM | OpenAI (GPT-4o), Anthropic (Claude), or DeepSeek (deepseek-chat, OpenAI-compatible, text-only) — set LLM_PROVIDER |
| Feature | Description |
|---|---|
| 30+ Blocker Detection | CAPTCHA, login walls, payment flows, OAuth, OTP, file uploads |
| Auto Login with Vault | Encrypted credential storage, auto-fills on login pages |
| Multi-Step Login | Handles email-first → password-next flows (Google, Coursera) |
| Sensitive Action Guard | Blocks before delete/purchase/checkout/pay — explicit Approve / Skip / Take-over gate |
| Stuck Detection | Detects when page hasn't changed, warns LLM to change approach |
| Failed Action Memory | Tracks what failed, prevents LLM from repeating mistakes |
| Validation Error Detection | Reads form errors, feeds to LLM for re-fill |
| Feature | Description |
|---|---|
| Stage Rail | Live phase indicator — Navigating → Reading → Acting → Verifying |
| Streamed Partial Results | Extracted data streams into a Results panel as it's gathered |
| Self-Correction Visibility | Surfaces "that didn't work — trying a different approach" on failures/stuck pages |
| "What Changed" Summary | Post-run recap of actions taken, pages visited, and key facts |
| Per-Action Confidence Gate | Before irreversible clicks: Approve / I'll do it / Skip, with the step's risk + confidence |
| Run History + Re-run | Past runs persisted to ~/.surfagent/runs.json; one-click re-run |
| Feature | Description |
|---|---|
| Context Window Management | Tiered memory with Task Ledger — never runs out of context |
| TurboQuant Compression | ~70% fewer tokens per page state |
| Differential Encoding | Only sends changed elements between steps |
| Session Logging | Every action logged with timestamps, metrics, error traces |
| Feature | Description |
|---|---|
| Resizable Panels | Drag divider between chat and browser |
| New Chat | One-click to start fresh |
| Settings Panel | Manage saved credentials (add, edit, delete) |
| Task Persistence | Messages saved to localStorage across sessions |
| Click Feedback | Purple ripple on screenshot clicks |
| Sound Notifications | Tone on task complete/fail |
| Export Session | Download chat as JSON |
| Editable URL Bar | Type a URL and navigate directly |
| Keyboard Shortcuts | Escape to hand back control |
| Reconnection Toast | Notifies when WebSocket recovers |
| Command Palette | ⌘/Ctrl+K to run any action or macro — keyboard-first |
| Instruction History | ↑/↓ in the task input recalls previous instructions |
| Soft-Undo | "Undo" toast restores a cleared chat |
| Offline / Disconnect Banner | Always-visible recovery when the network or server drops |
| Accessibility | Focus-trapped dialogs, skip-to-content, reduced-motion & reduced-transparency aware |
| Stall Signal | "Still working…" when a step runs long, so silence never reads as a freeze |
User instruction
→ LLM plans 1-3 actions
→ Validate actions (skip malformed ones)
→ Check for blockers (CAPTCHA, login, payment)
→ Execute in browser with 60s timeout
→ Capture page state (DOM + accessibility tree)
→ Compress state (~70% token reduction)
→ Feed back to LLM
→ Repeat until done or max steps
The agent uses multiple strategies per attempt, rotating on failure:
1. Accessibility tree → [button] "Apply Now" (most reliable)
2. Visible text match → getByText("Sign In")
3. Role + name → getByRole("link", "Jobs")
4. Aria-label → [aria-label="Search"]
5. Placeholder → getByPlaceholder("Search")
6. CSS selector → button.submit (last resort)
The agent never runs out of context, even on 50+ step tasks:
┌──────────────────────────────────────────┐
│ 90k Token Budget │
├──────────────────────────────────────────┤
│ System Prompt ~4k (fixed) │
│ Task Ledger ~1-3k (structured) │
│ Recent Messages ~8-12k (full text) │
│ Current Page State ~1-3k (compressed) │
│ Extra Context ~500 (warnings) │
│ Response Reserve ~4k (for LLM) │
└──────────────────────────────────────────┘
When conversation exceeds budget, old messages are parsed into a structured Task Ledger (pages visited, actions taken, outcomes) and dropped. The LLM always knows what happened before.
Inspired by Google's TurboQuant — keep the signal, drop the noise:
| Technique | Example | Savings |
|---|---|---|
| Element shorthand | <button role="button" type="submit" aria-label="Sign In">Sign In</button> → btn "Sign In" [submit] |
73% |
| JSON key compression | title, url, pageType → t, u, pt |
15% |
| Deduplication | 120 elements → ~80 unique | 33% |
| Differential encoding | Full list → only new/changed + unchanged: 45 |
50% |
Result: ~70% fewer tokens per step = ~70% cost reduction on page state payloads.
Encrypted local credential storage for auto-login:
- AES-256-GCM encryption at rest
- Auto-fill when login page detected — matches by domain
- Multi-step login support (email-first, then password)
- Never sent to LLM — credentials go directly to browser
- Settings panel in UI — add, edit, delete saved logins
- Save on login — checkbox to save credentials when entering them
Settings (gear icon) → + Add
Domain: linkedin.com
Username: your-email
Password: ••••••••
[Save Credentials]
Next time the agent hits LinkedIn login → auto-fills from vault → continues task.
SurfAgent/
├── cli.js Root orchestrator for npx usage
├── CLAUDE.md Project guide + UI/UX governance for AI agents
├── docs/UX_SPEC.md Canonical UX specification (v2.1)
├── backend/
│ ├── server.js WebSocket server — routes messages, manages lifecycle
│ ├── agent.js LLM agent loop — planning, execution, recovery
│ ├── browser.js Playwright — 17 actions, a11y tree, blocker detection
│ ├── config.js Global config (paths to ~/.surfagent)
│ ├── logger.js Session logging (stores in ~/.surfagent/logs)
│ ├── vault.js AES-256-GCM encrypted vault (stores in ~/.surfagent)
│ └── .env API keys + vault secret (gitignored)
│
├── frontend/
│ ├── src/app/
│ │ ├── page.tsx Main layout (consumes AgentContext)
│ │ ├── layout.tsx Root layout with AgentProvider
│ │ └── globals.css Animations, ripple effects
│ ├── src/context/
│ │ └── AgentContext.tsx Global state management & WebSocket logic
│ ├── src/components/
│ │ ├── ChatPanel.tsx Chat, steps, credentials, examples, copy, export
│ │ ├── BrowserView.tsx Interactive browser with click/scroll/type
│ │ └── CommandPalette.tsx ⌘/Ctrl+K command palette
│ ├── src/hooks/
│ │ ├── useWebSocket.ts Auto-reconnecting WebSocket with heartbeat
│ │ ├── useFocusTrap.ts Modal focus trap + restore
│ │ └── useOnlineStatus.ts Network reachability
│ ├── src/lib/
│ │ └── telemetry.ts Pluggable interaction telemetry
│ └── tailwind.config.ts Calm Intelligence design system
Calm Intelligence v2 — designed for AI agent interfaces. The full microinteraction, motion, state-machine, and accessibility spec lives in docs/UX_SPEC.md.
| Token | Value | Usage |
|---|---|---|
bg |
#0B0F14 |
Primary background |
surface |
#111827 |
Panels, cards |
elevated |
#1F2937 |
Inputs, hover states |
accent |
#4F46E5 |
Primary actions, focus |
success |
#22C55E |
Completed, connected |
warning |
#F59E0B |
Paused, Take Control |
error |
#EF4444 |
Failed, Stop |
Typography: Inter for UI, JetBrains Mono for logs/code. Animations: 150-200ms, functional only.
| Layer | Technology |
|---|---|
| Frontend | Next.js 14, React 18, Tailwind CSS, TypeScript |
| Backend | Node.js (ES Modules) |
| Browser | Playwright (Chromium, headless — streamed to UI; HEADLESS=false for a window) |
| LLM | OpenAI GPT-4o · Anthropic Claude · DeepSeek (OpenAI-compatible) |
| Communication | WebSocket with 30s heartbeat |
| Streaming | JPEG screenshots at 2fps |
| Encryption | AES-256-GCM (Node.js crypto) |
| Doc | What's inside |
|---|---|
docs/UX_SPEC.md |
Canonical UX spec — microinteractions, motion, state machines, accessibility |
design.txt |
"Calm Intelligence" design philosophy |
GEMINI.md |
Engineering & UX strict standards |
CLAUDE.md |
Guide for AI coding agents working in this repo |
prd.txt |
Original product requirements (historical) |
- Credentials never reach the LLM. Saved logins are AES-256-GCM encrypted at rest (
~/.surfagent/vault.enc) and injected directly into the browser — never sent to the model or written to logs. - Change the defaults before exposing SurfAgent beyond localhost: set a strong
WS_AUTH_TOKENandVAULT_SECRETinbackend/.env. backend/.env(your API keys) is gitignored — never commit it.- The browser uses automation-detection mitigations but is not a hardened sandbox; run it on machines and accounts you trust.
Contributions are welcome!
- Read
GEMINI.md(engineering standards) anddocs/UX_SPEC.md(UX standards) before making UI changes. - Make sure
cd frontend && npm run buildpasses with zero warnings or errors. - Keep changes surgical and well-scoped, and never commit secrets.
MIT © Vikas
Built with Playwright, Next.js, and LLM APIs.
If the agent understands how websites work, it can operate on any site.