SurfAgent

AI-powered browser automation agent that understands the web.
Give it a task in plain English. Watch it execute in a real browser. Take control when needed.

Quick Start • Features • Configuration • How It Works • Architecture • Token Compression • Credential Vault

⚡ Quick Start (No Install)

The fastest way to run SurfAgent is via npx. Ensure you have Node.js installed, then run:

# Set your API key (pick one provider)
export ANTHROPIC_API_KEY=sk-...                                  # Anthropic (Claude) — default
# export OPENAI_API_KEY=sk-...   && export LLM_PROVIDER=openai
# export DEEPSEEK_API_KEY=sk-... && export LLM_PROVIDER=deepseek # OpenAI-compatible, text-only
# Run directly
npx surfagent

SurfAgent will automatically launch its backend, frontend, and open the UI in your browser. All logs and saved credentials are stored locally in ~/.surfagent.

Note: the npx flow requires the package to be published to npm. For local development or contributing, use Manual Setup below.

What It Does

SurfAgent runs a Chromium browser headless by default — the live view streams into the UI, so no separate window pops up (set HEADLESS=false to open the real window for debugging). It takes your natural language instructions and executes them step by step. It navigates websites, fills forms, clicks buttons, handles logins, and completes complex multi-step workflows — all while you watch live via screenshot streaming.

When it hits a CAPTCHA, needs login credentials, or encounters something it can't handle — it pauses and lets you take control directly from the web UI. Click on the screenshot to interact, solve the problem, and hand back.

🛠️ Manual Setup

# Clone
git clone https://github.com/vikast908/Surfagent.git
cd Surfagent

# Install dependencies for both parts
npm run install:all

# Configure the backend
cd backend
cp .env.example .env     # add your API key + set LLM_PROVIDER
cd ..

# (optional) configure the frontend
cp frontend/.env.example frontend/.env.local

# Run both (backend + frontend, opens the UI)
npm start

# …or run them separately for development:
# npm run dev:backend     # WebSocket server + headless browser on :8080
# npm run dev:frontend    # Next.js UI on :3000

Configuration

Backend config lives in backend/.env (copy from backend/.env.example); frontend config in frontend/.env.local (copy from frontend/.env.example).

Backend — `backend/.env`

Variable	Default	Description
`LLM_PROVIDER`	`anthropic`	`anthropic` · `openai` · `deepseek`
`ANTHROPIC_API_KEY`	—	Required when provider is `anthropic`
`OPENAI_API_KEY`	—	Required when provider is `openai`
`DEEPSEEK_API_KEY`	—	Required when provider is `deepseek` (OpenAI-compatible, text-only)
`LLM_MODEL`	per-provider	Override model (e.g. `gpt-4o`, `deepseek-chat`, `claude-sonnet-4-…`)
`LLM_BASE_URL`	per-provider	Override API base URL (OpenAI-compatible providers)
`LLM_VISION`	`on` (`off` for deepseek)	Feed page screenshots to the model (needs a vision-capable model)
`HEADLESS`	`true`	Run the browser headless (streamed to the UI). `false` opens a real window for debugging
`PORT`	`8080`	WebSocket server port
`WS_AUTH_TOKEN`	`surfagent_dev_token_2026`	Change this — shared token the UI uses to connect
`VAULT_SECRET`	dev default	Change this — AES key for the credential vault

Frontend — `frontend/.env.local`

Variable	Default	Description
`NEXT_PUBLIC_WS_URL`	`ws://localhost:8080`	Backend WebSocket URL
`NEXT_PUBLIC_WS_AUTH_TOKEN`	`surfagent_dev_token_2026`	Must match the backend `WS_AUTH_TOKEN`

Features

Core

Feature	Description
Natural Language Tasks	"Go to LinkedIn and search for product manager jobs in Mumbai"
Live Browser Stream	JPEG screenshots at 2fps via WebSocket
Interactive Manual Control	Click, scroll, type directly on the browser screenshot
17 Action Types	goto, click, type, hover, scroll, type_slowly, select_option, switch_tab, set_viewport, wait_for_network, check_validation, dismiss_overlay, and more
Accessibility Tree	Uses Playwright's a11y API for reliable element targeting
Visual Grounding	Feeds the page screenshot to the model so it can see canvas, icons, charts, and layout — not just the DOM (toggle with `LLM_VISION`)
Multi-Provider LLM	OpenAI (GPT-4o), Anthropic (Claude), or DeepSeek (`deepseek-chat`, OpenAI-compatible, text-only) — set `LLM_PROVIDER`

Smart Automation

Feature	Description
30+ Blocker Detection	CAPTCHA, login walls, payment flows, OAuth, OTP, file uploads
Auto Login with Vault	Encrypted credential storage, auto-fills on login pages
Multi-Step Login	Handles email-first → password-next flows (Google, Coursera)
Sensitive Action Guard	Blocks before delete/purchase/checkout/pay — explicit Approve / Skip / Take-over gate
Stuck Detection	Detects when page hasn't changed, warns LLM to change approach
Failed Action Memory	Tracks what failed, prevents LLM from repeating mistakes
Validation Error Detection	Reads form errors, feeds to LLM for re-fill

Agent Observability & Trust (v2.1)

Feature	Description
Stage Rail	Live phase indicator — Navigating → Reading → Acting → Verifying
Streamed Partial Results	Extracted data streams into a Results panel as it's gathered
Self-Correction Visibility	Surfaces "that didn't work — trying a different approach" on failures/stuck pages
"What Changed" Summary	Post-run recap of actions taken, pages visited, and key facts
Per-Action Confidence Gate	Before irreversible clicks: Approve / I'll do it / Skip, with the step's risk + confidence
Run History + Re-run	Past runs persisted to `~/.surfagent/runs.json`; one-click re-run

Context & Cost Management

Feature	Description
Context Window Management	Tiered memory with Task Ledger — never runs out of context
TurboQuant Compression	~70% fewer tokens per page state
Differential Encoding	Only sends changed elements between steps
Session Logging	Every action logged with timestamps, metrics, error traces

UI / UX

Feature	Description
Resizable Panels	Drag divider between chat and browser
New Chat	One-click to start fresh
Settings Panel	Manage saved credentials (add, edit, delete)
Task Persistence	Messages saved to localStorage across sessions
Click Feedback	Purple ripple on screenshot clicks
Sound Notifications	Tone on task complete/fail
Export Session	Download chat as JSON
Editable URL Bar	Type a URL and navigate directly
Keyboard Shortcuts	Escape to hand back control
Reconnection Toast	Notifies when WebSocket recovers
Command Palette	⌘/Ctrl+K to run any action or macro — keyboard-first
Instruction History	↑/↓ in the task input recalls previous instructions
Soft-Undo	"Undo" toast restores a cleared chat
Offline / Disconnect Banner	Always-visible recovery when the network or server drops
Accessibility	Focus-trapped dialogs, skip-to-content, reduced-motion & reduced-transparency aware
Stall Signal	"Still working…" when a step runs long, so silence never reads as a freeze

How It Works

Agent Loop

User instruction
  → LLM plans 1-3 actions
    → Validate actions (skip malformed ones)
      → Check for blockers (CAPTCHA, login, payment)
        → Execute in browser with 60s timeout
          → Capture page state (DOM + accessibility tree)
            → Compress state (~70% token reduction)
              → Feed back to LLM
                → Repeat until done or max steps

Element Selection Strategy

The agent uses multiple strategies per attempt, rotating on failure:

1. Accessibility tree  →  [button] "Apply Now"       (most reliable)
2. Visible text match  →  getByText("Sign In")
3. Role + name         →  getByRole("link", "Jobs")
4. Aria-label          →  [aria-label="Search"]
5. Placeholder         →  getByPlaceholder("Search")
6. CSS selector        →  button.submit              (last resort)

Context Window Management

The agent never runs out of context, even on 50+ step tasks:

┌──────────────────────────────────────────┐
│           90k Token Budget               │
├──────────────────────────────────────────┤
│  System Prompt        ~4k   (fixed)      │
│  Task Ledger          ~1-3k (structured) │
│  Recent Messages      ~8-12k (full text) │
│  Current Page State   ~1-3k (compressed) │
│  Extra Context        ~500  (warnings)   │
│  Response Reserve     ~4k   (for LLM)    │
└──────────────────────────────────────────┘

When conversation exceeds budget, old messages are parsed into a structured Task Ledger (pages visited, actions taken, outcomes) and dropped. The LLM always knows what happened before.

Token Compression (TurboQuant-Inspired)

Inspired by Google's TurboQuant — keep the signal, drop the noise:

Technique	Example	Savings
Element shorthand	`<button role="button" type="submit" aria-label="Sign In">Sign In</button>` → `btn "Sign In" [submit]`	73%
JSON key compression	`title, url, pageType` → `t, u, pt`	15%
Deduplication	120 elements → ~80 unique	33%
Differential encoding	Full list → only new/changed + `unchanged: 45`	50%

Result: ~70% fewer tokens per step = ~70% cost reduction on page state payloads.

Credential Vault

Encrypted local credential storage for auto-login:

AES-256-GCM encryption at rest
Auto-fill when login page detected — matches by domain
Multi-step login support (email-first, then password)
Never sent to LLM — credentials go directly to browser
Settings panel in UI — add, edit, delete saved logins
Save on login — checkbox to save credentials when entering them

Settings (gear icon) → + Add
  Domain: linkedin.com
  Username: your-email
  Password: ••••••••
  [Save Credentials]

Next time the agent hits LinkedIn login → auto-fills from vault → continues task.

Architecture

SurfAgent/
├── cli.js                 Root orchestrator for npx usage
├── CLAUDE.md              Project guide + UI/UX governance for AI agents
├── docs/UX_SPEC.md        Canonical UX specification (v2.1)
├── backend/
│   ├── server.js          WebSocket server — routes messages, manages lifecycle
│   ├── agent.js           LLM agent loop — planning, execution, recovery
│   ├── browser.js         Playwright — 17 actions, a11y tree, blocker detection
│   ├── config.js          Global config (paths to ~/.surfagent)
│   ├── logger.js          Session logging (stores in ~/.surfagent/logs)
│   ├── vault.js           AES-256-GCM encrypted vault (stores in ~/.surfagent)
│   └── .env               API keys + vault secret (gitignored)
│
├── frontend/
│   ├── src/app/
│   │   ├── page.tsx       Main layout (consumes AgentContext)
│   │   ├── layout.tsx     Root layout with AgentProvider
│   │   └── globals.css    Animations, ripple effects
│   ├── src/context/
│   │   └── AgentContext.tsx Global state management & WebSocket logic
│   ├── src/components/
│   │   ├── ChatPanel.tsx  Chat, steps, credentials, examples, copy, export
│   │   ├── BrowserView.tsx Interactive browser with click/scroll/type
│   │   └── CommandPalette.tsx ⌘/Ctrl+K command palette
│   ├── src/hooks/
│   │   ├── useWebSocket.ts Auto-reconnecting WebSocket with heartbeat
│   │   ├── useFocusTrap.ts Modal focus trap + restore
│   │   └── useOnlineStatus.ts Network reachability
│   ├── src/lib/
│   │   └── telemetry.ts   Pluggable interaction telemetry
│   └── tailwind.config.ts Calm Intelligence design system

Design System

Calm Intelligence v2 — designed for AI agent interfaces. The full microinteraction, motion, state-machine, and accessibility spec lives in docs/UX_SPEC.md.

Token	Value	Usage
`bg`	`#0B0F14`	Primary background
`surface`	`#111827`	Panels, cards
`elevated`	`#1F2937`	Inputs, hover states
`accent`	`#4F46E5`	Primary actions, focus
`success`	`#22C55E`	Completed, connected
`warning`	`#F59E0B`	Paused, Take Control
`error`	`#EF4444`	Failed, Stop

Typography: Inter for UI, JetBrains Mono for logs/code. Animations: 150-200ms, functional only.

Tech Stack

Layer	Technology
Frontend	Next.js 14, React 18, Tailwind CSS, TypeScript
Backend	Node.js (ES Modules)
Browser	Playwright (Chromium, headless — streamed to UI; `HEADLESS=false` for a window)
LLM	OpenAI GPT-4o · Anthropic Claude · DeepSeek (OpenAI-compatible)
Communication	WebSocket with 30s heartbeat
Streaming	JPEG screenshots at 2fps
Encryption	AES-256-GCM (Node.js crypto)

Documentation

Doc	What's inside
`docs/UX_SPEC.md`	Canonical UX spec — microinteractions, motion, state machines, accessibility
`design.txt`	"Calm Intelligence" design philosophy
`GEMINI.md`	Engineering & UX strict standards
`CLAUDE.md`	Guide for AI coding agents working in this repo
`prd.txt`	Original product requirements (historical)

Security

Credentials never reach the LLM. Saved logins are AES-256-GCM encrypted at rest (~/.surfagent/vault.enc) and injected directly into the browser — never sent to the model or written to logs.
Change the defaults before exposing SurfAgent beyond localhost: set a strong WS_AUTH_TOKEN and VAULT_SECRET in backend/.env.
backend/.env (your API keys) is gitignored — never commit it.
The browser uses automation-detection mitigations but is not a hardened sandbox; run it on machines and accounts you trust.

Contributing

Contributions are welcome!

Read GEMINI.md (engineering standards) and docs/UX_SPEC.md (UX standards) before making UI changes.
Make sure cd frontend && npm run build passes with zero warnings or errors.
Keep changes surgical and well-scoped, and never commit secrets.

License

MIT © Vikas

Built with Playwright, Next.js, and LLM APIs.
_{If the agent understands how websites work, it can operate on any site.}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SurfAgent

⚡ Quick Start (No Install)

What It Does

🛠️ Manual Setup

Configuration

Backend — `backend/.env`

Frontend — `frontend/.env.local`

Features

Core

Smart Automation

Agent Observability & Trust (v2.1)

Context & Cost Management

UI / UX

How It Works

Agent Loop

Element Selection Strategy

Context Window Management

Token Compression (TurboQuant-Inspired)

Credential Vault

Architecture

Design System

Tech Stack

Documentation

Security

Contributing

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
backend		backend
docs		docs
frontend		frontend
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
GEMINI.md		GEMINI.md
LICENSE		LICENSE
README.md		README.md
cli.js		cli.js
design.txt		design.txt
package-lock.json		package-lock.json
package.json		package.json
prd.txt		prd.txt

Folders and files

Latest commit

History

Repository files navigation

SurfAgent

⚡ Quick Start (No Install)

What It Does

🛠️ Manual Setup

Configuration

Backend — backend/.env

Frontend — frontend/.env.local

Features

Core

Smart Automation

Agent Observability & Trust (v2.1)

Context & Cost Management

UI / UX

How It Works

Agent Loop

Element Selection Strategy

Context Window Management

Token Compression (TurboQuant-Inspired)

Credential Vault

Architecture

Design System

Tech Stack

Documentation

Security

Contributing

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Backend — `backend/.env`

Frontend — `frontend/.env.local`

Packages