Skip to content

vikast908/Surfagent

Repository files navigation

SurfAgent

SurfAgent

AI-powered browser automation agent that understands the web.
Give it a task in plain English. Watch it execute in a real browser. Take control when needed.

npx surfagent

Next.js Playwright OpenAI Anthropic DeepSeek TypeScript MIT License

Quick StartFeaturesConfigurationHow It WorksArchitectureToken CompressionCredential Vault


⚡ Quick Start (No Install)

The fastest way to run SurfAgent is via npx. Ensure you have Node.js installed, then run:

# Set your API key (pick one provider)
export ANTHROPIC_API_KEY=sk-...                                  # Anthropic (Claude) — default
# export OPENAI_API_KEY=sk-...   && export LLM_PROVIDER=openai
# export DEEPSEEK_API_KEY=sk-... && export LLM_PROVIDER=deepseek # OpenAI-compatible, text-only
# Run directly
npx surfagent

SurfAgent will automatically launch its backend, frontend, and open the UI in your browser. All logs and saved credentials are stored locally in ~/.surfagent.

Note: the npx flow requires the package to be published to npm. For local development or contributing, use Manual Setup below.


What It Does

SurfAgent runs a Chromium browser headless by default — the live view streams into the UI, so no separate window pops up (set HEADLESS=false to open the real window for debugging). It takes your natural language instructions and executes them step by step. It navigates websites, fills forms, clicks buttons, handles logins, and completes complex multi-step workflows — all while you watch live via screenshot streaming.

When it hits a CAPTCHA, needs login credentials, or encounters something it can't handle — it pauses and lets you take control directly from the web UI. Click on the screenshot to interact, solve the problem, and hand back.


🛠️ Manual Setup

# Clone
git clone https://github.com/vikast908/Surfagent.git
cd Surfagent

# Install dependencies for both parts
npm run install:all

# Configure the backend
cd backend
cp .env.example .env     # add your API key + set LLM_PROVIDER
cd ..

# (optional) configure the frontend
cp frontend/.env.example frontend/.env.local

# Run both (backend + frontend, opens the UI)
npm start

# …or run them separately for development:
# npm run dev:backend     # WebSocket server + headless browser on :8080
# npm run dev:frontend    # Next.js UI on :3000

Configuration

Backend config lives in backend/.env (copy from backend/.env.example); frontend config in frontend/.env.local (copy from frontend/.env.example).

Backend — backend/.env

Variable Default Description
LLM_PROVIDER anthropic anthropic · openai · deepseek
ANTHROPIC_API_KEY Required when provider is anthropic
OPENAI_API_KEY Required when provider is openai
DEEPSEEK_API_KEY Required when provider is deepseek (OpenAI-compatible, text-only)
LLM_MODEL per-provider Override model (e.g. gpt-4o, deepseek-chat, claude-sonnet-4-…)
LLM_BASE_URL per-provider Override API base URL (OpenAI-compatible providers)
LLM_VISION on (off for deepseek) Feed page screenshots to the model (needs a vision-capable model)
HEADLESS true Run the browser headless (streamed to the UI). false opens a real window for debugging
PORT 8080 WebSocket server port
WS_AUTH_TOKEN surfagent_dev_token_2026 Change this — shared token the UI uses to connect
VAULT_SECRET dev default Change this — AES key for the credential vault

Frontend — frontend/.env.local

Variable Default Description
NEXT_PUBLIC_WS_URL ws://localhost:8080 Backend WebSocket URL
NEXT_PUBLIC_WS_AUTH_TOKEN surfagent_dev_token_2026 Must match the backend WS_AUTH_TOKEN

Features

Core

Feature Description
Natural Language Tasks "Go to LinkedIn and search for product manager jobs in Mumbai"
Live Browser Stream JPEG screenshots at 2fps via WebSocket
Interactive Manual Control Click, scroll, type directly on the browser screenshot
17 Action Types goto, click, type, hover, scroll, type_slowly, select_option, switch_tab, set_viewport, wait_for_network, check_validation, dismiss_overlay, and more
Accessibility Tree Uses Playwright's a11y API for reliable element targeting
Visual Grounding Feeds the page screenshot to the model so it can see canvas, icons, charts, and layout — not just the DOM (toggle with LLM_VISION)
Multi-Provider LLM OpenAI (GPT-4o), Anthropic (Claude), or DeepSeek (deepseek-chat, OpenAI-compatible, text-only) — set LLM_PROVIDER

Smart Automation

Feature Description
30+ Blocker Detection CAPTCHA, login walls, payment flows, OAuth, OTP, file uploads
Auto Login with Vault Encrypted credential storage, auto-fills on login pages
Multi-Step Login Handles email-first → password-next flows (Google, Coursera)
Sensitive Action Guard Blocks before delete/purchase/checkout/pay — explicit Approve / Skip / Take-over gate
Stuck Detection Detects when page hasn't changed, warns LLM to change approach
Failed Action Memory Tracks what failed, prevents LLM from repeating mistakes
Validation Error Detection Reads form errors, feeds to LLM for re-fill

Agent Observability & Trust (v2.1)

Feature Description
Stage Rail Live phase indicator — Navigating → Reading → Acting → Verifying
Streamed Partial Results Extracted data streams into a Results panel as it's gathered
Self-Correction Visibility Surfaces "that didn't work — trying a different approach" on failures/stuck pages
"What Changed" Summary Post-run recap of actions taken, pages visited, and key facts
Per-Action Confidence Gate Before irreversible clicks: Approve / I'll do it / Skip, with the step's risk + confidence
Run History + Re-run Past runs persisted to ~/.surfagent/runs.json; one-click re-run

Context & Cost Management

Feature Description
Context Window Management Tiered memory with Task Ledger — never runs out of context
TurboQuant Compression ~70% fewer tokens per page state
Differential Encoding Only sends changed elements between steps
Session Logging Every action logged with timestamps, metrics, error traces

UI / UX

Feature Description
Resizable Panels Drag divider between chat and browser
New Chat One-click to start fresh
Settings Panel Manage saved credentials (add, edit, delete)
Task Persistence Messages saved to localStorage across sessions
Click Feedback Purple ripple on screenshot clicks
Sound Notifications Tone on task complete/fail
Export Session Download chat as JSON
Editable URL Bar Type a URL and navigate directly
Keyboard Shortcuts Escape to hand back control
Reconnection Toast Notifies when WebSocket recovers
Command Palette ⌘/Ctrl+K to run any action or macro — keyboard-first
Instruction History ↑/↓ in the task input recalls previous instructions
Soft-Undo "Undo" toast restores a cleared chat
Offline / Disconnect Banner Always-visible recovery when the network or server drops
Accessibility Focus-trapped dialogs, skip-to-content, reduced-motion & reduced-transparency aware
Stall Signal "Still working…" when a step runs long, so silence never reads as a freeze

How It Works

Agent Loop

User instruction
  → LLM plans 1-3 actions
    → Validate actions (skip malformed ones)
      → Check for blockers (CAPTCHA, login, payment)
        → Execute in browser with 60s timeout
          → Capture page state (DOM + accessibility tree)
            → Compress state (~70% token reduction)
              → Feed back to LLM
                → Repeat until done or max steps

Element Selection Strategy

The agent uses multiple strategies per attempt, rotating on failure:

1. Accessibility tree  →  [button] "Apply Now"       (most reliable)
2. Visible text match  →  getByText("Sign In")
3. Role + name         →  getByRole("link", "Jobs")
4. Aria-label          →  [aria-label="Search"]
5. Placeholder         →  getByPlaceholder("Search")
6. CSS selector        →  button.submit              (last resort)

Context Window Management

The agent never runs out of context, even on 50+ step tasks:

┌──────────────────────────────────────────┐
│           90k Token Budget               │
├──────────────────────────────────────────┤
│  System Prompt        ~4k   (fixed)      │
│  Task Ledger          ~1-3k (structured) │
│  Recent Messages      ~8-12k (full text) │
│  Current Page State   ~1-3k (compressed) │
│  Extra Context        ~500  (warnings)   │
│  Response Reserve     ~4k   (for LLM)    │
└──────────────────────────────────────────┘

When conversation exceeds budget, old messages are parsed into a structured Task Ledger (pages visited, actions taken, outcomes) and dropped. The LLM always knows what happened before.

Token Compression (TurboQuant-Inspired)

Inspired by Google's TurboQuant — keep the signal, drop the noise:

Technique Example Savings
Element shorthand <button role="button" type="submit" aria-label="Sign In">Sign In</button>btn "Sign In" [submit] 73%
JSON key compression title, url, pageTypet, u, pt 15%
Deduplication 120 elements → ~80 unique 33%
Differential encoding Full list → only new/changed + unchanged: 45 50%

Result: ~70% fewer tokens per step = ~70% cost reduction on page state payloads.


Credential Vault

Encrypted local credential storage for auto-login:

  • AES-256-GCM encryption at rest
  • Auto-fill when login page detected — matches by domain
  • Multi-step login support (email-first, then password)
  • Never sent to LLM — credentials go directly to browser
  • Settings panel in UI — add, edit, delete saved logins
  • Save on login — checkbox to save credentials when entering them
Settings (gear icon) → + Add
  Domain: linkedin.com
  Username: your-email
  Password: ••••••••
  [Save Credentials]

Next time the agent hits LinkedIn login → auto-fills from vault → continues task.


Architecture

SurfAgent/
├── cli.js                 Root orchestrator for npx usage
├── CLAUDE.md              Project guide + UI/UX governance for AI agents
├── docs/UX_SPEC.md        Canonical UX specification (v2.1)
├── backend/
│   ├── server.js          WebSocket server — routes messages, manages lifecycle
│   ├── agent.js           LLM agent loop — planning, execution, recovery
│   ├── browser.js         Playwright — 17 actions, a11y tree, blocker detection
│   ├── config.js          Global config (paths to ~/.surfagent)
│   ├── logger.js          Session logging (stores in ~/.surfagent/logs)
│   ├── vault.js           AES-256-GCM encrypted vault (stores in ~/.surfagent)
│   └── .env               API keys + vault secret (gitignored)
│
├── frontend/
│   ├── src/app/
│   │   ├── page.tsx       Main layout (consumes AgentContext)
│   │   ├── layout.tsx     Root layout with AgentProvider
│   │   └── globals.css    Animations, ripple effects
│   ├── src/context/
│   │   └── AgentContext.tsx Global state management & WebSocket logic
│   ├── src/components/
│   │   ├── ChatPanel.tsx  Chat, steps, credentials, examples, copy, export
│   │   ├── BrowserView.tsx Interactive browser with click/scroll/type
│   │   └── CommandPalette.tsx ⌘/Ctrl+K command palette
│   ├── src/hooks/
│   │   ├── useWebSocket.ts Auto-reconnecting WebSocket with heartbeat
│   │   ├── useFocusTrap.ts Modal focus trap + restore
│   │   └── useOnlineStatus.ts Network reachability
│   ├── src/lib/
│   │   └── telemetry.ts   Pluggable interaction telemetry
│   └── tailwind.config.ts Calm Intelligence design system

Design System

Calm Intelligence v2 — designed for AI agent interfaces. The full microinteraction, motion, state-machine, and accessibility spec lives in docs/UX_SPEC.md.

Token Value Usage
bg #0B0F14 Primary background
surface #111827 Panels, cards
elevated #1F2937 Inputs, hover states
accent #4F46E5 Primary actions, focus
success #22C55E Completed, connected
warning #F59E0B Paused, Take Control
error #EF4444 Failed, Stop

Typography: Inter for UI, JetBrains Mono for logs/code. Animations: 150-200ms, functional only.


Tech Stack

Layer Technology
Frontend Next.js 14, React 18, Tailwind CSS, TypeScript
Backend Node.js (ES Modules)
Browser Playwright (Chromium, headless — streamed to UI; HEADLESS=false for a window)
LLM OpenAI GPT-4o · Anthropic Claude · DeepSeek (OpenAI-compatible)
Communication WebSocket with 30s heartbeat
Streaming JPEG screenshots at 2fps
Encryption AES-256-GCM (Node.js crypto)

Documentation

Doc What's inside
docs/UX_SPEC.md Canonical UX spec — microinteractions, motion, state machines, accessibility
design.txt "Calm Intelligence" design philosophy
GEMINI.md Engineering & UX strict standards
CLAUDE.md Guide for AI coding agents working in this repo
prd.txt Original product requirements (historical)

Security

  • Credentials never reach the LLM. Saved logins are AES-256-GCM encrypted at rest (~/.surfagent/vault.enc) and injected directly into the browser — never sent to the model or written to logs.
  • Change the defaults before exposing SurfAgent beyond localhost: set a strong WS_AUTH_TOKEN and VAULT_SECRET in backend/.env.
  • backend/.env (your API keys) is gitignored — never commit it.
  • The browser uses automation-detection mitigations but is not a hardened sandbox; run it on machines and accounts you trust.

Contributing

Contributions are welcome!

  1. Read GEMINI.md (engineering standards) and docs/UX_SPEC.md (UX standards) before making UI changes.
  2. Make sure cd frontend && npm run build passes with zero warnings or errors.
  3. Keep changes surgical and well-scoped, and never commit secrets.

License

MIT © Vikas


Built with Playwright, Next.js, and LLM APIs.
If the agent understands how websites work, it can operate on any site.

About

AI browser agent that understands the web — give it a task in plain English, watch it run in a real browser streamed live into the app, and take control anytime. Next.js + Playwright + LLM (OpenAI / Anthropic / DeepSeek).

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors