Protocol

WebSocket message envelope, 13-action discriminated union, error codes, and the protocolVersion handshake between extension and backend

Scope

All extension ↔ backend traffic goes over one WebSocket. Every message is validated on both sides against a single Zod schema in @zapvol/common/schemas/browser-bridge.ts. Client and server agree on one integer protocolVersion; mismatches are rejected at handshake time so in-flight messages never use stale shapes.

The schema is the API. Backend, extension, and the agent tool’s inputSchema all import from it.

Handshake

Every connection opens with a helloack / reject round trip. The extension does not send any other message until it receives ack.

sequenceDiagram participant Ext as Extension participant BE as Backend Note over Ext,BE: WS connection open Ext->>BE: hello (protocolVersion, clientVersion, pairingToken) alt version matches and pairing token valid BE-->>Ext: ack (protocolVersion, serverVersion) Note over Ext,BE: request / response / event flow freely else version or token mismatch BE-->>Ext: reject (requiredMinProtocolVersion, error) BE->>BE: close ws 4001 or 4002 end

A reject is terminal. The extension marks the client as fatal and stops reconnecting, surfacing the error in the popup so the user can update the extension or re-pair.

Message Envelope

Top-level discriminated union, every variant tagged by type:

typeDirectionPurpose
helloext → serverHandshake with protocolVersion, clientVersion, pairingToken, caps
ackserver → extHandshake accepted; carries serverVersion
rejectserver → extHandshake refused; carries requiredMinProtocolVersion + error
requestserver → extAgent action; carries unique id + action discriminated union
responseext → serverResponse to a request; carries id and either result or error
eventext → serverUnsolicited: session_started, session_ended, domain_blocked, global_stop, …

request IDs are UUIDv4 strings generated on the backend; the extension never invents IDs. Every pending request is tracked in the pool with an action-appropriate timeout.

Action Schema (aligned with Browser Use)

Single browser tool, one action per call. Each action is a discriminated variant with its own params.

ActionParamsReturns
navigateurl, optional tabId{ ok: true }
clickexactly one of { selector } or { uid }, optional tabId{ ok: true }
typeexactly one of { selector } or { uid }, text, optional tabId{ ok: true }
hoverexactly one of { selector } or { uid }, optional tabId{ ok: true }
press_keykey (Enter, Tab, Escape, arrows, …){ ok: true }
scrolldirection (up / down), optional amount{ ok: true }
screenshotoptional fullPage{ dataUrl } base64 JPEG (q=80)
extractoptional selector (omit → full page){ text, markdown, elements, obstacle? } — see below
evaluateexpression (JS body, wrapped in arrow-IIFE — use return), optional tabId{ type, value? } | { type, description } | { type, truncated, preview }
wait_forexactly one of { selector } or { uid }, optional timeoutMs (≤60s){ ok: true }
get_tabsnoneArray<{ tabId, url, title, domain }> — blocklist-filtered
open_taburl, optional focus (default false){ tabId, windowId, domain } — auto-creates a session; rejected with domain_blocked if URL is on the blocklist
close_tabtabId{ ok: true }

Rules:

  • One action per call. Multi-step flows = multiple tool calls. Keeps retry / cancel semantics simple.
  • Sessions auto-create on first action. No Approve step; the extension’s only pre-action check is the domain blocklist. Blocklist hits return domain_blocked (terminal) — see Session Model → How sessions are created.
  • open_tab routes into the BUA window by default — a dedicated minimized, unfocused Chrome window so the user’s main browser isn’t disturbed. Set focus: true only when the user explicitly asked to see the tab. See Session Model — BUA Window.
  • Same domain can host multiple sessions (one on the user’s tab, one on a BUA tab). Sessions are keyed by tabId, so non-navigate actions should pass the tabId returned by open_tab. Omitting tabId lets the extension pick, preferring BUA-owned sessions — see Session Identity.
  • Agent must validate input via the same schema before sending. The backend tool calls browserActionSchema.parse (via zodSchema) inside AI SDK; the extension re-validates incoming messages with the full envelope schema.

Element addressing — uid vs selector

click / type / hover / wait_for accept exactly one of selector (CSS string) or uid (a short cache key like "e7"). The schema’s refinement rejects both-or-neither at validation time.

extract populates result.elements with one entry per interactive accessibility-tree node:

{ uid: string; role: string; name?: string; value?: string; visible?: boolean }

uids are generated as e0, e1, … in document order and stored in the extension’s per-tab cache. They are the preferred way for the agent to reference elements in the very next action:

  • Short (2–4 chars) and token-cheap compared to a CSS selector.
  • Survive class-name churn — the cache entry resolves to a CDP backendNodeId, not a selector string.
  • Round-tripped through DOM.scrollIntoViewIfNeeded + DOM.getBoxModel (click/hover) or DOM.focus (type) — no page querySelector indirection.

selector remains valid for elements that aren’t in elements (non-interactive anchors, custom roles) or when no extract has been performed yet.

Invalidation: the uid cache is cleared when Page.frameNavigated fires (main-frame navigation) and when navigate is issued explicitly. SPA route changes that don’t reload the document leave the cache intact. After invalidation, stale uids return error.code = "element_stale"; the agent must call extract again.

extract result shape

{
  text: string;        // raw innerText (truncated at 50KB, for direct quotes)
  markdown: string;    // DOM → Markdown, nav/footer/aside/fixed overlays stripped (≤30KB)
  elements: BrowserAxElement[];   // interactive AX-tree nodes with uids (≤200)
  obstacle?: BrowserObstacle;     // page-level interruption signal, see below
}

markdown is the highest-signal part and should be the default for LLM consumption. text is kept for verbatim quoting. html is not returned (it was in v1) — selector authoring is superseded by the uid scheme.

Obstacle detection

The extension runs detectObstacle (apps/bua/src/lib/obstacle-detect.ts — pure function, no DOM deps) against inputs gathered during extract, and when signals cross a threshold populates obstacle:

{
  type: "auth_wall" | "captcha" | "access_denied";
  confidence: "low" | "high";
  message: string;  // short human-readable reason
}

Inputs the classifier sees (gathered in the same Runtime.evaluate round-trip as markdown):

  • url — lowercased location.href
  • title — lowercased document.title
  • markdown — scanned up to the first 4K chars (obstacle markers are above the fold)
  • elements — the same AX-tree list the agent sees; password-input detection works on role="textbox" / role="searchbox" whose accessible name matches /\b(password|passcode|pwd)\b/i, so modern design systems (Material, Radix, Headless UI) that wrap inputs in custom components are caught — a raw input[type=password] query would miss them.
  • captchaFrame — page-side boolean, true if a known CAPTCHA iframe is present (Cloudflare challenges.cloudflare.com, hCaptcha, reCAPTCHA, Turnstile).

Classification is a five-layer stack, cheapest first; first layer to fire short-circuits:

LayerFires onVerdict
L1captcha iframe, captcha URL, or CAPTCHA prompt phrase in markdowncaptcha, high
L2403 / forbidden / rate-limit / blocked — title, URL, or markdownaccess_denied, high
L3password role="textbox" present AND (auth URL OR auth title)auth_wall, high
L4≥ 2 distinct auth signal types (from: password field, sign-in btn, sign-up link,auth_wall, high
forgot-pwd link, OAuth provider btn, auth URL, auth title)
… OR exactly 1 such signal on a short page (< 400 markdown chars)auth_wall, low
noneno layer firesobstacle field absent

Rule of thumb: bias is toward false negatives rather than false positives — a spurious terminal signal makes the agent abort on legitimate pages (navbar “Log in” link ≠ login wall). Tighten the layer thresholds before loosening.

  • Otherwise: field absent.

Agent contract: confidence: "high" is terminal — stop this plan and surface to the caller. confidence: "low" is advisory; one alternative attempt is reasonable before giving up.

evaluate semantics

evaluate runs Runtime.evaluate with returnByValue: true, awaitPromise: true, userGesture: true, and a 10s CDP-side timeout. The expression field is wrapped in (() => { ... })() — the agent must use return to emit a value. Bare expressions (document.title without return) resolve to undefined.

Result shapes:

  • { type: "string"|"number"|"boolean"|…, value } — JSON-serialisable result.
  • { type, description } — non-serialisable (DOM node, function, Map). description is CDP’s string form.
  • { type, truncated: true, preview } — serialised value exceeded 8KB; preview ends with …[truncated N chars].

If the expression throws, the response is { error: { code: "invalid_action", message } } — this is a caller-side bug (typo, wrong API), not a platform failure. The agent should inspect the message and fix the expression rather than retry blindly.

Dialog auto-dismiss: during an evaluate call, Page.javascriptDialogOpening events on the target tab are auto-handled (accept: true for alert / beforeunload; accept: false for confirm / prompt). This prevents CDP deadlock when an agent-written expression triggers a dialog. Outside evaluate, page-origin dialogs are left alone so the user can see them.

Anti-automation mitigations

At chrome.debugger.attach time the extension installs a single Page.addScriptToEvaluateOnNewDocument that overrides navigator.webdriver to false on every new document load. This is enough to bypass the common automation check that Cloudflare-fronted, ticketing, and banking sites apply. We deliberately do not spoof other signals (chrome.runtime, plugins, canvas) — each added property is also a new fingerprint surface, and CDP isTrusted input events already give us the highest-signal authenticity. Sites that still block are treated as out-of-scope rather than escalating the arms race.

Errors

response.error and reject.error both carry the same shape:

{ code: BrowserBridgeErrorCode, message: string }

Codes:

CodeWhen
domain_blockedTarget domain is on the user’s blocklist (sessionManager.isDomainBlocked)
session_not_foundNo active session on the resolved target, or multi-session ambiguity with no explicit tabId
tab_not_foundSpecified tabId closed or never existed
element_not_foundselector or live uid matched 0 elements for click / type / hover / extract
element_staleSupplied uid is not in the extension’s cache — page navigated or no extract yet
timeoutwait_for timeout OR round-trip exceeded the action-specific pool timeout
debugger_attach_failedchrome.debugger.attach rejected (DevTools open on same tab, another debugger active)
invalid_actionSchema validation failure OR JS error thrown inside evaluate
internal_errorCatch-all — send failure, extension crash, unexpected state

Every code means do not retry the exact same request. domain_blocked is terminal — do not retry on a different tab or action (the block is intentional, surface to user). session_not_found usually means the agent omitted tabId in a multi-candidate situation; pass tabId explicitly. element_not_found means the selector/uid is wrong for the current DOM — re-extract and pick a different target. element_stale means the uid was once valid but its page no longer is — call extract before the next interaction. invalid_action on evaluate carries the JS error message; the agent should inspect and fix the expression, not retry. timeout can be retried only if the root cause is transient (e.g., slow page load — pair with wait_for).

Events

Unsolicited event messages flow extension → backend. Backend logs them today; future iterations may forward them to the agent’s context (e.g., inject a session_ended notification mid-run).

EventFields
session_starteddomain, tabId, startedAt
session_endeddomain, tabId, actionCount, reason — one of seven (see Session Model → How sessions end)
domain_blockeddomain, attemptedAction, optional tabId — fires when an action targets a blocklisted domain
tab_closedtabId — for auditing (paired with session_ended when relevant)
global_stopendedCount — popup “Stop all” was invoked; server should treat in-flight requests as cancelled

The discriminated union is extensible: new event variants (e.g., user_task_trigger for page-initiated agent runs) slot in without a protocol bump if they’re purely additive and non-load-bearing.

Versioning discipline

BROWSER_BRIDGE_PROTOCOL_VERSION is currently 4. Bump it when a message variant changes shape, a field becomes required, or semantics shift in a way the extension cannot detect at runtime.

Change log:

  • v1 → v2: extract result { text, html }{ text, markdown, elements, obstacle? }; click / type / wait_for accept { uid } as an alternative to { selector }; new actions hover and evaluate; new error code element_stale. Server-side accepts both shapes for one release.
  • v2 → v3: sessionEndReason enum adds system_idle and screen_locked so audit logs can distinguish user-presence revocations (previously both mapped to expired). A v2 server would Zod-reject a v3 session_ended event carrying either new reason — the handshake bump makes the incompatibility fail fast instead of silently dropping audit records.
  • v3 → v4: UX-first internal refactor. The consent layer is removed: per-domain allowlist → domain blocklist, TTL eliminated (sessions end on tab-close / idle / detach / stop / blocklist). Error codes scope_violation and session_expired are removed in favor of domain_blocked. Event permission_denied is replaced by domain_blocked; new global_stop event fires when the popup’s Stop-all is invoked. session_started drops expiresAt (no TTL) and carries startedAt instead. session_ended adds tabId + actionCount. sessionEndReason drops "expired" and adds "domain_blocked" + "global_stop". Hard cut — internal-only deployment, coordinated server upgrade, no v3 back-compat shim. A v3 server rejects v4 extension handshake; a v4 server rejects v3 extension handshake.

When bumping:

  1. Increment the constant in @zapvol/common/src/schemas/browser-bridge.ts.
  2. For internal-only deployments (like v3 → v4), hard cuts are acceptable: coordinate extension + server deploy, requiredMinProtocolVersion equals the new value, no grace window.
  3. For public deployments, keep the previous version working on the server for at least one release — accept both hello.protocolVersion values and branch internally, or maintain two parallel schemas. Set the server’s requiredMinProtocolVersion to the previous-supported version so older extensions stay paired; after one release, raise it to the new value.

Pure additions — a new action variant, a new event type, a new optional field — do not require a version bump so long as old extensions receiving them can safely ignore.

Was this page helpful?