Protocol

Scope

All extension ↔ backend traffic goes over one WebSocket. Every message is validated on both sides against a single Zod schema in @zapvol/common/schemas/browser-bridge.ts. Client and server agree on one integer protocolVersion; mismatches are rejected at handshake time so in-flight messages never use stale shapes.

The schema is the API. Backend, extension, and the agent tool’s inputSchema all import from it.

Handshake

Every connection opens with a hello → ack / reject round trip. The extension does not send any other message until it receives ack.

sequenceDiagram participant Ext as Extension participant BE as Backend Note over Ext,BE: WS connection open Ext->>BE: hello (protocolVersion, clientVersion, pairingToken) alt version matches and pairing token valid BE-->>Ext: ack (protocolVersion, serverVersion) Note over Ext,BE: request / response / event flow freely else version or token mismatch BE-->>Ext: reject (requiredMinProtocolVersion, error) BE->>BE: close ws 4001 or 4002 end

A reject is terminal. The extension marks the client as fatal and stops reconnecting, surfacing the error in the popup so the user can update the extension or re-pair.

Message Envelope

Top-level discriminated union, every variant tagged by type:

`type`	Direction	Purpose
`hello`	ext → server	Handshake with `protocolVersion`, `clientVersion`, `pairingToken`, caps
`ack`	server → ext	Handshake accepted; carries `serverVersion`
`reject`	server → ext	Handshake refused; carries `requiredMinProtocolVersion` + `error`
`request`	server → ext	Agent action; carries unique `id` + `action` discriminated union
`response`	ext → server	Response to a `request`; carries `id` and either `result` or `error`
`event`	ext → server	Unsolicited: `session_started`, `session_ended`, `domain_blocked`, `global_stop`, …

request IDs are UUIDv4 strings generated on the backend; the extension never invents IDs. Every pending request is tracked in the pool with an action-appropriate timeout.

Action Schema (aligned with Browser Use)

Single browser tool, one action per call. Each action is a discriminated variant with its own params.

Action	Params	Returns
`navigate`	`url`, optional `tabId`	`{ ok: true }`
`click`	exactly one of `{ selector }` or `{ uid }`, optional `tabId`	`{ ok: true }`
`type`	exactly one of `{ selector }` or `{ uid }`, `text`, optional `tabId`	`{ ok: true }`
`hover`	exactly one of `{ selector }` or `{ uid }`, optional `tabId`	`{ ok: true }`
`press_key`	`key` (`Enter`, `Tab`, `Escape`, arrows, …)	`{ ok: true }`
`scroll`	`direction` (`up` / `down`), optional `amount`	`{ ok: true }`
`screenshot`	optional `fullPage`	`{ dataUrl }` base64 JPEG (q=80)
`extract`	optional `selector` (omit → full page)	`{ text, markdown, elements, obstacle? }` — see below
`evaluate`	`expression` (JS body, wrapped in arrow-IIFE — use `return`), optional `tabId`	`{ type, value? }` \| `{ type, description }` \| `{ type, truncated, preview }`
`wait_for`	exactly one of `{ selector }` or `{ uid }`, optional `timeoutMs` (≤60s)	`{ ok: true }`
`get_tabs`	none	`Array<{ tabId, url, title, domain }>` — blocklist-filtered
`open_tab`	`url`, optional `focus` (default `false`)	`{ tabId, windowId, domain }` — auto-creates a session; rejected with `domain_blocked` if URL is on the blocklist
`close_tab`	`tabId`	`{ ok: true }`

Rules:

One action per call. Multi-step flows = multiple tool calls. Keeps retry / cancel semantics simple.
Sessions auto-create on first action. No Approve step; the extension’s only pre-action check is the domain blocklist. Blocklist hits return domain_blocked (terminal) — see Session Model → How sessions are created.
open_tab routes into the BUA window by default — a dedicated minimized, unfocused Chrome window so the user’s main browser isn’t disturbed. Set focus: true only when the user explicitly asked to see the tab. See Session Model — BUA Window.
Same domain can host multiple sessions (one on the user’s tab, one on a BUA tab). Sessions are keyed by tabId, so non-navigate actions should pass the tabId returned by open_tab. Omitting tabId lets the extension pick, preferring BUA-owned sessions — see Session Identity.
Agent must validate input via the same schema before sending. The backend tool calls browserActionSchema.parse (via zodSchema) inside AI SDK; the extension re-validates incoming messages with the full envelope schema.

Element addressing — `uid` vs `selector`

click / type / hover / wait_for accept exactly one of selector (CSS string) or uid (a short cache key like "e7"). The schema’s refinement rejects both-or-neither at validation time.

extract populates result.elements with one entry per interactive accessibility-tree node:

{ uid: string; role: string; name?: string; value?: string; visible?: boolean }

uids are generated as e0, e1, … in document order and stored in the extension’s per-tab cache. They are the preferred way for the agent to reference elements in the very next action:

Short (2–4 chars) and token-cheap compared to a CSS selector.
Survive class-name churn — the cache entry resolves to a CDP backendNodeId, not a selector string.
Round-tripped through DOM.scrollIntoViewIfNeeded + DOM.getBoxModel (click/hover) or DOM.focus (type) — no page querySelector indirection.

selector remains valid for elements that aren’t in elements (non-interactive anchors, custom roles) or when no extract has been performed yet.

Invalidation: the uid cache is cleared when Page.frameNavigated fires (main-frame navigation) and when navigate is issued explicitly. SPA route changes that don’t reload the document leave the cache intact. After invalidation, stale uids return error.code = "element_stale"; the agent must call extract again.

`extract` result shape

{
  text: string;        // raw innerText (truncated at 50KB, for direct quotes)
  markdown: string;    // DOM → Markdown, nav/footer/aside/fixed overlays stripped (≤30KB)
  elements: BrowserAxElement[];   // interactive AX-tree nodes with uids (≤200)
  obstacle?: BrowserObstacle;     // page-level interruption signal, see below
}

markdown is the highest-signal part and should be the default for LLM consumption. text is kept for verbatim quoting. html is not returned (it was in v1) — selector authoring is superseded by the uid scheme.

Obstacle detection

The extension runs detectObstacle (apps/bua/src/lib/obstacle-detect.ts — pure function, no DOM deps) against inputs gathered during extract, and when signals cross a threshold populates obstacle:

{
  type: "auth_wall" | "captcha" | "access_denied";
  confidence: "low" | "high";
  message: string; // short human-readable reason
}

Inputs the classifier sees (gathered in the same Runtime.evaluate round-trip as markdown):

url — lowercased location.href
title — lowercased document.title
markdown — scanned up to the first 4K chars (obstacle markers are above the fold)
elements — the same AX-tree list the agent sees; password-input detection works on role="textbox" / role="searchbox" whose accessible name matches /\b(password|passcode|pwd)\b/i, so modern design systems (Material, Radix, Headless UI) that wrap inputs in custom components are caught — a raw input[type=password] query would miss them.
captchaFrame — page-side boolean, true if a known CAPTCHA iframe is present (Cloudflare challenges.cloudflare.com, hCaptcha, reCAPTCHA, Turnstile).

Classification is a five-layer stack, cheapest first; first layer to fire short-circuits:

Layer	Fires on	Verdict
L1	captcha iframe, captcha URL, or CAPTCHA prompt phrase in markdown	`captcha`, `high`
L2	403 / forbidden / rate-limit / blocked — title, URL, or markdown	`access_denied`, `high`
L3	password `role="textbox"` present AND (auth URL OR auth title)	`auth_wall`, `high`
L4	≥ 2 distinct auth signal types (from: password field, sign-in btn, sign-up link,	`auth_wall`, `high`
	forgot-pwd link, OAuth provider btn, auth URL, auth title)
	… OR exactly 1 such signal on a short page (< 400 markdown chars)	`auth_wall`, `low`
none	no layer fires	`obstacle` field absent

Rule of thumb: bias is toward false negatives rather than false positives — a spurious terminal signal makes the agent abort on legitimate pages (navbar “Log in” link ≠ login wall). Tighten the layer thresholds before loosening.

Otherwise: field absent.

Agent contract: confidence: "high" is terminal — stop this plan and surface to the caller. confidence: "low" is advisory; one alternative attempt is reasonable before giving up.

`evaluate` semantics

evaluate runs Runtime.evaluate with returnByValue: true, awaitPromise: true, userGesture: true, and a 10s CDP-side timeout. The expression field is wrapped in (() => { ... })() — the agent must use return to emit a value. Bare expressions (document.title without return) resolve to undefined.

Result shapes:

{ type: "string"|"number"|"boolean"|…, value } — JSON-serialisable result.
{ type, description } — non-serialisable (DOM node, function, Map). description is CDP’s string form.
{ type, truncated: true, preview } — serialised value exceeded 8KB; preview ends with …[truncated N chars].

If the expression throws, the response is { error: { code: "invalid_action", message } } — this is a caller-side bug (typo, wrong API), not a platform failure. The agent should inspect the message and fix the expression rather than retry blindly.

Dialog auto-dismiss: during an evaluate call, Page.javascriptDialogOpening events on the target tab are auto-handled (accept: true for alert / beforeunload; accept: false for confirm / prompt). This prevents CDP deadlock when an agent-written expression triggers a dialog. Outside evaluate, page-origin dialogs are left alone so the user can see them.

Anti-automation mitigations

At chrome.debugger.attach time the extension installs a single Page.addScriptToEvaluateOnNewDocument that overrides navigator.webdriver to false on every new document load. This is enough to bypass the common automation check that Cloudflare-fronted, ticketing, and banking sites apply. We deliberately do not spoof other signals (chrome.runtime, plugins, canvas) — each added property is also a new fingerprint surface, and CDP isTrusted input events already give us the highest-signal authenticity. Sites that still block are treated as out-of-scope rather than escalating the arms race.

Errors

response.error and reject.error both carry the same shape:

{ code: BrowserBridgeErrorCode, message: string }

Codes:

Code	When
`domain_blocked`	Target domain is on the user’s blocklist (`sessionManager.isDomainBlocked`)
`session_not_found`	No active session on the resolved target, or multi-session ambiguity with no explicit `tabId`
`tab_not_found`	Specified `tabId` closed or never existed
`element_not_found`	`selector` or live `uid` matched 0 elements for `click` / `type` / `hover` / `extract`
`element_stale`	Supplied `uid` is not in the extension’s cache — page navigated or no `extract` yet
`timeout`	`wait_for` timeout OR round-trip exceeded the action-specific pool timeout
`debugger_attach_failed`	`chrome.debugger.attach` rejected (DevTools open on same tab, another debugger active)
`invalid_action`	Schema validation failure OR JS error thrown inside `evaluate`
`internal_error`	Catch-all — send failure, extension crash, unexpected state

Every code means do not retry the exact same request. domain_blocked is terminal — do not retry on a different tab or action (the block is intentional, surface to user). session_not_found usually means the agent omitted tabId in a multi-candidate situation; pass tabId explicitly. element_not_found means the selector/uid is wrong for the current DOM — re-extract and pick a different target. element_stale means the uid was once valid but its page no longer is — call extract before the next interaction. invalid_action on evaluate carries the JS error message; the agent should inspect and fix the expression, not retry. timeout can be retried only if the root cause is transient (e.g., slow page load — pair with wait_for).

Events

Unsolicited event messages flow extension → backend. Backend logs them today; future iterations may forward them to the agent’s context (e.g., inject a session_ended notification mid-run).

Event	Fields
`session_started`	`domain`, `tabId`, `startedAt`
`session_ended`	`domain`, `tabId`, `actionCount`, `reason` — one of seven (see Session Model → How sessions end)
`domain_blocked`	`domain`, `attemptedAction`, optional `tabId` — fires when an action targets a blocklisted domain
`tab_closed`	`tabId` — for auditing (paired with `session_ended` when relevant)
`global_stop`	`endedCount` — popup “Stop all” was invoked; server should treat in-flight requests as cancelled

The discriminated union is extensible: new event variants (e.g., user_task_trigger for page-initiated agent runs) slot in without a protocol bump if they’re purely additive and non-load-bearing.

Versioning discipline

BROWSER_BRIDGE_PROTOCOL_VERSION is currently 4. Bump it when a message variant changes shape, a field becomes required, or semantics shift in a way the extension cannot detect at runtime.

Change log:

v1 → v2: extract result { text, html } → { text, markdown, elements, obstacle? }; click / type / wait_for accept { uid } as an alternative to { selector }; new actions hover and evaluate; new error code element_stale. Server-side accepts both shapes for one release.
v2 → v3: sessionEndReason enum adds system_idle and screen_locked so audit logs can distinguish user-presence revocations (previously both mapped to expired). A v2 server would Zod-reject a v3 session_ended event carrying either new reason — the handshake bump makes the incompatibility fail fast instead of silently dropping audit records.
v3 → v4: UX-first internal refactor. The consent layer is removed: per-domain allowlist → domain blocklist, TTL eliminated (sessions end on tab-close / idle / detach / stop / blocklist). Error codes scope_violation and session_expired are removed in favor of domain_blocked. Event permission_denied is replaced by domain_blocked; new global_stop event fires when the popup’s Stop-all is invoked. session_started drops expiresAt (no TTL) and carries startedAt instead. session_ended adds tabId + actionCount. sessionEndReason drops "expired" and adds "domain_blocked" + "global_stop". Hard cut — internal-only deployment, coordinated server upgrade, no v3 back-compat shim. A v3 server rejects v4 extension handshake; a v4 server rejects v3 extension handshake.

When bumping:

Increment the constant in @zapvol/common/src/schemas/browser-bridge.ts.
For internal-only deployments (like v3 → v4), hard cuts are acceptable: coordinate extension + server deploy, requiredMinProtocolVersion equals the new value, no grace window.
For public deployments, keep the previous version working on the server for at least one release — accept both hello.protocolVersion values and branch internally, or maintain two parallel schemas. Set the server’s requiredMinProtocolVersion to the previous-supported version so older extensions stay paired; after one release, raise it to the new value.

Pure additions — a new action variant, a new event type, a new optional field — do not require a version bump so long as old extensions receiving them can safely ignore.