Protocol
WebSocket message envelope, 13-action discriminated union, error codes, and the protocolVersion handshake between extension and backend
Scope
All extension ↔ backend traffic goes over one WebSocket. Every message is validated on both sides against a single Zod
schema in @zapvol/common/schemas/browser-bridge.ts. Client and server agree on one integer protocolVersion;
mismatches are rejected at handshake time so in-flight messages never use stale shapes.
The schema is the API. Backend, extension, and the agent tool’s inputSchema all import from it.
Handshake
Every connection opens with a hello → ack / reject round trip. The extension does not send any other message until
it receives ack.
A reject is terminal. The extension marks the client as fatal and stops reconnecting, surfacing the error in the
popup so the user can update the extension or re-pair.
Message Envelope
Top-level discriminated union, every variant tagged by type:
type | Direction | Purpose |
|---|---|---|
hello | ext → server | Handshake with protocolVersion, clientVersion, pairingToken, caps |
ack | server → ext | Handshake accepted; carries serverVersion |
reject | server → ext | Handshake refused; carries requiredMinProtocolVersion + error |
request | server → ext | Agent action; carries unique id + action discriminated union |
response | ext → server | Response to a request; carries id and either result or error |
event | ext → server | Unsolicited: session_started, session_ended, domain_blocked, global_stop, … |
request IDs are UUIDv4 strings generated on the backend; the extension never invents IDs. Every pending request is
tracked in the pool with an action-appropriate timeout.
Action Schema (aligned with Browser Use)
Single browser tool, one action per call. Each action is a discriminated variant with its own params.
| Action | Params | Returns |
|---|---|---|
navigate | url, optional tabId | { ok: true } |
click | exactly one of { selector } or { uid }, optional tabId | { ok: true } |
type | exactly one of { selector } or { uid }, text, optional tabId | { ok: true } |
hover | exactly one of { selector } or { uid }, optional tabId | { ok: true } |
press_key | key (Enter, Tab, Escape, arrows, …) | { ok: true } |
scroll | direction (up / down), optional amount | { ok: true } |
screenshot | optional fullPage | { dataUrl } base64 JPEG (q=80) |
extract | optional selector (omit → full page) | { text, markdown, elements, obstacle? } — see below |
evaluate | expression (JS body, wrapped in arrow-IIFE — use return), optional tabId | { type, value? } | { type, description } | { type, truncated, preview } |
wait_for | exactly one of { selector } or { uid }, optional timeoutMs (≤60s) | { ok: true } |
get_tabs | none | Array<{ tabId, url, title, domain }> — blocklist-filtered |
open_tab | url, optional focus (default false) | { tabId, windowId, domain } — auto-creates a session; rejected with domain_blocked if URL is on the blocklist |
close_tab | tabId | { ok: true } |
Rules:
- One action per call. Multi-step flows = multiple tool calls. Keeps retry / cancel semantics simple.
- Sessions auto-create on first action. No Approve step; the extension’s only pre-action check is the domain
blocklist. Blocklist hits return
domain_blocked(terminal) — see Session Model → How sessions are created. open_tabroutes into the BUA window by default — a dedicated minimized, unfocused Chrome window so the user’s main browser isn’t disturbed. Setfocus: trueonly when the user explicitly asked to see the tab. See Session Model — BUA Window.- Same domain can host multiple sessions (one on the user’s tab, one on a BUA tab). Sessions are keyed by
tabId, so non-navigate actions should pass thetabIdreturned byopen_tab. OmittingtabIdlets the extension pick, preferring BUA-owned sessions — see Session Identity. - Agent must validate input via the same schema before sending. The backend tool calls
browserActionSchema.parse(viazodSchema) inside AI SDK; the extension re-validates incoming messages with the full envelope schema.
Element addressing — uid vs selector
click / type / hover / wait_for accept exactly one of selector (CSS string) or uid
(a short cache key like "e7"). The schema’s refinement rejects both-or-neither at validation time.
extract populates result.elements with one entry per interactive accessibility-tree node:
{ uid: string; role: string; name?: string; value?: string; visible?: boolean }
uids are generated as e0, e1, … in document order and stored in the extension’s per-tab cache.
They are the preferred way for the agent to reference elements in the very next action:
- Short (2–4 chars) and token-cheap compared to a CSS selector.
- Survive class-name churn — the cache entry resolves to a CDP
backendNodeId, not a selector string. - Round-tripped through
DOM.scrollIntoViewIfNeeded+DOM.getBoxModel(click/hover) orDOM.focus(type) — no pagequerySelectorindirection.
selector remains valid for elements that aren’t in elements (non-interactive anchors, custom roles)
or when no extract has been performed yet.
Invalidation: the uid cache is cleared when Page.frameNavigated fires (main-frame navigation) and when
navigate is issued explicitly. SPA route changes that don’t reload the document leave the cache intact.
After invalidation, stale uids return error.code = "element_stale"; the agent must call extract again.
extract result shape
{
text: string; // raw innerText (truncated at 50KB, for direct quotes)
markdown: string; // DOM → Markdown, nav/footer/aside/fixed overlays stripped (≤30KB)
elements: BrowserAxElement[]; // interactive AX-tree nodes with uids (≤200)
obstacle?: BrowserObstacle; // page-level interruption signal, see below
}
markdown is the highest-signal part and should be the default for LLM consumption. text is kept
for verbatim quoting. html is not returned (it was in v1) — selector authoring is superseded
by the uid scheme.
Obstacle detection
The extension runs detectObstacle (apps/bua/src/lib/obstacle-detect.ts — pure function, no DOM deps)
against inputs gathered during extract, and when signals cross a threshold populates obstacle:
{
type: "auth_wall" | "captcha" | "access_denied";
confidence: "low" | "high";
message: string; // short human-readable reason
}
Inputs the classifier sees (gathered in the same Runtime.evaluate round-trip as markdown):
url— lowercasedlocation.hreftitle— lowercaseddocument.titlemarkdown— scanned up to the first 4K chars (obstacle markers are above the fold)elements— the same AX-tree list the agent sees; password-input detection works onrole="textbox"/role="searchbox"whose accessible name matches/\b(password|passcode|pwd)\b/i, so modern design systems (Material, Radix, Headless UI) that wrap inputs in custom components are caught — a rawinput[type=password]query would miss them.captchaFrame— page-side boolean, true if a known CAPTCHA iframe is present (Cloudflarechallenges.cloudflare.com, hCaptcha, reCAPTCHA, Turnstile).
Classification is a five-layer stack, cheapest first; first layer to fire short-circuits:
| Layer | Fires on | Verdict |
|---|---|---|
| L1 | captcha iframe, captcha URL, or CAPTCHA prompt phrase in markdown | captcha, high |
| L2 | 403 / forbidden / rate-limit / blocked — title, URL, or markdown | access_denied, high |
| L3 | password role="textbox" present AND (auth URL OR auth title) | auth_wall, high |
| L4 | ≥ 2 distinct auth signal types (from: password field, sign-in btn, sign-up link, | auth_wall, high |
| forgot-pwd link, OAuth provider btn, auth URL, auth title) | ||
| … OR exactly 1 such signal on a short page (< 400 markdown chars) | auth_wall, low | |
| none | no layer fires | obstacle field absent |
Rule of thumb: bias is toward false negatives rather than false positives — a spurious terminal signal makes the agent abort on legitimate pages (navbar “Log in” link ≠ login wall). Tighten the layer thresholds before loosening.
- Otherwise: field absent.
Agent contract: confidence: "high" is terminal — stop this plan and surface to the caller.
confidence: "low" is advisory; one alternative attempt is reasonable before giving up.
evaluate semantics
evaluate runs Runtime.evaluate with returnByValue: true, awaitPromise: true, userGesture: true,
and a 10s CDP-side timeout. The expression field is wrapped in (() => { ... })() — the agent must
use return to emit a value. Bare expressions (document.title without return) resolve to undefined.
Result shapes:
{ type: "string"|"number"|"boolean"|…, value }— JSON-serialisable result.{ type, description }— non-serialisable (DOM node, function, Map).descriptionis CDP’s string form.{ type, truncated: true, preview }— serialised value exceeded 8KB; preview ends with…[truncated N chars].
If the expression throws, the response is { error: { code: "invalid_action", message } } — this is a
caller-side bug (typo, wrong API), not a platform failure. The agent should inspect the message and fix
the expression rather than retry blindly.
Dialog auto-dismiss: during an evaluate call, Page.javascriptDialogOpening events on the target
tab are auto-handled (accept: true for alert / beforeunload; accept: false for confirm / prompt).
This prevents CDP deadlock when an agent-written expression triggers a dialog. Outside evaluate,
page-origin dialogs are left alone so the user can see them.
Anti-automation mitigations
At chrome.debugger.attach time the extension installs a single
Page.addScriptToEvaluateOnNewDocument that overrides navigator.webdriver to false on every
new document load. This is enough to bypass the common automation check that Cloudflare-fronted,
ticketing, and banking sites apply. We deliberately do not spoof other signals (chrome.runtime,
plugins, canvas) — each added property is also a new fingerprint surface, and CDP isTrusted input
events already give us the highest-signal authenticity. Sites that still block are treated as
out-of-scope rather than escalating the arms race.
Errors
response.error and reject.error both carry the same shape:
{ code: BrowserBridgeErrorCode, message: string }
Codes:
| Code | When |
|---|---|
domain_blocked | Target domain is on the user’s blocklist (sessionManager.isDomainBlocked) |
session_not_found | No active session on the resolved target, or multi-session ambiguity with no explicit tabId |
tab_not_found | Specified tabId closed or never existed |
element_not_found | selector or live uid matched 0 elements for click / type / hover / extract |
element_stale | Supplied uid is not in the extension’s cache — page navigated or no extract yet |
timeout | wait_for timeout OR round-trip exceeded the action-specific pool timeout |
debugger_attach_failed | chrome.debugger.attach rejected (DevTools open on same tab, another debugger active) |
invalid_action | Schema validation failure OR JS error thrown inside evaluate |
internal_error | Catch-all — send failure, extension crash, unexpected state |
Every code means do not retry the exact same request. domain_blocked is terminal — do not retry on a different
tab or action (the block is intentional, surface to user). session_not_found usually means the agent omitted tabId
in a multi-candidate situation; pass tabId explicitly. element_not_found means the selector/uid is wrong for the
current DOM — re-extract and pick a different target. element_stale means the uid was once valid but its page no
longer is — call extract before the next interaction. invalid_action on evaluate carries the JS error message;
the agent should inspect and fix the expression, not retry. timeout can be retried only if the root cause is
transient (e.g., slow page load — pair with wait_for).
Events
Unsolicited event messages flow extension → backend. Backend logs them today; future iterations may forward them to
the agent’s context (e.g., inject a session_ended notification mid-run).
| Event | Fields |
|---|---|
session_started | domain, tabId, startedAt |
session_ended | domain, tabId, actionCount, reason — one of seven (see Session Model → How sessions end) |
domain_blocked | domain, attemptedAction, optional tabId — fires when an action targets a blocklisted domain |
tab_closed | tabId — for auditing (paired with session_ended when relevant) |
global_stop | endedCount — popup “Stop all” was invoked; server should treat in-flight requests as cancelled |
The discriminated union is extensible: new event variants (e.g., user_task_trigger for page-initiated agent runs) slot
in without a protocol bump if they’re purely additive and non-load-bearing.
Versioning discipline
BROWSER_BRIDGE_PROTOCOL_VERSION is currently 4. Bump it when a message variant changes shape, a field becomes
required, or semantics shift in a way the extension cannot detect at runtime.
Change log:
- v1 → v2:
extractresult{ text, html }→{ text, markdown, elements, obstacle? };click/type/wait_foraccept{ uid }as an alternative to{ selector }; new actionshoverandevaluate; new error codeelement_stale. Server-side accepts both shapes for one release. - v2 → v3:
sessionEndReasonenum addssystem_idleandscreen_lockedso audit logs can distinguish user-presence revocations (previously both mapped toexpired). A v2 server would Zod-reject a v3session_endedevent carrying either new reason — the handshake bump makes the incompatibility fail fast instead of silently dropping audit records. - v3 → v4: UX-first internal refactor. The consent layer is removed: per-domain allowlist → domain blocklist,
TTL eliminated (sessions end on tab-close / idle / detach / stop / blocklist). Error codes
scope_violationandsession_expiredare removed in favor ofdomain_blocked. Eventpermission_deniedis replaced bydomain_blocked; newglobal_stopevent fires when the popup’s Stop-all is invoked.session_starteddropsexpiresAt(no TTL) and carriesstartedAtinstead.session_endedaddstabId+actionCount.sessionEndReasondrops"expired"and adds"domain_blocked"+"global_stop". Hard cut — internal-only deployment, coordinated server upgrade, no v3 back-compat shim. A v3 server rejects v4 extension handshake; a v4 server rejects v3 extension handshake.
When bumping:
- Increment the constant in
@zapvol/common/src/schemas/browser-bridge.ts. - For internal-only deployments (like v3 → v4), hard cuts are acceptable: coordinate extension + server deploy,
requiredMinProtocolVersionequals the new value, no grace window. - For public deployments, keep the previous version working on the server for at least one release — accept both
hello.protocolVersionvalues and branch internally, or maintain two parallel schemas. Set the server’srequiredMinProtocolVersionto the previous-supported version so older extensions stay paired; after one release, raise it to the new value.
Pure additions — a new action variant, a new event type, a new optional field — do not require a version bump so
long as old extensions receiving them can safely ignore.