Architecture

Runtime topology across backend, extension, and CDP target tab; shared pool abstraction; 5-layer extension UI mirroring @zapvol/app; data flow end-to-end

End-to-End Flow

One browser tool call traverses five layers — from the main agent delegating via task, through the schema-validated WebSocket envelope, into the extension’s consent gate, down the CDP branch matching the action, and onto the real Chrome tab. The three load-bearing mechanisms (security invariants, uid cache lifecycle, obstacle detection) are called out at the bottom because they span layers.

BUA End-to-End Architecture One browser action, round-tripped through 5 layers — from agent intent to real Chrome page 1 · Caller — packages/backend The agent decides what to do; the backend turns it into a schema-validated request Main Agent browser tool stripped via INTERNAL_ONLY_TOOLS_SET delegates via task(subagent_type: "browser") Browser Subagent toolKeys: [browser, complete] isolated ZapvolContext; parent.browserBridge runs the ReAct loop, returns one summary browser.tool.ts inputSchema: zodSchema(browserActionSchema) compact() + toClientOutput() shape the result abortSignal forwarded to bridge.request BrowserBridge pool per-user WebSocket · generates msg.id (UUIDv4) 30s pool timeout · abortSignal → reject + cleanup id-matched response resolves the promise WS send · type="request" · { id, action } · protocolVersion = 2 2 · Protocol Envelope @zapvol/common/schemas/browser-bridge.ts — one Zod union, validated at both ends hello / ack / reject request { id, action } response { id, result | error } event (session_*, permission_denied) pairing · apex-domain match 3 · Extension Background SW — the gate Consent and session invariants are enforced HERE, before any CDP call. Refuses out-of-scope actions. ActionDispatcher resolveTabAndDomain · explicit tabId wins · fallback: prefer BUA-owned session · refuse ambiguous (multi user-tab) buildTarget({selector} | {uid}) mapError → protocol error codes SessionManager checkScope(domain, tabId) · allowlist match (per-domain TTL) · Date.now() < session.expiresAt · chrome.idle + screen-lock guard incrementActionCount → rolling TTL chrome.alarms: proactive expiry BuaWindowManager dedicated minimized Chrome window · created lazy on first open_tab · focused: false, state: "minimized" · auto-collapse when empty withOpenTabLock → no orphan windows windowId persisted across SW restart DebuggerController .attach chrome.debugger.attach (CDP 1.3) enable: Page · DOM · Runtime · AX Page.addScriptToEvaluateOnNewDocument → anti-webdriver override listener: Page.frameNavigated listener: Page.javascriptDialogOpening Scope OK → CDP commands via chrome.debugger.sendCommand({ tabId }, …) 4 · DebuggerController — CDP dispatch, one path per action family apps/bua/src/debugger-controller.ts — all real browser work happens here navigate Page.navigate({ url }) pre-clear uidCaches.delete(tabId) Page.frameNavigated event → re-clears cache (defense-in-depth) SPA route changes (no reload) keep uidCache intact click / hover resolveTargetCenter(target) uid: scrollIntoViewIfNeeded + DOM.getBoxModel selector: querySelector + rect click: dispatchMouseEvent mousePressed + mouseReleased hover: single mouseMoved → event.isTrusted = true stale uid → element_stale type focus the target element uid: DOM.focus(backendNodeId) selector: el.focus() via evaluate Input.insertText({ text }) no synthesised keyboard events (accepts Unicode, paste-equivalent) press_key uses a separate Input.dispatchKeyEvent path with a KEY_MAP of named keys extract Runtime.evaluate buildExtractScript · clone DOM → Markdown walker · strip nav/footer/aside/overlays · obstacle signals (inline) Accessibility.getFullAXTree · filter INTERACTIVE_ROLES · assign e0/e1/… (cap 200) → uidCaches.set(tabId, map) detectObstacle(url, elements, …) evaluate wrap expression in arrow-IIFE (() => { expression })() Runtime.evaluate returnByValue + awaitPromise evaluateInProgress.add(tabId) dialogs auto-handled while in flight >8KB value → preview + truncated JS throws → invalid_action code .delete in finally (never leaks) 5 · Chrome Tab — real page, real events The site sees genuine user input; the SW listens for feedback to keep caches and dialogs in sync isTrusted CDP input mouse + key events look like human passes event.isTrusted checks on login forms, drag/drop, framework gates Anti-webdriver shim navigator.webdriver = false installed on every new document no broader fingerprint spoofing Page.frameNavigated ↑ main-frame nav → SW clears uidCache next uid-based action throws element_stale — agent must extract Page.javascriptDialogOpening ↑ if evaluateInProgress.has(tabId): alert/beforeunload → accept confirm/prompt → dismiss ↑ response { id, result | error } flows back up the stack · pool resolves the promise · tool returns to agent The three load-bearing mechanisms A · Security invariants enforced by SessionManager; bypass is impossible by design Per-domain allowlist explicit user grant per site TTL + rolling forward 15 min default · refreshed on each action Idle / lock guard system_idle 30m · screen_locked immediate BUA window agent tabs live minimized, never steal focus Audit log sessionHistory append-only, 1000 entries B · uid cache lifecycle uidCaches: Map<tabId, Map<uid, backendNodeId>> ① extract populates AX tree walk → e0/e1/… → Map set ② click / type / hover / wait_for use lookupUid → backendNodeId → CDP by id ③ navigate OR Page.frameNavigated uidCaches.delete(tabId) — tab's map cleared ④ stale uid → element_stale agent must call extract again before interacting C · Obstacle detection lib/obstacle-detect.ts · pure fn over extract inputs Inputs (gathered in extract round-trip) url · title · markdown · captchaFrame elements[] — AX tree matches design-system password inputs detectObstacle — 5 stacked layers L1 captcha · L2 denied · L3 password+auth · L4 ≥2 signals first layer to fire short-circuits · bias toward false negatives extract.obstacle? = { type, confidence, message } high = terminal · low = advisory · compact preserves it

Read it top-to-bottom for “what happens when the agent calls click({ uid: 'e3' })”; read the bottom cards side-to-side for “which invariants make this safe and resilient.” The rest of this page zooms into each layer.

Runtime Topology

Three zones participate on every browser tool call: the agent backend (server or desktop), the Chrome Extension (MV3 service worker), and the target tab (the user’s logged-in page).

Browser Use Agent — Runtime Topology Backend ↔ Extension ↔ CDP target Backend server or desktop browser tool action discriminated union context.browserBridge per-user injection BrowserBridgePool userId → ws WS endpoint Hono /ws/browser or ws://127.0.0.1:48123 pairing-token authenticated Chrome Extension background service worker ws-client handshake + reconnect session-manager allowlist + TTL gate action-dispatcher scope check → CDP call debugger-controller Input.dispatchMouseEvent Page.captureScreenshot... popup + options = thin UI Target Tab user's logged-in page DOM + cookies real event.isTrusted Gmail · intranet · SaaS request response event WebSocket hello / ack / reject chrome.debugger CDP commands agent issues actions extension enforces scope CDP executes on real tab

  • Backend owns the agent loop. The browser tool receives an action, looks up context.browserBridge, and calls pool.request(userId, action). The pool serializes the request over the live WebSocket to the extension.
  • Extension owns the enforcement point. ws-client receives the request; session-manager checks the domain blocklist against the target tab’s current domain and auto-creates a session on first action; action-dispatcher maps the action to a CDP command and runs it via debugger-controller.
  • Target Tab is the real, user-controlled Chrome tab. CDP Input.dispatchMouseEvent produces events with event.isTrusted = true, which is what Gmail, Google Accounts, and most enterprise SaaS require.

Only the background service worker holds the WebSocket. Popup, options, and content scripts relay through it via chrome.runtime.sendMessage, never directly.

Shared pool — one implementation, two platforms

The backend is cross-platform. The connection pool, per-user bridge factory, and BrowserBridgeSocket abstraction all live in @zapvol/backend/infra/browser-bridge-{pool,bridge}.ts:

export interface BrowserBridgeSocket {
  send(data: string): void;
  close(code?: number, reason?: string): void;
}

export interface BrowserBridgePool {
  isConnected(userId: string): boolean;
  request(userId: string, action: BrowserAction): Promise<BrowserBridgeActionResult>;
  attach(userId: string, ws: BrowserBridgeSocket): void;
  detach(userId: string, reason?: string): void;
  handleMessage(userId: string, raw: string): void;
  getStats(): { connections: number; inFlight: number };
}

Both Hono’s WSContext (server) and the ws package’s WebSocket (desktop Electron main) structurally satisfy BrowserBridgeSocket — no adapter needed. Only the handshake routing differs per platform:

PlatformEndpointAuth
Serverwss://<host>/ws/browserPairing token returned by /api/browser-extension/pairing-token
Desktopws://127.0.0.1:48123Pairing token file in Electron’s userData

Extension UI Layering (5 layers)

The extension’s popup and options UI strictly mirrors @zapvol/app’s Contract → Service → Context → Hooks → UI pattern. UI components never call chrome.runtime.* or chrome.storage.* directly.

Extension UI Layering mirrors @zapvol/app — Contract → Service → Context → Hooks → UI Contract src/contracts/bridge-service.ts BridgeService interface + BridgeState — single source of truth Services src/services/ bridge-service-local.ts background side — direct deps bridge-service-runtime.ts UI side — chrome.runtime proxy Context src/context/bridge-context.tsx BridgeProvider + useBridgeService() Hooks src/hooks/ use-bridge-state live snapshot use-bridge-config read / save / clear use-allowlist list / add / remove UI src/entrypoints/{popup,options}/App.tsx never imports chrome.* directly — always via hooks

LayerPathRole
Contractsrc/contracts/bridge-service.tsBridgeService interface + BridgeState type — single source of truth
Servicessrc/services/bridge-service-local.tsBackground side: composes SessionManager + WsClient into the service
src/services/bridge-service-runtime.tsUI side: proxies every contract method over chrome.runtime.sendMessage
Contextsrc/context/bridge-context.tsxReact provider + useBridgeService() injection hook
Hookssrc/hooks/use-bridge-state.tsLive BridgeState snapshot via subscribeState + getState
src/hooks/use-bridge-config.tsConfig read / save / clear with loading + error state
src/hooks/use-blocklist.tsBlocklist list / add / remove with refresh
UIsrc/entrypoints/popup/App.tsxActivity monitor — session stream, per-tab stop, global “Stop all”
src/entrypoints/options/App.tsxConnection + blocklist + audit log

Cross-process messaging (popup ↔ background) is wrapped in two files at the protocol boundary:

  • src/runtime-protocol.ts — typed request/broadcast shapes, tagged with kind: "bridge_request" | "bridge_broadcast" to not clash with other extension message buses
  • src/runtime-handler.ts — background-side chrome.runtime.onMessage listener that dispatches into the local service and broadcasts state changes back

End-to-End Data Flow — one click({ uid }) action

Concrete walkthrough complementing the hero diagram above. Shows the types that move across each boundary and the v2 uid path (the selector path follows the same control flow, only with querySelector-based center resolution instead of DOM.getBoxModel).

  1. Browser subagent emits a tool call: browser({ action: { type: "click", uid: "e3", tabId: 42 } }). The subagent learned uid: "e3" from the previous extract’s elements array.
  2. browser tool’s execute forwards to context.browserBridge.request(action, abortSignal).
  3. Per-user bridge calls pool.request(userId, action) — generates a UUID, sends { type: "request", id, action } over the live WS, stores a pending Promise with a per-action pool timeout and an abort-listener cleanup.
  4. Extension’s ws-client receives, validates against browserBridgeMessageSchema, routes to the request handler registered in background.ts.
  5. action-dispatcher runs buildTarget({ uid: "e3" }){ uid: "e3" }, then resolveTabAndDomain (tabId explicit → short-circuits), then sessionManager.isDomainBlocked(domain) (blocklist check) → sessionManager.ensureSession(tabId, domain) (creates on first action, reuses on subsequent; silently updates session.domain on mid-session navigation).
  6. debuggerController.click(tabId, { uid: "e3" }). The controller looks up uidCaches.get(tabId).get("e3")backendNodeId: 1829; DOM.scrollIntoViewIfNeeded({ backendNodeId: 1829 }) + DOM.getBoxModel({ backendNodeId: 1829 }) → center (x, y); Input.dispatchMouseEvent mousePressed → mouseReleased at that point. Dispatcher calls sessionManager.recordAction(tabId) to bump actionCount + lastActionAt for audit.
  7. CDP resolves; dispatcher returns { result: { ok: true } }; extension sends { type: "response", id, result: { ok: true } } back over the WS.
  8. Pool resolves the pending Promise; browser tool’s execute pushes a task_milestone event and returns { ok: true, action: "click", result: { ok: true } } to the subagent’s loop.

Where it breaks. Step 5 can fail with domain_blocked (terminal — target on blocklist) or tab_not_found; step 6 can fail with element_stale (uid not in cache — agent must extract first) or element_not_found (cached backendNodeId resolved to a node no longer in the DOM). Every failure lands on the same response envelope; the agent treats each code per the contract in Session Model → Error-handling contract.

Cancellation

Two independent abort paths reach an in-flight BUA action cleanly:

  • Extension-side (user-initiated): popup per-session Stop now, the global red Stop all, or Chrome’s debugger-bar Cancel → extension detaches the debugger, fires session_ended (and global_stop for Stop all), and any subsequent action on that tab fails with session_not_found. Adding the target domain to the blocklist ends the session the same way plus domain_blocked errors for new attempts.
  • Backend-side (agent-run cancelled): the AI SDK passes an AbortSignal to each tool’s execute. The browser tool forwards it to bridge.request(action, signal)BrowserBridgePool.request attaches an abort listener to its pending-request map. When the signal fires, the pool immediately resolves the pending promise with { error: { code: "internal_error", message: "aborted by caller" } }, clears the timeout, and detaches the listener. Any response that arrives late from the extension is logged and dropped as stray.

Both paths use the same error envelope, so the subagent’s error-handling contract applies uniformly — no new code path in the agent loop.

State Ownership

StateOwnerStorage
Pairing configExtensionchrome.storage.local key bridge_config
Domain blocklist entriesExtensionchrome.storage.local key blocklist
Active sessions (keyed by tabId; multiple per domain allowed)Extensionchrome.storage.local key activeSessions
BUA window idExtensionchrome.storage.local key buaWindowId
Session audit log (≤ 1000 entries)Extensionchrome.storage.local key sessionHistory
In-flight WS requestsPool (backend, in-memory)Map<userId, Map<requestId, Pending>>
Connection per userPool (backend, in-memory)Map<userId, WSContext | WebSocket>
CDP attachment setdebugger-controllerIn-memory Set<tabId>
uid → backendNodeId cache (per tab)debugger-controllerIn-memory Map<tabId, Map<uid, backendNodeId>> — cleared on Page.frameNavigated or navigate
evaluateInProgress set (guards dialog auto-dismiss)debugger-controllerIn-memory Set<tabId>
SessionEvent subscriberssession-manager module-scopeIn-memory Set<listener> (re-attached on SW wake)

The background service worker can be terminated by Chrome at any time. When it restarts, the bridge client re-connects and extension-side state is rehydrated from chrome.storage.local. Any agent request in flight at the moment of SW termination is rejected on the backend side with an internal_error — the agent learns of the failure and can choose to retry or stop.

Why this shape

  • One contract, two implementations — keeps the popup and options page honest about their dependency. Swapping the runtime for a mock in tests is a single factory call.
  • Pool in shared backend@zapvol/server and @zapvol/desktop both get identical enforcement behavior. Fixing a timeout bug in one fixes it everywhere.
  • Scope check in the extension, not the backend — the extension is the only code that sees the live tab. It can enforce “this domain is not blocklisted AND this tab is still open” atomically before dispatching. The backend trusts the extension’s answer and surfaces errors to the agent.
  • One tool, action enum — mirrors Anthropic’s computer-use shape and keeps the prompt tractable (13 actions in a single JSON schema the LLM has to understand). Individual tools per action would explode the prompt.
Was this page helpful?