Architecture
Runtime topology across backend, extension, and CDP target tab; shared pool abstraction; 5-layer extension UI mirroring @zapvol/app; data flow end-to-end
End-to-End Flow
One browser tool call traverses five layers — from the main agent delegating via task, through the
schema-validated WebSocket envelope, into the extension’s consent gate, down the CDP branch matching the action,
and onto the real Chrome tab. The three load-bearing mechanisms (security invariants, uid cache lifecycle,
obstacle detection) are called out at the bottom because they span layers.
Read it top-to-bottom for “what happens when the agent calls click({ uid: 'e3' })”; read the bottom cards
side-to-side for “which invariants make this safe and resilient.” The rest of this page zooms into each layer.
Runtime Topology
Three zones participate on every browser tool call: the agent backend (server or desktop), the Chrome
Extension (MV3 service worker), and the target tab (the user’s logged-in page).
- Backend owns the agent loop. The
browsertool receives an action, looks upcontext.browserBridge, and callspool.request(userId, action). The pool serializes the request over the live WebSocket to the extension. - Extension owns the enforcement point.
ws-clientreceives the request;session-managerchecks the domain blocklist against the target tab’s current domain and auto-creates a session on first action;action-dispatchermaps the action to a CDP command and runs it viadebugger-controller. - Target Tab is the real, user-controlled Chrome tab. CDP
Input.dispatchMouseEventproduces events withevent.isTrusted = true, which is what Gmail, Google Accounts, and most enterprise SaaS require.
Only the background service worker holds the WebSocket. Popup, options, and content scripts relay through it via
chrome.runtime.sendMessage, never directly.
Shared pool — one implementation, two platforms
The backend is cross-platform. The connection pool, per-user bridge factory, and BrowserBridgeSocket abstraction all
live in @zapvol/backend/infra/browser-bridge-{pool,bridge}.ts:
export interface BrowserBridgeSocket {
send(data: string): void;
close(code?: number, reason?: string): void;
}
export interface BrowserBridgePool {
isConnected(userId: string): boolean;
request(userId: string, action: BrowserAction): Promise<BrowserBridgeActionResult>;
attach(userId: string, ws: BrowserBridgeSocket): void;
detach(userId: string, reason?: string): void;
handleMessage(userId: string, raw: string): void;
getStats(): { connections: number; inFlight: number };
}
Both Hono’s WSContext (server) and the ws package’s WebSocket (desktop Electron main) structurally satisfy
BrowserBridgeSocket — no adapter needed. Only the handshake routing differs per platform:
| Platform | Endpoint | Auth |
|---|---|---|
| Server | wss://<host>/ws/browser | Pairing token returned by /api/browser-extension/pairing-token |
| Desktop | ws://127.0.0.1:48123 | Pairing token file in Electron’s userData |
Extension UI Layering (5 layers)
The extension’s popup and options UI strictly mirrors @zapvol/app’s Contract → Service → Context → Hooks → UI pattern.
UI components never call chrome.runtime.* or chrome.storage.* directly.
| Layer | Path | Role |
|---|---|---|
| Contract | src/contracts/bridge-service.ts | BridgeService interface + BridgeState type — single source of truth |
| Services | src/services/bridge-service-local.ts | Background side: composes SessionManager + WsClient into the service |
src/services/bridge-service-runtime.ts | UI side: proxies every contract method over chrome.runtime.sendMessage | |
| Context | src/context/bridge-context.tsx | React provider + useBridgeService() injection hook |
| Hooks | src/hooks/use-bridge-state.ts | Live BridgeState snapshot via subscribeState + getState |
src/hooks/use-bridge-config.ts | Config read / save / clear with loading + error state | |
src/hooks/use-blocklist.ts | Blocklist list / add / remove with refresh | |
| UI | src/entrypoints/popup/App.tsx | Activity monitor — session stream, per-tab stop, global “Stop all” |
src/entrypoints/options/App.tsx | Connection + blocklist + audit log |
Cross-process messaging (popup ↔ background) is wrapped in two files at the protocol boundary:
src/runtime-protocol.ts— typed request/broadcast shapes, tagged withkind: "bridge_request" | "bridge_broadcast"to not clash with other extension message busessrc/runtime-handler.ts— background-sidechrome.runtime.onMessagelistener that dispatches into the local service and broadcasts state changes back
End-to-End Data Flow — one click({ uid }) action
Concrete walkthrough complementing the hero diagram above. Shows the types that move across each boundary
and the v2 uid path (the selector path follows the same control flow, only with querySelector-based center
resolution instead of DOM.getBoxModel).
- Browser subagent emits a tool call:
browser({ action: { type: "click", uid: "e3", tabId: 42 } }). The subagent learneduid: "e3"from the previousextract’selementsarray. browsertool’sexecuteforwards tocontext.browserBridge.request(action, abortSignal).- Per-user bridge calls
pool.request(userId, action)— generates a UUID, sends{ type: "request", id, action }over the live WS, stores a pending Promise with a per-action pool timeout and an abort-listener cleanup. - Extension’s
ws-clientreceives, validates againstbrowserBridgeMessageSchema, routes to the request handler registered inbackground.ts. action-dispatcherrunsbuildTarget({ uid: "e3" })→{ uid: "e3" }, thenresolveTabAndDomain(tabId explicit → short-circuits), thensessionManager.isDomainBlocked(domain)(blocklist check) →sessionManager.ensureSession(tabId, domain)(creates on first action, reuses on subsequent; silently updatessession.domainon mid-session navigation).debuggerController.click(tabId, { uid: "e3" }). The controller looks upuidCaches.get(tabId).get("e3")→backendNodeId: 1829;DOM.scrollIntoViewIfNeeded({ backendNodeId: 1829 })+DOM.getBoxModel({ backendNodeId: 1829 })→ center (x, y);Input.dispatchMouseEventmousePressed → mouseReleased at that point. Dispatcher callssessionManager.recordAction(tabId)to bumpactionCount+lastActionAtfor audit.- CDP resolves; dispatcher returns
{ result: { ok: true } }; extension sends{ type: "response", id, result: { ok: true } }back over the WS. - Pool resolves the pending Promise;
browsertool’sexecutepushes atask_milestoneevent and returns{ ok: true, action: "click", result: { ok: true } }to the subagent’s loop.
Where it breaks. Step 5 can fail with domain_blocked (terminal — target on blocklist) or tab_not_found; step 6
can fail with element_stale (uid not in cache — agent must extract first) or element_not_found (cached
backendNodeId resolved to a node no longer in the DOM). Every failure lands on the same response envelope; the
agent treats each code per the contract in Session Model → Error-handling contract.
Cancellation
Two independent abort paths reach an in-flight BUA action cleanly:
- Extension-side (user-initiated): popup per-session Stop now, the global red Stop all, or Chrome’s
debugger-bar Cancel → extension detaches the debugger, fires
session_ended(andglobal_stopfor Stop all), and any subsequent action on that tab fails withsession_not_found. Adding the target domain to the blocklist ends the session the same way plusdomain_blockederrors for new attempts. - Backend-side (agent-run cancelled): the AI SDK passes an
AbortSignalto each tool’sexecute. Thebrowsertool forwards it tobridge.request(action, signal)→BrowserBridgePool.requestattaches anabortlistener to its pending-request map. When the signal fires, the pool immediately resolves the pending promise with{ error: { code: "internal_error", message: "aborted by caller" } }, clears the timeout, and detaches the listener. Any response that arrives late from the extension is logged and dropped as stray.
Both paths use the same error envelope, so the subagent’s error-handling contract applies uniformly — no new code path in the agent loop.
State Ownership
| State | Owner | Storage |
|---|---|---|
| Pairing config | Extension | chrome.storage.local key bridge_config |
| Domain blocklist entries | Extension | chrome.storage.local key blocklist |
| Active sessions (keyed by tabId; multiple per domain allowed) | Extension | chrome.storage.local key activeSessions |
| BUA window id | Extension | chrome.storage.local key buaWindowId |
| Session audit log (≤ 1000 entries) | Extension | chrome.storage.local key sessionHistory |
| In-flight WS requests | Pool (backend, in-memory) | Map<userId, Map<requestId, Pending>> |
| Connection per user | Pool (backend, in-memory) | Map<userId, WSContext | WebSocket> |
| CDP attachment set | debugger-controller | In-memory Set<tabId> |
| uid → backendNodeId cache (per tab) | debugger-controller | In-memory Map<tabId, Map<uid, backendNodeId>> — cleared on Page.frameNavigated or navigate |
evaluateInProgress set (guards dialog auto-dismiss) | debugger-controller | In-memory Set<tabId> |
SessionEvent subscribers | session-manager module-scope | In-memory Set<listener> (re-attached on SW wake) |
The background service worker can be terminated by Chrome at any time. When it restarts, the bridge client re-connects
and extension-side state is rehydrated from chrome.storage.local. Any agent request in flight at the moment of SW
termination is rejected on the backend side with an internal_error — the agent learns of the failure and can choose to
retry or stop.
Why this shape
- One contract, two implementations — keeps the popup and options page honest about their dependency. Swapping the runtime for a mock in tests is a single factory call.
- Pool in shared backend —
@zapvol/serverand@zapvol/desktopboth get identical enforcement behavior. Fixing a timeout bug in one fixes it everywhere. - Scope check in the extension, not the backend — the extension is the only code that sees the live tab. It can enforce “this domain is not blocklisted AND this tab is still open” atomically before dispatching. The backend trusts the extension’s answer and surfaces errors to the agent.
- One tool, action enum — mirrors Anthropic’s computer-use shape and keeps the prompt tractable (13 actions in a single JSON schema the LLM has to understand). Individual tools per action would explode the prompt.