Observability Stack Overview

Why observability is a first-class concern

Agent systems suffer from a particularly sneaky class of bugs: behaviorally correct, economically wrong. The LLM still returns sensible answers, tests still pass, but every step quietly misses cache, every compaction runs more aggressively than it should, every tool call burns 25% more of the token budget than expected. The human eye can’t catch any of this — only a dashboard can.

This chapter describes the stack Zapvol currently runs — pino → Alloy → Loki → Grafana — why these four, when to escalate, and the pitfalls to avoid. It is not a deployment manual; it is the mental model you need to operate this system.

Four-layer pipeline at a glance

Any layer can be swapped, but this combination currently has the best overall value: fully open source, standard protocols, a free tier hosted by Grafana Cloud, and a clean path to scale up.

Why this stack — three alternative designs and why we didn’t choose them

Candidate 1: ELK (Elasticsearch + Logstash + Kibana)

Most mature, most feature-complete. Overkill for Zapvol’s scale:

Elasticsearch is a full-text indexer — every field gets an inverted index. In a log workload, 90% of fields are never queried; index cost is pure waste.
Logstash’s JRuby runtime uses 5-10× more memory than Alloy.
Kibana’s permission model is more complex than a single team needs.

Verdict: suited for “grep everything in prod” full-text scenarios. Zapvol’s log queries are all structured (filter by event, taskId, time range) — the ES strengths don’t apply.

Candidate 2: OpenTelemetry Collector + any backend

Most standardized. But Alloy is itself an OTel-Collector-based distribution, and the differences are:

Alloy ships the River config language + a visual debugging UI (:12345).
Alloy has first-class integration with the Grafana ecosystem.
OTel Collector is more generic but has a rougher config surface.

Verdict: If you’re not leaving the Grafana ecosystem, Alloy is a superset of OTel Collector. Switch only if you need a non-Grafana backend (Datadog, Honeycomb).

Candidate 3: Commercial (Datadog / Honeycomb / Logz.io)

Best UI, best support. Price:

Datadog pricing is $1.27/GB ingest + $2.50/M indexed events.
A medium-sized agent task produces 5-10 info + debug logs per step. 20 steps × 100 tasks/day = 20k logs/day = 600k/month = a few hundred dollars.
The open-source alternative at the same volume costs < $10 (object storage + small VM).

Verdict: pick when money is loose. Not needed for an internal tool.

Zapvol’s existing infrastructure

pino config (`apps/server/src/lib/logger.ts`)

const pinoLogger = pino(
  { level: process.env.LOG_LEVEL || (isDev ? "debug" : "info") },
  isDev ? pretty({ colorize: true, translateTime: "HH:MM:ss" }) : undefined,
);

Dev mode uses pino-pretty (colored, readable); production defaults to JSONL on stdout — the format Alloy consumes natively, zero transformation.

event-first schema

log.info("task.created", { taskId, userId });
log.error("stream.failed", { taskId, err }, "Stream failed");

event is always the first required parameter. This convention runs through @zapvol/backend, @zapvol/server, and @zapvol/desktop. Consequences:

event="task.created" precisely filters one class of events in Grafana.
All event names form an auditable event catalog.
New contributors are forced to name the thing they’re logging — instead of log.info("something happened").

AsyncLocalStorage injection

function mergeContext(event, data) {
  const ctx = RequestContext.get();
  if (ctx?.traceId) merged.traceId = ctx.traceId;
  if (ctx?.userId) merged.userId = ctx.userId;
  // ...
}

Every log line carries traceId and userId without an explicit parameter. During debugging, filter by traceId to retrieve all logs for one request — across services, across async boundaries.

Existing key events (excerpt)

Event name	Location	Business meaning
`task.created` / `task.completed`	`apps/server/src/routes/tasks.ts`	Task lifecycle
`stream.messages_prepared`	`agent-round.ts`	prepareInitialMessages output at round start
`stream.step_finished`	`agent-round.ts` `onStepFinish`	Per-step usage + cache details
`cache.breakpoints_placed`	`agent-round.ts` `prepareStep`	Actual Anthropic cache breakpoint positions
`compaction.step_triggered` / `.round_degraded` / `.tools_compacted` / `.llm_summarize_triggered`	`compaction/step-compactor.ts`	Each of the three compaction tiers firing
`agent.created`	`create-agent.ts`	ToolLoopAgent assembled

When building a Dashboard, treat these events as primary keys.

Three deployment modes

Mode A: Grafana Cloud free tier (recommended starting point)

Simplest. Create a Grafana Cloud account, run one Alloy container. Cost: $0/month (50 GB logs + 10k metrics + 50 GB traces).

# docker-compose.yml excerpt
alloy:
  image: grafana/alloy:latest
  volumes:
    - /var/run/docker.sock:/var/run/docker.sock:ro
    - ./config.alloy:/etc/alloy/config.alloy
  command: run /etc/alloy/config.alloy --server.http.listen-addr=0.0.0.0:12345
  environment:
    GRAFANA_CLOUD_LOKI_USER: ${GRAFANA_CLOUD_LOKI_USER}
    GRAFANA_CLOUD_LOKI_TOKEN: ${GRAFANA_CLOUD_LOKI_TOKEN}

config.alloy (River language):

discovery.docker "zapvol" {
  host = "unix:///var/run/docker.sock"
}

loki.source.docker "zapvol" {
  host       = "unix:///var/run/docker.sock"
  targets    = discovery.docker.zapvol.targets
  forward_to = [loki.process.zapvol.receiver]
}

loki.process "zapvol" {
  forward_to = [loki.write.cloud.receiver]
  stage.json {
    expressions = { event = "event", module = "module", taskId = "taskId" }
  }
  stage.labels {
    values = { event = "", module = "" }  // only low-cardinality fields as labels
  }
}

loki.write "cloud" {
  endpoint {
    url = "https://logs-prod-XX.grafana.net/loki/api/v1/push"
    basic_auth {
      username = env("GRAFANA_CLOUD_LOKI_USER")
      password = env("GRAFANA_CLOUD_LOKI_TOKEN")
    }
  }
}

Mode B: Self-hosted single VM

One VM runs Loki + Grafana + Alloy. Data lives in local object storage (S3 / Cloudflare R2).

Cost: VM + object storage ≈ $10-30/month (depends on log volume).
Ops: you manage Loki retention, Grafana upgrades.
Fits: teams with existing VM infrastructure who don’t want logs leaving the network.

Mode C: Kubernetes

Alloy as a DaemonSet, one per node. Loki as a StatefulSet inside the cluster, or managed Loki.

Cost: depends on cluster size.
Ops: standard K8s operations.
Fits: teams already running K8s.

Strongly recommend Mode A as the starting point — zero infrastructure burden, migrate out if it stops fitting. Loki data can be exported via loki-migrate.

The three modes above describe production deployments. Local dev has a lighter path — no Docker, no Alloy container, pino pushes directly to Loki — see Local Development with Grafana Cloud.

Label cardinality pitfalls (required reading)

Loki’s storage cost is almost entirely a function of label combination count (cardinality), not log volume. The rule:

Field	Label?	Reason
`event`	Yes	Finite enum (a few dozen values)
`module`	Yes	Finite enum
`level`	Yes	Five values
`taskId`	No	High cardinality (millions)
`userId`	No	Medium-high cardinality
`traceId`	No	One per request
Numeric fields (tokens, ratios, …)	No	Continuous

One bad label config can slow Loki down 100× and inflate storage 50×. Rules:

Label = “something I will sum by”; field = “something I will filter on for precise queries.”
Never use ID fields as labels.
Monitor loki_ingester_memory_streams for one week before promoting any new label to production.

At query time, | json parses fields. The difference:

# Label filter (fast)
{event="stream.step_finished"}

# Field filter (slower, but doesn't contribute to cardinality)
{event="stream.step_finished"} | json | taskId="abc-123"

The two compose. Correct pattern: filter with labels down to low millions, then narrow with fields.

When to escalate to Prometheus metrics

Logs are best for “why did this specific thing happen” (“why did this task miss cache?”). Metrics are best for long-term trends + alerting (“p95 cache hit ratio over the past 7 days”).

Escalation signals:

Signal	Action
A dashboard query routinely takes > 30s	Emit that metric via `prom-client`; query it from Prometheus
Need declarative alert rules (“ratio < 0.3 for 5 min”)	Prometheus Alertmanager
Log volume approaching Grafana Cloud free-tier limits	Downgrade debug events to metrics; keep info+ in Loki

Escalation path: no change to Alloy. Alloy supports both prometheus.scrape (pull) and prometheus.remote_write (push). Adding prom-client to the app is sufficient.

When to escalate to OpenTelemetry traces

Traces are best for “one request spans many services”. Zapvol’s agent execution crosses:

Server-side task-orchestrator.ts
The 20+ step loop inside the agent engine
Tool calls reaching into sandbox
BUA sessions crossing WebSocket

If debugging “per-step latency across a whole task” becomes a routine task, traces are an order of magnitude more efficient than logs. Escalation signals:

Investigating latency requires sorting 10+ log lines by timestamp to derive a timeline.
Problems like “step N stalled” appear but logs can’t pinpoint where.
Multi-service cooperation exists but traceId alone can’t reconstruct the full path.

Escalation path: attach an OpenTelemetry logs appender to pino + add tracer.startSpan to hot code paths. Switch the backend from Loki-only to Loki + Tempo.

Typically unnecessary for internal tools. The traceId + timestamp + duration fields in logs are usually enough.

Observability Dashboards — Current built-in core dashboards and how to read them
Local Development with Grafana Cloud — Opt-in pino-loki direct push for dev machines
Context Compaction — How the compaction boundary shapes cache breakpoint design (origin of the cache.breakpoints_placed event)
prepareStep Semantics — Where stream.step_finished is emitted