Observability Stack Overview
pino → Alloy → Loki → Grafana four-layer pipeline, label cardinality pitfalls, three deployment modes, when to escalate to Prometheus or OpenTelemetry
Why observability is a first-class concern
Agent systems suffer from a particularly sneaky class of bugs: behaviorally correct, economically wrong. The LLM still returns sensible answers, tests still pass, but every step quietly misses cache, every compaction runs more aggressively than it should, every tool call burns 25% more of the token budget than expected. The human eye can’t catch any of this — only a dashboard can.
This chapter describes the stack Zapvol currently runs — pino → Alloy → Loki → Grafana — why these four, when to escalate, and the pitfalls to avoid. It is not a deployment manual; it is the mental model you need to operate this system.
Four-layer pipeline at a glance
Any layer can be swapped, but this combination currently has the best overall value: fully open source, standard protocols, a free tier hosted by Grafana Cloud, and a clean path to scale up.
Why this stack — three alternative designs and why we didn’t choose them
Candidate 1: ELK (Elasticsearch + Logstash + Kibana)
Most mature, most feature-complete. Overkill for Zapvol’s scale:
- Elasticsearch is a full-text indexer — every field gets an inverted index. In a log workload, 90% of fields are never queried; index cost is pure waste.
- Logstash’s JRuby runtime uses 5-10× more memory than Alloy.
- Kibana’s permission model is more complex than a single team needs.
Verdict: suited for “grep everything in prod” full-text scenarios. Zapvol’s log queries are all structured (filter by event, taskId, time range) — the ES strengths don’t apply.
Candidate 2: OpenTelemetry Collector + any backend
Most standardized. But Alloy is itself an OTel-Collector-based distribution, and the differences are:
- Alloy ships the River config language + a visual debugging UI (
:12345). - Alloy has first-class integration with the Grafana ecosystem.
- OTel Collector is more generic but has a rougher config surface.
Verdict: If you’re not leaving the Grafana ecosystem, Alloy is a superset of OTel Collector. Switch only if you need a non-Grafana backend (Datadog, Honeycomb).
Candidate 3: Commercial (Datadog / Honeycomb / Logz.io)
Best UI, best support. Price:
- Datadog pricing is
$1.27/GB ingest + $2.50/M indexed events. - A medium-sized agent task produces 5-10 info + debug logs per step. 20 steps × 100 tasks/day = 20k logs/day = 600k/month = a few hundred dollars.
- The open-source alternative at the same volume costs < $10 (object storage + small VM).
Verdict: pick when money is loose. Not needed for an internal tool.
Zapvol’s existing infrastructure
pino config (apps/server/src/lib/logger.ts)
const pinoLogger = pino(
{ level: process.env.LOG_LEVEL || (isDev ? "debug" : "info") },
isDev ? pretty({ colorize: true, translateTime: "HH:MM:ss" }) : undefined,
);
Dev mode uses pino-pretty (colored, readable); production defaults to JSONL on stdout — the format Alloy consumes
natively, zero transformation.
event-first schema
log.info("task.created", { taskId, userId });
log.error("stream.failed", { taskId, err }, "Stream failed");
event is always the first required parameter. This convention runs through @zapvol/backend, @zapvol/server,
and @zapvol/desktop. Consequences:
event="task.created"precisely filters one class of events in Grafana.- All event names form an auditable event catalog.
- New contributors are forced to name the thing they’re logging — instead of
log.info("something happened").
AsyncLocalStorage injection
function mergeContext(event, data) {
const ctx = RequestContext.get();
if (ctx?.traceId) merged.traceId = ctx.traceId;
if (ctx?.userId) merged.userId = ctx.userId;
// ...
}
Every log line carries traceId and userId without an explicit parameter. During debugging, filter by traceId to
retrieve all logs for one request — across services, across async boundaries.
Existing key events (excerpt)
| Event name | Location | Business meaning |
|---|---|---|
task.created / task.completed | apps/server/src/routes/tasks.ts | Task lifecycle |
stream.messages_prepared | agent-round.ts | prepareInitialMessages output at round start |
stream.step_finished | agent-round.ts onStepFinish | Per-step usage + cache details |
cache.breakpoints_placed | agent-round.ts prepareStep | Actual Anthropic cache breakpoint positions |
compaction.step_triggered / .round_degraded / .tools_compacted / .llm_summarize_triggered | compaction/step-compactor.ts | Each of the three compaction tiers firing |
agent.created | create-agent.ts | ToolLoopAgent assembled |
When building a Dashboard, treat these events as primary keys.
Three deployment modes
Mode A: Grafana Cloud free tier (recommended starting point)
Simplest. Create a Grafana Cloud account, run one Alloy container. Cost: $0/month (50 GB logs + 10k metrics + 50 GB traces).
# docker-compose.yml excerpt
alloy:
image: grafana/alloy:latest
volumes:
- /var/run/docker.sock:/var/run/docker.sock:ro
- ./config.alloy:/etc/alloy/config.alloy
command: run /etc/alloy/config.alloy --server.http.listen-addr=0.0.0.0:12345
environment:
GRAFANA_CLOUD_LOKI_USER: ${GRAFANA_CLOUD_LOKI_USER}
GRAFANA_CLOUD_LOKI_TOKEN: ${GRAFANA_CLOUD_LOKI_TOKEN}
config.alloy (River language):
discovery.docker "zapvol" {
host = "unix:///var/run/docker.sock"
}
loki.source.docker "zapvol" {
host = "unix:///var/run/docker.sock"
targets = discovery.docker.zapvol.targets
forward_to = [loki.process.zapvol.receiver]
}
loki.process "zapvol" {
forward_to = [loki.write.cloud.receiver]
stage.json {
expressions = { event = "event", module = "module", taskId = "taskId" }
}
stage.labels {
values = { event = "", module = "" } // only low-cardinality fields as labels
}
}
loki.write "cloud" {
endpoint {
url = "https://logs-prod-XX.grafana.net/loki/api/v1/push"
basic_auth {
username = env("GRAFANA_CLOUD_LOKI_USER")
password = env("GRAFANA_CLOUD_LOKI_TOKEN")
}
}
}
Mode B: Self-hosted single VM
One VM runs Loki + Grafana + Alloy. Data lives in local object storage (S3 / Cloudflare R2).
- Cost: VM + object storage ≈ $10-30/month (depends on log volume).
- Ops: you manage Loki retention, Grafana upgrades.
- Fits: teams with existing VM infrastructure who don’t want logs leaving the network.
Mode C: Kubernetes
Alloy as a DaemonSet, one per node. Loki as a StatefulSet inside the cluster, or managed Loki.
- Cost: depends on cluster size.
- Ops: standard K8s operations.
- Fits: teams already running K8s.
Strongly recommend Mode A as the starting point — zero infrastructure burden, migrate out if it stops fitting. Loki
data can be exported via loki-migrate.
The three modes above describe production deployments. Local dev has a lighter path — no Docker, no Alloy container, pino pushes directly to Loki — see Local Development with Grafana Cloud.
Label cardinality pitfalls (required reading)
Loki’s storage cost is almost entirely a function of label combination count (cardinality), not log volume. The rule:
| Field | Label? | Reason |
|---|---|---|
event | Yes | Finite enum (a few dozen values) |
module | Yes | Finite enum |
level | Yes | Five values |
taskId | No | High cardinality (millions) |
userId | No | Medium-high cardinality |
traceId | No | One per request |
| Numeric fields (tokens, ratios, …) | No | Continuous |
One bad label config can slow Loki down 100× and inflate storage 50×. Rules:
- Label = “something I will
sum by”; field = “something I will filter on for precise queries.” - Never use ID fields as labels.
- Monitor
loki_ingester_memory_streamsfor one week before promoting any new label to production.
At query time, | json parses fields. The difference:
# Label filter (fast)
{event="stream.step_finished"}
# Field filter (slower, but doesn't contribute to cardinality)
{event="stream.step_finished"} | json | taskId="abc-123"
The two compose. Correct pattern: filter with labels down to low millions, then narrow with fields.
When to escalate to Prometheus metrics
Logs are best for “why did this specific thing happen” (“why did this task miss cache?”). Metrics are best for long-term trends + alerting (“p95 cache hit ratio over the past 7 days”).
Escalation signals:
| Signal | Action |
|---|---|
| A dashboard query routinely takes > 30s | Emit that metric via prom-client; query it from Prometheus |
| Need declarative alert rules (“ratio < 0.3 for 5 min”) | Prometheus Alertmanager |
| Log volume approaching Grafana Cloud free-tier limits | Downgrade debug events to metrics; keep info+ in Loki |
Escalation path: no change to Alloy. Alloy supports both prometheus.scrape (pull) and prometheus.remote_write
(push). Adding prom-client to the app is sufficient.
When to escalate to OpenTelemetry traces
Traces are best for “one request spans many services”. Zapvol’s agent execution crosses:
- Server-side
task-orchestrator.ts - The 20+ step loop inside the agent engine
- Tool calls reaching into sandbox
- BUA sessions crossing WebSocket
If debugging “per-step latency across a whole task” becomes a routine task, traces are an order of magnitude more efficient than logs. Escalation signals:
- Investigating latency requires sorting 10+ log lines by timestamp to derive a timeline.
- Problems like “step N stalled” appear but logs can’t pinpoint where.
- Multi-service cooperation exists but
traceIdalone can’t reconstruct the full path.
Escalation path: attach an OpenTelemetry logs appender to pino + add tracer.startSpan to hot code paths. Switch the
backend from Loki-only to Loki + Tempo.
Typically unnecessary for internal tools. The traceId + timestamp + duration fields in logs are usually enough.
Related chapters
- Observability Dashboards — Current built-in core dashboards and how to read them
- Local Development with Grafana Cloud — Opt-in pino-loki direct push for dev machines
- Context Compaction — How the compaction boundary shapes cache breakpoint design
(origin of the
cache.breakpoints_placedevent) - prepareStep Semantics — Where
stream.step_finishedis emitted