Observability Dashboards

Design principle for panels

Not “add if it displays cleanly” but each panel answers one specific question. The narrower the question, the shorter the action path when something breaks.

The four panels in this chapter are all built around the cache overhaul — not by accident. This is the first place in Zapvol where “compaction + cache + prepareStep” are orchestrated together, the most error-prone, and the most worth monitoring.

Each panel is described with a uniform structure:

Question — what specific doubt this panel resolves
Query — LogQL code (paste directly into Grafana Explore)
Healthy range — what normal looks like
Alert signals — what requires urgent investigation
Related events — which log events this panel reads

Panel 1: Cache Hit Ratio Trend

Question

“Is Anthropic prompt cache actually saving money in this deployment?”

This is the primary validation metric for the cache overhaul. Without it, every “breakpoints are correctly placed, compaction boundary is stable” argument is just theory.

Query

avg_over_time(
  {job="zapvol-server", event="stream.step_finished"}
    | json
    | unwrap cacheHitRatio
  [5m]
)

Healthy range

Phase	Expected ratio
First step (step 0)	< 10% — first request, nothing to cache
Step 2 onward	>= 60% — system prompt + stable prefix should hit
Late steps (step 5+) in multi-step tasks	70-85% — ideal range

Alert signals

Symptom	Likely cause	Investigation entry
Persistently < 30%	AI SDK propagation assumption is false, or Anthropic’s 5-minute TTL expires between steps	Check Panel 2’s breakpoint distribution
First step > 30% but drops to < 20%	The system prompt is changing (prompt caching broken), or Anthropic cache is disabled on the account	Confirm `createCachedInstructions` returns the same system across requests
Long-term 40-55% plateau	Effective breakpoint count below 4 (Rule 2 not firing)	Panel 2 locates it
Some tasks have ratio = 0	Non-Anthropic provider (OpenAI / Google); cache fields come from provider raw	Filter `provider != "anthropic"`; expected

stream.step_finished — reads cacheHitRatio
Cross-check: cache.breakpoints_placed (design-side) vs stream.step_finished.cacheHitRatio (observation-side)

Panel 2: Breakpoint Placement Distribution

Question

“Are the three applyCacheControl rules + extraBreakpointAt firing correctly?”

Panel 1 tells you “is cache working”; Panel 2 tells you “why it is / isn’t”.

Query

sum by (placed_count) (
  count_over_time(
    {job="zapvol-server", event="cache.breakpoints_placed"}
      | json
      | label_format placed_count="{{len .placedAt}}"
    [1h]
  )
)

Output: per 1-hour window, how many times each “breakpoint count” appears.

Healthy range

Anthropic allows 4 cache_control breakpoints per request (system prompt is separate, handled by createCachedInstructions). applyCacheControl places at most 3 in messages[]:

Rule 1 (last message)
Rule 2 (last user, only when last != user)
Rule 3 extraBreakpointAt (compactedPrefixEnd or length/2 fallback)

Expected placed_count distribution:

placed_count	Scenario	Expected share
3	Compaction fired + last != user (typical agent tool-loop step)	60-80%
2	No compaction (short conversation) + last != user, or compaction + last = user	15-30%
1	First step, last = user and no compaction	5-10%
0	Anomaly — `messages` is empty	Should be 0
4+	Impossible — `marked: Set` deduplicates, upper bound 3	Should be 0

Alert signals

Symptom	Likely cause	Investigation entry
`placed_count = 1` exceeds 30%	In Branch B (last=tool_result) scenarios, Rule 2 is fooled by the pseudo-user — `applyCacheControl` is being called after reminder injection	Check `agent-round.ts` prepareStep: `applyCacheControl` must run before reminder injection
`extraBreakpointUsed: false` persistently true when compaction should have triggered	`compactedPrefixEnd = -1` is wrong — `appliedCrCount` wasn’t incremented in `activateMoreRoundReplacements`	Check if `compaction.step_triggered` events appear; then verify `compaction.round_degraded`’s `activatedCount`
Distribution skewed low (mostly = 1)	`applyCacheControl` isn’t being called at all, or `isAnthropicModel()` returns false	Inspect provider config; Gemini / OpenAI paths intentionally don’t hit this

cache.breakpoints_placed — primary data source
compaction.round_degraded — confirms compaction is really consuming fallbacks
agent.created — confirms the model is Anthropic

Panel 3: Cache Read / Write Token Ratio

Question

“Is the content we write into cache actually being reused, or wasted?”

cacheWriteTokens is the cost of writing a new prefix into Anthropic’s cache (priced at 1.25× input token). cacheReadTokens is the benefit of reading it back (priced at 0.1× input token). Ratio R = read / write:

R >> 1: high cache hit; write once, read many — ideal
R ≈ 1: every write is barely reused — orphan writes
R < 1: lots of cache writes, few reads — serious waste

Query

sum(rate({job="zapvol-server", event="stream.step_finished"} | json | unwrap cacheReadTokens [5m]))
  /
sum(rate({job="zapvol-server", event="stream.step_finished"} | json | unwrap cacheWriteTokens [5m]))

Healthy range

R	Verdict
> 3	Healthy — each write is read 3+ times
1.5-3	Acceptable — reads exceed writes, but cache is rewritten frequently
1-1.5	Borderline — each write is read roughly once before the next overwrites it
< 1	Alert — writes exceed reads; cache strategy has failed

Alert signals

Symptom	Likely cause	Investigation entry
R < 1 while Panel 1 shows healthy hit ratio	Grafana Cloud free-tier time-window aggregation issue (numerator/denominator rate windows misaligned); may not be real	Switch `rate` to `sum_over_time` and re-check
R < 1 AND Panel 1 hit ratio is also low	Breakpoint positions drifting (length/2 heuristic / reminder pollution); every step writes a new prefix that’s at a different position next step	Panel 2 locates the drift; revisit reminder injection ordering in code
R ≈ 1 persistently	The 5-minute TTL is expiring between steps (slow task cadence); Anthropic cache is passively invalidated	Monitor inter-step intervals; consider batching steps

stream.step_finished — cacheReadTokens + cacheWriteTokens

Panel 4: Top-N Anomalous Tasks

Question

“Which tasks in the past hour showed abnormal cache behavior?”

The first three panels are aggregate views; this panel is the per-task view — listing specific taskIds with low hit ratio, ready to drill into.

Query

{job="zapvol-server", event="stream.step_finished"}
  | json
  | cacheHitRatio < 0.3
  | line_format "taskId={{.taskId}} step={{.step}} ratio={{.cacheHitRatio}} inputTokens={{.inputTokens}}"

Healthy range

First-step cache ratio < 0.3 is normal; don’t overreact. What matters:

The same taskId stays < 0.3 at step 3+
Many different taskIds cross below 0.3 simultaneously within a time window (systemic, not single-task)

Alert signals

Symptom	Likely cause	Investigation entry
A single task has persistently low ratio	Non-deterministic content in that task’s system prompt or context (timestamps, random IDs)	Copy `traceId`; retrieve the task’s `cache.breakpoints_placed` sequence; verify breakpoint positions are stable
Many tasks drop simultaneously in a time window	Version release / config change / Anthropic-side cache issue	Check release timestamp alignment; check Anthropic status page
Tasks on a specific model ID have low ratio	That provider doesn’t support cache, or cache pricing/behavior differs	Inspect `createCachedInstructions` handling for that provider

stream.step_finished — primary data source
After drilling into a taskId, switch to: {traceId="<id>"} | json — retrieve all logs for that task

Workflow for adding a new panel

Step 1: Add the event first, then the panel

Never treat Grafana as a code editor. Panels are views of data, not containers of business logic. First, emit a structured event from Zapvol:

log.info("your_event.name", {
  taskId,
  /* low-cardinality fields */ kind: "...",
  /* numeric fields */ someMetric: 42,
});

Run it a few times. In Grafana Explore, confirm {event="your_event.name"} | json returns the expected shape.

Step 2: Define the four elements of a panel

Before creating the dashboard, write them down (at minimum in the PR description):

Question: what does this panel resolve
Query: LogQL prototype
Healthy range: written so someone without your context can still read “this number should be X”
Alert signals: at least 2 actionable investigation paths

If you can’t write these down, the question isn’t clear enough — don’t start drawing the panel.

Step 3: Commit the dashboard JSON to the repo

From Grafana Dashboard → Settings → JSON Model, copy the JSON into:

ops/grafana/dashboards/{domain}-{name}.json

In the same directory’s README.md, add import instructions and read criteria (or link to the corresponding section in this chapter).

Why commit JSON: a single Grafana instance loses data, you lose everything. Committed to the repo, a new engineer can spin up an equivalent Grafana locally in 15 minutes. It’s also a good code-review carrier — reviewers can diff query changes in a PR.

Step 4: If you need alerting

Use Grafana’s native Alert rules (declarative YAML — not the clicky UI). Commit the declarative rules:

ops/grafana/alerts/cache-hit-degradation.yaml

Alert rules should be extremely restrained. In an agent system, many things “could go wrong” but few “must be handled immediately”. Defaults:

for: 15m or longer — brief fluctuations don’t page.
Only alert on outcome metrics (cache hit ratio, error rate, latency tail), never on process metrics (breakpoint count, compaction frequency).

Observability Stack Overview — Design rationale for the pino → Alloy → Loki → Grafana pipeline
Compaction Boundary as Cache Anchor — Panels 1, 2, and 3 observe behavior defined here
prepareStep Semantics — Where cache.breakpoints_placed is emitted

Design principle for panels

Panel 1: Cache Hit Ratio Trend

Question

Query

Healthy range

Alert signals

Related events

Panel 2: Breakpoint Placement Distribution

Question

Query

Healthy range

Alert signals

Related events

Panel 3: Cache Read / Write Token Ratio

Question

Query

Healthy range

Alert signals

Related events

Panel 4: Top-N Anomalous Tasks

Question

Query

Healthy range

Alert signals

Related events

Workflow for adding a new panel

Step 1: Add the event first, then the panel

Step 2: Define the four elements of a panel

Step 3: Commit the dashboard JSON to the repo

Step 4: If you need alerting

Related chapters