Observability Dashboards

Four core dashboards bundled with Zapvol — LogQL queries, read criteria, and troubleshooting paths when something looks wrong

Design principle for panels

Not “add if it displays cleanly” but each panel answers one specific question. The narrower the question, the shorter the action path when something breaks.

The four panels in this chapter are all built around the cache overhaul — not by accident. This is the first place in Zapvol where “compaction + cache + prepareStep” are orchestrated together, the most error-prone, and the most worth monitoring.

Each panel is described with a uniform structure:

  1. Question — what specific doubt this panel resolves
  2. Query — LogQL code (paste directly into Grafana Explore)
  3. Healthy range — what normal looks like
  4. Alert signals — what requires urgent investigation
  5. Related events — which log events this panel reads

Panel 1: Cache Hit Ratio Trend

Question

“Is Anthropic prompt cache actually saving money in this deployment?”

This is the primary validation metric for the cache overhaul. Without it, every “breakpoints are correctly placed, compaction boundary is stable” argument is just theory.

Query

avg_over_time(
  {job="zapvol-server", event="stream.step_finished"}
    | json
    | unwrap cacheHitRatio
  [5m]
)

Healthy range

PhaseExpected ratio
First step (step 0)< 10% — first request, nothing to cache
Step 2 onward>= 60% — system prompt + stable prefix should hit
Late steps (step 5+) in multi-step tasks70-85% — ideal range

Alert signals

SymptomLikely causeInvestigation entry
Persistently < 30%AI SDK propagation assumption is false, or Anthropic’s 5-minute TTL expires between stepsCheck Panel 2’s breakpoint distribution
First step > 30% but drops to < 20%The system prompt is changing (prompt caching broken), or Anthropic cache is disabled on the accountConfirm createCachedInstructions returns the same system across requests
Long-term 40-55% plateauEffective breakpoint count below 4 (Rule 2 not firing)Panel 2 locates it
Some tasks have ratio = 0Non-Anthropic provider (OpenAI / Google); cache fields come from provider rawFilter provider != "anthropic"; expected
  • stream.step_finished — reads cacheHitRatio
  • Cross-check: cache.breakpoints_placed (design-side) vs stream.step_finished.cacheHitRatio (observation-side)

Panel 2: Breakpoint Placement Distribution

Question

“Are the three applyCacheControl rules + extraBreakpointAt firing correctly?”

Panel 1 tells you “is cache working”; Panel 2 tells you “why it is / isn’t”.

Query

sum by (placed_count) (
  count_over_time(
    {job="zapvol-server", event="cache.breakpoints_placed"}
      | json
      | label_format placed_count="{{len .placedAt}}"
    [1h]
  )
)

Output: per 1-hour window, how many times each “breakpoint count” appears.

Healthy range

Anthropic allows 4 cache_control breakpoints per request (system prompt is separate, handled by createCachedInstructions). applyCacheControl places at most 3 in messages[]:

  • Rule 1 (last message)
  • Rule 2 (last user, only when last != user)
  • Rule 3 extraBreakpointAt (compactedPrefixEnd or length/2 fallback)

Expected placed_count distribution:

placed_countScenarioExpected share
3Compaction fired + last != user (typical agent tool-loop step)60-80%
2No compaction (short conversation) + last != user, or compaction + last = user15-30%
1First step, last = user and no compaction5-10%
0Anomaly — messages is emptyShould be 0
4+Impossible — marked: Set deduplicates, upper bound 3Should be 0

Alert signals

SymptomLikely causeInvestigation entry
placed_count = 1 exceeds 30%In Branch B (last=tool_result) scenarios, Rule 2 is fooled by the pseudo-user — applyCacheControl is being called after reminder injectionCheck agent-round.ts prepareStep: applyCacheControl must run before reminder injection
extraBreakpointUsed: false persistently true when compaction should have triggeredcompactedPrefixEnd = -1 is wrong — appliedCrCount wasn’t incremented in activateMoreRoundReplacementsCheck if compaction.step_triggered events appear; then verify compaction.round_degraded’s activatedCount
Distribution skewed low (mostly = 1)applyCacheControl isn’t being called at all, or isAnthropicModel() returns falseInspect provider config; Gemini / OpenAI paths intentionally don’t hit this
  • cache.breakpoints_placed — primary data source
  • compaction.round_degraded — confirms compaction is really consuming fallbacks
  • agent.created — confirms the model is Anthropic

Panel 3: Cache Read / Write Token Ratio

Question

“Is the content we write into cache actually being reused, or wasted?”

cacheWriteTokens is the cost of writing a new prefix into Anthropic’s cache (priced at 1.25× input token). cacheReadTokens is the benefit of reading it back (priced at 0.1× input token). Ratio R = read / write:

  • R >> 1: high cache hit; write once, read many — ideal
  • R ≈ 1: every write is barely reused — orphan writes
  • R < 1: lots of cache writes, few reads — serious waste

Query

sum(rate({job="zapvol-server", event="stream.step_finished"} | json | unwrap cacheReadTokens [5m]))
  /
sum(rate({job="zapvol-server", event="stream.step_finished"} | json | unwrap cacheWriteTokens [5m]))

Healthy range

RVerdict
> 3Healthy — each write is read 3+ times
1.5-3Acceptable — reads exceed writes, but cache is rewritten frequently
1-1.5Borderline — each write is read roughly once before the next overwrites it
< 1Alert — writes exceed reads; cache strategy has failed

Alert signals

SymptomLikely causeInvestigation entry
R < 1 while Panel 1 shows healthy hit ratioGrafana Cloud free-tier time-window aggregation issue (numerator/denominator rate windows misaligned); may not be realSwitch rate to sum_over_time and re-check
R < 1 AND Panel 1 hit ratio is also lowBreakpoint positions drifting (length/2 heuristic / reminder pollution); every step writes a new prefix that’s at a different position next stepPanel 2 locates the drift; revisit reminder injection ordering in code
R ≈ 1 persistentlyThe 5-minute TTL is expiring between steps (slow task cadence); Anthropic cache is passively invalidatedMonitor inter-step intervals; consider batching steps
  • stream.step_finishedcacheReadTokens + cacheWriteTokens

Panel 4: Top-N Anomalous Tasks

Question

“Which tasks in the past hour showed abnormal cache behavior?”

The first three panels are aggregate views; this panel is the per-task view — listing specific taskIds with low hit ratio, ready to drill into.

Query

{job="zapvol-server", event="stream.step_finished"}
  | json
  | cacheHitRatio < 0.3
  | line_format "taskId={{.taskId}} step={{.step}} ratio={{.cacheHitRatio}} inputTokens={{.inputTokens}}"

Healthy range

First-step cache ratio < 0.3 is normal; don’t overreact. What matters:

  • The same taskId stays < 0.3 at step 3+
  • Many different taskIds cross below 0.3 simultaneously within a time window (systemic, not single-task)

Alert signals

SymptomLikely causeInvestigation entry
A single task has persistently low ratioNon-deterministic content in that task’s system prompt or context (timestamps, random IDs)Copy traceId; retrieve the task’s cache.breakpoints_placed sequence; verify breakpoint positions are stable
Many tasks drop simultaneously in a time windowVersion release / config change / Anthropic-side cache issueCheck release timestamp alignment; check Anthropic status page
Tasks on a specific model ID have low ratioThat provider doesn’t support cache, or cache pricing/behavior differsInspect createCachedInstructions handling for that provider
  • stream.step_finished — primary data source
  • After drilling into a taskId, switch to: {traceId="<id>"} | json — retrieve all logs for that task

Workflow for adding a new panel

Step 1: Add the event first, then the panel

Never treat Grafana as a code editor. Panels are views of data, not containers of business logic. First, emit a structured event from Zapvol:

log.info("your_event.name", {
  taskId,
  /* low-cardinality fields */ kind: "...",
  /* numeric fields */ someMetric: 42,
});

Run it a few times. In Grafana Explore, confirm {event="your_event.name"} | json returns the expected shape.

Step 2: Define the four elements of a panel

Before creating the dashboard, write them down (at minimum in the PR description):

  • Question: what does this panel resolve
  • Query: LogQL prototype
  • Healthy range: written so someone without your context can still read “this number should be X”
  • Alert signals: at least 2 actionable investigation paths

If you can’t write these down, the question isn’t clear enough — don’t start drawing the panel.

Step 3: Commit the dashboard JSON to the repo

From Grafana Dashboard → Settings → JSON Model, copy the JSON into:

ops/grafana/dashboards/{domain}-{name}.json

In the same directory’s README.md, add import instructions and read criteria (or link to the corresponding section in this chapter).

Why commit JSON: a single Grafana instance loses data, you lose everything. Committed to the repo, a new engineer can spin up an equivalent Grafana locally in 15 minutes. It’s also a good code-review carrier — reviewers can diff query changes in a PR.

Step 4: If you need alerting

Use Grafana’s native Alert rules (declarative YAML — not the clicky UI). Commit the declarative rules:

ops/grafana/alerts/cache-hit-degradation.yaml

Alert rules should be extremely restrained. In an agent system, many things “could go wrong” but few “must be handled immediately”. Defaults:

  • for: 15m or longer — brief fluctuations don’t page.
  • Only alert on outcome metrics (cache hit ratio, error rate, latency tail), never on process metrics (breakpoint count, compaction frequency).
Was this page helpful?