Observability Dashboards
Four core dashboards bundled with Zapvol — LogQL queries, read criteria, and troubleshooting paths when something looks wrong
Design principle for panels
Not “add if it displays cleanly” but each panel answers one specific question. The narrower the question, the shorter the action path when something breaks.
The four panels in this chapter are all built around the cache overhaul — not by accident. This is the first place in Zapvol where “compaction + cache + prepareStep” are orchestrated together, the most error-prone, and the most worth monitoring.
Each panel is described with a uniform structure:
- Question — what specific doubt this panel resolves
- Query — LogQL code (paste directly into Grafana Explore)
- Healthy range — what normal looks like
- Alert signals — what requires urgent investigation
- Related events — which log events this panel reads
Panel 1: Cache Hit Ratio Trend
Question
“Is Anthropic prompt cache actually saving money in this deployment?”
This is the primary validation metric for the cache overhaul. Without it, every “breakpoints are correctly placed, compaction boundary is stable” argument is just theory.
Query
avg_over_time(
{job="zapvol-server", event="stream.step_finished"}
| json
| unwrap cacheHitRatio
[5m]
)
Healthy range
| Phase | Expected ratio |
|---|---|
| First step (step 0) | < 10% — first request, nothing to cache |
| Step 2 onward | >= 60% — system prompt + stable prefix should hit |
| Late steps (step 5+) in multi-step tasks | 70-85% — ideal range |
Alert signals
| Symptom | Likely cause | Investigation entry |
|---|---|---|
| Persistently < 30% | AI SDK propagation assumption is false, or Anthropic’s 5-minute TTL expires between steps | Check Panel 2’s breakpoint distribution |
| First step > 30% but drops to < 20% | The system prompt is changing (prompt caching broken), or Anthropic cache is disabled on the account | Confirm createCachedInstructions returns the same system across requests |
| Long-term 40-55% plateau | Effective breakpoint count below 4 (Rule 2 not firing) | Panel 2 locates it |
| Some tasks have ratio = 0 | Non-Anthropic provider (OpenAI / Google); cache fields come from provider raw | Filter provider != "anthropic"; expected |
Related events
stream.step_finished— readscacheHitRatio- Cross-check:
cache.breakpoints_placed(design-side) vsstream.step_finished.cacheHitRatio(observation-side)
Panel 2: Breakpoint Placement Distribution
Question
“Are the three applyCacheControl rules + extraBreakpointAt firing correctly?”
Panel 1 tells you “is cache working”; Panel 2 tells you “why it is / isn’t”.
Query
sum by (placed_count) (
count_over_time(
{job="zapvol-server", event="cache.breakpoints_placed"}
| json
| label_format placed_count="{{len .placedAt}}"
[1h]
)
)
Output: per 1-hour window, how many times each “breakpoint count” appears.
Healthy range
Anthropic allows 4 cache_control breakpoints per request (system prompt is separate, handled by
createCachedInstructions). applyCacheControl places at most 3 in messages[]:
- Rule 1 (last message)
- Rule 2 (last user, only when last != user)
- Rule 3 extraBreakpointAt (
compactedPrefixEndor length/2 fallback)
Expected placed_count distribution:
| placed_count | Scenario | Expected share |
|---|---|---|
| 3 | Compaction fired + last != user (typical agent tool-loop step) | 60-80% |
| 2 | No compaction (short conversation) + last != user, or compaction + last = user | 15-30% |
| 1 | First step, last = user and no compaction | 5-10% |
| 0 | Anomaly — messages is empty | Should be 0 |
| 4+ | Impossible — marked: Set deduplicates, upper bound 3 | Should be 0 |
Alert signals
| Symptom | Likely cause | Investigation entry |
|---|---|---|
placed_count = 1 exceeds 30% | In Branch B (last=tool_result) scenarios, Rule 2 is fooled by the pseudo-user — applyCacheControl is being called after reminder injection | Check agent-round.ts prepareStep: applyCacheControl must run before reminder injection |
extraBreakpointUsed: false persistently true when compaction should have triggered | compactedPrefixEnd = -1 is wrong — appliedCrCount wasn’t incremented in activateMoreRoundReplacements | Check if compaction.step_triggered events appear; then verify compaction.round_degraded’s activatedCount |
| Distribution skewed low (mostly = 1) | applyCacheControl isn’t being called at all, or isAnthropicModel() returns false | Inspect provider config; Gemini / OpenAI paths intentionally don’t hit this |
Related events
cache.breakpoints_placed— primary data sourcecompaction.round_degraded— confirms compaction is really consuming fallbacksagent.created— confirms the model is Anthropic
Panel 3: Cache Read / Write Token Ratio
Question
“Is the content we write into cache actually being reused, or wasted?”
cacheWriteTokens is the cost of writing a new prefix into Anthropic’s cache (priced at 1.25× input token).
cacheReadTokens is the benefit of reading it back (priced at 0.1× input token). Ratio R = read / write:
- R >> 1: high cache hit; write once, read many — ideal
- R ≈ 1: every write is barely reused — orphan writes
- R < 1: lots of cache writes, few reads — serious waste
Query
sum(rate({job="zapvol-server", event="stream.step_finished"} | json | unwrap cacheReadTokens [5m]))
/
sum(rate({job="zapvol-server", event="stream.step_finished"} | json | unwrap cacheWriteTokens [5m]))
Healthy range
| R | Verdict |
|---|---|
| > 3 | Healthy — each write is read 3+ times |
| 1.5-3 | Acceptable — reads exceed writes, but cache is rewritten frequently |
| 1-1.5 | Borderline — each write is read roughly once before the next overwrites it |
| < 1 | Alert — writes exceed reads; cache strategy has failed |
Alert signals
| Symptom | Likely cause | Investigation entry |
|---|---|---|
| R < 1 while Panel 1 shows healthy hit ratio | Grafana Cloud free-tier time-window aggregation issue (numerator/denominator rate windows misaligned); may not be real | Switch rate to sum_over_time and re-check |
| R < 1 AND Panel 1 hit ratio is also low | Breakpoint positions drifting (length/2 heuristic / reminder pollution); every step writes a new prefix that’s at a different position next step | Panel 2 locates the drift; revisit reminder injection ordering in code |
| R ≈ 1 persistently | The 5-minute TTL is expiring between steps (slow task cadence); Anthropic cache is passively invalidated | Monitor inter-step intervals; consider batching steps |
Related events
stream.step_finished—cacheReadTokens+cacheWriteTokens
Panel 4: Top-N Anomalous Tasks
Question
“Which tasks in the past hour showed abnormal cache behavior?”
The first three panels are aggregate views; this panel is the per-task view — listing specific taskIds with low
hit ratio, ready to drill into.
Query
{job="zapvol-server", event="stream.step_finished"}
| json
| cacheHitRatio < 0.3
| line_format "taskId={{.taskId}} step={{.step}} ratio={{.cacheHitRatio}} inputTokens={{.inputTokens}}"
Healthy range
First-step cache ratio < 0.3 is normal; don’t overreact. What matters:
- The same
taskIdstays < 0.3 at step 3+ - Many different
taskIds cross below 0.3 simultaneously within a time window (systemic, not single-task)
Alert signals
| Symptom | Likely cause | Investigation entry |
|---|---|---|
| A single task has persistently low ratio | Non-deterministic content in that task’s system prompt or context (timestamps, random IDs) | Copy traceId; retrieve the task’s cache.breakpoints_placed sequence; verify breakpoint positions are stable |
| Many tasks drop simultaneously in a time window | Version release / config change / Anthropic-side cache issue | Check release timestamp alignment; check Anthropic status page |
| Tasks on a specific model ID have low ratio | That provider doesn’t support cache, or cache pricing/behavior differs | Inspect createCachedInstructions handling for that provider |
Related events
stream.step_finished— primary data source- After drilling into a
taskId, switch to:{traceId="<id>"} | json— retrieve all logs for that task
Workflow for adding a new panel
Step 1: Add the event first, then the panel
Never treat Grafana as a code editor. Panels are views of data, not containers of business logic. First, emit a structured event from Zapvol:
log.info("your_event.name", {
taskId,
/* low-cardinality fields */ kind: "...",
/* numeric fields */ someMetric: 42,
});
Run it a few times. In Grafana Explore, confirm {event="your_event.name"} | json returns the expected shape.
Step 2: Define the four elements of a panel
Before creating the dashboard, write them down (at minimum in the PR description):
- Question: what does this panel resolve
- Query: LogQL prototype
- Healthy range: written so someone without your context can still read “this number should be X”
- Alert signals: at least 2 actionable investigation paths
If you can’t write these down, the question isn’t clear enough — don’t start drawing the panel.
Step 3: Commit the dashboard JSON to the repo
From Grafana Dashboard → Settings → JSON Model, copy the JSON into:
ops/grafana/dashboards/{domain}-{name}.json
In the same directory’s README.md, add import instructions and read criteria (or link to the corresponding section in
this chapter).
Why commit JSON: a single Grafana instance loses data, you lose everything. Committed to the repo, a new engineer can spin up an equivalent Grafana locally in 15 minutes. It’s also a good code-review carrier — reviewers can diff query changes in a PR.
Step 4: If you need alerting
Use Grafana’s native Alert rules (declarative YAML — not the clicky UI). Commit the declarative rules:
ops/grafana/alerts/cache-hit-degradation.yaml
Alert rules should be extremely restrained. In an agent system, many things “could go wrong” but few “must be handled immediately”. Defaults:
for: 15mor longer — brief fluctuations don’t page.- Only alert on outcome metrics (cache hit ratio, error rate, latency tail), never on process metrics (breakpoint count, compaction frequency).
Related chapters
- Observability Stack Overview — Design rationale for the pino → Alloy → Loki → Grafana pipeline
- Compaction Boundary as Cache Anchor — Panels 1, 2, and 3 observe behavior defined here
- prepareStep Semantics — Where
cache.breakpoints_placedis emitted