Tokens as Cost: Core Concepts
Why LLM product economics diverges from traditional SaaS — a token simultaneously constitutes money, latency, and context capacity. Three governing principles plus a reference bill for one task.
Cost structure compared to traditional SaaS
Traditional SaaS economics consists of fixed infrastructure costs (servers, bandwidth, storage) and near-zero marginal cost per additional user. A linear “user count × monthly fee” model is sufficient for budgeting.
LLM-powered agent products break this linearity. The billing unit shifts from “user count” to “inference workload” — every tool call, every history replay, every retry, every context compaction directly incurs API cost. Consequences:
- Among 100 users, approximately 5 heavy users typically drive 80% of the bill (a power-law distribution, not linear aggregation)
- For the same task, one wrong-path attempt may cost more than ten successful completions combined
- The same product running for 1 month versus 3 months produces materially different cost curves — cache hit rate, history length, and model version all drift over time
Non-engineering roles (product, sales, procurement, support, internal users) planning with traditional SaaS templates will systematically under-estimate. The token cost concept must therefore become a shared baseline across all stakeholders, not be confined to the engineering team.
Three resource dimensions of a token
Each token simultaneously occupies three resource categories:
| Resource type | Quantitative relationship | Business impact |
|---|---|---|
| Money | API unit price × token count | Monthly bill |
| Latency | Token count is approximately proportional to wall time | User-facing response time |
| Capacity | Token count = context occupied | Task execution depth and memory retention |
Any token optimization must first specify the target dimension. The three dimensions typically trade off rather than scale together:
- Compacting history releases capacity and reduces future cost; compaction is itself an LLM call, so current-period cost rises
- Switching to a stronger model raises per-step cost but reduces total step count — cost may increase while latency decreases
- Switching to a cheaper model with additional fallback steps — cost decreases, latency and failure rate increase
There is no “all-dimensional optimization” — only “explicitly chosen target dimension with acceptance of the corresponding trade-off”.
Three governing principles
- Input is inexpensive, cached input is cheaper, output is expensive, history is the most expensive — history is both prior output and re-sent as input on every subsequent turn
- Tokens not consumed are worth more than tokens saved — designing prompts as static to trigger cache hits typically yields more savings than manually shortening prompt content
- Failure cost ≥ success cost — when an agent retries down the wrong path, tokens consumed in prior steps are already billed and unrecoverable
Reference bill: 12-step task
A concrete figure facilitates intuition; detailed itemization follows in the next section.
Task scenario: a user requests the agent to triage Gmail inbox and classify the most recent 20 emails by project. The agent completes the task in 12 steps on Claude Sonnet 4.6. Total bill approximately $0.19, by source:
- 41% Conversation history — historical dialogue re-transmitted on every turn
- 23% Model output — model responses
- 22% Tool results — tool return values (enter history)
- 13% System prompt + tool descriptions — static prefix, this low only with cache hits
- < 1% Sandbox + storage
Medium-length tasks (approximately 10-15 steps) tend to follow a 40 / 25 / 20 / 15 cost distribution. Conversation history is consistently the largest cost item.
Section guide
| Section | Content | Audience |
|---|---|---|
| cost-model | Line-by-line derivation of the $0.19 figure; cost comparison across models on the same task | Engineering, product, finance |
| controls-and-roi | Cost cap configuration and monitoring; methodology for converting human time into ROI | Operations, sales, procurement |
When only one section is read, cost-model is recommended — itemized billing data is more persuasive than abstract argument.