Tokens as Cost: Core Concepts

Cost structure compared to traditional SaaS

Traditional SaaS economics consists of fixed infrastructure costs (servers, bandwidth, storage) and near-zero marginal cost per additional user. A linear “user count × monthly fee” model is sufficient for budgeting.

LLM-powered agent products break this linearity. The billing unit shifts from “user count” to “inference workload” — every tool call, every history replay, every retry, every context compaction directly incurs API cost. Consequences:

Among 100 users, approximately 5 heavy users typically drive 80% of the bill (a power-law distribution, not linear aggregation)
For the same task, one wrong-path attempt may cost more than ten successful completions combined
The same product running for 1 month versus 3 months produces materially different cost curves — cache hit rate, history length, and model version all drift over time

Non-engineering roles (product, sales, procurement, support, internal users) planning with traditional SaaS templates will systematically under-estimate. The token cost concept must therefore become a shared baseline across all stakeholders, not be confined to the engineering team.

Three resource dimensions of a token

Each token simultaneously occupies three resource categories:

Resource type	Quantitative relationship	Business impact
Money	API unit price × token count	Monthly bill
Latency	Token count is approximately proportional to wall time	User-facing response time
Capacity	Token count = context occupied	Task execution depth and memory retention

Any token optimization must first specify the target dimension. The three dimensions typically trade off rather than scale together:

Compacting history releases capacity and reduces future cost; compaction is itself an LLM call, so current-period cost rises
Switching to a stronger model raises per-step cost but reduces total step count — cost may increase while latency decreases
Switching to a cheaper model with additional fallback steps — cost decreases, latency and failure rate increase

There is no “all-dimensional optimization” — only “explicitly chosen target dimension with acceptance of the corresponding trade-off”.

Three governing principles

Input is inexpensive, cached input is cheaper, output is expensive, history is the most expensive — history is both prior output and re-sent as input on every subsequent turn
Tokens not consumed are worth more than tokens saved — designing prompts as static to trigger cache hits typically yields more savings than manually shortening prompt content
Failure cost ≥ success cost — when an agent retries down the wrong path, tokens consumed in prior steps are already billed and unrecoverable

Reference bill: 12-step task

A concrete figure facilitates intuition; detailed itemization follows in the next section.

Task scenario: a user requests the agent to triage Gmail inbox and classify the most recent 20 emails by project. The agent completes the task in 12 steps on Claude Sonnet 4.6. Total bill approximately $0.19, by source:

41% Conversation history — historical dialogue re-transmitted on every turn
23% Model output — model responses
22% Tool results — tool return values (enter history)
13% System prompt + tool descriptions — static prefix, this low only with cache hits
< 1% Sandbox + storage

Medium-length tasks (approximately 10-15 steps) tend to follow a 40 / 25 / 20 / 15 cost distribution. Conversation history is consistently the largest cost item.

Section guide

Section	Content	Audience
cost-model	Line-by-line derivation of the $0.19 figure; cost comparison across models on the same task	Engineering, product, finance
controls-and-roi	Cost cap configuration and monitoring; methodology for converting human time into ROI	Operations, sales, procurement

When only one section is read, cost-model is recommended — itemized billing data is more persuasive than abstract argument.