Tokens as Cost: Core Concepts

Why LLM product economics diverges from traditional SaaS — a token simultaneously constitutes money, latency, and context capacity. Three governing principles plus a reference bill for one task.

Cost structure compared to traditional SaaS

Traditional SaaS economics consists of fixed infrastructure costs (servers, bandwidth, storage) and near-zero marginal cost per additional user. A linear “user count × monthly fee” model is sufficient for budgeting.

LLM-powered agent products break this linearity. The billing unit shifts from “user count” to “inference workload” — every tool call, every history replay, every retry, every context compaction directly incurs API cost. Consequences:

  • Among 100 users, approximately 5 heavy users typically drive 80% of the bill (a power-law distribution, not linear aggregation)
  • For the same task, one wrong-path attempt may cost more than ten successful completions combined
  • The same product running for 1 month versus 3 months produces materially different cost curves — cache hit rate, history length, and model version all drift over time

Non-engineering roles (product, sales, procurement, support, internal users) planning with traditional SaaS templates will systematically under-estimate. The token cost concept must therefore become a shared baseline across all stakeholders, not be confined to the engineering team.

Three resource dimensions of a token

Each token simultaneously occupies three resource categories:

Resource typeQuantitative relationshipBusiness impact
MoneyAPI unit price × token countMonthly bill
LatencyToken count is approximately proportional to wall timeUser-facing response time
CapacityToken count = context occupiedTask execution depth and memory retention

Any token optimization must first specify the target dimension. The three dimensions typically trade off rather than scale together:

  • Compacting history releases capacity and reduces future cost; compaction is itself an LLM call, so current-period cost rises
  • Switching to a stronger model raises per-step cost but reduces total step count — cost may increase while latency decreases
  • Switching to a cheaper model with additional fallback steps — cost decreases, latency and failure rate increase

There is no “all-dimensional optimization” — only “explicitly chosen target dimension with acceptance of the corresponding trade-off”.

Three governing principles

  1. Input is inexpensive, cached input is cheaper, output is expensive, history is the most expensive — history is both prior output and re-sent as input on every subsequent turn
  2. Tokens not consumed are worth more than tokens saved — designing prompts as static to trigger cache hits typically yields more savings than manually shortening prompt content
  3. Failure cost ≥ success cost — when an agent retries down the wrong path, tokens consumed in prior steps are already billed and unrecoverable

Reference bill: 12-step task

A concrete figure facilitates intuition; detailed itemization follows in the next section.

Task scenario: a user requests the agent to triage Gmail inbox and classify the most recent 20 emails by project. The agent completes the task in 12 steps on Claude Sonnet 4.6. Total bill approximately $0.19, by source:

  • 41% Conversation history — historical dialogue re-transmitted on every turn
  • 23% Model output — model responses
  • 22% Tool results — tool return values (enter history)
  • 13% System prompt + tool descriptions — static prefix, this low only with cache hits
  • < 1% Sandbox + storage

Medium-length tasks (approximately 10-15 steps) tend to follow a 40 / 25 / 20 / 15 cost distribution. Conversation history is consistently the largest cost item.

Section guide

SectionContentAudience
cost-modelLine-by-line derivation of the $0.19 figure; cost comparison across models on the same taskEngineering, product, finance
controls-and-roiCost cap configuration and monitoring; methodology for converting human time into ROIOperations, sales, procurement

When only one section is read, cost-model is recommended — itemized billing data is more persuasive than abstract argument.

Was this page helpful?