The Token Economy
4 April 2026
Why Cheaper Tokens Still Cost You More
If you’ve ever burned through a 5-hour AI usage cap in under twenty minutes, you already know something is wrong with how we think about LLM costs. The price per token keeps dropping. The models keep getting smarter. And somehow, the bill, or the rate limit, never gets easier to live with.
This post is about why. Not the marketing version. The first-principles version.
The Old World vs. The New World
For twenty years, “performance optimization” meant one thing to software engineers: reduce memory allocation, minimize CPU cycles, cache aggressively. We optimized for RAM.
That era is over. The new bottleneck is the context window: the amount of text a language model can hold in its working memory during a single conversation. And unlike RAM, which you buy once and use indefinitely, context costs you every single time it gets read.
This is the Token Economy. And most developers are losing money in it without understanding why.
The Context Tax: Why Re-Reading Is the Real Cost
The mechanic is simple. It’s just not talked about much.
LLMs are stateless. They don’t remember your last message. Every time you send a follow-up in a conversation, the entire history (system instructions, every code block you’ve pasted, every previous response) gets re-sent as the prompt. The model computes over all of it from scratch.
Under the hood, providers use a KV Cache (Key-Value Cache) to store computed attention vectors, and techniques like prompt caching can reuse prior computation across requests at reduced cost. But the default behavior, and what most developers experience, is that every token in the accumulated conversation gets billed as input on every turn.
By turn 10, 93% of tokens processed are re-reads of previous conversation — only 7% is your new question.
Watch how this compounds. Say you paste 200 lines of a TypeScript orchestration layer. The AI responds with a 300-line analysis. You ask a follow-up. The API now sends: your original 200 lines + the AI’s 300+ lines response + your new question. That’s 500+ lines of context before the model starts thinking about your follow-up.
By turn ten, you might have 15,000 tokens of accumulated history. Without prompt caching, you’re billed for all of it on every turn. In a modeled example, roughly 93% of the tokens processed at turn ten are re-reads of prior conversation. Only 7% is your new question. The exact ratio depends on how long your turns are, but the pattern is consistent: old context dominates new input, and it grows with every exchange.
On a frontier model like Claude Opus 4.6 at $5 per million input tokens, this is the single largest driver of cost and usage-cap exhaustion. We’ll look at how prompt caching changes this math later, but first it’s worth understanding another tax you’re paying without realizing it.
The Language Tax
Consider this Go struct, the kind you’d find in any enterprise API service:
type OrderFulfillmentState struct {
CustomerAccountID string
ShippingDestination string
CurrentProgress float64
IsRetryEligible bool
}
Four fields. Forty-five tokens just for the type definition. The information content (“an order has an ID, a destination, a progress percentage, and a retry flag”) could be expressed in maybe fifteen tokens of natural language. The remaining thirty tokens are structural overhead: capitalization conventions, type annotations, field alignment.
Same 4-field data structure across representations. Tokens spent on syntax, type annotations, and naming conventions are "free" to the compiler but expensive to the LLM.
This is the Language Tax. Verbose languages and deeply nested data formats (looking at you, enterprise JSON) cost more in the Token Economy than terse ones, not because they’re worse engineering, but because LLMs charge by the syllable.
In the Old World, we called this “Clean Code.” In the Token Economy, we might have to start calling it “Financial Negligence.”
The DeepSeek Disruption
In January 2025, a Chinese AI lab called DeepSeek dropped R1, a reasoning model that performed competitively with OpenAI’s o1, at roughly 95% less cost. R1’s breakthrough was its use of large-scale reinforcement learning to produce strong reasoning without expensive supervised fine-tuning. But the cost story started earlier: DeepSeek’s V2 model (mid-2024) had already introduced Multi-head Latent Attention (MLA), a technique that compresses KV Cache into a much smaller latent space, reducing it by up to 93%. V3 inherited MLA, and R1 was built on V3. The architectural efficiency was already baked in before the reasoning leap happened.
The combined effect on the industry was immediate and lasting. DeepSeek made a strong case that the cost of intelligence had been higher than it needed to be: that architectural innovation and lean training pipelines could close much of the gap with frontier models at a fraction of the price. Prices cratered. OpenAI responded with GPT-5.4 Nano at $0.20 per million input tokens. Anthropic’s Claude 4.6 Sonnet came in at $3 per million. DeepSeek’s V3.2 lists input at $0.28 per million tokens (cache miss) and output $0.42, with cache hits dropping to $0.028.
Tokens became a commodity. The price of a single inference dropped to fractions of a cent.
And then something counterintuitive happened.
The Jevons Paradox of Intelligence
Price dropped 97%. The bill went up 150%. That's the Jevons Paradox — efficiency doesn't reduce consumption, it enables more of it.
In 1865, economist William Stanley Jevons observed that as coal-burning engines became more efficient, total coal consumption increased, not decreased. Cheaper energy didn’t lead to conservation, it led to proliferation.
We’re living the Jevons Paradox of tokens right now.
GPT-5.4 Nano costs $0.20 per million input tokens. Claude Opus 4.6 offers a million-token context window. So what do developers do? They stuff more context. Longer system prompts. Entire codebases in a single request. Multi-turn agent loops that run for fifty iterations.
The result: your per-token cost drops by 90%, but your total token consumption increases by 10x. The bill stays the same, or goes up. And on consumer-facing products with usage caps, you hit the wall faster, because cheaper models encourage the kind of sprawling, context-heavy workflows that eat through allocations in minutes.
This is why nineteen minutes felt like five hours. The model is so capable that you throw everything at it, and the Context Tax compounds with every turn.
Surviving the Token Economy: Prompt Caching
The industry’s answer, at least for now, is Prompt Caching.
Instead of reprocessing your entire prompt from scratch on every request, the API stores the computed KV Cache for the stable parts (system instructions, large documents, conversation history) and only processes the new tokens.
Anthropic’s implementation lets you place explicit cache breakpoints on content blocks. The first request pays a small write premium (1.25x the base input price for the default 5-minute TTL, or 2x for a 1-hour TTL), but every subsequent cache hit costs just 10% of standard input pricing. OpenAI’s GPT-5.4 family now offers similar 90% discounts on cached tokens, though their caching is automatic rather than explicit.
The math is compelling. Take a 10,000-token system prompt on Claude Opus 4.6 using the 5-minute cache. Without caching, ten requests cost $0.50 in input tokens alone. With caching, the first request costs $0.0625, and each subsequent hit costs $0.005. Ten requests total: $0.1075 instead of $0.50. That’s a 78% reduction. For long-running agentic sessions, the 1-hour TTL at 2x write cost makes even more sense: it breaks even after just two cache reads, and you stop worrying about the cache expiring mid-task.
10K-token system prompt on Claude Opus 4.6 ($5/MTok input). Cache breaks even after just 1 read.
For developers building agentic systems (where the same system prompt, tool definitions, and accumulated context get sent dozens or hundreds of times), the economics without caching don’t work at any meaningful scale.
The Hidden Line Item: Reasoning Tokens
There’s one more cost that most developers don’t account for: inference-time compute.
Modern reasoning models (OpenAI’s o-series, DeepSeek’s R1, Claude’s extended thinking mode) don’t just read your prompt and respond. They generate an internal chain of thought (a “hidden monologue”) before producing the visible answer. These reasoning tokens are billed as output tokens but never shown to the user.
On OpenAI’s o-series models, this internal monologue can run 2–5x longer than the visible response. You’re not just paying for the answer. You’re paying for the AI to think about the answer. And on models where output tokens cost $15–25 per million, that thinking isn’t cheap.
This creates a new optimization problem: when do you need the model to reason deeply, and when is a fast, shallow response sufficient? The answer is model routing: use a cheap model (Nano, Haiku) for classification and simple tasks, and reserving the expensive reasoning models for problems that actually require them.
The New Efficiency
In the old world, efficiency meant fewer lines of code, less memory, faster execution. In the Token Economy, efficiency means token density: maximizing the information content per token sent to the model.
This has concrete implications for how we write code, structure prompts, and design AI-powered systems:
- System prompts bloat quietly. Every token that isn’t doing real work is a recurring cost you stop noticing until it’s significant.
- Raw conversation history is dead weight. Summarize prior context instead of dragging the full transcript into every request.
- A well-commented 20-line function teaches more per token than a 200-line class with verbose field names. Information density matters at the example level too.
- Not every task needs a frontier model. Routing a string classification to Opus is the token economy equivalent of taking a cab to check the mail.
When people hit the wall, the blame usually splits two ways: the architecture — too much raw history, no caching, wrong model for the task — or the provider, quietly adjusting limits without much explanation. Both are real. Anthropic and OpenAI don’t publish a clear formula for how usage caps are calculated or when they change, which makes it genuinely hard to know which problem you’re actually solving. The conversation design is doing more work than anyone accounted for, and the limit you’re optimizing against today may not be the same one you hit tomorrow.
References
- KV Caching in Transformers — Hugging Face Blog, “KV Caching Explained: Optimizing Transformer Inference Efficiency” (Sep 2025)
- KV Cache from Scratch — Sebastian Raschka, “Understanding and Coding the KV Cache in LLMs from Scratch” (Jun 2025)
- DeepSeek-V2 Paper (MLA Origin) — DeepSeek-AI, “DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model” (2024). Reports 93.3% KV cache reduction via MLA.
- MLA Deep Dive — Towards Data Science, “DeepSeek-V3 Explained 1: Multi-Head Latent Attention” (Feb 2025)
- MLA Technical Analysis — Sebastian Raschka, “Multi-Head Latent Attention (MLA)” (Mar 2026)
- DeepSeek Inference Cost Analysis — IntuitionLabs, “DeepSeek’s Low Inference Cost Explained” (Mar 2026). Confirms DeepSeek R1 runs 20–50× cheaper than OpenAI’s comparable model.
- DeepSeek API Pricing — NxCode, “DeepSeek API Pricing 2026” (Mar 2026). V3.2 at $0.28/MTok output, R1 at $0.55/$2.19 input/output.
- GPT-5.4 Nano Launch — OpenAI, “Introducing GPT-5.4 mini and nano” (Mar 17, 2026). Nano priced at $0.20/$1.25 per MTok.
- Claude API Pricing — Anthropic, “Pricing”. Opus 4.6 at $5/$25, Sonnet 4.6 at $3/$15, cache hits at 10% of base input.
- Prompt Caching — Anthropic, “Prompt Caching Documentation”. 1.25x write cost, 10% read cost, 5-minute default TTL.
- OpenAI vs Anthropic Caching Comparison — Finout, “OpenAI vs Anthropic API Pricing Comparison (2026)”. Confirms both providers offer 90% cached input discounts.
- Reasoning Tokens — OpenAI, “Reasoning Models”. Documents internal reasoning tokens billed as output, not shown to user.
- Reasoning Token Cost Multiplier — Finout (ibid). Reports real-world output costs on o-series models “often run 2–5x higher than the headline rate.” Industry estimate, not an official OpenAI figure.
- Jevons Paradox — Wikipedia, “Jevons paradox”. W.S. Jevons, The Coal Question (1865).
- Jevons Paradox Primary Source — Yale Energy History, “W. Stanley Jevons, ‘The Coal Question,’ 1865”