AISHAH SOFEA
Back

The Token Economy

4 April 2026

ai infrastructure cost llm

Why Cheaper Tokens Still Cost You More

If you’ve ever burned through a 5-hour AI usage cap in under twenty minutes, you already know something is wrong with how we think about LLM costs. The price per token keeps dropping. The models keep getting smarter. And somehow, the bill — or the rate limit — never gets easier to live with.

This post is about why. Not the marketing version. The first-principles version.

The Old World vs. The New World

For twenty years, “performance optimization” meant one thing to software engineers: reduce memory allocation, minimize CPU cycles, cache aggressively. We optimized for RAM.

That era is over. The new bottleneck is the context window — the amount of text a language model can hold in its working memory during a single conversation. And unlike RAM, which you buy once and use indefinitely, context costs you every single time it gets read.

This is the Token Economy. And most developers are losing money in it without understanding why.

The Context Tax: Why Re-Reading Is the Real Cost

Here’s the mechanic that nobody explains clearly enough.

When you send a message to an LLM via an API, the entire conversation history — system instructions, every prior code block, every previous response — is re-sent as part of the prompt. The model then computes attention over all of it. Internally, providers use the KV Cache (Key-Value Cache) to store the model’s computed attention state, and some providers reuse cached prefixes automatically to reduce redundant computation. But from a billing perspective, all those input tokens still count toward your usage — whether the provider’s infrastructure optimizes the compute behind the scenes or not.

Re-read context (the tax) New input from you AI response added to context

By turn 10, 93% of tokens processed are re-reads of previous conversation — only 7% is your new question.

Think about what happens in a real coding session. You paste 200 lines of a TypeScript orchestration layer. The AI responds with an analysis and a refactored version. You ask a follow-up question. Now the model is processing: your original 200 lines + the AI’s 300-line response + your follow-up. That’s 500+ lines of context before the model even begins thinking about your new question.

By turn ten of a complex refactoring session, you might have 15,000 tokens of accumulated context. In a naive implementation without prompt caching, the model re-reads all of it on every turn. That’s your “Context Tax.” In our modeled example, roughly 93% of tokens processed by turn ten are re-reads of prior conversation — only 7% is your new question. The exact ratio varies by session, but the pattern is consistent: accumulated context dominates new input. And on a frontier model like Claude Opus 4.6 at $5 per million input tokens, those re-reads add up fast.

Now scale that to the real world: a deeply nested Go struct with verbose field names, a JSON schema with repeated boilerplate, or a state machine configuration where the structure of the code consumes more tokens than the logic it encodes. You’re paying a tax on verbosity itself.

The Language Tax

Consider this Go struct, the kind you’d find in any enterprise API service:

type OrderFulfillmentState struct {
    CustomerAccountID   string
    ShippingDestination string
    CurrentProgress     float64
    IsRetryEligible     bool
}

Four fields. Forty-five tokens just for the type definition. The information content — “an order has an ID, a destination, a progress percentage, and a retry flag” — could be expressed in maybe fifteen tokens of natural language. The remaining thirty tokens are structural overhead: capitalization conventions, type annotations, field alignment.

Structural overhead (the tax) Actual information content

Same 4-field data structure across representations. Tokens spent on syntax, type annotations, and naming conventions are "free" to the compiler but expensive to the LLM.

This is the Language Tax. Verbose languages and deeply nested data formats (looking at you, enterprise JSON) cost more in the Token Economy than terse ones — not because they’re worse engineering, but because LLMs charge by the syllable.

In the Old World, we called this “Clean Code.” In the Token Economy, we might have to start calling it “Financial Negligence.”

The DeepSeek Moment

In January 2025, a Chinese AI lab called DeepSeek dropped R1, a reasoning model that performed competitively with OpenAI’s o1 — at roughly 95% less cost. R1’s breakthrough was its use of large-scale reinforcement learning to produce strong reasoning without expensive supervised fine-tuning. But the cost story started earlier: DeepSeek’s V2 model (mid-2024) had already introduced Multi-head Latent Attention (MLA), a technique that compresses KV Cache into a much smaller latent space, reducing it by up to 93%. V3 inherited MLA, and R1 was built on V3. The architectural efficiency was already baked in before the reasoning leap happened.

The combined effect on the industry was immediate and lasting. DeepSeek made a strong case that the cost of intelligence had been higher than it needed to be — that architectural innovation and lean training pipelines could close much of the gap with frontier models at a fraction of the price. Prices cratered. OpenAI responded with GPT-5.4 Nano at $0.20 per million input tokens. Anthropic’s Claude 4.6 Sonnet came in at $3 per million. DeepSeek’s V3.2 lists input at $0.28 per million tokens (cache miss) and output $0.42, with cache hits dropping to $0.028.

Tokens became a commodity. The price of a single inference dropped to fractions of a cent.

And then something counterintuitive happened.

The Jevons Paradox of Intelligence

2023: GPT-4 era
$60/MTok — used sparingly
$30/mo
2026: Post-DeepSeek era
$1.50/MTok — used for everything
$75/mo
Price per MTok Volume (millions of tokens)

Price dropped 97%. The bill went up 150%. That's the Jevons Paradox — efficiency doesn't reduce consumption, it enables more of it.

In 1865, economist William Stanley Jevons observed that as coal-burning engines became more efficient, total coal consumption increased, not decreased. Cheaper energy didn’t lead to conservation, it led to proliferation.

We’re living the Jevons Paradox of tokens right now.

GPT-5.4 Nano costs $0.20 per million input tokens. Claude Opus 4.6 offers a million-token context window. So what do developers do? They stuff more context. Longer system prompts. Entire codebases in a single request. Multi-turn agent loops that run for fifty iterations.

The result: your per-token cost drops by 90%, but your total token consumption increases by 10x. The bill stays the same — or goes up. And on consumer-facing products with usage caps, you hit the wall faster, because cheaper models encourage the kind of sprawling, context-heavy workflows that eat through allocations in minutes.

This is why nineteen minutes felt like five hours. The model is so capable that you throw everything at it, and the Context Tax compounds with every turn.

Surviving the Token Economy: Prompt Caching

The industry’s answer — at least for now — is Prompt Caching.

Here’s the idea: instead of reprocessing your entire prompt from scratch on every request, the API stores the computed KV Cache for the stable parts (system instructions, large documents, conversation history) and only processes the new tokens.

Anthropic’s implementation lets you place explicit cache breakpoints on content blocks. The first request pays a small write premium (1.25x the base input price for the default 5-minute TTL, or 2x for a 1-hour TTL), but every subsequent cache hit costs just 10% of standard input pricing. OpenAI’s GPT-5.4 family now offers similar 90% discounts on cached tokens, though their caching is automatic rather than explicit.

The math is compelling. Take a 10,000-token system prompt on Claude Opus 4.6 using the 5-minute cache. Without caching, ten requests cost $0.50 in input tokens alone. With caching, the first request costs $0.0625, and each subsequent hit costs $0.005. Ten requests total: $0.1075 instead of $0.50. That’s a 78% reduction. For long-running agentic sessions, the 1-hour TTL at 2x write cost makes even more sense — it breaks even after just two cache reads, and you stop worrying about the cache expiring mid-task.

Without caching
$0.50
With caching
$0.11
Savings
78%
Uncached: $0.05 per request Cached: $0.0625 write + $0.005 per hit

10K-token system prompt on Claude Opus 4.6 ($5/MTok input). Cache breaks even after just 1 read.

For developers building agentic systems — where the same system prompt, tool definitions, and accumulated context get sent dozens or hundreds of times — caching is the difference between a viable product and a bankruptcy filing.

The Hidden Line Item: Reasoning Tokens

There’s one more cost that most developers don’t account for: inference-time compute.

Modern reasoning models (OpenAI’s o-series, DeepSeek’s R1, Claude’s extended thinking mode) don’t just read your prompt and respond. They generate an internal chain of thought — a “hidden monologue” — before producing the visible answer. These reasoning tokens are billed as output tokens but never shown to the user.

On OpenAI’s o-series models, this internal monologue can run 2–5x longer than the visible response. You’re not just paying for the answer. You’re paying for the AI to think about the answer. And on models where output tokens cost $15–25 per million, that thinking isn’t cheap.

This creates a new optimization problem: when do you need the model to reason deeply, and when is a fast, shallow response sufficient? The answer is model routing — using a cheap model (Nano, Haiku) for classification and simple tasks, and reserving the expensive reasoning models for problems that actually require them.

The New Efficiency

Here’s the shift that developers need to internalize.

In the old world, efficiency meant fewer lines of code, less memory, faster execution. In the Token Economy, efficiency means token density — maximizing the information content per token sent to the model.

This has concrete implications for how we write code, structure prompts, and design AI-powered systems:

  • Your system prompts should be surgically concise. Every token that isn’t earning its keep is a recurring tax.
  • Your conversation architecture should prune aggressively. Summarize prior context instead of carrying raw history.
  • Your code examples should favor information-dense representations. A well-commented 20-line function teaches more per token than a 200-line class with verbose field names.
  • Your model selection should be tiered. Don’t send a $25/MTok Opus request to classify a string.

The engineers who thrive in the Token Economy won’t be the ones who write the cleverest prompts. They’ll be the ones who understand that every token is a unit of cost, and that the architecture of a conversation is as important as the architecture of the code inside it.

The 20-minute wall isn’t a bug. It’s the price signal telling you that intelligence has a marginal cost — and the market is just starting to figure out what it’s worth.


If you’re building with LLMs in production, the most important metric you’re not tracking is tokens-per-task. Start there.


References

  1. KV Caching in Transformers — Hugging Face Blog, “KV Caching Explained: Optimizing Transformer Inference Efficiency” (Sep 2025)
  2. KV Cache from Scratch — Sebastian Raschka, “Understanding and Coding the KV Cache in LLMs from Scratch” (Jun 2025)
  3. DeepSeek-V2 Paper (MLA Origin) — DeepSeek-AI, “DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model” (2024). Reports 93.3% KV cache reduction via MLA.
  4. MLA Deep Dive — Towards Data Science, “DeepSeek-V3 Explained 1: Multi-Head Latent Attention” (Feb 2025)
  5. MLA Technical Analysis — Sebastian Raschka, “Multi-Head Latent Attention (MLA)” (Mar 2026)
  6. DeepSeek Inference Cost Analysis — IntuitionLabs, “DeepSeek’s Low Inference Cost Explained” (Mar 2026). Confirms DeepSeek R1 runs 20–50× cheaper than OpenAI’s comparable model.
  7. DeepSeek API Pricing — NxCode, “DeepSeek API Pricing 2026” (Mar 2026). V3.2 at $0.28/MTok output, R1 at $0.55/$2.19 input/output.
  8. GPT-5.4 Nano Launch — OpenAI, “Introducing GPT-5.4 mini and nano” (Mar 17, 2026). Nano priced at $0.20/$1.25 per MTok.
  9. Claude API Pricing — Anthropic, “Pricing”. Opus 4.6 at $5/$25, Sonnet 4.6 at $3/$15, cache hits at 10% of base input.
  10. Prompt Caching — Anthropic, “Prompt Caching Documentation”. 1.25x write cost, 10% read cost, 5-minute default TTL.
  11. OpenAI vs Anthropic Caching Comparison — Finout, “OpenAI vs Anthropic API Pricing Comparison (2026)”. Confirms both providers offer 90% cached input discounts.
  12. Reasoning Tokens — OpenAI, “Reasoning Models”. Documents internal reasoning tokens billed as output, not shown to user.
  13. Reasoning Token Cost Multiplier — Finout (ibid). Reports real-world output costs on o-series models “often run 2–5x higher than the headline rate.” Industry estimate, not an official OpenAI figure.
  14. Jevons Paradox — Wikipedia, “Jevons paradox”. W.S. Jevons, The Coal Question (1865).
  15. Jevons Paradox Primary Source — Yale Energy History, “W. Stanley Jevons, ‘The Coal Question,’ 1865”