AI Systems Architecture — Mastery6 / 9

Cost Engineering — Token Budgets That Hold

An AI feature that delights at 100 users can bankrupt you at 100,000. Cost is an architectural constraint, designed in — not discovered on the invoice.

Published May 15, 20261 min readHaythem Rehouma · Claude Mastery

Traditional software gets cheaper per user as you scale. AI software gets more expensive — every request costs tokens. If unit economics aren't designed in, growth is the thing that kills you.

Budget per request

Decide, per feature, a token budget the way you'd cap DB queries. Know the input + output token cost of a typical request and the worst case. "Cost per request × requests/month" is a spreadsheet you can fix before it's an invoice you can't.

Model tiering

Not every step needs your best model. Use a cheap, fast model for routing, classification, query rewriting, and faithfulness checks; reserve the expensive model for the step where quality is the product. This is often a 2–5x cost cut at equal quality.

Cache everything cacheable

Prompt/response cache for stable, repeated requests.
Prompt caching (provider-side) for the large, unchanging prefix of a prompt.
Retrieval cache so popular queries don't re-search.

A cache hit is a near-free request.

Trade quality for cost deliberately

Costs controlled. Next: making it fast — latency and throughput at scale.

Budget per request

Model tiering

Cache everything cacheable

Trade quality for cost deliberately

Related Claude skills you can install

Share this article

Series — AI Systems Architecture — Mastery

Keep learning

The Claude Mastery course