Cost, Latency & Context Management · AI Engineer

8Cost, Latency & Context Management

Tokens cost money. A lot of money.

You build a chatbot. It works beautifully in testing. You deploy it. A hundred users show up. Your monthly API bill is 3,000 dollars.

Wait, what?

Here's the math. A single GPT-4-class API call with a long conversation might use 4,000 input tokens and 1,000 output tokens. At typical pricing, that's around 5-7 cents per conversation turn. Doesn't sound like much. But multiply by thousands of users, several turns each, and it adds up fast.

The dirty secret of production LLM apps: the engineering isn't in making them work. It's in making them affordable.

Understanding token pricing

Every LLM API charges by the token. A token is roughly 3/4 of a word in English. "Hello, how are you?" is about 6 tokens.

Most providers charge differently for input tokens (what you send) and output tokens (what the model generates). Output tokens are usually 2-4x more expensive because generating text is computationally harder than reading it.

Model Class	Input (per 1M tokens)	Output (per 1M tokens)
Small (Haiku, GPT-4o mini)	0.25 - 0.50 dollars	1 - 2 dollars
Medium (Sonnet, GPT-4o)	3 - 5 dollars	10 - 15 dollars
Large (Opus, GPT-4)	10 - 15 dollars	30 - 75 dollars

The difference between model tiers is enormous. A request that costs 0.1 cents on a small model costs 5 cents on a large one. Same question, 50x the cost.

This is why model selection per task is one of your biggest levers. More on that later.

Token budgets: planning ahead

A token budget is exactly what it sounds like — a cap on how many tokens a request can use. You set it before the request, and you enforce it.

Why bother? Because without a budget, a single runaway conversation can eat thousands of tokens. A user who pastes in a 50-page document as context? That's 15,000+ tokens just for the input.

Here's a practical approach:

System prompt budget: Keep it short. Every token in your system prompt is charged on every single request. A 500-token system prompt across 100,000 daily requests costs as much as 50 dollars per day on a medium model. Trim it to 200 tokens and you save 60%.

Context budget: Limit how many previous messages you include in a conversation. Don't send the entire conversation history every time — use the last 5-10 messages, or better yet, summarize older messages.

Output budget: Set max_tokens on every API call. If you need a one-sentence answer, don't allow 2,000 tokens of output. The model will fill whatever space you give it.

Caching: stop paying for the same answer twice

If 50 users ask "What's your return policy?" in the same hour, do you really need to call the LLM 50 times?

Exact-match caching is the simplest approach. Hash the full prompt. If you've seen this exact prompt before and the cache hasn't expired, return the cached response. Works great for repetitive queries like FAQ-type questions.

But users rarely ask the exact same thing. "What's your return policy?" and "How do I return something?" are different strings with the same intent.

Semantic caching fixes this. Instead of hashing the exact text, you embed the query into a vector and search for similar previous queries. If there's a match above a similarity threshold (say 0.95), return the cached response.

# Pseudocode for semantic caching
query_embedding = embed(user_query)
cached = vector_store.search(query_embedding, threshold=0.95)
 
if cached:
    return cached.response  # Free!
else:
    response = llm.generate(user_query)
    vector_store.store(query_embedding, response)
    return response

The savings can be dramatic. If even 30% of your queries hit the cache, you've cut your API costs by nearly a third. And latency drops to near-zero for cached responses.

Gotcha: caching works best for factual, static answers. If your responses depend on real-time data or user-specific context, the cache hit rate will be low. Know when caching makes sense for your use case.

Streaming: perceived speed matters

Here's an interesting UX insight: a response that takes 5 seconds to generate feels fast if it starts appearing after 200 milliseconds.

That's streaming. Instead of waiting for the entire response to be generated and then sending it all at once, you send each token as it's generated.

Every major LLM API supports streaming. Instead of a single JSON response, you get a stream of server-sent events (SSE), each containing a few tokens. Your frontend renders them as they arrive.

The total time to complete the response is the same. But the perceived latency drops dramatically because the user sees progress immediately. It's the same reason progress bars exist — waiting is more bearable when you can see something happening.

For chatbots, streaming is basically mandatory. A 3-second blank screen followed by a wall of text feels broken. The same text appearing word by word feels responsive.

Implementation tip: when streaming, you can't cache the full response until the stream completes. Buffer the tokens, and once the stream ends, cache the complete response for future exact or semantic matches.

Context window management

Modern LLMs have large context windows. Claude supports 200K tokens. GPT-4 goes up to 128K. Gemini reaches over a million.

Tempting to just dump everything in, right?

Don't. Bigger context = more tokens = more cost = more latency. And research shows that LLMs pay less attention to information in the middle of long contexts (the "lost in the middle" problem). More context isn't always better context.

Here are the main strategies for managing context:

Truncation: The simplest approach. Keep the most recent N messages. Old messages get dropped. Works for casual chatbots where older context isn't critical.

Summarization: Before dropping old messages, summarize them. Include the summary as a "memory" in the prompt. The model gets the gist of the conversation without the full token cost.

Sliding window: Keep the first message (usually the system prompt), the last N messages, and optionally a summary of everything in between. This preserves the original instructions and recent context while staying within budget.

RAG instead of stuffing: Instead of putting everything in the context window, store documents in a vector database and retrieve only the relevant chunks. A RAG pipeline that retrieves 3 relevant paragraphs is cheaper than stuffing 50 pages into the prompt.

Prompt optimization: shorter is cheaper

Every word in your prompt costs money. This isn't a metaphor — it's literally true with per-token pricing.

A verbose prompt:

You are a helpful customer service assistant for Acme Corp. 
Your job is to help customers with their questions about our 
products and services. Please be polite, professional, and 
thorough in your responses. If you don't know the answer, 
please let the customer know that you'll need to check with 
a specialist. Always end your response by asking if there's 
anything else you can help with.

That's about 80 tokens. Repeated on every single API call.

A trimmed version:

Acme Corp support assistant. Be helpful and concise. If 
unsure, say you'll check with a specialist.

About 25 tokens. Same behavior from the model in most cases. 70% token savings on the system prompt.

This sounds trivial, but at scale it matters. A 55-token savings per request across a million daily requests on a medium model saves hundreds of dollars per month.

Review your prompts regularly. Remove filler. Test whether each instruction actually changes the model's behavior. If removing a sentence doesn't change the output, remove it.

Model selection by task: the biggest lever

Not every task needs the smartest model. This is probably the single most impactful cost optimization.

Classification, extraction, formatting: Use the smallest model that works. A small model can categorize support tickets, extract dates from emails, or format JSON just fine. No need for the expensive model.

Summarization, translation, simple Q and A: A medium model handles these well. Good enough quality at a fraction of the cost.

Complex reasoning, nuanced writing, multi-step planning: This is where the large models earn their price. If the task requires genuine intelligence, pay for it.

A common pattern: use a small model as a router. It reads the user's message and classifies the complexity. Simple questions go to the cheap model. Complex ones go to the expensive model. You pay top dollar only when you need to.

User: "What are your store hours?"  →  Small model  (0.01 cents)
User: "Compare your premium plan vs enterprise for a 500-person company 
       with HIPAA requirements"  →  Large model  (3 cents)

This routing approach can cut costs by 60-80% with minimal quality impact.

Batching: amortize the overhead

If you need to process a large volume of requests that aren't time-sensitive, batch them.

Most LLM APIs offer batch endpoints at 50% discounts. You submit a file of requests, and the API processes them within a few hours. No streaming, no real-time responses. But half the cost.

Good candidates for batching:

Nightly content moderation runs
Bulk document classification
Generating embeddings for a large dataset
Running evaluation suites

If it doesn't need to be real-time, batch it.

Monitoring costs in production

You can't optimize what you don't measure. Track these numbers:

Cost per request: Break it down by endpoint. Your chatbot might cost 3 cents per turn, your summarizer 0.5 cents, your classifier 0.01 cents. Know which features are expensive.

Token usage over time: Look for trends. Is usage growing linearly with users? Or is there a specific feature that's burning through tokens disproportionately?

Cache hit rate: If you've set up caching, track how often it saves you an API call. A low hit rate means your cache strategy needs tuning.

Cost per user: The metric that matters most to the business. If you're charging users 20 dollars per month and each active user costs you 15 dollars in API fees, you have a problem.

Set up alerts. If daily spending exceeds 2x the average, you want to know before the bill arrives.

The cost optimization playbook

If your LLM costs are too high, work through this list in order. Each step has diminishing returns, so start at the top:

Route by complexity — Use small models for simple tasks. Biggest single lever.
Trim your prompts — Remove unnecessary instructions and verbose system prompts.
Implement caching — Exact-match first, then semantic caching.
Manage context — Summarize old messages instead of sending full history.
Set token budgets — Cap input and output tokens per request type.
Batch non-urgent work — Use batch APIs at half the cost.
Monitor and iterate — Track cost per request, find the expensive outliers, optimize them specifically.

Most teams can cut their LLM costs by 50-70% just by implementing the first three items. No quality loss. Just smarter engineering.

The finish line

That wraps up the Production AI Engineering series. We went from writing our first HuggingFace app, through fine-tuning, frameworks, model serving, cloud platforms, MLOps, evaluation, and now cost management.

If you followed along, you now know how to take a model from "works in a notebook" to running in production with proper monitoring and budget controls. That's the whole pipeline.

The field moves fast — new models, new tools, new pricing every month. But the fundamentals here don't expire. Serving infrastructure still needs load management. Evaluations still need real user queries. And cost optimization will only get more important as usage scales.

Go build something. The best way to learn the rest is by shipping.