Shipping 100k AI interactions a day without the bill eating you alive

When we launched the Neosmart LLM Platform, our API bill was a rounding error. The product was still finding its footing, daily requests were in the low thousands, and we were preoccupied with reliability and latency rather than cost. Six months later, we were processing 100,000 AI interactions every day and the infrastructure bill had grown from something we barely noticed to something that could make or break the unit economics of the entire product.

This is the story of how we brought that cost under control — not by shipping worse AI, but by being smarter about what we actually needed to compute. The techniques that ended up mattering most were prompt caching, Redis-backed response caching, model tiering, and batching. Each one saved a meaningful chunk. Together they prevented a cost crisis.

How 100k/day sneaks up on you

At launch the math looked fine. GPT-4o at roughly $0.005 per 1k input tokens and $0.015 per 1k output tokens, average interaction ~1,200 tokens in and ~400 tokens out: call it $0.012 per request. At 5,000 requests a day that's $60/day, or about $1,800/month. Manageable.

Then usage grew. It always grows faster than the spreadsheet said it would. By the time we hit 100k interactions daily, that same per-request rate becomes $1,200/day, $36,000/month. Not catastrophic for a healthy SaaS business, but we were still in a growth phase — and more importantly, the 100k baseline was a floor, not a ceiling. We needed to get cost per interaction down before the next inflection point arrived.

The other thing that snuck up on us: not all interactions are equal. A quick classification task might cost $0.002. A complex summarisation with a large retrieved context could run $0.08. When you aggregate everything into a single daily number, the expensive tail requests dominate the bill but are invisible in your dashboards until you deliberately instrument for them.

The four levers

1. Prompt caching

Most production LLM prompts have a large static component — the system prompt, tool definitions, retrieved knowledge base content — followed by a small dynamic component (the user's actual message). The Anthropic Claude API's prompt caching feature and OpenAI's equivalent let you mark the static prefix as cacheable. Subsequent requests that share that prefix hit a cache at the provider level; you pay a fraction of the input token cost for cache hits.

For our knowledge-base-grounded workflows, the system prompt plus retrieved context averaged ~3,000 tokens. That prefix was identical across thousands of requests per hour. After enabling caching, cache hits on that prefix ran at ~80–90% over any given hour. The effective input cost for those tokens dropped by roughly 60–70%. The impact on the total bill was significant: input tokens were our largest cost line, and prompt caching attacked the biggest, most repetitive slice of them.

The key discipline: structure your prompts so the static content comes first. Cache invalidation happens at the token level — any modification to a prefix character invalidates everything downstream in the cached section. That means no dynamic timestamps, user IDs, or A/B test variants inside the cached prefix.

2. Redis response caching

For deterministic prompts — classification, tagging, structured extraction from fixed templates — the same input reliably produces the same output. There is no reason to call the model twice for an identical request. We implemented a two-tier Redis cache: exact-match keying for identical prompt strings, and semantic similarity matching using cosine distance on embeddings for near-identical prompts above a 0.97 threshold.

Exact-match cache hits are free; you pay only for the embedding call on cache lookup, which costs orders of magnitude less than a full completion. Semantic matching required more care — at 0.95 threshold we were occasionally serving stale answers for subtly different questions, so we tuned upward to 0.97 and added a TTL of 24 hours for semantic hits versus 7 days for exact matches.

The classification and tagging workflows that benefited most from this saw ~35–45% cache hit rates, reducing their net model calls by roughly the same proportion.

3. Model tiering

Not every task needs GPT-4o or Claude Sonnet. We categorised every workflow by complexity and routed accordingly. Short classification tasks, intent detection, and simple slot-filling went to gpt-4o-mini or claude-haiku-3. Multi-step reasoning, synthesis, and anything customer-facing went to the full models. The cost delta between tiers is substantial — mini/haiku models are roughly 10–20x cheaper per token than their full counterparts.

Getting the routing logic right was the hardest part. We evaluated each task type against a golden evaluation set of 200 examples before reassigning it to a cheaper model. Accuracy on that set had to stay within 2 percentage points of the full-model baseline. Tasks that failed that bar stayed on the expensive model. About 40–50% of our request volume passed the evaluation and moved to cheaper models, cutting the blended cost per interaction by roughly 30–40%.

4. Batching

Some workflows don't need real-time responses. Document processing, nightly report generation, bulk tagging jobs — these can tolerate a 15-minute window. The OpenAI Batch API and Anthropic's equivalent offer significant discounts (around 50%) in exchange for asynchronous processing. We shifted every non-interactive workflow to batch mode. Batch requests now account for roughly 25% of our total token volume, at half the cost of synchronous calls.

Prompt engineering for cost

Once you've applied the architectural levers above, the next biggest gain comes from reducing tokens without reducing quality.

Token budgets. We added explicit instructions to most system prompts: "Respond in under 150 words unless the task requires more." LLMs are verbose by default. A token budget instruction cuts average output length by 20–35% with negligible quality impact on structured tasks.

System prompt compression. Original system prompts were written by engineers who were optimising for clarity and were not thinking about tokens. We audited every prompt and removed redundant instructions, merged overlapping guidance, and eliminated examples that didn't change output quality. The average system prompt shrank from ~800 tokens to ~350 tokens after compression. On high-volume workflows, those 450 tokens per request add up quickly.

Context trimming. RAG pipelines often stuff as much retrieved context as possible into the prompt on the theory that more context equals better answers. At scale that's expensive. We implemented a relevance scorer on retrieved chunks and enforced a hard cap of the top 5 chunks by relevance score, regardless of how many chunks were retrieved. For most queries, chunks 6–20 had near-zero marginal contribution to answer quality but real token cost. The trimming cut average input token counts by roughly 25% on RAG-heavy workflows.

Observability: what to instrument

You can't optimise what you can't see. We found that aggregate cost dashboards were almost useless for decision-making — they told us the bill was high but not why. The metrics that actually drove action were:

  • Cost per request, by workflow type. Reveals which features are expensive and whether cost spikes are isolated to a single flow.
  • Cost per user. Identifies power users who are consuming disproportionate resources. Useful for pricing model decisions and abuse detection.
  • Cost per feature. The single most actionable metric — it tells you whether to optimise a feature or raise its price tier.
  • Token distribution histograms. The 95th percentile request is often 3–5x the median. Understanding the tail is essential for capping runaway costs.
  • Cache hit rate, by cache type. A prompt cache hit rate below 70% on a stable workflow means your prompt structure is too dynamic.

We built a LangChain cost callback that tagged every LLM call with workflow name, user ID, and feature flag context, then streamed the token counts and model IDs to a lightweight aggregation layer. Here's the core of it:

# LangChain cost callback — tags every LLM call with workflow context

from langchain.callbacks.base import BaseCallbackHandler
from langchain_core.outputs import LLMResult
import time

# Approximate cost per 1k tokens (USD), update as pricing changes
MODEL_COST_PER_1K = {
    "gpt-4o":            {"input": 0.005,  "output": 0.015},
    "gpt-4o-mini":       {"input": 0.00015, "output": 0.0006},
    "claude-3-5-sonnet": {"input": 0.003,  "output": 0.015},
    "claude-3-haiku":    {"input": 0.00025, "output": 0.00125},
}

class CostTrackingCallback(BaseCallbackHandler):
    def __init__(self, workflow: str, user_id: str, metrics_client):
        self.workflow = workflow
        self.user_id = user_id
        self.metrics = metrics_client
        self._start_time = None

    def on_llm_start(self, serialized, prompts, **kwargs):
        self._start_time = time.monotonic()

    def on_llm_end(self, response: LLMResult, **kwargs):
        usage = response.llm_output.get("token_usage", {})
        model  = response.llm_output.get("model_name", "unknown")
        rates  = MODEL_COST_PER_1K.get(model, {"input": 0, "output": 0})

        input_tokens  = usage.get("prompt_tokens", 0)
        output_tokens = usage.get("completion_tokens", 0)
        cost_usd = (
            input_tokens  / 1000 * rates["input"] +
            output_tokens / 1000 * rates["output"]
        )
        latency_ms = (time.monotonic() - self._start_time) * 1000

        self.metrics.record({
            "workflow":      self.workflow,
            "user_id":       self.user_id,
            "model":         model,
            "input_tokens":  input_tokens,
            "output_tokens": output_tokens,
            "cost_usd":      cost_usd,
            "latency_ms":    latency_ms,
            "cache_hit":     usage.get("cached_tokens", 0) > 0,
        })

Redis caching strategy for deterministic prompts

The Redis decorator below wraps any LangChain chain call. It computes an exact-match key from the serialised prompt, checks the cache, and falls back to the model only on a miss. The embedding-based semantic path is triggered separately for workflows where near-identical prompts are common.

import hashlib, json, redis
from functools import wraps
from typing import Callable, Any

r = redis.Redis.from_url("redis://localhost:6379", decode_responses=True)

def llm_cache(ttl: int = 86400, prefix: str = "llm"):
    """
    Exact-match response cache for deterministic LLM calls.
    ttl: seconds. Set lower for time-sensitive content.
    """
    def decorator(fn: Callable) -> Callable:
        @wraps(fn)
        async def wrapper(*args, **kwargs) -> Any:
            # Build a stable cache key from all inputs
            key_data = json.dumps(
                {"args": args, "kwargs": kwargs},
                sort_keys=True, default=str
            )
            cache_key = f"{prefix}:" + hashlib.sha256(
                key_data.encode()
            ).hexdigest()

            cached = r.get(cache_key)
            if cached:
                # Deserialise and return without hitting the model
                return json.loads(cached)

            result = await fn(*args, **kwargs)

            # Only cache if result looks valid (no error sentinel)
            if result and not getattr(result, "error", None):
                r.setex(cache_key, ttl, json.dumps(result, default=str))

            return result
        return wrapper
    return decorator


# Usage: wrap any chain invoke
@llm_cache(ttl=604800, prefix="classify")  # 7-day TTL for classification
async def classify_document(text: str, categories: list[str]) -> dict:
    return await classification_chain.ainvoke({
        "text": text, "categories": categories
    })

When NOT to cache

Caching is not always appropriate. Three categories where we deliberately skip it:

Compliance-sensitive workflows. In regulated contexts — financial advice, medical information, legal summaries — serving a cached response from a month ago carries real liability. The prompt may have been identical but the model weights or the regulatory context may have changed. For these workflows we enforce a no-cache policy at the decorator level and accept the higher cost as a business requirement.

User-personalised responses. If the output incorporates user history, preferences, or profile data that can change between requests, exact-match caching will serve the wrong personalisation to the wrong state. The hit rate on these prompts tends to be low anyway — the dynamic suffix defeats deduplication.

Real-time and event-driven workflows. Anything where freshness is the feature — live data analysis, real-time summaries, current-state reasoning — should not be cached beyond a very short TTL, if at all. We use a 60-second TTL for these and treat it as coalescing rather than true caching.

Model routing logic

The routing function below runs before every LLM call. It checks the task type against a configuration table, verifies cache status, and selects the cheapest model that meets the quality threshold for that task type.

from enum import Enum
from dataclasses import dataclass

class ModelTier(str, Enum):
    FAST   = "fast"    # gpt-4o-mini / claude-3-haiku — cheap, low latency
    FULL   = "full"    # gpt-4o / claude-3-5-sonnet — expensive, high quality
    BATCH  = "batch"   # async batch endpoint — 50% discount, 15 min SLA

@dataclass
class RoutingRule:
    tier:          ModelTier
    fast_model:    str
    full_model:    str
    max_tokens:    int
    cacheable:     bool

# Per-workflow routing config, maintained as data not code
ROUTING_TABLE: dict[str, RoutingRule] = {
    "classify":   RoutingRule(ModelTier.FAST,  "gpt-4o-mini",       "gpt-4o",          150,  True),
    "tag":        RoutingRule(ModelTier.FAST,  "claude-3-haiku",    "claude-3-5-sonnet", 100,  True),
    "summarise":  RoutingRule(ModelTier.FULL,  "gpt-4o-mini",       "gpt-4o",          500,  False),
    "extract":    RoutingRule(ModelTier.FAST,  "gpt-4o-mini",       "gpt-4o",          300,  True),
    "generate":   RoutingRule(ModelTier.FULL,  "gpt-4o-mini",       "gpt-4o",          800,  False),
    "report":     RoutingRule(ModelTier.BATCH, "gpt-4o-mini",       "gpt-4o",         2000,  False),
}

def resolve_model(task: str, force_full: bool = False) -> tuple[str, RoutingRule]:
    rule = ROUTING_TABLE.get(task)
    if not rule:
        raise ValueError(f"Unknown task type: {task}")

    if force_full or rule.tier == ModelTier.FULL:
        return rule.full_model, rule
    return rule.fast_model, rule

Where we landed. After implementing all four levers plus the prompt engineering changes, our blended cost per interaction dropped by roughly 55–65% from peak. We went from a trajectory that was going to be untenable at 500k/day to one where the unit economics improve with scale — because the cache hit rates and batch ratios both increase as volume grows. The work took about six weeks of engineering time spread across the team, and the ROI was immediate.

The biggest mistake we made early on was treating the AI bill as a single number to optimise. It's not. It's a portfolio of workflows, each with its own cost profile, caching potential, and quality-cost tradeoff. Instrument first. Route and cache second. Compress prompts third. In that order, the gains compound.