Pedro Bertoluchi

The AI bill you didn't budget for: tracking and cost guardrails in .NET

The PoC was cheap. Production isn't. Four invisible cost vectors that quietly turn a clean Azure OpenAI bill into a board-level conversation.

7 min read
Back to blog

Every PoC is cheap. Production is where the AI bill stops being a rounding error and starts showing up in finance reviews. The gap is rarely the headline model price. It is a handful of architectural defaults that nobody flagged because the demo ran on a hundred prompts and the prod system runs on a hundred thousand.

Retry tokens are the first leak. A naive Polly policy with three exponential retries and no circuit breaker turns a flaky upstream into a triple bill on every spike. The fix is boring: a typed HttpClient with a circuit breaker, a retry budget per minute, and a fallback path that returns a cached or smaller-model answer instead of hammering the same endpoint. Retries belong on transient failures, not on rate limits and not on content filter rejections.

Embedding re-index is the second. Teams pick text-embedding-3-large on day one, ship two million documents, and then realise that any frontier change forces a full re-index at full price. The architectural choice that pays back is a content-addressed embedding cache keyed by document hash plus model name plus chunking parameters. Re-index becomes incremental, and rollback to a previous model is a config flip instead of a weekend.

Fine-tunes that did not scale are the third. Someone fine-tuned GPT-4o on five hundred examples to fix a tone problem, the per-token cost jumped, and the quality gain evaporated under real traffic. Smaller models on hot paths almost always win. Route classification and extraction to a Mini variant, reserve the large model for the long-tail reasoning step, and measure the routing decision instead of guessing. The cost curve flattens immediately.

AI observability is the fourth, and it is the one engineers resist because it feels like plumbing. Per-user, per-feature and per-prompt-version tracking from day one, emitted through OpenTelemetry and stored next to the application traces. Without it, the monthly bill is one big number and nobody can attribute it. With it, the conversation moves from cutting features to renegotiating a single noisy customer or retiring a single bad prompt.

Treat cost as a hard constraint with the same rigour as latency. A budget per tenant enforced in middleware, a daily anomaly alert wired to the same channel as production alarms, and a dashboard that any product manager can read without a SQL course. None of this is glamorous. All of it is the difference between a margin-positive AI feature and the quiet renegotiation that follows the third invoice.

Tags

  • #applied-ai
  • #azure-openai
  • #cost

Let's talk about your next project.

Share the challenge in a few lines. Within one business day I respond with a technical assessment and the next steps.