Why Your AI Agent Works in Staging and Breaks in Production

Why does my AI agent break in production?

Because staging is a lie. Not maliciously — but staging is built from a small, curated set of test cases that exercise the happy path. Production isn't the happy path. Production is users typing emoji into form fields, sending 10,000 requests per minute, asking your agent questions your test set never anticipated, and discovering edge cases your team never thought of. The agent itself didn't change between staging and production. The environment did. Closing the gap doesn't mean making the agent smarter (it usually can't be made significantly smarter on the timeline you have). It means building the 5 layers around the agent that catch the failures staging missed. Input validation. Output validation. Observability. Human-in-the-loop. A continuous eval harness. Each layer catches a different class of failure. Skip any one and production will surface that exact class of failure within days.

The 6 most common failure modes

We've productionized 40+ AI agents across customer support, document processing, data extraction, and code generation. Every single one had at least three of these failures in week one of production: hallucination cascades (the agent confidently produces wrong information that triggers downstream actions), tool-call mismatches (the agent calls an API with the wrong argument shape and silently fails), latency spikes under load (the agent works in single-request tests but degrades at concurrency 100+), cost blow-ups (tokens consumed scale non-linearly with input complexity, leading to 10× expected spend), edge-case crashes (empty input, unusual unicode, very long input), and prompt drift after a model upgrade (Sonnet 3.5 → 4.6 changes behavior on 15–20% of test cases). None of these were the agent's "intelligence" failing. They were the system around the agent missing.

“”

The 5 layers that fix it

Layer 1: Input validation. Schema, range, and content checks before the LLM sees the input. Strip injection attempts. Reject inputs that don't match expected shape. This is the cheapest layer to build and catches the most failures. Layer 2: Output validation. Validate every LLM response against a schema (JSON validators, regex format checks, fact-checking against a known set of allowed values). If the output fails validation, retry or fail gracefully. Don't let unvalidated output reach downstream systems. Layer 3: Observability. Structured logs, traces, alerts on the metrics that matter (error rate, latency p50/p99, cost per task). You can't fix what you can't see. We build this into every agent we ship at AI Agents Development. Layer 4: Human-in-the-loop. For low-confidence outputs or high-consequence decisions, route to a human reviewer. The 5–10% of cases that need review save you from the 100% of consequences from acting on them. Layer 5: Eval harness. A continuous regression test suite of 100–500 inputs that runs on every prompt change, every model upgrade, and every release. Catches regressions before users do.

How do I roll out an AI agent safely?

Same way you roll out any high-stakes system: gradually, with gates between stages. Stage 1 (Shadow): Agent runs on production traffic but its output is logged, not used. Compare to baseline for 5–7 days. Stage 2 (5% Canary): Route 5% of traffic to the agent. Monitor for error rate, latency, and cost. If gates hold for 5–7 days, advance. Stage 3 (50% Canary): Half of traffic. Watch carefully. Stage 4 (100%): Full rollout. Continue monitoring; the eval harness still runs nightly. Rollback at any stage if any gate metric breaches. This 4-stage path is paranoid by design — and it's the only path that's let us roll out 40+ agents without a production incident worth writing home about. For workflows where the consequences of agent failure are critical, also consider running an AI automation discovery first to scope what's worth automating at all.

FAQs

Frequently asked questions

PSG

Written by

Partha Sarathi Ghosh

Founder & Engineering Lead, DevOrbital

Partha leads DevOrbital, where his team has elevated 50+ businesses across MVP development, AI agents, custom software, and growth. He writes about the hidden mechanics of getting AI-generated code into production, MVP scope discipline, and the architecture decisions founders make too late.

LinkedIn @parthasarathig Email More articles by Partha →

Keep reading

Why Your AI Agent Works in Staging and Breaks in Production

Why does my AI agent break in production?

The 6 most common failure modes

The 5 layers that fix it

How do I roll out an AI agent safely?

Frequently asked questions

Partha Sarathi Ghosh

Related reading

From Prompt to Production: A Founder's Playbook for Shipping AI-Built Apps

AI Automation: When It Pays Off (And When It Doesn't)

MVP in 8 Weeks: The Scope-Cutting Framework That Actually Ships

Ready to Build Something Great?