2026-05-15·4 min read

Taking an AI agent from pilot to production: the reliability work nobody scopes

Most AI agents work in a Jupyter notebook and fail in production. The gap isn't the model — it's the engineering. Here's what production-grade agentic systems actually require.

Most AI agents fail in production not because the model isn't capable enough, but because the engineering around the model isn't production-grade. The demo works. The agent calls a tool, gets a result, produces an output. It looks finished. Then it hits a real request, the tool call fails, and the agent silently returns a hallucinated result instead of retrying or escalating.

This is the pilot-to-production gap. Closing it is unglamorous, specific engineering work — and it's almost never scoped.

What "production-grade" actually means for an agentic system

A production-grade agentic system has four properties that most pilots lack:

1. Structured error recovery. Every tool call can fail. A production agent handles failure explicitly: it retries with exponential backoff, escalates to a human if retries exhaust, and never silently proceeds on a failed tool call. The failure path is as designed as the success path.

2. Observable internals. Every agent step is traced. You can look at a log and reconstruct exactly what the agent did, what it saw, and what it decided — without re-running the agent. Without this, debugging production failures is guesswork.

3. Cost discipline. Agentic systems have running costs that compound. A ten-step agent that processes ten thousand requests a day at S$0.03 per run costs S$109,500 a year. Prompt caching can reduce this by 60–80%. Nobody scopes this in the pilot.

4. Provider abstraction. The agent logic should be decoupled from the specific model and API you're using today. Models improve, pricing changes, providers go down. A well-designed agent can switch providers by changing a config line, not by rewriting the orchestration layer.

The reliability work, item by item

Tool-call failure handling

The simplest reliable pattern is a tool wrapper that:

Catches network and API errors
Retries with exponential backoff (start at 1s, cap at 30s, max 3 attempts)
Logs the failure with the tool name, input, error code, and attempt count
Returns a structured failure object the agent can reason about — never raises an unhandled exception

Most pilots have none of this. The agent receives an error and either crashes or hallucinates past it.

Observability

The minimum viable observability stack for a production agent:

A trace ID that follows the entire request lifecycle
A step log: for each agent iteration, record the prompt (or a hash of it), the tool calls made, the tool results, and the model's response
Latency and token spend per step — aggregated daily so you can spot drift
Error rates per tool — so you know when an external dependency degrades

This doesn't require a commercial observability platform. A Supabase table and a structured logger get you most of the way there for a small agent fleet.

Prompt caching

Anthropic's Claude supports prompt caching on large context prefixes. For an agent with a long system prompt and a static tool schema, caching the prefix reduces input token costs by 90% on cache hits. The engineering work is:

Structure the prompt so the cacheable prefix (system prompt, tool definitions) is separated from the per-request context
Instrument cache hit rates — a dashboard metric, not a launch-week check
Set a cache TTL budget: longer caches save more money but reduce freshness

On a system processing 50,000 requests per day with a 4,000-token system prompt, caching can save S$3,000–S$8,000 per month at current pricing.

MCP-ready architecture

The Model Context Protocol (MCP) is becoming the standard interface for connecting agents to external tools and data sources. An MCP-ready agent architecture:

Exposes tool definitions through an MCP server rather than hardcoding them in the prompt
Uses a provider abstraction layer that can route requests to different models
Treats tool schemas as configuration, not code

Building this from the start costs a few extra days. Retrofitting it into a tangled pilot costs weeks.

What the handoff should look like

A pilot is not production-ready just because it's deployed. The production handoff should include:

Runbook: what to do when the agent fails, when tool X goes down, when costs spike
Monitoring: dashboards showing agent health, cost, and error rates
On-call coverage: who gets paged, what the escalation path is
Cost projections: modelled at current volume, with sensitivity to growth

Most vendors won't give you this. They'll deploy the agent and call it done. The reliability engineering is what separates a demo that survived its launch week from a system your business can depend on.

This article reflects our approach to agentic systems at Sygan. Our own internal agentic builds are held to the same standard described here — reliability work first, features second.

Ready to scope your project?

The scoping call is free. We'll assess grant eligibility and tell you honestly whether EDG/PSG applies to your build.

Start a project Grant calculator

← All insights