Continuously Improving Agents

TL;DR

TL;DR: Agents do not fail like normal software. Their failures are behavioral and data dependent. A single bad behavior (silently choosing the wrong tool, retrieving stale context, hallucinating a field, looping, skipping approval steps, etc.) can be caused by the prompt, the LLM, the context and memory, tools, or deployment errors. Continuous improvement requires a flywheel that quantifies good behavior, observes behavior at scale (think millions traces per day), curates traces into failure patterns, diagnoses the root cause, proposes an intervention to fix, evaluates the candidates that apply the fixes, and then deploys safely.

Continuous Improvement flywheel overview

Quantify “good” agentic behavior: pick metrics that adequately capture the desired agentic behavior and decide explicitly tradeoffs when metrics conflict.
Observability: Implement an observability stack that captures structured traces.
Telemetry to curated data: turn raw traces into versioned datasets for training and evaluation.
Diagnosis: Produce intervention recommendations through failure clustering & triaging with root cause analysis (RCA).
The improvement ladder: Climb the cheapest run that fixes the diagnosed failure (each step below is more expensive).
1. Prompt optimization.
2. Tool issues.
3. Context issues.
4. LLM Finetuning / post-training.
5. Agentic Finetuning Finetuning / post-training.
Evolving evaluation: Two parts, offline & online. Online catches drift and failures graduate to offline.
Redeploy stack: Ramp up (prompt, tooling, etc.) when there's improvement on online evals, guardrail metrics, and business KPIs. Rollback otherwise.

Agents fail differently from software

Traditional software fails loudly. Exceptions get thrown, tests turn red, alerts fire. You get a stack trace and you go fix it.

Agents can fail quietly: they pick the wrong tool, they retrieve a stale doc, they hallucinate a field, they skip approvals, or they get stuck in a loop and consume context as some examples. Without the proper setup, these issues may never surface costing money, time, and lower performance. With a naive setup, the only noisy signal might be a thumps up or thumbs down from the user of your agent.

That's why the traditional toolkit (logs, tests, CI/CD, bug reports) is necessary but not sufficient. Agent failures are behavioral, multi-step, and distribution dependent. A single bad output can originate in the prompt, the retrieved context, the tool schema, the underlying model, the memory layer, or the infrastructure the agent is interacting with (sometimes websites go down, APIs are inaccessible). You can't fix that with a simple unit test.

The fix is designing and building a flywheel that observes behavior at production scale, converts traces into versioned curated datasets for training and eval, diagnoses root failure causes, picks the cheapest effective intervention, validates candidates against evals and guardrails, and ships them with rollback. The seven sections below walk through this essential loop.

1. Quantifying agentic behavior: "better" and "worse"

Before any of the rest matters, you have to define the agent’s contract.

For each agent, you should be able to answer: what does the agent own, what is it allowed to do, when must it escalate, what does success mean for the user, and what does success mean for the business? These don't all line up. Improving safety usually hurts helpfulness. Cutting latency often hurts task success. Adding guardrails raises escalation rate.

The metrics themselves are not exotic; task success, trajectory quality, safety, latency, cost, escalation rate, plus one or two business outcomes (resolution rate, time saved, conversion). The hard part is not picking them. The hard part is the weighting policy when they conflict, and that contract is the actual definition of "better" for your agent.

This is the executive point: the metric weighting is a product decision, not an engineering one, and it has to live somewhere durable enough that the eval set, the rollout gates, and the on-call escalation policy all reference the same source of truth. Naturally, this metric can change overtime as agents change but it’s the north star for the entire agent deployment.

Let's unpack this principle in the example below.

B2B SaaS example

Take a B2B SaaS support agent that owns billing questions, account configuration, and basic troubleshooting; it must escalate refunds over $500, security incidents, and any churn risk signal. The dashboard tracks task success (68%), escalation rate (18%), Customer satisfaction score (CSAT) on agent only sessions (3.3/5), p50 latency (14s), distribution of cost per session ($5.18 + 1 std of $10.12), and 30 day churn after agent only resolution (2.4%). A new retrieval policy fetches context from a newly deployed RAG where now docs are only fetched when confidence is greater than 0.85; the decision moves CSAT from 4.1 to 4.4 and p50 latency from 14s to 20s, but task success drops from 68% to 57% and escalation rate increases from 18% to 25% because more questions hit "I don't have enough information for a confident answer." Does it ship? On one hand customers are satisfied on the other this is costing the company more.

Without a weighting policy, this is a meeting. Product wants the CSAT win, engineering wants to defend task success, support ops worries about the escalation spike the task success everyone is causing. Everyone is right; nobody can decide. With a weighting policy like "CSAT and trust outrank deflection, but task success has a hard floor at 50%, and any change that breaches the floor does not ship regardless of other gains" it becomes a decision rule. This candidate passes the floor so it ships as is. Had the hard floor been at 60% however this candidate would not have been deployed and new candidates would need to be proposed.

2. Observability: structured traces, not dashboards

Observability for agents is the instrumentation layer that turns every run into a structured trace. Dashboards are the UI on top but they're not the underlying goal.

A useful trace captures: session and user identifiers, agent deployment version, prompt version, model version, every model call with its tokens and cost, every tool call with arguments and results, every retrieval with the IDs of the chunks returned, memory reads and writes, guardrail events, handoff events, latency at each hop, thinking traces, the final output, downstream user feedback, and the eventual business outcome like customer satisfaction if it’s captured.

Logs tell you what happened. Traces let you reconstruct why.

The vendors here have largely converged on OTel/OpenInference as the wire format, which means the choice between Galileo, Arize Phoenix, Langfuse, Braintrust, and LangSmith is mostly about workflow opinion (eval first vs. observability first vs. open source first) rather than data lock in. Pick the one whose downstream story matches yours.

B2B SaaS example (cont'd)

Back to the support agent. A single "where's my October invoice?" session emits a trace with session_id, plan_tier=enterprise, agent_version=v2.3, prompt_version=billingv18, model=claude-sonnet-4-5, the intent classification (billing_lookup, confidence 0.94), tool calls (lookup_account, then fetch_invoice with the returned invoice ID), the retrieval call with three knowledge base chunk IDs (billing_FAQ_v12, invoice_email_template, refund_policy), guardrail events (refund-threshold check passed, no escalation triggered), latency per hop, total cost ($0.21), the final answer, the user's thumbs up, and a flag set seven days later confirming the user did not re contact about the same issue.

Two weeks later, CSAT drops 0.3 points across the enterprise tier. Without trace structure, this is "the model got worse" the team's options are to roll back the last model upgrade, retrain, or wait it out. With trace structure, the team filters to enterprise tier sessions in the affected window, groups by retrieval chunk IDs, and finds that billing_FAQ_v12 was reindexed last Wednesday and now misranks against a newer pricing doc on most queries. The fix is a thirty minute change to the retrieval ranker. Same data, same drop, two completely different responses and only one of them actually solved this.

3. Telemetry is not learning until it becomes a dataset

Raw traces are too noisy to learn from. They contain successes that aren't interesting, failures that are duplicates, PII you can't put in a training set, traffic from a single power user that would skew everything, and edge cases that should become eval cases instead of training data.

The data layer that sits between observability and improvement does the unglamorous work: sampling (head, tail, drift triggered), deduplication, clustering, PII redaction, label assignment, human review routing, augmentation for rare cases, train eval test splitting, decontamination so eval cases don't leak into training, dataset versioning, and lineage tracking so you can answer "where did this training example come from?"

The unit of improvement is not the trace. It's the curated, versioned example derived from the trace. The artifact lineage looks like:

production trace -> failure cluster -> labeled example -> eval case (or training example) -> candidate agent -> release decision

If you can't draw this lineage for any improvement you've shipped, the flywheel is leaking. You can't reproduce wins, you can't audit regressions, and you can't decontaminate.

B2B SaaS example (cont'd)

The B2B SaaS agent emits roughly 200k traces a day. Nobody human labels 200k of anything, so the data layer samples: 1% random for baseline distribution, 100% of thumbs-down sessions, 100% of cases where the human-escalation policy fired, and 100% of any session containing a refund greater than $500. Semantic duplicates are clustered by predicted intent and plan tier so the team isn't reviewing the same "where's my invoice?" question 4,000 times. PII (account numbers, email addresses, support ticket attachments) is redacted before any human reviewer sees a trace. The eval set holds 500 golden cases (hand picked, labeled, etc.) plus 50 adversarial cases where a refund over $500 is disguised in a way to try and trick the escalation from firing; these never enter training. The eval dataset combined with the metrics become the benchmark for developing candidate models.

4. Diagnosis: intervention recommendations, not scores

Informed diagnosis dictates how to fix issues. The cheap version of this layer is "an LLM judge labels each trace good or bad." That's not enough. A score is not actionable; an intervention recommendation is.

Useful diagnosis explains a general failure pattern and then decomposes where that pattern occurs in multiple layers of the agent: infra, human agent turns, LLM reasoning, or combination of tool calls. The taxonomy in production matters and can evolve; not having this can lead to many of the same failures being labeled or categorized different things. For instance, suppose a user wants to purchase a dress but has a max price of $200 dollars and an LLM hallucinates that the user dress it’s trying to buy is < $200 dollars when it’s not. Suppose this same hallucination happens for another user when buying shoes. This hallucination failure might be labeled different ways when semantically they’re the same: “agent hallucinated dress price” and “agent didn’t know shoe price”. Notice that based on failure description it’s not obvious these failures are really the same thing. This gets even more difficult when escalation is sent to human reviewers who are asked to diagnose why something failed without a shared rubric of definitions; without a starting point of definitions everyone can describe a failure a different way!

An example of some essential genres of agentic failures that could be useful as a broad starting point that you can specialize as a hierarchy if need be.

Infra failures: did the agent fail because the environment it was working in went down
Context failures: did the agent not have enough information
Tool selection: Did the agent fail to understand what tool was correct
Retrieval failures: missing, stale, irrelevant, or unauthorized context
Memory failures: Did the agent forget about what it was talking about or what it’s supposed to do.
Planning: Did the agent understand the plan to solve the original goal.
Policy or escalation failures: did the agent take an action it shouldn't have, or didn't escalate when it should have

B2B SaaS

A judge flags 13% of customer support sessions in the last 24 hours as "user repeatedly asked for the same thing." Clustering shows they share a pattern: the agent calls lookup_account before verify_identity finishes, gets a permission error, retries, eventually times out and apologizes. That is not "the model is bad." It's a tool execution plus orchestration failure. The intervention is not fine tuning it's adding a precondition to the lookup_account schema and a wait for verification step in the planner. Cost: a few hours. Compare that to "the agent is bad at customer support, we need to retrain it". Same data, different intervention, 100x the cost. The whole point of this layer is to distinguish the two and in this case fine tuning is overkill.

Failure clustering (LangSmith Insights, Latitude lifecycle) is becoming commoditized as most obs. vendors will ship something similar within a year. Root cause analysis where models actually diagnose how the failure occurred in the trace is still emerging especially in multimodal contexts. This is where work like InterpretAI's Multimodal RCA lives. Human escalation and labeling still produces the highest fidelity signal you can get; it just doesn't scale to 1M traces/day on its own.

The output of this layer should not be "this trace was bad." It should be "this cluster of failures is a retrieval freshness issue," or "this is a tool schema ambiguity," or "this is a missing approval gate." That output is what feeds the next layer and provides the clarity needed to wrangle millions of traces in a cohesive diagnostic and intervention plan.

5. The improvement ladder: cheapest effective intervention first

Once a failure is diagnosed, you have a menu of interventions. They are not equally expensive and they are not equally capable. The discipline is to climb only as far as the failure mode requires. Each rung below has a “use this when” trigger in ascending order of commit and cost where the diagnosis from section 4 should pick the rung, not "vibes". (As discussed in our previous coverage of Continuous Learning)

Prompt optimization: DSPy, GEPA, MIPRO
Tool issues: Adding the right tooling depends on the failures.
Context issues: adding some form of RAG. Letta, Mem0, Zep
Finetuning / post-training on the LLM: DPO, KTO, ORPO, with managed APIs like Interpret's Finetuning API or Together, Modal, Anyscale self-managed. DMPO is the multi-turn variant.
Finetuning / post-training on the agent: ART + RULER, Agent RM, Verifiers, NeMo Gym (essentially running GRPO or PPO). Reward modeling is super important as this is where most RL finetuning fails.

1. Prompt and instruction patches. Use this when failures cluster around specific instructions, formatting, edge cases, or phrasings the agent didn't expect*.* Hours to a few days. Tools like DSPy, GEPA, and MIPRO move this from hand tuning to compiled optimization.

2. Tool fixes & missing tools. Use this when the diagnosis points at action space failures: wrong tool picked, schema ambiguous, two tools overlap, descriptions misleading*.* Hours to days. Many "model failures" are actually action space failures, and the right move is to change the tool layer, not the model. There’s also cases when certain tools can significantly improve agentic performance but don’t exist: for instance Cursor’s “apply” tool. The early cursor versions used the model to generate full file rewrites which was slow and error prone but after adding a dedicated apply tool the diff instruction was fast and specialized. In fact, companies like MorphLLM built a specialized tool for exactly this use case. Finding when to use a tool is an art but can significantly improve workflows.

3. Context and memory fixes. Use this when failures correlate with stale, missing, or wrongly-scoped retrieved content. Days. Better chunking, better query rewrites, freshness policies, permission filtering, citation. Memory belongs adjacent to this but evaluates differently. RAG asks "what should the agent retrieve," memory asks "what should the agent remember about this user or workflow." Letta, Mem0, and Zep operate in the memory lane.

4. LLM level fine tuning. Use this when prompts have plateaued, the model lacks domain behavior or format compliance, and you have enough labeled correctness or preference data*.* Weeks. DPO, KTO, ORPO, with DMPO as the multi turn variant. Managed APIs from Interpret AI and the major labs, or self managed on Together, Modal, Anyscale.

5. Agent level RL. Use this when you have a verifiable reward signal and the lower rungs are exhausted*.* Weeks to months. ART + RULER, AgentRM, Verifiers, NeMo Gym, running GRPO or PPO under the hood. The 2026 inflection: RL on agents is operationally accessible for the first time.

Two things bear repeating. First, most production wins live on the lower rungs. Teams who jump to fine tuning before exhausting prompts, tools, and orchestration are usually paying 100x more for the same outcome. Second, agent RL is only as good as the reward which can be sparse, hackable, or misaligned; poor reward models produce agents that game the evaluator instead of helping the user. Reward modeling is where most agent RL projects fail, and it's where the most interesting infrastructure work is happening right now.

6. Evolving evaluation: release gates and learning engines

Evals do two jobs: they prevent regressions before release and they evaluate the next round of candidates. The flywheel works because online failures graduate to offline evals, offline evals become release gates, and release gates determine what ships.

Offline evals are the gate. They’re versioned, slice-aware, drift managed artifacts that include frozen regression suites, golden traces, adversarial cases, replayed production traces, and synthetic edge cases. They evolve as the agent and the world evolve, with explicit decontamination so the agent isn't being graded on what it was trained on.

Online evals are the canary in the coal mine. Sampled judge scores, user feedback, business KPI deltas, escalation outcomes, guardrail triggers, cost and latency anomalies, drift detection on inputs and outputs. At 1M traces/day, online judge cost is real money; even 1% sampling with a frontier judge runs into hundreds of dollars per rubric per day. Small calibrated judges (Galileo's Luna 2, Arize's online eval at scale) exist as a product category specifically because this cost matters.

B2B SaaS Example

Online drift triggers offline action. Two weeks after a model upgrade, the online judge flags a 4 point drop in escalation appropriateness on enterprise tier billing queries. The team pulls the 60 flagged sessions, finds a consistent failure (the agent now under escalates ambiguous refund language), labels them, and adds them to the offline eval set as a new eval slice: enterprise billing and ambiguous refund language. The next candidate agent has to clear that slice to ship. The online failure became an offline gate, which is the only mechanism that prevents the same regression from reappearing six months later.

7. Redeploy as a staircase, with full-stack rollback

Shipping a new candidate agent should look like a staircase, not a switch. Given some candiate model you should

Replay on historical traces (open loop) for evals
Shadow deployment: run next to the existing baseline model for A/B comparisons.
Canary release: run the candidate on a small percentage of users
Progressive rollout: increase percentage of rollout
Full promotion

Each stage answers a different question. Replay asks "does the candidate handle our known cases?". Shadow asks "does it produce reasonable outputs on real traffic?". Canary asks "do the metrics hold on a real population?". Progressive rollout asks "do they hold at scale?".

The crucial point: in agent systems, rollback is not just weights. You may need to roll back a prompt, a tool schema, a retrieval index, a memory policy, a guardrail, or a router config. Every one of those is a deployable artifact, every one can break in production, and every one needs to be versioned and revertible independently.

B2B SaaS examples (cont’d)

You ship a tighter search_internal_docs tool schema that requires a team_id parameter; a downstream agent that wasn't updated breaks silently: the fix is a schema rollback, not a model rollback.
You reindex the knowledge base with new chunking and recall drops on a slice of queries that worked before. The fix is repointing retrieval to the previous index, not retraining anything.
You tighten a guardrail and watch escalation rate spike past your tolerable rate: the fix is reverting the guardrail config in isolation, leaving the rest of the agent in place.

None of these is solved by reverting the model. Each requires that the artifact has versions and that your release stack can target it independently. If your release tooling only knows how to roll back model versions, half the surface is unprotected.

LaunchDarkly and Statsig cover the experimentation and feature flag side; both are now shipping AI specific guardrails and predictive rollout features however the full coverage is still not complete. Whatever you pick, make sure the rollback target is the full agent configuration, not just the LLM.

Conclusion

In production, the agent itself is one artifact among many. The durable system is everything around it: the traces, the curated datasets, the failure taxonomies, the eval sets, the optimization pipeline, the deployment ladder, and the policies that govern all of it.

The companies that win the agent decade will not be the ones with the cleverest prompt or the largest fine tune; they will be the ones whose flywheel turns the fastest: observation to evidence to diagnosis to intervention to release, every week, on every agent in production.

Better learning infrastructure beats a better agent.

7 step continuous improvement flywheel overview — Continuous Improvement flywheel overview

If you’re tired of debugging silent failures and ready to implement a true continuous improvement flywheel, we can help. Interpret AI provides the infrastructure to turn your raw traces into actionable datasets, cutting your diagnosis time in half and getting your interventions shipped safely.