RAG, RL, and the Judge You Need Before Either

TL;DR

TL;DR: after you have observability, evals and agentic traces showing failures, what should you do to fix these issues? A framework that’s served me well:

Judges can be used to find failures and trigger analysis that quantifies the scale of the problem. Judges identify failures but don’t solve them; false negatives and false positives are inevitable and they can harm reinforcement learning reward functions (ie RLAIF.
Forgetfulness (did the agent have what it needed to make a decision?): RAG and other tools can improve context. This is especially useful for dynamic data like financials or news. It will not improve thinking.
Decision making (did the agent fail to reason properly?): Reinforcement Learning (RL) can improve thinking. Requires corrections and reward function shaping. Be careful with SFT; in practice, I’ve generally seen it destroys generality, so I would use it for the simplest reasoning engines.

When a model fails in production, the instinct is often to reach for the biggest tool in the toolbox, usually RL. Sometimes that’s the wrong tool and overkill. The first question should be "what kind of failure is this?", the next “what tools do I need to solve it?”.

Here's a framework that's served me well.

Prerequisites:

Observability and Evals: Make sure that you have some way of analyzing agent traces at scale and create evaluations that let you compare different candidates of your agentic workflow.

Prompting: Make sure that you’re not falling into the context window paradox; egregiously long prompts will harm reasoning as tokens dedicated to reasoning process these longer prompts.

Two failure modes

Failure mode 1: Forgetfulness. The model doesn't have the information it needs. It's hallucinating a policy, citing a deprecated API, or missing context about a specific customer. When you inspect the trace, it's clear the model simply didn't know: the relevant fact wasn't in the prompt, wasn't retrieved, wasn't anywhere in the context window.

This is a context problem. Training won't fix it durably, because facts change. The answer is better retrieval: RAG, better tool calling, richer context assembly, or whatever gets the right bytes in front of the model at decision time. However, adding more context can impact decision making so this is a trade off.

Note: Volatile facts usually belong in RAG. Anything that changes frequently: prices, policies, inventory, personnel, org structure, should typically not be baked into weights. That's how you get models confidently stating yesterday's truth.

Failure mode 2: Poor reasoning. The model has everything it needs. The relevant docs are retrieved, the user's request is clear, the history is available the model still does the wrong thing. It purchases the wrong item, it hallucinates a price, it drops a database, attempts to circumvent guardrails, it picks the wrong answer.

This is a behavior problem. More retrieval won't help because the model already has the context. What it hasn't learned is how to act on that context. This is where RL training on a reward signal tailored to correcting the failure is useful. Note, reward shaping is difficult and a bad reward signal can make the model worse that’s why this is a more complex tool.

The diagnostic question

When you're looking at a failure trace, ask:

Given exactly what was in the context window at decision time, could a competent version of this model have gotten this right?

If no, the agent didn't have what it needed, so fix retrieval.
If yes, it had what it needed and reasoning incorrectly, so fix behavior.

That single question resolves most "should we train or should we retrieve?" debates before they start. The hard part is answering it at scale, which is where judges can sometimes help.

Judges: the layer that makes this actionable

You cannot run the diagnostic question on ten thousand traces by hand. This is where LLM-as-judge can become a connective tissue between observability, RAG, and RL. However, there are catches: namely judges are still LLMs and they can still have forgetfulness and reasoning issues.

A well built judge does two jobs:

Detects failures in production. Sample your traffic, let a judge score outputs for faithfulness, correctness, and task completion. Escalate flagged cases to human review. This is how you find the problems worth fixing in the first place and how you can correct judges if a human finds a false positive or false negative.
Provides the reward signal for RL. Modern RL pipelines, RLAIF, Constitutional AI, self-rewarding setups, use LLM judges as the reward model. This alleviates (but doesn’t solve) the problem of finding a learning signal at scale on your specific workflows, which is normally addressed by manual annotations.

Three catches:

An unaligned judge is a compass pointing the wrong way. Before you trust a judge's verdicts, validate them against a set of human-labeled examples. Measure agreement. Iterate on the rubric until the judge reliably agrees with your experts on cases that matter. Common judge failure modes include position bias, where the judge changes its preference based on answer order; length bias, where it prefers longer outputs; and self-preference, where it favors outputs from the same model family. If these issues are not corrected, judges can distort RL training signals
At scale diagnostics are hard: At Interpret AI, we noticed that judges face the same problems that all agents face: forgetfulness and reasoning issues. To ameliorate this problem use a separate multimodal model trained on a different data distribution with structured breakdowns. That is the role of our Interpret Foundation Model: a model trained specifically to detect failures.
Today, judge improvement scales with human correction speed: judges are improved by finding errors and asking humans to review them. This is problematic because judge improvement is constrained by the speed of manual labeling.

The practical order for addressing agentic failures:

Check observability, evals, and prompts.
Build a judge for the task, create a rubric, and align the judge to human labels.
Use it to classify failures you now know whether you have a RAG problem or a behavior problem.
If it is a RAG problem: improve retrieval, reevaluate candidate model with new context.
If it is a behavior problem: use the judge, Interpret Foundation Model, or another foundation model as the training signal for RL and reevaluate your candidate.

Lingering thoughts:

Sometimes the problem is both retrieval and behavior. Retrieval brings in the right docs, but the model doesn't know how to prioritize them, or it weighs a stale cached document over a fresh one. Fix both layers, in that order: context first, then behavior. Otherwise you're teaching the model to make the best of bad inputs.

Training can cause regressions elsewhere. SFT and RL are not free. Optimizing for one behavior can quietly degrade others, so your eval suite (and your judge) needs to cover more than the specific failure you're targeting. There’s no free lunch.

Conclusion

Being able to identify where in your agentic pipeline failures are happening and then quickly resolving that failure could be the difference between delivering value with your agents or failing miserably. Once the team is setup to introspect the failures, rapidly identifying and fixing the issues is the next piece. At Interpret AI, our bread and butter is identifying agentic failures at scale with our multimodal foundation model. If you have questions, feel free to reach out at ilian@interpretai.tech.