The cost of skipping evals on an agent platform compounds. The first month, the system feels precise. By month three, three changes to prompts and one model upgrade have introduced quiet drift, and the only signal comes from a customer email. By month six, half of the traces in production are doing things the original design did not intend, and nobody can tell which prompt change caused which regression.
The teams shipping agents that survive that arc, in 2026, share a discipline. They invest in evals before they invest in scale. This handbook is the practical version of that discipline: how to build the eval surface, how to keep it useful, and which failure modes only show up in the wild.
Why eval-first beats prompt-first
Through 2024 and most of 2025, the dominant pattern was prompt-first development. Engineers iterated on prompts, models, and tool definitions inside a notebook or playground, shipped changes by feel, and treated production as the validation surface. That worked at low scale and low stakes. It does not work for agent platforms in production B2B.
The reason is structural. An agent platform makes decisions across multiple LLM calls, tool invocations, and conditional branches. A prompt change in one node propagates probabilistically to the rest of the graph. Without a stable test suite, the engineer changing the prompt has no view into how that change shifts the distribution of outcomes downstream. They are debugging blind.
Eval-first reverses the order. Before a prompt or a model change ships, the change runs against a stable golden dataset and a deterministic harness. The output of the eval is a delta on a fixed metric set: pass rates, citation accuracy, tool-call correctness, latency, cost. The engineer sees the impact before the user sees it. This is not a culture change. It is a workflow change. Once the harness exists, eval-first becomes the path of least resistance.
Building the golden dataset
The golden dataset is the foundation. Every other piece of eval infrastructure depends on its quality. Three rules for a useful one.
Rule 1: real traces, not synthetic prompts. The dataset should be drawn from real production traffic, manually curated, with the expected outcome labeled by a domain expert. Synthetic prompts are fine for smoke tests but they consistently miss the failure modes that actually matter. A common antipattern is generating "edge cases" with an LLM and treating them as ground truth. The cases are usually internally consistent with whatever model generated them and miss the specific quirks of the production data distribution.
Rule 2: cover the failure modes, not just the happy path. The dataset should over-index on the patterns where the agent has historically failed. If the production logs show that 4 percent of queries about contract clauses get the wrong section cited, those cases should be 15 to 25 percent of the eval set. The point of evals is regression catching, and regressions live where the model is weakest.
Rule 3: keep the dataset stable, version it, and grade it manually at least once a quarter. A golden dataset that drifts is worse than no dataset, because it gives false confidence. Pin it to a version, store it in the same repository as the agent code, and re-grade it manually on a cadence. New examples enter through a separate "candidates" file and graduate to the dataset only after a manual review.
A reasonable starting size for an agent eval set is 100 to 200 examples. Below 100, the signal is noisy. Above 500, the maintenance burden becomes the bottleneck. Most teams I work with settle around 200 by month six.
LLM-as-judge calibration
LLM-as-judge is the standard pattern for grading outputs that are too open-ended for exact-match comparison: summaries, recommendations, multi-step reasoning. The pattern works, but it requires calibration. A judge that has not been calibrated against a human grader is, in practice, scoring a different problem than the one the team thinks it is scoring.
Calibration is straightforward. Take a sample of 30 to 50 outputs from the eval set. Have a domain expert grade them against the rubric. Have the LLM judge grade the same set against the same rubric. Compute agreement. If agreement is below roughly 80 percent, the rubric needs work. Common rubric problems: too many criteria fused into a single score, ambiguous language, examples in the rubric that pull the judge toward a default answer.
The other calibration question is which model serves as judge. The robust pattern in 2026 is to use a different model family for the judge than the one being graded. Self-grading with the same model family produces inflated scores in roughly the way you would expect. The simplest split that works in practice is to grade with Claude when the system under test runs on GPT, and vice versa.
Deterministic harnesses for tool calls and outputs
Not every part of an agent needs LLM-as-judge. Anything with a structured output should be graded deterministically. Three places where this is consistently underused.
Tool-call correctness. Did the agent invoke the expected tool? Did it pass the expected arguments? Did the tool return the expected shape? These are exact-match questions. A test harness with 50 graded tool-call examples catches a class of regressions that LLM-as-judge silently hides.
Citation correctness. If the agent is doing RAG, did the response actually cite the source documents the retrieval layer surfaced? This is checkable with string matching against the retrieval results. When citation drift starts, deterministic checks catch it long before users notice.
Output schema validation. If the agent returns structured JSON, did it conform to the schema? Did it pass a downstream parser? Schema drift is one of the quietest agent failures: the LLM starts returning slightly different field names or nested shapes, the downstream parser silently coerces, and the user-facing behavior degrades over a week before anyone investigates.
The general rule: anything that can be graded deterministically should be graded deterministically. Reserve LLM-as-judge for the genuinely subjective questions.
Production traces and drift detection
Eval suites catch regressions before deploy. Production traces catch drift after deploy. Both surfaces are necessary. The teams shipping confidently in 2026 ingest production traces into a tracing platform — Braintrust, LangSmith, Helicone, or PostHog — and watch a small number of headline metrics over time.
The metrics worth watching are narrow on purpose. Three categories cover most needs.
Distribution drift on tool calls. Which tools are being called, and at what rates? A sudden change in tool-call distribution is almost always the symptom of a model upgrade or a prompt change shifting the agent's planning. It rarely shows up in eval scores because the eval set is fixed. It shows up in production immediately.
Latency and cost percentiles. Median latency and p95 cost per session. A sudden p95 spike, especially if median is stable, is almost always a single problematic prompt or a tool that started returning a much larger response. The p95 view catches what the median hides.
Failure rate by class. Categorize traces by an LLM classifier into a small set of failure classes — "wrong tool," "incorrect citation," "incomplete output," "user dissatisfied." Watch the rates per class over time. The shape of the rates moving is more informative than any single rate.
The discipline is to look at these dashboards weekly, before there is a problem. A team that opens the tracing platform only when something is broken will keep getting surprised.
The seven failure modes you will hit
Every agent platform shipped to production hits a recognizable set of failure modes. The list below is from real engagements, not theory.
1. Prompt injection through retrieved documents. The agent retrieves a document, and the document contains text that hijacks the agent's behavior. Mitigation: never put untrusted retrieved text directly into the system prompt. Treat all retrieval output as untrusted user input, with explicit boundary markers.
2. Tool surface that grew too fast. Six months into the platform, the tool count is 30, the agent's tool-selection accuracy has degraded, and nobody can name what each tool does. Mitigation: prune. A tool surface of 8 to 12 well-named tools nearly always outperforms a surface of 30 partially overlapping tools.
3. Silent schema drift on outputs. The model upgraded, the JSON schema the downstream consumer expects subtly changed, the consumer is silently coercing. Mitigation: schema validation on every output, in CI and in production, with explicit alerts on drift.
4. Eval set that aged out. The product changed, the eval set did not, and the eval scores are now graded against an outdated rubric. Mitigation: schedule a quarterly review of the rubric, not just the dataset.
5. Cost spiral on a long-context model. A workflow worked beautifully on the small-context model and quietly broke when an upstream change pushed it onto the long-context tier. Mitigation: explicit cost guardrails per session and per tenant, with hard limits that fail fast rather than degrading silently.
6. Hallucinated citations. The agent generates a response with citations that look correct but reference content the retrieval layer did not actually return. Mitigation: enforce citation-from-retrieval programmatically. The post-processing layer rejects any citation that is not present in the retrieval results.
7. Authentication leakage through the agent context. The agent inherits the user's identity but, somewhere in the graph, retrieves data the user is not authorized to read. Mitigation: enforce identity at the retrieval layer, not at the agent layer. Never rely on the agent to honor access boundaries it could be jailbroken out of.
Tooling, in 2026
A short note on tooling, because the question comes up in every engagement. The combinations I see most often in production right now are these.
Agent orchestration: LangGraph for graph-based control flow, the Vercel AI SDK for streaming UX with Next.js, OpenAI's tool calling and Anthropic's tool use as the underlying execution surface. Most production systems mix these depending on the route.
MCP: native Anthropic clients (Claude Desktop, Cursor, Zed) for end-user-facing surfaces; custom MCP servers for internal tool exposure. The MCP ecosystem matured rapidly through 2025 and is now the default integration pattern for any team that wants their tools to be portable across AI clients.
Eval and tracing: Braintrust and LangSmith for managed eval and tracing; Helicone for cost-focused observability; for teams already on PostHog, the same platform handles both LLM tracing and product analytics with reasonable depth.
Inference: vLLM for self-hosted production traffic at scale; Ollama for development and lower-throughput deployments. OpenAI, Anthropic, Google, and the OpenRouter aggregator for managed access; Anthropic's Claude family disproportionately winning the agentic-tool-use benchmarks.
The specific stack will keep evolving. The discipline below it — eval-first development, deterministic checks where possible, calibrated LLM-as-judge where not, watched production traces — does not.
If you are scoping an agent platform now and want help designing the eval surface, the AI Engineering service page is the closest match to this work. If you would rather read more first, the Private AI Deployment Checklist covers the architecture-level checks that an agent platform inherits.