Gates That Catch Regressions

Observability tells you what happened. Evals tell you whether a change made things worse. Grade the final state — not the path — and know exactly when an LLM judge is trustworthy.

Why this, for you: all the tracing in the world is useless if a "fix" silently regresses something else. Evals are the gate that turns a trace into a verdict. This lesson is about grading outcomes the right way and, crucially, about when a judge is reliable enough to gate on — because a bad judge is worse than no gate.

Observability is descriptive — it shows what the agent did. An eval is a verdict: given a known input, is the system in the correct state? Wired into CI, evals become regression gates that block a change before it ships.

1 Grade the outcome, not the path

Path-based evals check that the agent called tool X before tool Y. They penalize valid alternative solutions — a different API order, a smarter refactor — and frontier models routinely find solutions the eval author never anticipated. Grade the final state instead.

For coding agents, the most reliable outcome graders are deterministic tests: objective, fast, and path-agnostic. A passing suite is a correct outcome whether the agent took two steps or twenty.

# fragile — fails valid solutions assert trace.tools == ["read", "edit", "test"] # wrong order? "fail" # robust — accepts any correct solution assert unique_sorted([3,1,1,2]) == [1,2,3] # path-agnostic

For subjective outcomes — readability, style, "follows team conventions" — combine the deterministic check with an LLM rubric grader: specify the criteria, have a model judge the result, never prescribe the steps.

2 Know when the judge is trustworthy

An LLM judge is a measurement instrument, and a noisy one. Before you gate on it, know its precision.

Judges are weaker than you think

AgentRewardBench measured 12 LLM judges on 1,302 web-agent trajectories — none cleared human inter-annotator agreement, with errors clustering around grounding mismatch and misunderstood actions. TRAIL found long-context models score only 11% on trace-debugging tasks. And when the same model both generates and grades, self-preference bias marks its own output as passing up to 50% more often than a neutral evaluator. Cross-check fixes with a different model family.

A practical floor: macro/aggregate analysis only earns trust above ~70% judge precision. Below it, aggregation amplifies the judge's mistakes — your "behavior patterns" become recurring judge errors that look like system behavior.

3 Three eval tiers; don't skip to the top

Tier	Question it answers
Per-call	Was this single tool selection / response correct?
Per-trace	Did this whole run reach the right outcome?
Macro	Which problems recur across the corpus, and where?

Macro evals catch failures no single trace exposes — a constraint dropped in step 2, a drift when two conditions interact. But they're a heavy pipeline with three preconditions: thousands of traces, judge precision above the floor, and genuine cross-trace structure. Below ~1,000 traces, a frequency table of (case_type, error_code) carries the same signal at zero pipeline cost.

Clusters are hypotheses, not verdicts

A macro cluster labelled "pricing-incentive-omission" is a place to look, not a fix to ship. The analysis pool is selection-biased — it describes the pathology of flagged traces, not how the system behaves overall. Read clusters as a triage queue; confirm the root cause before changing code, and re-run a held-out set after, so a fix that overfits the eval surface gets caught.

↪ Your win: a gate you can actually trust

Grade outcomes, not paths — deterministic tests first; rubric graders for subjective quality.
Measure judge precision before gating — judges trail human agreement and self-prefer.
Cross-check with a different model family when generator and grader would otherwise be the same.
Climb per-call → per-trace → macro; macro needs thousands of traces and a precise judge.
Treat clusters as hypotheses, and re-run a held-out set so fixes don't overfit the eval.

Retrieval practice — recall, don't peek

Question 1Path-based evals tend to fail because they…

Question 2The most reliable outcome grader for coding agents is…

Question 3When the same model generates and grades, you risk…

Question 4Below the judge-precision floor, macro aggregation…

Question 5 · spaced recall from Lesson 05Of the loop layers, the one that stops rather than nudges is…

Ask me anything. Want the over-specification grading bugs that cause false negatives, or how to feed eval transcripts back to an agent to surface tool-description issues at scale? Next, the Capstone: a decision table that ties the whole course together.