Observability · ~7 min
Observability tells you what happened. Evals tell you whether a change made things worse. Grade the final state — not the path — and know exactly when an LLM judge is trustworthy.
Observability is descriptive — it shows what the agent did. An eval is a verdict: given a known input, is the system in the correct state? Wired into CI, evals become regression gates that block a change before it ships.
Path-based evals check that the agent called tool X before tool Y. They penalize valid alternative solutions — a different API order, a smarter refactor — and frontier models routinely find solutions the eval author never anticipated. Grade the final state instead.
For subjective outcomes — readability, style, "follows team conventions" — combine the deterministic check with an LLM rubric grader: specify the criteria, have a model judge the result, never prescribe the steps.
An LLM judge is a measurement instrument, and a noisy one. Before you gate on it, know its precision.
AgentRewardBench measured 12 LLM judges on 1,302 web-agent trajectories — none cleared human inter-annotator agreement, with errors clustering around grounding mismatch and misunderstood actions. TRAIL found long-context models score only 11% on trace-debugging tasks. And when the same model both generates and grades, self-preference bias marks its own output as passing up to 50% more often than a neutral evaluator. Cross-check fixes with a different model family.
| Tier | Question it answers |
|---|---|
| Per-call | Was this single tool selection / response correct? |
| Per-trace | Did this whole run reach the right outcome? |
| Macro | Which problems recur across the corpus, and where? |
Macro evals catch failures no single trace exposes — a constraint dropped in step 2, a drift when two conditions
interact. But they're a heavy pipeline with three preconditions: thousands of traces, judge precision
above the floor, and genuine cross-trace structure. Below ~1,000 traces, a frequency table of
(case_type, error_code) carries the same signal at zero pipeline cost.
A macro cluster labelled "pricing-incentive-omission" is a place to look, not a fix to ship. The analysis pool is selection-biased — it describes the pathology of flagged traces, not how the system behaves overall. Read clusters as a triage queue; confirm the root cause before changing code, and re-run a held-out set after, so a fix that overfits the eval surface gets caught.
Retrieval practice — recall, don't peek
Question 1Path-based evals tend to fail because they…
Question 2The most reliable outcome grader for coding agents is…
Question 3When the same model generates and grades, you risk…
Question 4Below the judge-precision floor, macro aggregation…
Question 5 · spaced recall from Lesson 05Of the loop layers, the one that stops rather than nudges is…