Part 3 · Evaluating Behavior

Verifying Agent Work · ~8 min

Evals at Scale

Some failures live in no single run. The agent answers every turn plausibly, yet a pattern across a thousand traces is broken. And the thing you edit most — the skill — usually ships with no eval at all.

Why this, for you: the per-trace evals of the last three lessons miss two things — failures that are statistical (only visible across a corpus) and the quality of the skills you keep editing. This lesson adds the population layer and the unit layer, and — importantly — the thresholds below which each is just theatre.

Macro evaluation is the population-level layer above per-trace evals: which problems repeat, where they concentrate, which part of the workflow to inspect first. Skill evals are the unit layer below: is this particular skill still earning its cost? Both are powerful, and both have a floor below which a frequency table or a smoke check does the same job for free.

1 Macro evals — the pattern across the corpus

An agent that drops a constraint in step 2, or drifts when two conditions interact, produces individually plausible traces. The failure is the concentration of similar suboptimal decisions across runs — not the badness of any one. Shift the unit of analysis to the corpus.

The pipeline: per-call rubrics → embed traces → UMAP dimensionality reduction → HDBSCAN density clustering → c-TF-IDF labels → rank by impact_score = prevalence × severity. In the reference run, 992 traces of an EV-order workflow scored fine per-call, but the macro layer surfaced a cluster — pricing systematically ignored a substitution incentive whenever stockout pressure compounded with it. No single trace looked broken.

Three pre-conditions must hold, or macro evals are theatre: thousands of traces (below ~1,000, clustering reports noise or merges unrelated cases), judge precision above ~70% (aggregation concentrates judge bias rather than averaging it out), and cross-trace structure worth aggregating. Outside those, a sorted (case_type, error_code) table carries the same signal at zero pipeline cost. And clusters are hypotheses, not diagnoses — the analysis pool is selection-biased toward flagged traces.

2 Skill evals — the unit you edit most

Skills are edited far more often than the harness, yet most teams have no signal that a skill still works after an edit or a model upgrade. Evaluate each skill on two axes: output quality (right result when loaded) and trigger precision (the description fires on the prompts it should, stays dormant on those it shouldn't). Output-only evals leave trigger failures invisible; trigger-only evals leave silent output regressions unreported.

# run each case twice in ISOLATED contexts — no cross-run bleed with_skill: pass rate 0.83 | without_skill: 0.33 delta: +50 points at +13s / +1,700 tokens # the delta makes the skill's cost-benefit explicit before shipping

Store evals/evals.json next to SKILL.md; start with 2–3 cases and add assertions after the first run (defining "good" before seeing output yields weak checks). Split skills for upgrades: capability uplift (retire if the raw model catches up) versus encoded preference (durable — check workflow fidelity, not raw quality).

Keep the judge off mechanical checks — again

The same LLM-judge caution from Lesson 10 bites hard at scale. The assertion patterns to watch: an assertion that passes in both configurations isn't discriminating (remove it); one that fails in both is broken or impossible (fix before re-running); pass-with / fail-without is where the skill earns its cost; high variance across runs means ambiguous instructions (add examples). Skip skill evals entirely for single-user, highly subjective, or mid-rewrite skills — the harness cost exceeds the signal.

↪ Your win: the population layer and the unit layer

  • Macro evals find statistical failures — the concentration across runs, not any one trace.
  • Respect the floor — thousands of traces, ~70% judge precision, real cross-trace structure.
  • Clusters are hypotheses — a selection-biased pool describes flagged-trace pathology, not the system.
  • Evaluate skills on two axes — output quality and trigger precision, in isolated paired runs.
  • The delta is the point — pass-rate, time, tokens with vs without; ship only if it justifies cost.

Retrieval practice — recall, don't peek

Question 1Macro evals surface failures that are…

Question 2Below ~1,000 traces, macro clustering tends to…

Question 3The two axes a skill eval must cover are output quality and…

Question 4A skill-eval assertion that passes in both configurations should be…

Question 5 · spaced recall from Lesson 10Grading the outcome instead of the path matters because path graders…

Ask me anything. Want an evals/evals.json scaffold with a paired with/without runner for one of your skills, or a check on whether your trace volume and judge precision actually clear the macro-eval floor? Next, the Capstone — the whole course as a symptom→move table, plus a mixed review.
✎ Feedback