Verifying Agent Work · ~8 min
Some failures live in no single run. The agent answers every turn plausibly, yet a pattern across a thousand traces is broken. And the thing you edit most — the skill — usually ships with no eval at all.
Macro evaluation is the population-level layer above per-trace evals: which problems repeat, where they concentrate, which part of the workflow to inspect first. Skill evals are the unit layer below: is this particular skill still earning its cost? Both are powerful, and both have a floor below which a frequency table or a smoke check does the same job for free.
The pipeline: per-call rubrics → embed traces → UMAP dimensionality reduction → HDBSCAN density clustering →
c-TF-IDF labels → rank by impact_score = prevalence × severity. In the reference run, 992 traces of an
EV-order workflow scored fine per-call, but the macro layer surfaced a cluster — pricing systematically ignored a
substitution incentive whenever stockout pressure compounded with it. No single trace looked broken.
(case_type, error_code) table carries the same signal at
zero pipeline cost. And clusters are hypotheses, not diagnoses — the analysis pool is selection-biased
toward flagged traces.
Skills are edited far more often than the harness, yet most teams have no signal that a skill still works after an edit or a model upgrade. Evaluate each skill on two axes: output quality (right result when loaded) and trigger precision (the description fires on the prompts it should, stays dormant on those it shouldn't). Output-only evals leave trigger failures invisible; trigger-only evals leave silent output regressions unreported.
Store evals/evals.json next to SKILL.md; start with 2–3 cases and add assertions
after the first run (defining "good" before seeing output yields weak checks). Split skills for upgrades:
capability uplift (retire if the raw model catches up) versus encoded preference
(durable — check workflow fidelity, not raw quality).
The same LLM-judge caution from Lesson 10 bites hard at scale. The assertion patterns to watch: an assertion that passes in both configurations isn't discriminating (remove it); one that fails in both is broken or impossible (fix before re-running); pass-with / fail-without is where the skill earns its cost; high variance across runs means ambiguous instructions (add examples). Skip skill evals entirely for single-user, highly subjective, or mid-rewrite skills — the harness cost exceeds the signal.
Retrieval practice — recall, don't peek
Question 1Macro evals surface failures that are…
Question 2Below ~1,000 traces, macro clustering tends to…
Question 3The two axes a skill eval must cover are output quality and…
Question 4A skill-eval assertion that passes in both configurations should be…
Question 5 · spaced recall from Lesson 10Grading the outcome instead of the path matters because path graders…
evals/evals.json scaffold with a paired with/without runner
for one of your skills, or a check on whether your trace volume and judge precision actually clear the macro-eval
floor? Next, the Capstone — the whole course as a symptom→move table, plus a mixed review.