Define "Done" First

Write the evaluation before the feature. A low pass rate on a new eval isn't a problem — it's the improvement surface made visible before a single line of code exists.

Why this, for you: "done" for an agent feature is slippery — without a definition written first, you reverse-engineer success from whatever the agent already does, bugs included. Evals make "done" objective, and they're what lets you adopt a new model in days instead of weeks.

Teams that write evals after the fact embed the agent's current behavior — including its bugs — into the definition of correct. The suite then grades what the agent does, not what it should do. Writing evals first forces the clarity: you decide what "done" means before building toward it.

1 What to write before any feature code

Three artifacts, in order:

Define the tasks — 20–50 representative inputs, sourced from real failures, anticipated edge cases, and the behaviors that motivated the feature. Precision matters more than volume; a small set still shows signal.
Define success criteria — what a correct output looks like per task. This is the hardest part. If two domain experts can't agree on the pass/fail verdict, the task isn't eval-ready yet.
Choose a grader — automated checks for deterministic outcomes (tests, schema validation, state comparison); an LLM rubric with explicit criteria for subjective ones; both for complex tasks.

Run the suite against a baseline before development. A low pass rate on a new capability eval is a feature, not a problem — it identifies the gap and makes progress visible as you implement. You likely already have inputs: manual checks, production incidents, and exploratory prompts all convert into eval tasks.

2 Evals as executable specifications

A well-defined eval task answers "does this feature work?" with a reproducible check, not a manual judgment call. That has a compounding payoff at model-upgrade time.

Evals turn model upgrades from weeks to days

Teams with evals in place adopt new model releases in days; teams without them face weeks of manual regression testing per upgrade. The same loop applies to building tools: prototype → write eval tasks → run → track metrics → analyze transcripts → iterate. In one worked example, adding state and label filters to a search_issues tool lifted accuracy from 53% to 80% and dropped average tool calls from 9.4 to 4.1 — one targeted change, measured.

3 The pitfalls that make evals lie

Four ways an eval suite misleads: overfitting to the implementation (write tasks from expected behavior, not observed); ambiguous pass/fail (if two experts disagree, fix the task before committing it); graders too strict (exact-match rejects valid alternative solutions — use outcome-based or semantic graders); and too few tasks (5 starts you, but won't catch regressions reliably).

When to skip evals-first

Defer it for early exploration of a novel space (committing to pass/fail anchors you to metrics that may prove irrelevant), short-lived prototypes (the harness outweighs the artifact), highly subjective outputs with shifting preferences (tasks pass while users dislike the result), and unstable upstream dependencies (the set breaks faster than it yields signal). The heuristic: if you can't get two reviewers to agree on 20 tasks, the problem isn't eval-ready — iterate manually first, then convert observations into evals.

↪ Your win: an objective "done," and faster upgrades

Write tasks, criteria, and a grader before the feature — "done" gets a definition up front.
Baseline first — a low pass rate is the improvement surface, not a failure.
Convert what you already have — manual checks, incidents, exploratory prompts become tasks.
Bank the upgrade speed — evals turn model adoption from weeks of regression into days.
Watch the four pitfalls — overfitting, ambiguity, over-strict graders, too few tasks.

Retrieval practice — recall, don't peek

Question 1Writing evals after the feature tends to…

Question 2A low pass rate on a new capability eval is…

Question 3If two domain experts disagree on a task's pass/fail verdict, you should…

Question 4Teams with eval suites adopt new model releases in…

Question 5 · spaced recall from Lesson 10Entropy reduction agents differ from CI mainly because they are…

Ask me anything. Want the eval-first tasks.yaml + LLM-judge runner skeleton, or how the eval loop closes into the agentic flywheel — agents analyzing their own transcripts to propose harness fixes? Next: the Capstone — symptom → move, across the whole course.