Evaluator-Optimizer

A generator produces, a critic judges, feedback recycles until a bar is met. The loop only works when the bar is machine-checkable — and only helps when the generator was weak to begin with.

Why this, for you: Lesson 6 made "done" a deterministic gate. This lesson turns that gate into a loop: the same machine-checkable signal that decides done can drive iterative refinement. But there's a sharp trap — point this loop at output the generator already nails and it makes things worse. Knowing when not to run it is half the skill.

The evaluator-optimizer loops two roles: a generator produces output and revisions, an evaluator applies criteria and returns structured feedback, and a termination condition ends the loop when the evaluator returns PASS. For coding it maps cleanly: generate code → run tests → failures feed back → repeat.

1 The loop and its termination

The evaluator must return structured output — a verdict plus specific issues — so the generator acts on precise feedback, not parsed prose. And every loop needs a hard stop.

Anthropic's reference implementation ships an unbounded while True: that only exits on PASS — so production callers must impose their own round cap. A starting limit of 3 is common. Without one, conflicting assumptions between the two roles run the loop to budget exhaustion. Hitting the cap isn't a failure — it's a signal the criteria or the generator need adjustment.

# generator → evaluator → PASS or FAIL+feedback → repeat gen: sort_by_key(items, key) # misses missing-key case eval: {"verdict":"FAIL","issues":["KeyError on items without the field"]} gen: + key=lambda x: x.get(field, "") eval: {"verdict":"PASS","issues":[]} # terminate after 2 rounds

2 Designing the evaluator

The evaluator can be the same model with a different system prompt — lower cost, but it may inherit the generator's blind spots — or a different model, for a fully independent perspective. Either way, criteria must be explicit and machine-checkable: tests pass, lint is clean, the spec is satisfied. Vague prose ("is it high quality?") makes PASS/FAIL noisy and the loop runs past the point of improvement, burning tokens without converging. When the bar can be a deterministic checker, prefer it over an LLM judge.

3 The self-critique paradox

This is the counterintuitive part, and it changes when you should reach for the loop at all.

Snorkel AI's "Self-Critique Paradox" study found that on tasks where the generator already scored ~98%, adding a self-critique loop dropped accuracy to ~57% — the critic hallucinates flaws to justify its existence. The loop pays off when the generator is weak on the task (below ~35% baseline); on tasks it already solves reliably, skip the loop and return the first output.

So the decision rule inverts the intuition: don't add a critic because you want higher quality in general — add it only where the generator is demonstrably struggling and the feedback is actionable. On a strong baseline, the loop is not neutral overhead; it's actively harmful.

The five ways it degrades

Shared blind spots: same model for both misses the same error class — use a different model, a committee, or a deterministic checker. Vague criteria: noisy PASS/FAIL, no convergence. Non-actionable feedback: the generator gets no surface to act on and only produces cosmetic variation. Single-correct-answer tasks: a lookup or pure computation gains nothing from iteration. Already-high baseline: the self-critique paradox. Each round costs roughly 2× a single generation, so an N-round loop costs ~2N× — marginal progress per round means redesign the feedback format, not raise the cap.

↪ Your win: loop on a checkable bar, only when it helps

Make criteria machine-checkable — tests, lint, spec satisfaction; prefer a deterministic checker over a judge.
Return structured feedback — a verdict plus specific issues the generator can act on.
Always cap the rounds — the reference loop is unbounded; hitting your cap signals bad criteria.
Use a different model when blind spots are shared — or a committee, or a deterministic check.
Skip the loop on a strong baseline — self-critique can drop a 98% generator to 57%.

Retrieval practice — recall, don't peek

Question 1An evaluator-optimizer loop should terminate on…

Question 2Evaluation criteria must be…

Question 3The self-critique paradox shows a critic loop can…

Question 4Shared blind spots are best fixed by…

Question 5 · spaced recall from Lesson 12In orchestrator-worker, synthesis is…

Ask me anything. Want a generator-critic loop wired to your test suite, or help deciding whether your task's baseline is low enough to benefit? Next, Part 7 opens with Reversibility & Idempotency — designing every agent action to be undone or safely re-run.