Harness Engineering · ~7 min
A generator produces, a critic judges, feedback recycles until a bar is met. The loop only works when the bar is machine-checkable — and only helps when the generator was weak to begin with.
The evaluator-optimizer loops two roles: a generator produces output and revisions, an evaluator applies criteria and returns structured feedback, and a termination condition ends the loop when the evaluator returns PASS. For coding it maps cleanly: generate code → run tests → failures feed back → repeat.
The evaluator must return structured output — a verdict plus specific issues — so the generator acts on precise feedback, not parsed prose. And every loop needs a hard stop.
while True: that only exits on
PASS — so production callers must impose their own round cap. A starting limit of 3 is common.
Without one, conflicting assumptions between the two roles run the loop to budget exhaustion. Hitting the cap isn't a
failure — it's a signal the criteria or the generator need adjustment.The evaluator can be the same model with a different system prompt — lower cost, but it may inherit the generator's blind spots — or a different model, for a fully independent perspective. Either way, criteria must be explicit and machine-checkable: tests pass, lint is clean, the spec is satisfied. Vague prose ("is it high quality?") makes PASS/FAIL noisy and the loop runs past the point of improvement, burning tokens without converging. When the bar can be a deterministic checker, prefer it over an LLM judge.
This is the counterintuitive part, and it changes when you should reach for the loop at all.
So the decision rule inverts the intuition: don't add a critic because you want higher quality in general — add it only where the generator is demonstrably struggling and the feedback is actionable. On a strong baseline, the loop is not neutral overhead; it's actively harmful.
Shared blind spots: same model for both misses the same error class — use a different model, a committee, or a deterministic checker. Vague criteria: noisy PASS/FAIL, no convergence. Non-actionable feedback: the generator gets no surface to act on and only produces cosmetic variation. Single-correct-answer tasks: a lookup or pure computation gains nothing from iteration. Already-high baseline: the self-critique paradox. Each round costs roughly 2× a single generation, so an N-round loop costs ~2N× — marginal progress per round means redesign the feedback format, not raise the cap.
Retrieval practice — recall, don't peek
Question 1An evaluator-optimizer loop should terminate on…
Question 2Evaluation criteria must be…
Question 3The self-critique paradox shows a critic loop can…
Question 4Shared blind spots are best fixed by…
Question 5 · spaced recall from Lesson 12In orchestrator-worker, synthesis is…