Verifying Agent Work · ~8 min
When no test or type checker reaches a claim, the model can verify itself — but only one variant works, and used naively it overturns correct code 1-in-4 times.
Chain-of-Verification (CoVe) is a four-step self-correction loop — draft, plan verification questions, answer each independently, revise. For coding agents it is conditional: it pays off only in the factored variant, and only over claims no external oracle covers.
| Variant | What it does | Result |
|---|---|---|
| Joint | One prompt drafts and verifies | Verifier attends to the draft's hallucinations and repeats them |
| Two-step | Questions planned with draft visible, answered together | Better than joint, still anchored |
| Factored | Each question answered in its own prompt, no draft | Best across all tasks evaluated |
| Factor+revise | Factored plus a separate revision step | Highest precision on longform |
The mechanism is anti-anchoring. If the draft contains a fabricated df.write_to_csv()
call, a verifier that sees the draft attends to that token sequence and re-emits it. Answering each question
in a separate prompt that excludes the draft forces independent recall.
| Claim in draft | Check | CoVe? |
|---|---|---|
| Imports, symbol existence | LSP, type checker, phantom-symbol detection | Skip |
| Signatures, types, behavior | Type checker, compile, test suite | Skip |
| API surface for unfamiliar library | Factored CoVe against docs | Use |
| Citation, version, fact in commentary | Factored CoVe | Use |
| Cross-file refactor consistency | Factored CoVe over the change set | Use |
This matches ConVerTest, which integrates factored CoVe with external test execution and reports +39% test validity, +28% line coverage, +18% mutation scores over baselines on BigCodeBench and LBPP. The gains come from CoVe paired with an oracle — not CoVe alone.
Liu et al. (2024) report 21.9% of correct GPT-4o solutions and 28.3% of correct GPT-3.5 solutions are flipped to wrong under intrinsic self-correction prompts — answer wavering, prompt bias, human-like cognitive bias. Huang et al. (2023) found LLMs cannot reliably self-correct reasoning errors without external feedback. So: never use joint or two-step on the same draft (the verifier repeats the hallucination), never add CoVe where strong oracles already cover the claim, and beware strong-hypothesis debugging — once the agent commits to "it's a null-pointer bug," verification framed in that frame re-confirms it.
Retrieval practice — recall, don't peek
Question 1The only CoVe variant that reliably helps coding agents is…
Question 2The factored variant works because answering without the draft…
Question 3A claim about whether an import symbol exists should be routed to…
Question 4Naive intrinsic self-correction on code was measured to…
Question 5 · spaced recall from Lesson 4Incremental verification helps only when the verifier is…