Part 2 · Verifying As You Build

Verifying Agent Work · ~8 min

Chain-of-Verification

When no test or type checker reaches a claim, the model can verify itself — but only one variant works, and used naively it overturns correct code 1-in-4 times.

Why this, for you: some claims have no oracle — an unfamiliar library's API surface, a citation, a cross-file refactor's consistency. This is the technique for those, and the discipline for knowing when not to reach for it. Get the variant wrong and you make the agent worse.

Chain-of-Verification (CoVe) is a four-step self-correction loop — draft, plan verification questions, answer each independently, revise. For coding agents it is conditional: it pays off only in the factored variant, and only over claims no external oracle covers.

1 Four variants — only one matters

VariantWhat it doesResult
JointOne prompt drafts and verifiesVerifier attends to the draft's hallucinations and repeats them
Two-stepQuestions planned with draft visible, answered togetherBetter than joint, still anchored
FactoredEach question answered in its own prompt, no draftBest across all tasks evaluated
Factor+reviseFactored plus a separate revision stepHighest precision on longform

The mechanism is anti-anchoring. If the draft contains a fabricated df.write_to_csv() call, a verifier that sees the draft attends to that token sequence and re-emits it. Answering each question in a separate prompt that excludes the draft forces independent recall.

# factored CoVe: ask without the draft in context Q: "What is the method to write a polars DataFrame to CSV?" A: write_csv # retrieved from API knowledge, not continued from the draft # draft said df.write_to_csv() → mismatch surfaces → agent revises

2 Route claims by class

Skip CoVe wherever a deterministic check is stronger; use it only where none exists. The discipline is to classify each draft claim and send it to the cheapest reliable check.
Claim in draftCheckCoVe?
Imports, symbol existenceLSP, type checker, phantom-symbol detectionSkip
Signatures, types, behaviorType checker, compile, test suiteSkip
API surface for unfamiliar libraryFactored CoVe against docsUse
Citation, version, fact in commentaryFactored CoVeUse
Cross-file refactor consistencyFactored CoVe over the change setUse

This matches ConVerTest, which integrates factored CoVe with external test execution and reports +39% test validity, +28% line coverage, +18% mutation scores over baselines on BigCodeBench and LBPP. The gains come from CoVe paired with an oracle — not CoVe alone.

3 The naive version makes code worse

Self-correction overturns correct code 22–28% of the time

Liu et al. (2024) report 21.9% of correct GPT-4o solutions and 28.3% of correct GPT-3.5 solutions are flipped to wrong under intrinsic self-correction prompts — answer wavering, prompt bias, human-like cognitive bias. Huang et al. (2023) found LLMs cannot reliably self-correct reasoning errors without external feedback. So: never use joint or two-step on the same draft (the verifier repeats the hallucination), never add CoVe where strong oracles already cover the claim, and beware strong-hypothesis debugging — once the agent commits to "it's a null-pointer bug," verification framed in that frame re-confirms it.

↪ Your win: factored, and only where no oracle reaches

Retrieval practice — recall, don't peek

Question 1The only CoVe variant that reliably helps coding agents is…

Question 2The factored variant works because answering without the draft…

Question 3A claim about whether an import symbol exists should be routed to…

Question 4Naive intrinsic self-correction on code was measured to…

Question 5 · spaced recall from Lesson 4Incremental verification helps only when the verifier is…

Ask me anything. Want a claim-classifier prompt that routes each draft assertion to its cheapest check, or a factored-CoVe step you can drop into a research or refactor pipeline? Next, Part 3 opens with Testing What It Decides — behavioral testing for non-deterministic agents.
✎ Feedback