Chain-of-Verification

When no test or type checker reaches a claim, the model can verify itself — but only one variant works, and used naively it overturns correct code 1-in-4 times.

Why this, for you: some claims have no oracle — an unfamiliar library's API surface, a citation, a cross-file refactor's consistency. This is the technique for those, and the discipline for knowing when not to reach for it. Get the variant wrong and you make the agent worse.

Chain-of-Verification (CoVe) is a four-step self-correction loop — draft, plan verification questions, answer each independently, revise. For coding agents it is conditional: it pays off only in the factored variant, and only over claims no external oracle covers.

1 Four variants — only one matters

Variant	What it does	Result
Joint	One prompt drafts and verifies	Verifier attends to the draft's hallucinations and repeats them
Two-step	Questions planned with draft visible, answered together	Better than joint, still anchored
Factored	Each question answered in its own prompt, no draft	Best across all tasks evaluated
Factor+revise	Factored plus a separate revision step	Highest precision on longform

The mechanism is anti-anchoring. If the draft contains a fabricated df.write_to_csv() call, a verifier that sees the draft attends to that token sequence and re-emits it. Answering each question in a separate prompt that excludes the draft forces independent recall.

# factored CoVe: ask without the draft in context Q: "What is the method to write a polars DataFrame to CSV?" A: write_csv # retrieved from API knowledge, not continued from the draft # draft said df.write_to_csv() → mismatch surfaces → agent revises

2 Route claims by class

Skip CoVe wherever a deterministic check is stronger; use it only where none exists. The discipline is to classify each draft claim and send it to the cheapest reliable check.

Claim in draft	Check	CoVe?
Imports, symbol existence	LSP, type checker, phantom-symbol detection	Skip
Signatures, types, behavior	Type checker, compile, test suite	Skip
API surface for unfamiliar library	Factored CoVe against docs	Use
Citation, version, fact in commentary	Factored CoVe	Use
Cross-file refactor consistency	Factored CoVe over the change set	Use

This matches ConVerTest, which integrates factored CoVe with external test execution and reports +39% test validity, +28% line coverage, +18% mutation scores over baselines on BigCodeBench and LBPP. The gains come from CoVe paired with an oracle — not CoVe alone.

3 The naive version makes code worse

Self-correction overturns correct code 22–28% of the time

Liu et al. (2024) report 21.9% of correct GPT-4o solutions and 28.3% of correct GPT-3.5 solutions are flipped to wrong under intrinsic self-correction prompts — answer wavering, prompt bias, human-like cognitive bias. Huang et al. (2023) found LLMs cannot reliably self-correct reasoning errors without external feedback. So: never use joint or two-step on the same draft (the verifier repeats the hallucination), never add CoVe where strong oracles already cover the claim, and beware strong-hypothesis debugging — once the agent commits to "it's a null-pointer bug," verification framed in that frame re-confirms it.

↪ Your win: factored, and only where no oracle reaches

Factored variant only — answer each question without the draft in context; that's the win.
Route by claim class — deterministic checks for symbols/types/behavior; CoVe for API recall, citations, cross-file consistency.
Pair it with an oracle — the measured gains come from CoVe plus external tests, not alone.
Never joint or two-step on code — the verifier re-emits the draft's hallucination.
Naive self-correction flips correct code 22–28% — one layer in a stack, not the strategy.

Retrieval practice — recall, don't peek

Question 1The only CoVe variant that reliably helps coding agents is…

Question 2The factored variant works because answering without the draft…

Question 3A claim about whether an import symbol exists should be routed to…

Question 4Naive intrinsic self-correction on code was measured to…

Question 5 · spaced recall from Lesson 4Incremental verification helps only when the verifier is…

Ask me anything. Want a claim-classifier prompt that routes each draft assertion to its cheapest check, or a factored-CoVe step you can drop into a research or refactor pipeline? Next, Part 3 opens with Testing What It Decides — behavioral testing for non-deterministic agents.