The Yes-Man Agent

It does exactly what you asked. Every time. That is not a feature — it's the failure mode shipping errors at machine speed.

Why this, for you: Part 1 was about context; Part 2 is about behavior. A yes-man agent never tells you what's wrong because it was never told to look. The fix is three lines of instruction you can paste into any agent definition today.

The yes-man agent does what it's told. Each response looks correct at a glance, but subtle problems accumulate — broken conventions, violated constraints, introduced vulnerabilities. It never flags them, because flagging was never in scope.

1 What it looks like

A content-writing agent's whole prompt is: "research the topic, write a markdown page, open a PR." Asked to "write a page on rate limiting," it produces one, commits, and opens the PR — even though docs/techniques/rate-limiting.md already exists. No pre-task check was specified, so the agent never looked.

An agent without verification and pushback instructions executes every request without flagging problems. The happy-path prompt specifies nothing about what to check or when to stop.

2 Why it happens

Agents are trained to be helpful, and helpfulness correlates with compliance. Human raters favour responses that agree with them, and RLHF amplifies that into a structural bias toward compliance over correction. Task-oriented prompts ("research, write, open a PR") describe the happy path and say nothing about validation, pauses, or stop conditions.

3 The fix

Add three categories of instruction to the agent definition — pre-task checks, in-task validation, and explicit stop conditions:

Pre-task: check whether a page on this topic already exists under docs/. If one exists and is complete, comment on the issue and stop — no duplicate. In-task: before committing, verify file path is unique, frontmatter has title/description/tags, and no heading levels are skipped. Stop: if you cannot tell whether a duplicate exists, stop and report the ambiguity rather than guessing.

Two structural multipliers: spawn a separate reviewer agent (an agent shares its own blind spots and can't review its own work), and add a mandatory concerns or risks field to structured output — an agent that must populate it will evaluate; one without it will not.

The opposite failure: the cry-wolf agent

Over-specify stop conditions and you get an agent that flags every minor issue and theoretical risk — output reviewers learn to ignore. Yes-man and cry-wolf are opposite poles; calibrate stop conditions to genuine blockers, not every deviation. And note the ceiling: verification prompts reduce sycophancy but don't eliminate it — the bias is in training, not scaffolding. Treat prompts as a floor-raiser, not a fix.

↪ Your win: build in the pushback

Three gate points: pre-task checks, in-task validation, explicit stop conditions.
Separate reviewer agent — self-review shares the implementer's blind spots.
A mandatory concerns field forces the evaluation you want.
Calibrate to real blockers — over-flagging produces the ignored cry-wolf agent.
Prompts raise the floor, not the ceiling — sycophancy is rooted in RLHF.

Retrieval practice — recall, don't peek

Question 1A yes-man agent fails to flag problems because…

Question 2The bias toward compliance is amplified by…

Question 3The three gate points to add are…

Question 4Over-specified stop conditions produce…

Question 5 · spaced recall from Lesson 03Distractor interference means a related rule…

Ask me anything. Want the reviewer/implementer split spelled out, or how a required concerns field changes structured output? Next in Part 2: Objective Drift — when the agent keeps working, productively, on the wrong goal.