Write and Hope

An agent that only reads code and test output is flying blind. Wire in signals it can see, and it stops guessing whether the fix worked.

Why this, for you: the single biggest reliability upgrade to a coding agent is not a smarter model — it's giving the model a way to observe the system it just changed. This lesson is the frame for the whole course: every later technique is about making something the agent did, or something the system does, legible.

An agent writes code, runs tests, reads the output. That is the entire loop for most setups. Three things it cannot do without help: confirm a UI renders, query whether a fix actually dropped the error rate, or search logs for the pattern a user reported. Missing those, it operates in "write and hope" mode.

1 Three signal categories

Wiring observability into an agent's context turns "write and hope" into "write, observe, verify." The signals fall into three buckets, each closing a different blind spot.

Signal	What it answers
Visual	Browser automation — did the UI render and behave?
Log	Structured, filterable log entries — what error pattern fired, and how often?
Metric	Counters and latencies — did error rate or p99 actually move after the change?

Accessibility snapshots beat screenshots for functional checks. A snapshot returns structured text — roles, names, states — that the model reasons about directly, with no vision model and a fraction of the tokens. Reserve screenshots for layout bugs.

2 Climb the verification ladder

Not every check costs the same. Start with the cheapest signal that can answer the question and only climb when it can't.

# cheapest signal first — climb only when it can't answer python -c # unit-level assertion on behavior curl # API-level endpoint check playwright # browser-level interaction + a11y snapshot screenshot # vision check — slowest, last resort

An agent fixing a login bug climbs it in order: logs name the failing call, tests confirm the code fix, a browser snapshot shows the dashboard heading appeared, and a metric query proves errors fell from 312 to 3. The loop closed because each layer was legible.

3 Keep the payloads lean

Observability data is high-volume — a careless log query returns thousands of entries and shreds the context window. The discipline is just-in-time references: store a query string and time range, load the payload only when you need to verify.

# store a reference, not the data log_query = "service:auth level:error @timestamp:[now-1h TO now]" # run it only at verification time datadog_log_search(log_query) → 3 results (was 47 before the fix)

The blind-agent trap

If the log or metric MCP server is down, the agent loses all visibility — and may silently proceed, reading an empty result as "no errors." Stale or sampled metrics mislead the same way: querying error rate 30 seconds after a deploy can read pre-deploy data and wrongly conclude the fix worked.

↪ Your win: write, observe, verify

Wire three signal categories — visual, log, metric — into the agent's context.
Prefer accessibility snapshots over screenshots for functional checks; screenshots for layout only.
Climb the verification ladder cheapest-first: python -c → curl → browser → vision.
Use JIT references — store query + time range, load the payload on demand to spare the window.
Plan for the blind spot — an MCP outage or stale metric reads as "all clear" if you let it.

Retrieval practice — recall, don't peek

Question 1An agent that only reads code and test output operates in…

Question 2For a functional UI check, the cheaper, model-readable signal is…

Question 3The verification ladder says you should start with…

Question 4A JIT reference means the agent stores the…

Question 5If the log MCP server goes down mid-task, the danger is the agent…

Ask me anything. Want the four-step login-bug walkthrough that uses all three signals, or which browser-automation tool fits native dialog interactions? Next in Part 1: Leaving a Trail — trajectory logging that survives a context reset.