Verifying Agent Work · ~7 min
Unit tests pass. Integration tests pass. The agent left a corrupt cache and a stuck worker behind. One rule catches it: no feature is complete if the system can't restart cleanly afterward.
A Golden Journey is a named, repeatable path through the running system with an explicit failure signal at each step. The governing rule from the Walking Labs reliability framework: "No feature is complete if the system cannot restart cleanly afterward."
The restart-clean rule turns the journey into a completion gate. A feature that passes its unit and integration tests but leaves a corrupt cache, a half-applied migration, or a stuck background worker is — under this rule — not done.
/index returns 500 with body
containing chunk size <= 0" is. The signal must be specific enough to grep for.Specificity defeats agent rubber-stamping. An agent that reads exit code 137 and looks it up will
diagnose an OOM kill; an agent that sees "tests failed" just retries the same change. This mirrors observability
practice — semantic exit codes and grep-friendly log lines make a failure diagnosable from repo-local signals,
without spelunking through a trace UI.
Maintain 3 to 7 journeys per app surface. The journey list is the smallest set that, if all pass
and the system restarts clean, you would ship. Golden Journeys plug into the same completion-gate stack as feature
list files and pre-completion checklists: a feature's verification field cites the journey it exercises;
the checklist runs the journey and the restart; the feature flips to passing only when both pass.
Note what they are not: "happy path" testing covers the nominal flow with no errors, and "golden path" in platform engineering means a paved developer workflow. Golden Journeys add two things neither enforces — a grep-able per-step failure signal and a hard restart-clean criterion.
Long-startup systems (multi-minute warm-up, large index loads, migration-heavy DBs) pay the restart-clean cost every PR cycle — CI balloons or the gate goes nominal. Stateless or trivial systems have nothing meaningful to restart from; the ceremony adds no signal. Pre-PMF prototypes change faster than the journeys can be maintained, so the list goes stale. And where mature observability already encodes the critical paths, journeys just re-encode it.
feature_list.json, run in the checklist.Retrieval practice — recall, don't peek
Question 1The load-bearing rule of a Golden Journey is that a feature isn't done unless…
Question 2A proper per-step failure signal is something like…
Question 3The recommended number of journeys per app surface is…
Question 4Golden Journeys would add ceremony without signal on a…
Question 5 · spaced recall from Lesson 8Behavioral testing evaluates an agent's…
RELIABILITY.md with two or three Golden Journeys drafted for
your app, plus a feature_list.json that cites them? Next, Part 3 closes with Grade the Outcome
— scoring final state and keeping an LLM judge honest.