Golden Journeys

Unit tests pass. Integration tests pass. The agent left a corrupt cache and a stuck worker behind. One rule catches it: no feature is complete if the system can't restart cleanly afterward.

Why this, for you: end-to-end verification of a running system is where agents quietly leave damage that lower-level tests never see. A Golden Journey is a named path with a per-step failure signal and a restart-clean gate — a completion artefact, not a smoke-test afterthought.

A Golden Journey is a named, repeatable path through the running system with an explicit failure signal at each step. The governing rule from the Walking Labs reliability framework: "No feature is complete if the system cannot restart cleanly afterward."

1 Four fields per journey

Start command — the exact invocation that boots the surface from cold.
User-visible steps — what the operator or test driver does, in order.
Observable end state — what the system shows when the journey completes.
Failure signal per step — the specific log line, screen state, or exit code that means "this step did not work."

The restart-clean rule turns the journey into a completion gate. A feature that passes its unit and integration tests but leaves a corrupt cache, a half-applied migration, or a stuck background worker is — under this rule — not done.

2 "Test fails" is not a failure signal

"Test fails" is not a failure signal. "Request to /index returns 500 with body containing chunk size <= 0" is. The signal must be specific enough to grep for.

Specificity defeats agent rubber-stamping. An agent that reads exit code 137 and looks it up will diagnose an OOM kill; an agent that sees "tests failed" just retries the same change. This mirrors observability practice — semantic exit codes and grep-friendly log lines make a failure diagnosable from repo-local signals, without spelunking through a trace UI.

# Golden Journey: index-search 1. Operator visits /index Failure signal: status != 200, or body contains `chunk size <= 0` 2. Submit query "agent harness" Failure signal: log `ERROR retrieval timeout` within 5s, or empty results 3. Restart: make stop && make run Failure signal: /healthz != 200 within 30s, or stale lockfile

3 Representative, not exhaustive

Maintain 3 to 7 journeys per app surface. The journey list is the smallest set that, if all pass and the system restarts clean, you would ship. Golden Journeys plug into the same completion-gate stack as feature list files and pre-completion checklists: a feature's verification field cites the journey it exercises; the checklist runs the journey and the restart; the feature flips to passing only when both pass.

Note what they are not: "happy path" testing covers the nominal flow with no errors, and "golden path" in platform engineering means a paved developer workflow. Golden Journeys add two things neither enforces — a grep-able per-step failure signal and a hard restart-clean criterion.

When Golden Journeys don't earn their keep

Long-startup systems (multi-minute warm-up, large index loads, migration-heavy DBs) pay the restart-clean cost every PR cycle — CI balloons or the gate goes nominal. Stateless or trivial systems have nothing meaningful to restart from; the ceremony adds no signal. Pre-PMF prototypes change faster than the journeys can be maintained, so the list goes stale. And where mature observability already encodes the critical paths, journeys just re-encode it.

↪ Your win: gate on a clean restart

Restart-clean is the load-bearing rule — not done if the system can't restart cleanly after.
Four fields — start command, steps, observable end state, per-step failure signal.
Grep-able signals — a specific log line or exit code, never "test fails."
3–7 journeys per surface — representative coverage, the smallest set you'd ship on.
Wire into the gate stack — cite from feature_list.json, run in the checklist.

Retrieval practice — recall, don't peek

Question 1The load-bearing rule of a Golden Journey is that a feature isn't done unless…

Question 2A proper per-step failure signal is something like…

Question 3The recommended number of journeys per app surface is…

Question 4Golden Journeys would add ceremony without signal on a…

Question 5 · spaced recall from Lesson 8Behavioral testing evaluates an agent's…

Ask me anything. Want a RELIABILITY.md with two or three Golden Journeys drafted for your app, plus a feature_list.json that cites them? Next, Part 3 closes with Grade the Outcome — scoring final state and keeping an LLM judge honest.