Harness Engineering · ~8 min
When a run outlives a context window, state has to live outside the model. The harness becomes a way to resume a task that no single session can hold.
A long-running agent is one whose run survives session boundaries, sandbox crashes, and human pauses. The defining move: move state out of the context window into durable artifacts that can resume it.
The same three failure modes recur across every published write-up, and together they rule out "just keep one big session going."
| Wall | Why it bites |
|---|---|
| Finite context | Even a 1M-token window fills, and quality rots well before the hard cap. No window on the roadmap holds a 24-hour run. |
| No persistent state | A new session starts blank — Anthropic likens it to "engineers working in shifts," each arriving with no memory of the last. |
| Unreliable self-verification | Models skew positive on their own work; without a separate evaluator the agent ships half-done with full confidence. |
Anthropic, Cursor, Google, and open-source practitioners converge on the same five pieces:
The reference structure is two agents and three files. An initializer runs once — sets up the
environment, expands the prompt into a feature-list.json (every feature marked failing initially), writes
an init.sh. A coding agent is woken repeatedly; each session reads orientation artifacts
before touching code:
Git commits double as cross-session handoff notes; git log becomes a readable audit trail. And a
test ratchet in the prompt — "it is unacceptable to remove or edit tests" — blocks the classic
failure of an agent deleting failing tests to make the suite green.
At day-plus durations, summarizing old turns isn't enough — fidelity erodes each time goals are re-summarized. Anthropic resorts to full context resets: tear the session down and rebuild from the structured handoff file. The bash form is the Ralph loop — every iteration starts fresh and reads the filesystem before acting.
When you're watching a run live and it drifts, you have three moves: steer (a mid-run message that redirects without discarding context — for a recoverable wrong turn), restart (a fresh context with a better prompt — cheaper than salvaging a fundamentally wrong run through repeated steers), or let it finish. The handoff file is what makes restart cheap.
The primitives pay only when work genuinely exceeds one session. For short-horizon interactive tasks, checkpoint/resume adds latency with no reliability gain. And the open problems are real: without budgets and circuit breakers an agent can burn a week's API spend in an afternoon; credentials plus shell access is a far larger attack surface than a chat; and auditing 24 hours of autonomous activity is a human-time problem that only structured artifacts (PRs, commits, test runs) make tractable.
Retrieval practice — recall, don't peek
Question 1The defining move for a long-running agent is to…
Question 2Writing the done-condition to disk before the run prevents the agent from…
Question 3A stateless harness with a disposable sandbox makes crash recovery…
Question 4The "test ratchet" in the prompt exists to stop the agent from…
Question 5 · spaced recall from Lesson 06A reliable completion gate anchors "done" to…
/compact? Next, Part 4 opens with Skills & Progressive
Disclosure — loading an agent's knowledge on demand instead of all at once.