Part 3 · Safety & State

Harness Engineering · ~8 min

Long-Running Agents

When a run outlives a context window, state has to live outside the model. The harness becomes a way to resume a task that no single session can hold.

Why this, for you: the moment a task can't finish in one session — a multi-hour migration, an overnight build — every harness lever from this course composes into one operational shape. Get the state model right and a crashed sandbox is a resume, not a restart from zero.

A long-running agent is one whose run survives session boundaries, sandbox crashes, and human pauses. The defining move: move state out of the context window into durable artifacts that can resume it.

1 Three walls force the design

The same three failure modes recur across every published write-up, and together they rule out "just keep one big session going."

WallWhy it bites
Finite contextEven a 1M-token window fills, and quality rots well before the hard cap. No window on the roadmap holds a 24-hour run.
No persistent stateA new session starts blank — Anthropic likens it to "engineers working in shifts," each arriving with no memory of the last.
Unreliable self-verificationModels skew positive on their own work; without a separate evaluator the agent ships half-done with full confidence.

2 Five primitives that recur everywhere

Anthropic, Cursor, Google, and open-source practitioners converge on the same five pieces:

1. External done-condition — completion criteria written to disk before the run, so the agent can't quietly redefine "done." 2. Durable session log — an append-only stream of every thought and tool call, outside the harness process, so any instance can resume from it. 3. Stateless harness, disposable sandbox — the harness holds no run state; the sandbox is cattle, not pets, so crash recovery is architectural. 4. Separate evaluator — generation and verification run as different roles. 5. Deliberate checkpoint cadence — save state every N units, not every step (waste) and not only at the end (catastrophic on failure).

The reference structure is two agents and three files. An initializer runs once — sets up the environment, expands the prompt into a feature-list.json (every feature marked failing initially), writes an init.sh. A coding agent is woken repeatedly; each session reads orientation artifacts before touching code:

# every session starts by reading the handoff, not by guessing git log --oneline -10 # what's been committed since baseline? cat claude-progress.txt # current status, next priority cat feature-list.json # which features pass, which remain # then: pick ONE feature, implement, test, update progress, commit

Git commits double as cross-session handoff notes; git log becomes a readable audit trail. And a test ratchet in the prompt — "it is unacceptable to remove or edit tests" — blocks the classic failure of an agent deleting failing tests to make the suite green.

3 Recovery: reset, steer, or restart

At day-plus durations, summarizing old turns isn't enough — fidelity erodes each time goals are re-summarized. Anthropic resorts to full context resets: tear the session down and rebuild from the structured handoff file. The bash form is the Ralph loop — every iteration starts fresh and reads the filesystem before acting.

When you're watching a run live and it drifts, you have three moves: steer (a mid-run message that redirects without discarding context — for a recoverable wrong turn), restart (a fresh context with a better prompt — cheaper than salvaging a fundamentally wrong run through repeated steers), or let it finish. The handoff file is what makes restart cheap.

When this machinery is just overhead

The primitives pay only when work genuinely exceeds one session. For short-horizon interactive tasks, checkpoint/resume adds latency with no reliability gain. And the open problems are real: without budgets and circuit breakers an agent can burn a week's API spend in an afternoon; credentials plus shell access is a far larger attack surface than a chat; and auditing 24 hours of autonomous activity is a human-time problem that only structured artifacts (PRs, commits, test runs) make tractable.

↪ Your win: state outside the model, resume over restart

Retrieval practice — recall, don't peek

Question 1The defining move for a long-running agent is to…

Question 2Writing the done-condition to disk before the run prevents the agent from…

Question 3A stateless harness with a disposable sandbox makes crash recovery…

Question 4The "test ratchet" in the prompt exists to stop the agent from…

Question 5 · spaced recall from Lesson 06A reliable completion gate anchors "done" to…

Ask me anything. Want the initializer + coding-agent split sketched for a real overnight task, or how full context resets differ from /compact? Next, Part 4 opens with Skills & Progressive Disclosure — loading an agent's knowledge on demand instead of all at once.
✎ Feedback