Long-Running Agents

When a run outlives a context window, state has to live outside the model. The harness becomes a way to resume a task that no single session can hold.

Why this, for you: the moment a task can't finish in one session — a multi-hour migration, an overnight build — every harness lever from this course composes into one operational shape. Get the state model right and a crashed sandbox is a resume, not a restart from zero.

A long-running agent is one whose run survives session boundaries, sandbox crashes, and human pauses. The defining move: move state out of the context window into durable artifacts that can resume it.

1 Three walls force the design

The same three failure modes recur across every published write-up, and together they rule out "just keep one big session going."

Wall	Why it bites
Finite context	Even a 1M-token window fills, and quality rots well before the hard cap. No window on the roadmap holds a 24-hour run.
No persistent state	A new session starts blank — Anthropic likens it to "engineers working in shifts," each arriving with no memory of the last.
Unreliable self-verification	Models skew positive on their own work; without a separate evaluator the agent ships half-done with full confidence.

2 Five primitives that recur everywhere

Anthropic, Cursor, Google, and open-source practitioners converge on the same five pieces:

1. External done-condition — completion criteria written to disk before the run, so the agent can't quietly redefine "done." 2. Durable session log — an append-only stream of every thought and tool call, outside the harness process, so any instance can resume from it. 3. Stateless harness, disposable sandbox — the harness holds no run state; the sandbox is cattle, not pets, so crash recovery is architectural. 4. Separate evaluator — generation and verification run as different roles. 5. Deliberate checkpoint cadence — save state every N units, not every step (waste) and not only at the end (catastrophic on failure).

The reference structure is two agents and three files. An initializer runs once — sets up the environment, expands the prompt into a feature-list.json (every feature marked failing initially), writes an init.sh. A coding agent is woken repeatedly; each session reads orientation artifacts before touching code:

# every session starts by reading the handoff, not by guessing git log --oneline -10 # what's been committed since baseline? cat claude-progress.txt # current status, next priority cat feature-list.json # which features pass, which remain # then: pick ONE feature, implement, test, update progress, commit

Git commits double as cross-session handoff notes; git log becomes a readable audit trail. And a test ratchet in the prompt — "it is unacceptable to remove or edit tests" — blocks the classic failure of an agent deleting failing tests to make the suite green.

3 Recovery: reset, steer, or restart

At day-plus durations, summarizing old turns isn't enough — fidelity erodes each time goals are re-summarized. Anthropic resorts to full context resets: tear the session down and rebuild from the structured handoff file. The bash form is the Ralph loop — every iteration starts fresh and reads the filesystem before acting.

When you're watching a run live and it drifts, you have three moves: steer (a mid-run message that redirects without discarding context — for a recoverable wrong turn), restart (a fresh context with a better prompt — cheaper than salvaging a fundamentally wrong run through repeated steers), or let it finish. The handoff file is what makes restart cheap.

When this machinery is just overhead

The primitives pay only when work genuinely exceeds one session. For short-horizon interactive tasks, checkpoint/resume adds latency with no reliability gain. And the open problems are real: without budgets and circuit breakers an agent can burn a week's API spend in an afternoon; credentials plus shell access is a far larger attack surface than a chat; and auditing 24 hours of autonomous activity is a human-time problem that only structured artifacts (PRs, commits, test runs) make tractable.

↪ Your win: state outside the model, resume over restart

Write the done-condition to disk before the run — the agent can't move the goalposts mid-flight.
Keep an append-only session log + progress file — so any instance resumes from the event stream.
Make the harness stateless and the sandbox disposable — then a crash is a resume, not a loss.
One feature per session, read artifacts first, commit as handoff — and ratchet tests so they can't be deleted.
Reset on long runs, steer on recoverable drift, restart when it's fundamentally wrong — bounded by budgets and circuit breakers.

Retrieval practice — recall, don't peek

Question 1The defining move for a long-running agent is to…

Question 2Writing the done-condition to disk before the run prevents the agent from…

Question 3A stateless harness with a disposable sandbox makes crash recovery…

Question 4The "test ratchet" in the prompt exists to stop the agent from…

Question 5 · spaced recall from Lesson 06A reliable completion gate anchors "done" to…

Ask me anything. Want the initializer + coding-agent split sketched for a real overnight task, or how full context resets differ from /compact? Next, Part 4 opens with Skills & Progressive Disclosure — loading an agent's knowledge on demand instead of all at once.