Snapshot and Roll Back

Repository setup is where general agents quietly fail — and the fix is to make every state-changing command reversible, then verify with two roles instead of one exit code.

Why this, for you: "get this unfamiliar repo running" is a task agents are measurably bad at, and the failure is invisible — the install succeeds, the feature doesn't run. This lesson is the composition that closes that gap, plus the precondition test for when it's worth the machinery.

Setup is a documented weak point: on SetupBench (93 tasks across seven language ecosystems), one agent hit only 38.9–57.4% on repo setup and 20.0–53.3% on local database config, with 38–89% of actions unnecessary versus optimal human behavior. The experiential pipeline — SetupX — targets that gap with three composed mechanisms.

1 Make irreversible state reversible

Before each state-modifying command, the agent issues docker commit to snapshot the container. If the command exits non-zero — or later verification fails — it reverts. This converts irreversible system mutations into reversible ones, eliminating the "environment pollution" failure mode (a botched pip install that half-mutates the environment).

# snapshot only before state-changing commands docker commit setup-ctn snap-3 # checkpoint pip install tensorflow-gpu → fails, pollutes env docker tag snap-3 setup-ctn # revert — clean slate # read-only commands (ls, cat, grep) are NOT snapshotted

Repo2Run independently validated this primitive for Python-repo Dockerfile generation: atomic docker commit/rollback hit 86.0% versus 6.0% / 9.0% / 22.1% for pipreqs, SWE-agent, and a README-only LLM baseline. Snapshot only state-modifying commands — snapshotting reads makes per-command overhead dominate wall-clock time.

2 Verify with prosecutor and judge, not an exit code

Surface success and "the documented feature actually runs" are different properties — SetupBench names the gap "verification-strategy mismatches." So two roles split the check: the prosecutor gathers evidence (runs the documented tests, exercises README entry points, checks health endpoints); the judge reads the evidence and renders the binary verdict. On rejection, the agent reverts to the last good snapshot and pulls the next candidate fix.

The loop closes by promoting the verified sequence — symptom plus executable action plus evidence — back to an experience store (a dual-modality XPU record). The next repo hitting the same symptom retrieves the fix instead of re-deriving it. SetupX reports a 92% pass rate, +19% over its strongest baseline.

Three preconditions, or reach for something simpler

This is not the default for setup. Use it only when: no usable dev-environment artifact exists upstream (a devcontainer, Nix flake, or pinned Dockerfile beats any trial loop in one declarative pull); repos are heterogeneous but share substrate (the same package-manager family, so executable fixes transfer); and verification is ambiguous (a single clean make test doesn't need a two-role protocol).

Where it backfires

On a CI hot path with hundreds of matrix entries, per-command docker commit compounds — pre-bake the image instead. For a single-shot setup, the experience store never amortizes; the first run dominates. And across truly heterogeneous toolchains (a pnpm monorepo vs. a cargo workspace), executable fixes share almost nothing, and low-abstraction memories cause negative transfer.

↪ Your win: setup that's reversible and actually verified

Snapshot before state changes — docker commit turns environment pollution into a revert.
Exclude read-only commands — snapshotting reads makes overhead dominate.
Split verification — prosecutor gathers evidence, judge renders the verdict; exit codes lie.
Promote verified fixes — write the dual-modality record back so the next repo reuses it.
Check the three preconditions first — a devcontainer or one clean make test beats the whole pipeline.

Retrieval practice — recall, don't peek

Question 1The snapshot is taken before…

Question 2The prosecutor-judge split exists because…

Question 3You should skip this pipeline entirely when the repo ships…

Question 4Promoting a verified fix back to the experience store lets the next repo…

Question 5 · spaced recall from Lesson 7A background agent hands work to a human best at roughly…

Ask me anything. Want the repo-audit snippet that decides pre-bake vs. trial-and-repair, or how XPU's dual-modality record avoids negative transfer? Next in Part 3: Monolith to Sub-Agents — the five-step prototype-to-production refactor.