Context Engineering · ~6 min
The model writes against the API it learned, not the one you've installed. Tell it which version is real, and the failures stop.
package-lock.json says otherwise. Close that gap and you get correct code on the first try in
daily work (#1) — no more debugging calls that were renamed two minor versions ago.A model that aces an isolated coding puzzle can still fail badly the moment that code has to run under a specific library version. The capability is there. The missing piece is context about your environment.
On standard benchmarks (HumanEval+, MBPP) that test isolated functions with no version constraints, models score 80%+ Pass@1. Add the requirement that the code run under specific library versions and the same models collapse.
On the VersiBCB benchmark — code that must target specific library versions — Pass@1 drops to just 13–28%. The capability didn't change between benchmarks; the only thing added was version constraints. The deficit is context, not capability.
An independent benchmark, GitChameleon, confirms the pattern: enterprise models hit only 48–51% on version-conditioned Python tasks across 26 libraries. And in reproducibility studies, 31.7% of AI-generated code fails at runtime purely from environment mismatches.
Models train on web-scale code corpora that contain far more examples of older API surfaces than current ones. So the model develops a systematic preference for deprecated patterns — and without environment context, it has no signal to deviate from the most common (often stale) version it saw.
torch,
transformers, tensorflow — show the steepest accuracy drops, because their API surfaces
change across minor versions. Stable standard-library modules rarely trigger a mismatch.One renamed parameter. Trivial to fix once you see it — expensive to debug when nothing told the agent which version is live.
The fix is to make the environment explicit context. Three moves, cheapest first:
requirements.txt, pyproject.toml, or
package-lock.json in context as a version manifest. Workspace-indexing tools can surface these
automatically; otherwise paste the relevant dep lines. Excerpt the declarations — a full resolved lockfile can run
thousands of tokens and crowd out what matters more.Across the adaptation strategies tested, models are 2–3× better at adapting existing code to a target environment than generating version-correct code from scratch. When you can, hand the agent working code to migrate rather than asking it to invent the right calls.
And keep the loop honest: a runtime error trace like AttributeError: module 'torch' has no attribute
'compile' carries a version-specific signal. Feed it back in — it's a corrective the model can act on.
Retrieval practice — recall, don't peek
Question 1Adding version constraints drops Pass@1 from 80%+ down to…
Question 2Models default to deprecated APIs mainly because…
Question 3The steepest version-gap accuracy drops show up in…
Question 4When the installed version postdates the training cutoff, you should…
Question 5 · spaced recall from Lesson 14A repository map gives the agent…