Reference · Canonical Language

Observability — Glossary

The working vocabulary for this course. Once a term lives here, every lesson uses this word for it. Grows as we go.

Making Agents Legible

Write-and-hope mode: An agent that only reads code and test output, with no visual, log, or metric signal — it changes the system and cannot confirm the change had its intended effect. The default state the rest of this course is about escaping.; Avoid: "untested" — the issue is unobserved effect, not absent tests.; Source: observability-legible-to-agents.md
Accessibility snapshot · a11y snapshot: A structured-text view of a rendered page — roles, names, states — that a model reasons about directly, with no vision model and a fraction of the tokens. Beats screenshots for functional checks; reserve screenshots for layout bugs.; Avoid: defaulting to screenshots — they need a vision model and cost more tokens per check.; Source: observability-legible-to-agents.md
Verification ladder: The cheapest-first ordering of verification signals: python -c → curl → browser driver → vision screenshot. Climb only when a cheaper signal cannot answer the question.; Source: observability-legible-to-agents.md · Simon Willison — agentic manual testing
JIT reference · just-in-time reference: Storing a lightweight identifier — a query string, metric name, time range — and loading the high-volume payload only at verification time, so observability data does not flood the context window.; Avoid: pulling full log/metric payloads upfront — untrimmed queries shred the window.; Source: observability-legible-to-agents.md

Trajectory log · progress file pattern: A replayable audit trail of agent decisions across sessions, built from a progress file, git commits, a feature-state JSON, and an init.sh — no observability backend required.; Source: trajectory-logging-progress-files.md
Feature-state JSON: A machine-readable snapshot of features with passes/fails status, toggled to passes only after verification. Survives context resets as an independent record — its job is to block premature completion.; Avoid: flipping a feature to passes on the model's say-so — anchor it to a deterministic check.; Source: trajectory-logging-progress-files.md
prompt.id: The Claude Code OTel correlation key shared by every event in one prompt cycle. A single prompt fans out to dozens of API calls; prompt.id is what makes tracing a cost spike feasible after the fact. Excluded from metrics (unbounded cardinality).; Avoid: using per-request IDs as metric labels — they create unbounded time series.; Source: agent-observability-otel.md

Event sourcing for agents · ESAA: A pattern where agents emit structured JSON intentions and a deterministic orchestrator validates, persists, and applies them — separating the cognitive layer from state mutation, and enabling replay verification.; Avoid: letting agents write files directly — that is the non-deterministic mutation ESAA exists to prevent.; Source: event-sourcing-for-agents.md · arXiv:2602.23193
Append-only log · activity.jsonl: The immutable, never-modified event record that is the source of truth. Validation is fail-closed: malformed or out-of-contract events return an error and never enter the log, which is what makes replay trustworthy.; Source: event-sourcing-for-agents.md
Replay verification: Re-deriving project state from scratch by replaying the event log, then checking it matches the filesystem. Confirms execution integrity and gives forensic traceability, immutability, and reproducibility at once.; Source: event-sourcing-for-agents.md · verification-ledger.md
Materialized view · roadmap.json: A compact, continuously-rebuilt projection of current task status given to agents as context instead of growing conversation history — directly countering context degradation in long-horizon work.; Source: event-sourcing-for-agents.md

Four failure modes: The classification every agent failure falls into: missing context, conflicting instructions, missing/blocked tools, capability ceiling. Determine which applies before changing anything.; Avoid: rewording the prompt or swapping the model before classifying the layer that actually failed.; Source: agent-debugging.md
Reproduction · fresh-session test: Re-running the same inputs in a fresh session to separate structural bugs from session-specific drift. Recurs → structural (instructions, skills, tools). Doesn't recur → session-specific (context drift or overflow).; Source: agent-debugging.md
Capability ceiling: The failure mode where the task genuinely exceeds the model tier — no context, instruction, or tool change fixes it. Reached only after see/told/do all check out; escalate the tier rather than tune further.; Avoid: blaming the tier first — capability is the last step, not the reflex.; Source: agent-debugging.md

Micro-loop: An intra-session cycle — edit, test, same failure, edit again — where each pass looks like progress from the inside while the context window quietly burns.; Source: loop-detection.md
Doom-loop detection: Catching identical tool-call / error pairs and stopping iteration rather than nudging — because identical failures will not self-resolve. Distinct from edit-count tracking, which counts edits per file.; Avoid: nudging on a doom loop — identical errors need a stop, not a reminder.; Source: loop-detection.md
Circuit breaker: A stopping mechanism that halts an agent when progress stalls — iteration limit, repeated failure, repetition, context budget, or cost threshold. Runtime-enforced maxTurns cannot be overridden; instruction-level stops can.; Avoid: instruction-only stops for safety-critical halts — prefer maxTurns or hooks.; Source: circuit-breakers.md
Graceful degradation: The required behavior when a breaker trips: stop new actions, return the partial results already completed, and explain what triggered the stop and what remains. Partial work beats a discarded session.; Source: circuit-breakers.md

Outcome grading: Evaluating an agent by the final state it produces, not the execution path it took. Path-based grading penalizes valid alternative solutions; deterministic tests are the most reliable, path-agnostic outcome grader.; Avoid: asserting a specific tool order or function name — that embeds the eval author's assumptions as "correct".; Source: grade-agent-outcomes.md
Judge precision floor: The reliability threshold (~70%) below which an LLM judge should not be aggregated on — beneath it, macro analysis amplifies the judge's mistakes into "behavior patterns" that are really recurring judge errors.; Source: macro-evals-agentic-systems.md · AgentRewardBench
Self-preference bias: The tendency of a model grading its own output to mark it as passing a rubric up to 50% more often than a neutral evaluator. The reason to cross-check fixes with a different model family.; Avoid: using the same model to generate and grade without a cross-check.; Source: agent-transcript-analysis.md
Macro eval: The population-level layer above per-call and per-trace evals, surfacing recurring patterns that are properties of the corpus, not any single run. Earns its keep only with thousands of traces, a judge above the precision floor, and real cross-trace structure — clusters are hypotheses, not verdicts.; Avoid: reading a macro cluster as a diagnosis — the analysis pool is selection-biased toward flagged traces.; Source: macro-evals-agentic-systems.md