Reference · Canonical Language

Observability — Glossary

The working vocabulary for this course. Once a term lives here, every lesson uses this word for it. Grows as we go.

Making Agents Legible

Write-and-hope mode
An agent that only reads code and test output, with no visual, log, or metric signal — it changes the system and cannot confirm the change had its intended effect. The default state the rest of this course is about escaping.
Avoid: "untested" — the issue is unobserved effect, not absent tests.
Source: observability-legible-to-agents.md
Accessibility snapshot · a11y snapshot
A structured-text view of a rendered page — roles, names, states — that a model reasons about directly, with no vision model and a fraction of the tokens. Beats screenshots for functional checks; reserve screenshots for layout bugs.
Avoid: defaulting to screenshots — they need a vision model and cost more tokens per check.
Source: observability-legible-to-agents.md
Verification ladder
The cheapest-first ordering of verification signals: python -ccurl → browser driver → vision screenshot. Climb only when a cheaper signal cannot answer the question.
Source: observability-legible-to-agents.md · Simon Willison — agentic manual testing
JIT reference · just-in-time reference
Storing a lightweight identifier — a query string, metric name, time range — and loading the high-volume payload only at verification time, so observability data does not flood the context window.
Avoid: pulling full log/metric payloads upfront — untrimmed queries shred the window.
Source: observability-legible-to-agents.md

The Trail

Trajectory log · progress file pattern
A replayable audit trail of agent decisions across sessions, built from a progress file, git commits, a feature-state JSON, and an init.sh — no observability backend required.
Source: trajectory-logging-progress-files.md
Feature-state JSON
A machine-readable snapshot of features with passes/fails status, toggled to passes only after verification. Survives context resets as an independent record — its job is to block premature completion.
Avoid: flipping a feature to passes on the model's say-so — anchor it to a deterministic check.
Source: trajectory-logging-progress-files.md
prompt.id
The Claude Code OTel correlation key shared by every event in one prompt cycle. A single prompt fans out to dozens of API calls; prompt.id is what makes tracing a cost spike feasible after the fact. Excluded from metrics (unbounded cardinality).
Avoid: using per-request IDs as metric labels — they create unbounded time series.
Source: agent-observability-otel.md

The Source of Truth

Event sourcing for agents · ESAA
A pattern where agents emit structured JSON intentions and a deterministic orchestrator validates, persists, and applies them — separating the cognitive layer from state mutation, and enabling replay verification.
Avoid: letting agents write files directly — that is the non-deterministic mutation ESAA exists to prevent.
Source: event-sourcing-for-agents.md · arXiv:2602.23193
Append-only log · activity.jsonl
The immutable, never-modified event record that is the source of truth. Validation is fail-closed: malformed or out-of-contract events return an error and never enter the log, which is what makes replay trustworthy.
Source: event-sourcing-for-agents.md
Replay verification
Re-deriving project state from scratch by replaying the event log, then checking it matches the filesystem. Confirms execution integrity and gives forensic traceability, immutability, and reproducibility at once.
Source: event-sourcing-for-agents.md · verification-ledger.md
Materialized view · roadmap.json
A compact, continuously-rebuilt projection of current task status given to agents as context instead of growing conversation history — directly countering context degradation in long-horizon work.
Source: event-sourcing-for-agents.md

Diagnosing Failure

Four failure modes
The classification every agent failure falls into: missing context, conflicting instructions, missing/blocked tools, capability ceiling. Determine which applies before changing anything.
Avoid: rewording the prompt or swapping the model before classifying the layer that actually failed.
Source: agent-debugging.md
Reproduction · fresh-session test
Re-running the same inputs in a fresh session to separate structural bugs from session-specific drift. Recurs → structural (instructions, skills, tools). Doesn't recur → session-specific (context drift or overflow).
Source: agent-debugging.md
Capability ceiling
The failure mode where the task genuinely exceeds the model tier — no context, instruction, or tool change fixes it. Reached only after see/told/do all check out; escalate the tier rather than tune further.
Avoid: blaming the tier first — capability is the last step, not the reflex.
Source: agent-debugging.md

Stopping the Bleeding

Micro-loop
An intra-session cycle — edit, test, same failure, edit again — where each pass looks like progress from the inside while the context window quietly burns.
Source: loop-detection.md
Doom-loop detection
Catching identical tool-call / error pairs and stopping iteration rather than nudging — because identical failures will not self-resolve. Distinct from edit-count tracking, which counts edits per file.
Avoid: nudging on a doom loop — identical errors need a stop, not a reminder.
Source: loop-detection.md
Circuit breaker
A stopping mechanism that halts an agent when progress stalls — iteration limit, repeated failure, repetition, context budget, or cost threshold. Runtime-enforced maxTurns cannot be overridden; instruction-level stops can.
Avoid: instruction-only stops for safety-critical halts — prefer maxTurns or hooks.
Source: circuit-breakers.md
Graceful degradation
The required behavior when a breaker trips: stop new actions, return the partial results already completed, and explain what triggered the stop and what remains. Partial work beats a discarded session.
Source: circuit-breakers.md

Gates & Evals

Outcome grading
Evaluating an agent by the final state it produces, not the execution path it took. Path-based grading penalizes valid alternative solutions; deterministic tests are the most reliable, path-agnostic outcome grader.
Avoid: asserting a specific tool order or function name — that embeds the eval author's assumptions as "correct".
Source: grade-agent-outcomes.md
Judge precision floor
The reliability threshold (~70%) below which an LLM judge should not be aggregated on — beneath it, macro analysis amplifies the judge's mistakes into "behavior patterns" that are really recurring judge errors.
Source: macro-evals-agentic-systems.md · AgentRewardBench
Self-preference bias
The tendency of a model grading its own output to mark it as passing a rubric up to 50% more often than a neutral evaluator. The reason to cross-check fixes with a different model family.
Avoid: using the same model to generate and grade without a cross-check.
Source: agent-transcript-analysis.md
Macro eval
The population-level layer above per-call and per-trace evals, surfacing recurring patterns that are properties of the corpus, not any single run. Earns its keep only with thousands of traces, a judge above the precision floor, and real cross-trace structure — clusters are hypotheses, not verdicts.
Avoid: reading a macro cluster as a diagnosis — the analysis pool is selection-biased toward flagged traces.
Source: macro-evals-agentic-systems.md