The Bill Is the Attack

Every threat so far targeted your data. This one targets your wallet. The service stays up, latency is fine, error rates are flat — and the bill drains anyway. Resource exhaustion is a threat in its own right.

Why this, for you: a stolen key or a crafted prompt can run an agent at full tilt and your DoS detection sees nothing — availability metrics register healthy. OWASP names this LLM10 Unbounded Consumption, and it binds two owners to one control surface: availability (DoS) and finance (denial-of-wallet). Real incidents reached $46K/day and $82K in 48 hours before any per-application alarm fired.

An LLM call's cost is variable and attacker-influenceable — input length, output length, tool-chain depth — and priced linearly. Requests-per-second doesn't bind dollars-per-second when one request costs $0.001 and the next $0.50. The same retry loop that drains the wallet can exhaust a rate-shared backend.

1 One surface, two owners, four sub-classes

OWASP LLM10:2025 Unbounded Consumption names four sub-classes the same harness can produce. Three of them share the structural feature above: cost is variable, attacker-influenceable, and linear.

Sub-class	Mechanism	Owner
Variable-length input	Oversized input drives CPU/memory until the service degrades	Availability
Denial of wallet	Token consumption drains a pay-per-use bill; service stays up	Finance
Resource amplification	Crafted input triggers the model's most expensive paths	Both
Model replication	API access mints synthetic data for a derivative model	Product/legal

Bounding the runaway retry loop is a security control, not a finance preference — the same loop that drains the wallet exhausts a shared backend. DoS and denial-of-wallet are the same surface.

2 Five bounds, because no single unit covers cost

Each bound keys on a different unit of cost; their union covers what no one unit captures. A familiar iteration cap (LangChain ships max_iterations=15) is blind to per-step cost — a fast agent burns ten iterations in eight seconds.

Bound	What it caps	What it misses alone
Per-call token cap	One call's output size	Multi-call tool chains; expensive inputs
Per-task iteration cap	Agent loop depth	Cost variance per iteration
Fan-out concurrency cap	Parallel sub-agent breadth	Sequential, long-running expense
Cost-velocity breaker	Rolling $/min per principal	Pre-existing baseline; first-time spikes
Per-day dollar budget	Absolute spend ceiling	Within-day burst windows

# Tuple-keyed on (user, repo, model) so one runaway repo # does not block the user's other work limits: per_call_max_tokens: 8192 per_task_max_iterations: 15 fan_out_concurrency: 4 cost_velocity: { window_minutes: 5, multiplier_over_rolling_avg: 8, action: pause } per_day_dollar_budget: { claude_opus: 200.00, on_exhaust: block_until_window }

Remove any one bound and a documented amplification path stays open. They are complementary by design.

3 Why fixed limits fail

A flat "100 calls/min" misses the low-and-slow denial-of-wallet pattern — hard to distinguish from legitimate traffic — and over-triggers on bursty real work: a summarisation task with retrieval, chunking, three LLM calls, and storage trips a tight bucket, so "one rogue script blocks all the user's legitimate work, including the work they need to debug the rogue script." Tuple-keyed limits on (user, repo, model) plus rolling-average velocity are the working shape.

Where the bounds backfire — and which belong in the path

The bounds add config surface and false-positive risk; conditions invert the trade. A single-shot summariser has no loop to bound; trusted internal-only callers collapse the denial-of-wallet vector. But two pitfalls matter most. Per-call max_tokens is blind to chains — one study showed 658× cost amplification by coercing verbose multi-turn chains past a 4K per-call cap; the chain-level bounds (iteration, velocity) are the control. And don't put a brittle classifier in the enforcement path: a 30-character adversarial suffix blocked over 97% of legitimate requests on one LLM-based guard — the safeguard itself becomes the DoS. Deterministic counters enforce; semantic checks detect.

↪ Your win: bound consumption as a security control

Treat the bill as an attack surface — denial-of-wallet leaves availability metrics healthy.
Bind DoS and wallet together — one bound surface serves both owners; don't duplicate.
Stack all five bounds — each keys on a different cost unit; the union is the coverage.
Tuple-key the limits — (user, repo, model) plus rolling velocity beats fixed RPS.
Keep classifiers out of the path — deterministic counters enforce; semantic checks only detect.

Retrieval practice — recall, don't peek

Question 1Denial-of-wallet is dangerous specifically because…

Question 2OWASP LLM10 binds DoS and denial-of-wallet because they…

Question 3Five separate bounds are needed because each one…

Question 4A per-call max_tokens cap is blind to…

Question 5 · spaced recall from Lesson 17The multitenant RAG fix moves authorization to…

Ask me anything. Want to size the five bounds for an agent platform you run, or wire a cost-velocity breaker keyed on (user, repo, model)? You've finished the content lessons — next is the Capstone: the full symptom → mitigation table and a mixed review across all eighteen.