Security · ~7 min
Every threat so far targeted your data. This one targets your wallet. The service stays up, latency is fine, error rates are flat — and the bill drains anyway. Resource exhaustion is a threat in its own right.
An LLM call's cost is variable and attacker-influenceable — input length, output length, tool-chain depth — and priced linearly. Requests-per-second doesn't bind dollars-per-second when one request costs $0.001 and the next $0.50. The same retry loop that drains the wallet can exhaust a rate-shared backend.
OWASP LLM10:2025 Unbounded Consumption names four sub-classes the same harness can produce. Three of them share the structural feature above: cost is variable, attacker-influenceable, and linear.
| Sub-class | Mechanism | Owner |
|---|---|---|
| Variable-length input | Oversized input drives CPU/memory until the service degrades | Availability |
| Denial of wallet | Token consumption drains a pay-per-use bill; service stays up | Finance |
| Resource amplification | Crafted input triggers the model's most expensive paths | Both |
| Model replication | API access mints synthetic data for a derivative model | Product/legal |
Each bound keys on a different unit of cost; their union covers what no one unit captures. A familiar
iteration cap (LangChain ships max_iterations=15) is blind to per-step cost — a fast agent burns ten
iterations in eight seconds.
| Bound | What it caps | What it misses alone |
|---|---|---|
| Per-call token cap | One call's output size | Multi-call tool chains; expensive inputs |
| Per-task iteration cap | Agent loop depth | Cost variance per iteration |
| Fan-out concurrency cap | Parallel sub-agent breadth | Sequential, long-running expense |
| Cost-velocity breaker | Rolling $/min per principal | Pre-existing baseline; first-time spikes |
| Per-day dollar budget | Absolute spend ceiling | Within-day burst windows |
Remove any one bound and a documented amplification path stays open. They are complementary by design.
A flat "100 calls/min" misses the low-and-slow denial-of-wallet pattern — hard to distinguish from
legitimate traffic — and over-triggers on bursty real work: a summarisation task with retrieval, chunking, three LLM
calls, and storage trips a tight bucket, so "one rogue script blocks all the user's legitimate work, including the work
they need to debug the rogue script." Tuple-keyed limits on (user, repo, model) plus rolling-average
velocity are the working shape.
The bounds add config surface and false-positive risk; conditions invert the trade. A single-shot
summariser has no loop to bound; trusted internal-only callers collapse the denial-of-wallet
vector. But two pitfalls matter most. Per-call max_tokens is blind to chains — one study showed
658× cost amplification by coercing verbose multi-turn chains past a 4K per-call cap; the
chain-level bounds (iteration, velocity) are the control. And don't put a brittle classifier in the enforcement path:
a 30-character adversarial suffix blocked over 97% of legitimate requests on one LLM-based guard — the safeguard
itself becomes the DoS. Deterministic counters enforce; semantic checks detect.
(user, repo, model) plus rolling velocity beats fixed RPS.Retrieval practice — recall, don't peek
Question 1Denial-of-wallet is dangerous specifically because…
Question 2OWASP LLM10 binds DoS and denial-of-wallet because they…
Question 3Five separate bounds are needed because each one…
Question 4A per-call max_tokens cap is blind to…
Question 5 · spaced recall from Lesson 17The multitenant RAG fix moves authorization to…
(user, repo, model)? You've finished the content lessons — next is the Capstone:
the full symptom → mitigation table and a mixed review across all eighteen.