Machine-Readable Corpora

Three machine-facing files — llms.txt, JSON-LD schema, robots.txt — each do real work. The trap is expecting the wrong one to lift your citations.

Why this, for you: these are the conventions everyone hears about and most people mis-deploy. Knowing what each file actually does — navigation, citation, access — stops you from publishing an llms.txt and expecting rankings, or blocking a bot that costs you citations.

Content techniques decide whether a chunk is citable. The technical layer decides whether an engine can reach and parse it at all. Three files, three distinct jobs — don't conflate them.

1 llms.txt — navigation, not citation

/llms.txt is a curated Markdown index at your site root that gives an AI agent a pre-filtered entry point: it fetches the file, picks the relevant section, then fetches only the linked pages it needs — instead of undirected crawling that burns context on irrelevant pages. The spec requires exactly one element: an H1.

# Acme Docs > Developer docs for the Acme platform — API, SDKs, tutorials. ## Core Documentation - [Quick Start](/docs/quickstart): First app in 5 minutes - [API Reference](/docs/api): Full endpoint reference ## Optional - [Changelog](/changelog): Release notes

llms.txt is not a citation signal

No major provider (Anthropic, OpenAI, Google) has published documentation confirming they read llms.txt at inference time. A 300k-domain study found no statistical citation correlation. Its value is agent comprehension, not ranking. Publish llms-full.txt too — the whole corpus concatenated for a one-fetch load. And keep it current: stale links are worse than no file.

2 Schema — the one that does lift citation

Structured data (JSON-LD) pre-packages content in the Q&A and step formats engines reuse, reducing extraction effort at indexing time — chatbots don't read JSON-LD on live fetch.

Independent studies report FAQPage citation lifts of 2.7×–3.2× in AI responses. Match the type to the content shape: FAQPage for Q&A, HowTo for step lists, DefinedTerm for named concepts.

The catch: stale schema hurts. If body text drifts from the markup, engines see contradictory signals and may deprioritize the page. Wrong type for the shape (HowTo on prose) gets flagged by validators too.

3 Crawler policy — allow the right tier

AI crawlers split into three tiers, each needing a different robots.txt response. The default for docs sites: allow retrieval, disallow training, WAF-block the non-compliant.

Tier	Examples	Action
Retrieval (powers citations)	`OAI-SearchBot`, `Claude-SearchBot`, `PerplexityBot`	Allow
Training scrapers	`GPTBot`, `ClaudeBot`, `Google-Extended`	Disallow
Non-compliant	`Bytespider`	CDN/WAF block

But robots.txt is advisory, not enforceable. As of OpenAI's Dec 2025 policy update, ChatGPT-User no longer respects robots.txt; Cloudflare documented Perplexity rotating user-agents to evade blocks. For a hard block, you need WAF rules — not a robots.txt line.

↪ Your win: right file, right job

llms.txt = agent navigation, not a citation or ranking signal; ship llms-full.txt too, keep both current.
Schema = the citation lifter — FAQPage reports 2.7×–3.2× lift at indexing time; match type to shape, keep it in sync with the body.
Allow retrieval bots, disallow training, WAF-block non-compliant — three tiers, three responses.
robots.txt is advisory — ChatGPT-User is now exempt, Perplexity evades; hard blocks need WAF.

Retrieval practice — recall, don't peek

Question 1The primary value of llms.txt is…

Question 2FAQPage schema reportedly lifts AI citation by roughly…

Question 3To stay citation-eligible while opting out of training, you…

Question 4For a hard block of a non-compliant crawler, you need…

Question 5 · spaced recall from Lesson 04If a claim has no real source, the right move is to…

Ask me anything. Want the full robots.txt reference config, or the FAQPage/HowTo JSON-LD templates? Next, the strategic frame these techniques sit inside: Topical Authority and entity coverage.