llms.txt: Spec, 100-Domain Adoption Audit, and Setup
We audited 100 top developer-tools and SaaS sites for an llms.txt file. Only 37 of them serve one at the apex — and the gap is concentrated in the places you might expect it not to be. The full spec, the audit, and a 10-minute setup.
llms.txt is a proposal by Jeremy Howard (Answer.AI) for a small text file at the root of your site that helps LLMs read it. It is not a standard. It is not enforced by any crawler today. And yet — a meaningful slice of the AI-native sites we audited have one, and the ones that don't are leaving money on the table for an investment of about 10 minutes.
This is the full spec, the 100-domain audit, and a setup tutorial you can copy.
What is llms.txt
llms.txt is a curated, machine-readable index of your site, structured as Markdown, served at https://yourdomain.com/llms.txt. It tells an LLM crawler:
- The name of your publication / project.
- A one-line description.
- A curated list of canonical URLs, grouped by topic, with one-line descriptions.
- Optional sections like "About", "FAQ", "API reference".
There is a companion file, llms-full.txt, which contains the full text of the most important pages, concatenated.
The point: instead of a crawler walking your sitemap and ranking every page by guesswork, you hand-curate what matters and serve it as a single file.
Why it matters (even though no crawler enforces it yet)
- Voluntary signal. Engineers at OpenAI, Anthropic, and Perplexity have all referenced reading
llms.txtfiles during product development. It's used today as a research aid, even if it's not crawled at scale. - It's an editorial statement. A site that publishes a curated index is signalling something about how it wants to be read. That's a useful signal for humans too, in particular journalists and competitors.
- Cost: zero. You're already maintaining a sitemap. This is one more file.
- Forward compatibility. If a major crawler does adopt the format, you're already ready.
The spec, in 90 seconds
A minimal valid llms.txt:
# My Project Name
> One-sentence description of what this site is.
A short paragraph explaining the project, audience, and scope.
## Docs
- [Quickstart](https://example.com/quickstart): Get started in 5 minutes.
- [API reference](https://example.com/api): Full API documentation.
## Optional
- [Changelog](https://example.com/changelog): Recent releases.Rules:
- File is Markdown (not plain text, despite the
.txtextension). - First line is
# Project Name(H1). - Second line is
> Short description(Markdown blockquote). - The next paragraph is a longer description.
- After that,
## Section Nameheaders separate logical groups. - Inside each section: a
- [Title](URL): description.line per resource. - "Optional" section is a convention for secondary links.
The companion llms-full.txt:
- Same format on top.
- Each H1 is the title of an article.
- Body of the article in raw Markdown follows.
---separates articles.
Audit: 100 domains
We crawled 100 domains on 2026-05-19 across eight categories: AI/ML companies (15), top SaaS (20), dev tools and frameworks (15), documentation platforms (10), tech media (10), top global brands (10), news and journalism (10), and GEO/SEO tools (10). The full domain list and the crawler that ran the audit are open: see the methodology dataset and scripts/llms-txt-audit.ts in our repository.
The audit probes the apex domain only — https://<brand>.com/llms.txt. A site that ships the file under a subdomain (docs.example.com/llms.txt) or a deeper path is recorded as "not found." That apex-only frame matters: it's the location the spec proposes, and the one an LLM crawler would try first without prior knowledge.
Adoption rate
| Category | Audited | With llms.txt | Adoption rate |
|---|---|---|---|
| AI / ML companies | 15 | 5 | 33% |
| Top SaaS | 20 | 14 | 70% |
| Dev tools and frameworks | 15 | 7 | 47% |
| Documentation platforms | 10 | 3 | 30% |
| GEO / SEO tools | 10 | 3 | 30% |
| Tech media | 10 | 1 | 10% |
| Top global brands | 10 | 0 | 0% |
| News and journalism | 10 | 0 | 0% |
| Total | 100 | 37 | 37% |
The headline is the SaaS row: 70% of the top SaaS sample serves an llms.txt at the apex, a higher rate than AI/ML or dev tools and a much higher rate than we expected before running the crawl. The cleanest gap is at the other end: news, global brands, and the bulk of tech media ship nothing.
Among the 37 adopters, 8 also serve the companion llms-full.txt (22% of adopters; 8% of the full sample). The companion file is concentrated in dev tools and docs platforms — Cloudflare, Supabase, Bun, Drizzle, tRPC, Vite, Turborepo, Speakeasy.
A few well-known names are missing from the apex adoption list because they redirect their apex to www.<brand>.com, which 404s on /llms.txt. The most visible of these are Anthropic and Mintlify — both ship documentation surfaces that publish llms.txt, but not at the apex this audit probes. That's a finding about where the file lives, not whether it exists; we treat it as a separate question for a follow-up audit.
What's good (exemplary files)
A short list of llms.txt files that get the format right, ranked by our compliance score (sum of seven dimensions / 7) then by file size.
- Stripe (
stripe.com/llms.txt, ~64 KB body, score 1.00) — the largest well-formed file we found. The crawler truncated the body at 64 KB; the live file is longer still. Sections cover API products and SDKs; every link in the file is a curated entry, not a generic sitemap dump. - Together AI (
together.ai/llms.txt, ~39 KB, score 1.00) — comprehensive coverage of model docs, inference endpoints, and platform features; clean H2 structure;.mdalternates throughout. - Webflow (
webflow.com/llms.txt, ~13 KB, score 1.00) — a SaaS that shipsllms.txtas an editorial signal, not just a docs concession. Concise H2 sections, descriptive one-liners per link. - Athena HQ (
athenahq.ai/llms.txt, ~12 KB, score 1.00) — one of three GEO-tool vendors that publish at the apex. Categorized sections with descriptions written for LLM consumers, not humans. - Turborepo (
turbo.build/llms.txt, ~10 KB, score 1.00) — a Vercel-stack project that uses the file as a curated index over the docs site. Clean H1 + summary + Docs/Examples sections; no marketing copy.
What these have in common: an H1 with the brand name, a single-sentence blockquote summary, 3-5 H2 sections (Docs, API Reference, Optional, sometimes Examples), .md alternates linked rather than HTML pages, and no marketing copy.
Common mistakes
The crawler's compliance check scores each found file on seven dimensions
(see the crawler spec for
detail). The most common dimensional failure across the 37 found files
was "no typed links" — the file ships an H1, a summary, and at least
one ## section, but no [text](url) line under a ## Docs or
## Optional heading. Sites in this group treat llms.txt as a
human-readable about-page instead of a curated link index, which defeats
the file's purpose for LLM consumers.
Beyond the dimensional check, the same five editorial mistakes recur whenever we hand-inspect the files:
- Linking to HTML pages instead of
.mdalternates. The whole point ofllms.txtis to surface raw Markdown. Linking tohttps://example.com/docs/quickstartmakes the LLM crawler walk back through HTML parsing — the very work the file was meant to remove. Link tohttps://example.com/docs/quickstart.md(or whatever your raw-Markdown route is) instead. - Missing one-line descriptions on link items. The spec says
- [Title](URL): description.The colon and trailing description are not optional. Files that omit them give the LLM no hint about relevance, defeating the curation step. - Stale links (404). Curated indexes drift faster than sitemaps because they aren't usually regenerated by the build system. We expect to find dead links in 10-20% of audited files.
- Marketing copy in the description. "The best CRM platform for modern teams" is a tagline, not a summary. The summary should describe what the linked page contains, not why you should buy from the vendor.
- Mixing first-party docs with marketing pages. An
llms.txtthat links to/pricing,/about, and/blog/announcement-xis using the file as a sitemap. The spec is for an LLM-grounding index — pricing belongs on a sitemap, not here.
Practical setup (10 minutes)
Step 1 — Inventory
Decide what belongs in llms.txt. Not every page. Just the pages an LLM should ground on when answering questions about your site. For a typical SaaS:
- Quickstart / getting-started
- API reference
- Pricing
- Top 5–10 most-read blog posts (cornerstones)
- Methodology / how-it-works pages
- Changelog
Step 2 — Write the index
In your site repo, create the file at the path your framework serves as the root. For Next.js app router, a complete production-quality llms.txt route looks like this:
// app/llms.txt/route.ts
import { articles } from '#site/content'; // Velite-typed export
export const dynamic = 'force-static';
export const revalidate = 3600;
const SITE = {
name: 'The Cited',
description: 'Primary research on Generative Engine Optimization.',
url: 'https://geosalience.com',
};
export async function GET() {
const cornerstones = articles
.filter((a) => a.cornerstone && a.state === 'live')
.sort((a, b) => +new Date(b.publishedAt) - +new Date(a.publishedAt));
const docs = cornerstones
.map((a) => `- [${a.title}](${SITE.url}${a.permalink}.md): ${a.dek}`)
.join('\n');
const optional = [
`- [About](${SITE.url}/about.md): Who runs this publication.`,
`- [Methodology](${SITE.url}/methodology.md): How we run our tests.`,
`- [Editorial policy](${SITE.url}/editorial-policy.md): Corrections, disclosure, right of reply.`,
].join('\n');
const body = `# ${SITE.name}
> ${SITE.description}
The Cited is an editorial publication producing primary research on how
LLMs decide which sources to cite. Every article is built on a dataset
we publish alongside it.
## Cornerstones
${docs}
## Optional
${optional}
`;
return new Response(body, {
headers: {
'Content-Type': 'text/markdown; charset=utf-8',
'Cache-Control': 'public, max-age=3600, s-maxage=86400',
},
});
}Three things to notice in the above:
- The route reads from the content layer (Velite, in our case) rather than hardcoding URLs. The file stays in sync with what's actually published without manual maintenance.
- Every URL ends in
.md— those are the raw-Markdown alternates from Step 3. Content-Typeistext/markdown, nottext/plain. The spec accepts either, buttext/markdownis more accurate and is what LLM crawlers explicitly look for.
The Cited's actual llms.txt route is open-source; see the repo.
Step 3 — Add a .md alternate for every linked URL
This is the step everyone skips. An LLM crawler does not want your hero animation and your cookie banner. Serve a raw-markdown version of every URL you link in llms.txt, at the same path with a .md suffix.
In Next.js you can do this with a middleware rewrite — see our raw markdown route for a working example.
Step 4 — llms-full.txt (optional, recommended)
Concatenate the full Markdown body of each cornerstone article into one file. This gives an LLM with a generous context window everything it needs to ground on your content in a single fetch.
Step 5 — Test
Fetch your llms.txt with curl and confirm:
Content-Type: text/markdown(ortext/plain— both are spec-compliant;text/markdownis preferred).- File is valid Markdown — paste into any Markdown viewer to confirm headings render.
- All linked URLs return 200.
- The
.mdalternate URLs also return 200, withContent-Type: text/markdown.
A 30-second sanity check:
# Fetch llms.txt and verify content-type
curl -sI https://yourdomain.com/llms.txt | grep -i 'content-type'
# Extract every URL from the file and check status codes
curl -s https://yourdomain.com/llms.txt \
| grep -oE 'https?://[^)]+' \
| sort -u \
| xargs -I{} sh -c 'echo "$(curl -sI -o /dev/null -w "%{http_code}" {}) {}"'If any URL returns 404 or 5xx, fix it before publishing. If a URL redirects (3xx), decide whether the redirect destination is what you actually want LLMs to see — if yes, link the destination directly.
Step 6 — Publish and announce
This step is one we wish more sites would do: when you publish an llms.txt, tell people. Mention it in your changelog. Tweet the URL. Add a footer link LLMs index so humans can discover it too. The file itself is invisible to humans by design, but the cultural signal is worth surfacing.
What llms.txt is not
- It is not a robots.txt for AI. Use
robots.txtfor crawl directives.llms.txtis descriptive, not restrictive. - It is not a sitemap. A sitemap lists all your URLs.
llms.txtis the curated, hand-picked index. - It is not a standard. Treat it as a strong proposal that some teams find useful.
FAQ
Will my llms.txt actually be fetched by ChatGPT / Claude / Gemini?
There is no public claim from any of the major LLM vendors that they fetch llms.txt as a routine part of crawl. Anecdotally, Perplexity has referenced reading them, and Anthropic engineers have referenced consulting them during product development. Treat adoption as forward-looking, not a current channel. If a major crawler does adopt the format and you already ship the file, you're already ready.
Does writing one help SEO?
No. llms.txt is a separate, additive signal for AI consumers. Your sitemap is what helps SEO. The two files complement each other — sitemap is exhaustive, llms.txt is curated.
How often should I update it?
Whenever the cornerstone list changes. The file is small enough that automation isn't necessary, but if your content layer can emit it (as ours does from Velite), it stays in sync for free. We rebuild ours on every deploy.
Should I include marketing pages?
No. Pricing, about, and announcement pages belong in a sitemap, not an llms.txt. The spec is for an LLM-grounding index — the pages an LLM should ground on when answering a question about your site. Marketing pages tell humans why to buy; they don't help an LLM answer "how does X work."
What's the difference between llms.txt and llms-full.txt?
llms.txt is the index — short, curated, mostly links. llms-full.txt is the index plus the full Markdown body of each cornerstone, concatenated. A model with a generous context window can ingest the entire llms-full.txt in a single request and ground on it. We recommend shipping both; llms.txt is the spec primitive, llms-full.txt is the optimisation.
Does the file need to be exactly at the root?
The spec says /llms.txt at the apex. A minority of implementations put it under /.well-known/llms.txt instead. We treat the apex location as canonical; .well-known is a deviation that may be tolerated by tooling but isn't documented in the spec.
Can I block specific LLMs with this file?
No. llms.txt is descriptive, not restrictive. For blocking crawlers, use robots.txt with the LLM crawler's user-agent (e.g., GPTBot, ClaudeBot, Google-Extended, PerplexityBot). The two files have different jobs.
Is the format going to change?
Possibly. The proposal is on version 0.1 as of 2025. The community has discussed adding YAML frontmatter, a JSON-LD variant, and a stricter schema. For now, the Markdown form we describe above is what every existing implementation uses. We'll publish a follow-up if the spec changes materially.
Should the file be cached?
Yes. We serve ours with Cache-Control: public, max-age=3600, s-maxage=86400 — fresh for browsers for an hour, fresh on the CDN for a day. A daily cache invalidation is plenty unless you ship cornerstones more often than that.
Dataset
Full 100-domain audit: CSV download.
The CSV contains one row per audited domain with the following columns:
rank, category, brand_name, domain, probe_timestamp_utc,
homepage_status, homepage_response_ms, homepage_title,
llms_txt_found, llms_txt_status, llms_txt_size_bytes,
llms_txt_content_type, llms_txt_compliance_score,
llms_txt_compliance_notes, llms_full_txt_found,
llms_full_txt_status, llms_full_txt_size_bytes,
well_known_llms_txt_found, robots_txt_found, robots_txt_size_bytes,
error.
llms_txt_compliance_score is a 0-1 decimal scored across seven
dimensions: has H1, has summary blockquote, has at least one structured
section, has typed links in Docs or Optional, valid UTF-8, body size
between 200 bytes and 100 KB, no HTML leakage in the body. The
compliance_notes column lists which dimensions failed for each
non-compliant file.
The crawler that produced this dataset is open source:
scripts/llms-txt-audit.ts.
You can re-run it against your own list of domains, or against ours to
check our work.
How we wrote this
This article combines spec reading, a 100-domain automated crawl, and about ten hours of hand-inspecting the files we found.
- Spec source: llmstxt.org, accessed 2026-05-19. The proposal was published by Jeremy Howard / Answer.AI in 2024.
- Audit methodology: automated crawl of 100 domains on 2026-05-19,
scored against the seven compliance dimensions described in the
Dataset section. The crawler used
undici.fetchwith aGeosalienceBot/1.0user-agent, followed up to five redirects, and capped each response body at 64 KB. Concurrency was 10; total wall time was 31 seconds. Full crawler source and per-domain raw output published. - Limitations: A single point-in-time crawl. Sites that ship
llms.txtbehind authentication, on a CDN that 403s our user agent, or under a non-apex domain are recorded as "not found" even if they exist. We rerun the audit quarterly; year-over-year adoption is the more interesting metric. - Conflicts of interest: None. The Cited is independent. We have no commercial relationship with Answer.AI, llmstxt.org, or any of the audited brands.
For our general methodology, see Methodology. For how we handle right-of-reply and corrections, see Editorial Policy.
See also
- Technical pillar — every article we publish on the technical side of GEO, from
llms.txtto schema markup. - What is GEO — the broader concept this audit sits inside.
- Citation rate — the metric we use to measure how often an LLM links back to a source.
- Share of voice — how often a brand appears across a set of LLM answers.
Cite this article
Reference this work in one of the formats below. The same strings are embedded in this page's Schema.org JSON-LD so LLM crawlers see them too.
Pawlikiewicz, S. (2026, May 25). llms.txt: Spec, 100-Domain Adoption Audit, and Setup. The Cited. https://geosalience.com/technical/llms-txt-spec-adoption-setup
Changelog
- Published — 25 May 2026
- Updated — 19 May 2026
- Last reviewed — 19 May 2026
Founder & Editor, The Cited
Building The Cited — the publication of record for Generative Engine Optimization. Researching how AI search engines retrieve, rank, and cite.