# llms.txt: Spec, 100-Domain Adoption Audit, and Setup

> We audited 100 top developer-tools and SaaS sites for an llms.txt file. Only 37 of them serve one at the apex — and the gap is concentrated in the places you might expect it not to be. The full spec, the audit, and a 10-minute setup.

Canonical: https://geosalience.com/technical/llms-txt-spec-adoption-setup
Published: 2026-05-25T00:00:00.000Z
Updated: 2026-05-19T00:00:00.000Z
Pillar: technical
Authors: slav

---
<Callout type="info" title="Key takeaways">
- **37 of 100** audited top sites serve `llms.txt` at the apex — above the 18–25 prior.
- **Top SaaS leads at 14/20 (70%)**; news outlets and global brands ship none.
- **Mean spec compliance** among adopters is **0.84**. Stripe, Together AI, Webflow, Athena HQ, and Turborepo all score 1.00.
- **Dominant failure mode:** no typed links under `## Docs` / `## Optional` — the file ships an H1 and sections but no link list, which defeats its purpose for LLM consumers.
- **Setup is one route handler + raw-markdown alternates**, about 10 minutes for a typical SaaS.
</Callout>

`llms.txt` is a proposal by Jeremy Howard (Answer.AI) for a small text file at the root of your site that helps LLMs read it. It is _not_ a standard. It is _not_ enforced by any crawler today. And yet — a meaningful slice of the AI-native sites we audited have one, and the ones that don't are leaving money on the table for an investment of about 10 minutes.

This is the full spec, the 100-domain audit, and a setup tutorial you can copy.

## What is llms.txt

`llms.txt` is a curated, machine-readable index of your site, structured as Markdown, served at `https://yourdomain.com/llms.txt`. It tells an LLM crawler:

1. The name of your publication / project.
2. A one-line description.
3. A curated list of canonical URLs, grouped by topic, with one-line descriptions.
4. Optional sections like "About", "FAQ", "API reference".

There is a companion file, `llms-full.txt`, which contains the _full text_ of the most important pages, concatenated.

The point: instead of a crawler walking your sitemap and ranking every page by guesswork, you hand-curate what matters and serve it as a single file.

## Why it matters (even though no crawler enforces it yet)

- **Voluntary signal.** Engineers at OpenAI, Anthropic, and Perplexity have all referenced reading `llms.txt` files during product development. It's used today as a research aid, even if it's not crawled at scale.
- **It's an editorial statement.** A site that publishes a curated index is signalling something about how it wants to be read. That's a useful signal _for humans_ too, in particular journalists and competitors.
- **Cost: zero.** You're already maintaining a sitemap. This is one more file.
- **Forward compatibility.** If a major crawler does adopt the format, you're already ready.

## The spec, in 90 seconds

A minimal valid `llms.txt`:

```markdown
# My Project Name

> One-sentence description of what this site is.

A short paragraph explaining the project, audience, and scope.

## Docs

- [Quickstart](https://example.com/quickstart): Get started in 5 minutes.
- [API reference](https://example.com/api): Full API documentation.

## Optional

- [Changelog](https://example.com/changelog): Recent releases.
```

Rules:

- File is **Markdown** (not plain text, despite the `.txt` extension).
- First line is `# Project Name` (H1).
- Second line is `> Short description` (Markdown blockquote).
- The next paragraph is a longer description.
- After that, `## Section Name` headers separate logical groups.
- Inside each section: a `- [Title](URL): description.` line per resource.
- "Optional" section is a convention for secondary links.

The companion `llms-full.txt`:

- Same format on top.
- Each H1 is the title of an article.
- Body of the article in raw Markdown follows.
- `---` separates articles.

## Audit: 100 domains

We crawled 100 domains on 2026-05-19 across eight categories: AI/ML companies (15), top SaaS (20), dev tools and frameworks (15), documentation platforms (10), tech media (10), top global brands (10), news and journalism (10), and GEO/SEO tools (10). The full domain list and the crawler that ran the audit are open: see [the methodology dataset](/datasets/llms-txt-audit.csv) and `scripts/llms-txt-audit.ts` in [our repository](https://github.com/geosalience/geosalience).

The audit probes the **apex domain only** — `https://<brand>.com/llms.txt`. A site that ships the file under a subdomain (`docs.example.com/llms.txt`) or a deeper path is recorded as "not found." That apex-only frame matters: it's the location the spec proposes, and the one an LLM crawler would try first without prior knowledge.

### Adoption rate

| Category | Audited | With `llms.txt` | Adoption rate |
|---|---|---|---|
| AI / ML companies | 15 | 5 | 33% |
| Top SaaS | 20 | 14 | 70% |
| Dev tools and frameworks | 15 | 7 | 47% |
| Documentation platforms | 10 | 3 | 30% |
| GEO / SEO tools | 10 | 3 | 30% |
| Tech media | 10 | 1 | 10% |
| Top global brands | 10 | 0 | 0% |
| News and journalism | 10 | 0 | 0% |
| **Total** | **100** | **37** | **37%** |

The headline is the SaaS row: **70% of the top SaaS sample serves an `llms.txt` at the apex**, a higher rate than AI/ML or dev tools and a much higher rate than we expected before running the crawl. The cleanest gap is at the other end: news, global brands, and the bulk of tech media ship nothing.

Among the 37 adopters, 8 also serve the companion `llms-full.txt` (22% of adopters; 8% of the full sample). The companion file is concentrated in dev tools and docs platforms — Cloudflare, Supabase, Bun, Drizzle, tRPC, Vite, Turborepo, Speakeasy.

A few well-known names are missing from the apex adoption list because they redirect their apex to `www.<brand>.com`, which 404s on `/llms.txt`. The most visible of these are Anthropic and Mintlify — both ship documentation surfaces that publish `llms.txt`, but not at the apex this audit probes. That's a finding about where the file lives, not whether it exists; we treat it as a separate question for a follow-up audit.

### What's good (exemplary files)

A short list of `llms.txt` files that get the format right, ranked by our compliance score (sum of seven dimensions / 7) then by file size.

- **Stripe** (`stripe.com/llms.txt`, ~64 KB body, score 1.00) — the largest well-formed file we found. The crawler truncated the body at 64 KB; the live file is longer still. Sections cover API products and SDKs; every link in the file is a curated entry, not a generic sitemap dump.
- **Together AI** (`together.ai/llms.txt`, ~39 KB, score 1.00) — comprehensive coverage of model docs, inference endpoints, and platform features; clean H2 structure; `.md` alternates throughout.
- **Webflow** (`webflow.com/llms.txt`, ~13 KB, score 1.00) — a SaaS that ships `llms.txt` as an editorial signal, not just a docs concession. Concise H2 sections, descriptive one-liners per link.
- **Athena HQ** (`athenahq.ai/llms.txt`, ~12 KB, score 1.00) — one of three GEO-tool vendors that publish at the apex. Categorized sections with descriptions written for LLM consumers, not humans.
- **Turborepo** (`turbo.build/llms.txt`, ~10 KB, score 1.00) — a Vercel-stack project that uses the file as a curated index over the docs site. Clean H1 + summary + Docs/Examples sections; no marketing copy.

What these have in common: an H1 with the brand name, a single-sentence blockquote summary, 3-5 H2 sections (Docs, API Reference, Optional, sometimes Examples), `.md` alternates linked rather than HTML pages, and no marketing copy.

### Common mistakes

The crawler's compliance check scores each found file on seven dimensions
(see the [crawler spec](/raw/technical/llms-txt-spec-adoption-setup.md) for
detail). The **most common dimensional failure across the 37 found files
was "no typed links"** — the file ships an H1, a summary, and at least
one `##` section, but no `[text](url)` line under a `## Docs` or
`## Optional` heading. Sites in this group treat `llms.txt` as a
human-readable about-page instead of a curated link index, which defeats
the file's purpose for LLM consumers.

Beyond the dimensional check, the same five editorial mistakes recur
whenever we hand-inspect the files:

1. **Linking to HTML pages instead of `.md` alternates.** The whole point of `llms.txt` is to surface raw Markdown. Linking to `https://example.com/docs/quickstart` makes the LLM crawler walk back through HTML parsing — the very work the file was meant to remove. Link to `https://example.com/docs/quickstart.md` (or whatever your raw-Markdown route is) instead.
2. **Missing one-line descriptions on link items.** The spec says `- [Title](URL): description.` The colon and trailing description are not optional. Files that omit them give the LLM no hint about relevance, defeating the curation step.
3. **Stale links (404).** Curated indexes drift faster than sitemaps because they aren't usually regenerated by the build system. We expect to find dead links in 10-20% of audited files.
4. **Marketing copy in the description.** "The best CRM platform for modern teams" is a tagline, not a summary. The summary should describe what the linked page contains, not why you should buy from the vendor.
5. **Mixing first-party docs with marketing pages.** An `llms.txt` that links to `/pricing`, `/about`, and `/blog/announcement-x` is using the file as a sitemap. The spec is for an LLM-grounding index — pricing belongs on a sitemap, not here.

## Practical setup (10 minutes)

### Step 1 — Inventory

Decide what belongs in `llms.txt`. Not every page. Just the pages an LLM should ground on when answering questions about your site. For a typical SaaS:

- Quickstart / getting-started
- API reference
- Pricing
- Top 5–10 most-read blog posts (cornerstones)
- Methodology / how-it-works pages
- Changelog

### Step 2 — Write the index

In your site repo, create the file at the path your framework serves as the root. For Next.js app router, a complete production-quality `llms.txt` route looks like this:

```ts
// app/llms.txt/route.ts
import { articles } from '#site/content'; // Velite-typed export

export const dynamic = 'force-static';
export const revalidate = 3600;

const SITE = {
  name: 'The Cited',
  description: 'Primary research on Generative Engine Optimization.',
  url: 'https://geosalience.com',
};

export async function GET() {
  const cornerstones = articles
    .filter((a) => a.cornerstone && a.state === 'live')
    .sort((a, b) => +new Date(b.publishedAt) - +new Date(a.publishedAt));

  const docs = cornerstones
    .map((a) => `- [${a.title}](${SITE.url}${a.permalink}.md): ${a.dek}`)
    .join('\n');

  const optional = [
    `- [About](${SITE.url}/about.md): Who runs this publication.`,
    `- [Methodology](${SITE.url}/methodology.md): How we run our tests.`,
    `- [Editorial policy](${SITE.url}/editorial-policy.md): Corrections, disclosure, right of reply.`,
  ].join('\n');

  const body = `# ${SITE.name}

> ${SITE.description}

The Cited is an editorial publication producing primary research on how
LLMs decide which sources to cite. Every article is built on a dataset
we publish alongside it.

## Cornerstones

${docs}

## Optional

${optional}
`;

  return new Response(body, {
    headers: {
      'Content-Type': 'text/markdown; charset=utf-8',
      'Cache-Control': 'public, max-age=3600, s-maxage=86400',
    },
  });
}
```

Three things to notice in the above:

1. The route reads from the content layer (Velite, in our case) rather than hardcoding URLs. The file stays in sync with what's actually published without manual maintenance.
2. Every URL ends in `.md` — those are the raw-Markdown alternates from Step 3.
3. `Content-Type` is `text/markdown`, not `text/plain`. The spec accepts either, but `text/markdown` is more accurate and is what LLM crawlers explicitly look for.

The Cited's actual `llms.txt` route is open-source; see [the repo](https://github.com/geosalience/geosalience).

### Step 3 — Add a `.md` alternate for every linked URL

This is the step everyone skips. An LLM crawler does not want your hero animation and your cookie banner. Serve a raw-markdown version of every URL you link in `llms.txt`, at the same path with a `.md` suffix.

In Next.js you can do this with a middleware rewrite — see [our raw markdown route](/foundations/what-is-geo.md) for a working example.

### Step 4 — `llms-full.txt` (optional, recommended)

Concatenate the full Markdown body of each cornerstone article into one file. This gives an LLM with a generous context window everything it needs to ground on your content in a single fetch.

### Step 5 — Test

Fetch your `llms.txt` with `curl` and confirm:

- `Content-Type: text/markdown` (or `text/plain` — both are spec-compliant; `text/markdown` is preferred).
- File is valid Markdown — paste into any Markdown viewer to confirm headings render.
- All linked URLs return 200.
- The `.md` alternate URLs also return 200, with `Content-Type: text/markdown`.

A 30-second sanity check:

```bash
# Fetch llms.txt and verify content-type
curl -sI https://yourdomain.com/llms.txt | grep -i 'content-type'

# Extract every URL from the file and check status codes
curl -s https://yourdomain.com/llms.txt \
  | grep -oE 'https?://[^)]+' \
  | sort -u \
  | xargs -I{} sh -c 'echo "$(curl -sI -o /dev/null -w "%{http_code}" {}) {}"'
```

If any URL returns 404 or 5xx, fix it before publishing. If a URL redirects (3xx), decide whether the redirect destination is what you actually want LLMs to see — if yes, link the destination directly.

### Step 6 — Publish and announce

This step is one we wish more sites would do: when you publish an `llms.txt`, tell people. Mention it in your changelog. Tweet the URL. Add a footer link `LLMs index` so humans can discover it too. The file itself is invisible to humans by design, but the cultural signal is worth surfacing.

## What llms.txt is not

- It is **not** a robots.txt for AI. Use `robots.txt` for crawl directives. `llms.txt` is descriptive, not restrictive.
- It is **not** a sitemap. A sitemap lists all your URLs. `llms.txt` is the curated, hand-picked index.
- It is **not** a standard. Treat it as a strong proposal that some teams find useful.

## FAQ

### Will my llms.txt actually be fetched by ChatGPT / Claude / Gemini?

There is no public claim from any of the major LLM vendors that they fetch `llms.txt` as a routine part of crawl. Anecdotally, Perplexity has referenced reading them, and Anthropic engineers have referenced consulting them during product development. Treat adoption as forward-looking, not a current channel. If a major crawler does adopt the format and you already ship the file, you're already ready.

### Does writing one help SEO?

No. `llms.txt` is a separate, additive signal for AI consumers. Your sitemap is what helps SEO. The two files complement each other — sitemap is exhaustive, `llms.txt` is curated.

### How often should I update it?

Whenever the cornerstone list changes. The file is small enough that automation isn't necessary, but if your content layer can emit it (as ours does from Velite), it stays in sync for free. We rebuild ours on every deploy.

### Should I include marketing pages?

No. Pricing, about, and announcement pages belong in a sitemap, not an `llms.txt`. The spec is for an LLM-grounding index — the pages an LLM should ground on when answering a question about your site. Marketing pages tell humans why to buy; they don't help an LLM answer "how does X work."

### What's the difference between `llms.txt` and `llms-full.txt`?

`llms.txt` is the index — short, curated, mostly links. `llms-full.txt` is the index plus the full Markdown body of each cornerstone, concatenated. A model with a generous context window can ingest the entire `llms-full.txt` in a single request and ground on it. We recommend shipping both; `llms.txt` is the spec primitive, `llms-full.txt` is the optimisation.

### Does the file need to be exactly at the root?

The spec says `/llms.txt` at the apex. A minority of implementations put it under `/.well-known/llms.txt` instead. We treat the apex location as canonical; `.well-known` is a deviation that may be tolerated by tooling but isn't documented in the spec.

### Can I block specific LLMs with this file?

No. `llms.txt` is descriptive, not restrictive. For blocking crawlers, use `robots.txt` with the LLM crawler's user-agent (e.g., `GPTBot`, `ClaudeBot`, `Google-Extended`, `PerplexityBot`). The two files have different jobs.

### Is the format going to change?

Possibly. The proposal is on version 0.1 as of 2025. The community has discussed adding YAML frontmatter, a JSON-LD variant, and a stricter schema. For now, the Markdown form we describe above is what every existing implementation uses. We'll publish a follow-up if the spec changes materially.

### Should the file be cached?

Yes. We serve ours with `Cache-Control: public, max-age=3600, s-maxage=86400` — fresh for browsers for an hour, fresh on the CDN for a day. A daily cache invalidation is plenty unless you ship cornerstones more often than that.

## Dataset

Full 100-domain audit: <a href="/datasets/llms-txt-audit.csv">CSV download</a>.

The CSV contains one row per audited domain with the following columns:
`rank`, `category`, `brand_name`, `domain`, `probe_timestamp_utc`,
`homepage_status`, `homepage_response_ms`, `homepage_title`,
`llms_txt_found`, `llms_txt_status`, `llms_txt_size_bytes`,
`llms_txt_content_type`, `llms_txt_compliance_score`,
`llms_txt_compliance_notes`, `llms_full_txt_found`,
`llms_full_txt_status`, `llms_full_txt_size_bytes`,
`well_known_llms_txt_found`, `robots_txt_found`, `robots_txt_size_bytes`,
`error`.

`llms_txt_compliance_score` is a 0-1 decimal scored across seven
dimensions: has H1, has summary blockquote, has at least one structured
section, has typed links in Docs or Optional, valid UTF-8, body size
between 200 bytes and 100 KB, no HTML leakage in the body. The
`compliance_notes` column lists which dimensions failed for each
non-compliant file.

The crawler that produced this dataset is open source:
[`scripts/llms-txt-audit.ts`](https://github.com/geosalience/geosalience).
You can re-run it against your own list of domains, or against ours to
check our work.

## How we wrote this

This article combines spec reading, a 100-domain automated crawl, and
about ten hours of hand-inspecting the files we found.

- **Spec source:** [llmstxt.org](https://llmstxt.org/), accessed
  2026-05-19. The proposal was published by Jeremy Howard / Answer.AI
  in 2024.
- **Audit methodology:** automated crawl of 100 domains on 2026-05-19,
  scored against the seven compliance dimensions described in the
  Dataset section. The crawler used `undici.fetch` with a
  `GeosalienceBot/1.0` user-agent, followed up to five redirects, and
  capped each response body at 64 KB. Concurrency was 10; total wall
  time was 31 seconds. Full crawler source and per-domain raw output
  published.
- **Limitations:** A single point-in-time crawl. Sites that ship
  `llms.txt` behind authentication, on a CDN that 403s our user agent,
  or under a non-apex domain are recorded as "not found" even if they
  exist. We rerun the audit quarterly; year-over-year adoption is the
  more interesting metric.
- **Conflicts of interest:** None. The Cited is independent. We have no
  commercial relationship with Answer.AI, llmstxt.org, or any of the
  audited brands.

For our general methodology, see [Methodology](/methodology). For how
we handle right-of-reply and corrections, see
[Editorial Policy](/editorial-policy).

## See also

- [Technical pillar](/technical) — every article we publish on the technical side of GEO, from `llms.txt` to schema markup.
- [What is GEO](/foundations/what-is-geo) — the broader concept this audit sits inside.
- [Citation rate](/glossary/citation-rate) — the metric we use to measure how often an LLM links back to a source.
- [Share of voice](/glossary/share-of-voice) — how often a brand appears across a set of LLM answers.