# Knowledge Cutoff, Web Access, and Why It Matters

> An LLM that browses the web in 2026 still answers from a training anchor months — sometimes a year or more — in the past. Knowing which knowledge comes from where, and how the two interact, is the prerequisite for any GEO strategy.

Canonical: https://geosalience.com/foundations/knowledge-cutoff-and-web-access
Published: 2026-05-31T00:00:00.000Z
Updated: 2026-05-31T00:00:00.000Z
Pillar: foundations
Authors: geosalience

---
<Callout type="key" title="The one-line version">
Even with browsing turned on, an LLM blends two pools: frozen training knowledge (bounded by a cutoff date — January 2025 to January 2026 across today's frontier models) and pages it retrieves for your query. GEO means earning a place in both.
</Callout>

Modern AI search engines mix two sources of information: what the model learned during training, and what it can fetch from the live web. Most users don't notice the seam. Most GEO writers don't notice it either. But the seam is exactly where [citation behaviour](/glossary/citation-rate) gets weird, and understanding it is the prerequisite for any serious [GEO strategy](/foundations/what-is-geo).

## The two layers

An LLM-based search engine — ChatGPT with browsing, Claude with web search, Perplexity, AI Overviews — answers from two distinct pools:

1. **Trained knowledge** — facts encoded in the model's weights during pre-training. This pool is large (trillions of tokens) but **frozen** at the model's knowledge cutoff date.
2. **Retrieved knowledge** — pages the model fetched in response to the current query, parsed, and used as ground truth.

When you ask "what's the population of France?", the model can answer from either pool. When you ask "what was announced at OpenAI DevDay 2026?", only the retrieved pool can answer correctly.

## Knowledge cutoffs by model (as of 2026-05-31)

{/* TODO(review): re-verify cutoffs + lineup each quarter; values drift. Last verified 2026-05-31. */}

| Model | Knowledge cutoff | Live web access |
|---|---|---|
| Claude Opus 4.8 (Anthropic) | January 2026 | Yes — via web-search tool |
| Claude Sonnet 4.6 (Anthropic) | August 2025 | Yes — via web-search tool |
| GPT-5.2 (OpenAI) | August 31, 2025 | Yes — via browsing/search |
| Gemini 3 Pro (Google) | January 2025 | Yes — via Search Grounding |

*Knowledge cutoffs as of 2026-05-31, from vendor model cards. The frontier lineup changes often (OpenAI already ships 5.3/5.4/5.5; Google ships Gemini 3.5 Flash) — treat this as a dated snapshot, not a standing list.*

A meaningful detail: **continuous retraining** vs. **discrete cutoffs**. Perplexity and Gemini's grounding layer are designed to constantly ingest fresh content. ChatGPT and Claude have discrete training runs, so their pre-training knowledge is bounded by a specific date — even when browsing is on, anything they "know" without retrieving was true only up to that cutoff.

## Where the seam shows up

### Disagreements between memory and web

If the model "remembers" something from training and the web disagrees, behaviour varies:

- **ChatGPT** prefers retrieved content for time-sensitive facts; defers to memory for general explanations.
- **Claude** is more conservative — it will often cite the retrieved source even if memory would have given a more polished phrasing.
- **Perplexity** treats memory as effectively zero — everything must come from a retrieved source.

### Stale training data citing itself

This is the failure mode that matters most for GEO. If the model's training corpus included a now-outdated guide written by your competitor, the model will *prefer* to "remember" their take rather than retrieve your fresher one — unless you explicitly trigger retrieval (with a fresh-news query) or your new article is clearly stronger.

### Recency cliff

For queries that don't trigger retrieval, the model gives you the freshest answer it has — which is the training cutoff. There's a noticeable cliff: claims more recent than the cutoff aren't represented at all.

## Practical implications for GEO

1. **Aim for both pools.** Your goal is to be cited *and* to be remembered. Cited helps now; remembered helps in 2 years when the next model is trained.
2. **Date everything.** A clear `datePublished` tells the engine "this is fresh" and biases it toward retrieval.
3. **Be the canonical source for an evergreen claim.** If your article is the cited source for "GEO" in the corpus that GPT-6 trains on, every GPT-6 answer about GEO references you — even when browsing is off.
4. **Avoid outdated specifics in evergreen articles.** Don't say "ChatGPT 4o is the latest". Say "as of [date], the most capable model is X". Models that remember the article will remember the dated claim, not the now-stale specifics.

## What's still unknown

- The exact contribution of any given URL to a model's training corpus.
- The decay function — how long after a citation drops out of the live web does it stop being "remembered"?
- Whether models with RAG-style architectures (Perplexity) behave fundamentally differently for citation behaviour over time.

## See also

- [What is GEO?](/foundations/what-is-geo) — the discipline this article is a prerequisite for.
- [GEO vs AEO vs LLMO vs SGE](/foundations/geo-vs-aeo-vs-llmo-vs-sge) — why the no-browsing case (answering from the training anchor alone) is where LLMO and GEO pull apart.
- [llms.txt: spec, 100-domain audit, and setup](/technical/llms-txt-spec-adoption-setup) — one concrete infrastructure signal that helps the retrieval layer find you.
- More in the [Foundations](/foundations) pillar — the theory and mechanics of AI search.
- How ChatGPT Decides Which Source to Cite {/* TODO(relink): how-chatgpt-decides-to-cite when it goes live */}