Knowledge Cutoff, Web Access, and Why It Matters
An LLM that browses the web in 2026 still answers from a training anchor months — sometimes a year or more — in the past. Knowing which knowledge comes from where, and how the two interact, is the prerequisite for any GEO strategy.
Modern AI search engines mix two sources of information: what the model learned during training, and what it can fetch from the live web. Most users don't notice the seam. Most GEO writers don't notice it either. But the seam is exactly where citation behaviour gets weird, and understanding it is the prerequisite for any serious GEO strategy.
The two layers
An LLM-based search engine — ChatGPT with browsing, Claude with web search, Perplexity, AI Overviews — answers from two distinct pools:
- Trained knowledge — facts encoded in the model's weights during pre-training. This pool is large (trillions of tokens) but frozen at the model's knowledge cutoff date.
- Retrieved knowledge — pages the model fetched in response to the current query, parsed, and used as ground truth.
When you ask "what's the population of France?", the model can answer from either pool. When you ask "what was announced at OpenAI DevDay 2026?", only the retrieved pool can answer correctly.
Knowledge cutoffs by model (as of 2026-05-31)
| Model | Knowledge cutoff | Live web access |
|---|---|---|
| Claude Opus 4.8 (Anthropic) | January 2026 | Yes — via web-search tool |
| Claude Sonnet 4.6 (Anthropic) | August 2025 | Yes — via web-search tool |
| GPT-5.2 (OpenAI) | August 31, 2025 | Yes — via browsing/search |
| Gemini 3 Pro (Google) | January 2025 | Yes — via Search Grounding |
Knowledge cutoffs as of 2026-05-31, from vendor model cards. The frontier lineup changes often (OpenAI already ships 5.3/5.4/5.5; Google ships Gemini 3.5 Flash) — treat this as a dated snapshot, not a standing list.
A meaningful detail: continuous retraining vs. discrete cutoffs. Perplexity and Gemini's grounding layer are designed to constantly ingest fresh content. ChatGPT and Claude have discrete training runs, so their pre-training knowledge is bounded by a specific date — even when browsing is on, anything they "know" without retrieving was true only up to that cutoff.
Where the seam shows up
Disagreements between memory and web
If the model "remembers" something from training and the web disagrees, behaviour varies:
- ChatGPT prefers retrieved content for time-sensitive facts; defers to memory for general explanations.
- Claude is more conservative — it will often cite the retrieved source even if memory would have given a more polished phrasing.
- Perplexity treats memory as effectively zero — everything must come from a retrieved source.
Stale training data citing itself
This is the failure mode that matters most for GEO. If the model's training corpus included a now-outdated guide written by your competitor, the model will prefer to "remember" their take rather than retrieve your fresher one — unless you explicitly trigger retrieval (with a fresh-news query) or your new article is clearly stronger.
Recency cliff
For queries that don't trigger retrieval, the model gives you the freshest answer it has — which is the training cutoff. There's a noticeable cliff: claims more recent than the cutoff aren't represented at all.
Practical implications for GEO
- Aim for both pools. Your goal is to be cited and to be remembered. Cited helps now; remembered helps in 2 years when the next model is trained.
- Date everything. A clear
datePublishedtells the engine "this is fresh" and biases it toward retrieval. - Be the canonical source for an evergreen claim. If your article is the cited source for "GEO" in the corpus that GPT-6 trains on, every GPT-6 answer about GEO references you — even when browsing is off.
- Avoid outdated specifics in evergreen articles. Don't say "ChatGPT 4o is the latest". Say "as of [date], the most capable model is X". Models that remember the article will remember the dated claim, not the now-stale specifics.
What's still unknown
- The exact contribution of any given URL to a model's training corpus.
- The decay function — how long after a citation drops out of the live web does it stop being "remembered"?
- Whether models with RAG-style architectures (Perplexity) behave fundamentally differently for citation behaviour over time.
See also
- What is GEO? — the discipline this article is a prerequisite for.
- GEO vs AEO vs LLMO vs SGE — why the no-browsing case (answering from the training anchor alone) is where LLMO and GEO pull apart.
- llms.txt: spec, 100-domain audit, and setup — one concrete infrastructure signal that helps the retrieval layer find you.
- More in the Foundations pillar — the theory and mechanics of AI search.
- How ChatGPT Decides Which Source to Cite
Changelog
- Published — 31 May 2026
- Updated — 31 May 2026
- Last reviewed — 31 May 2026
Editorial
Independent publication on Generative Engine Optimization. Primary research on how AI search engines retrieve, rank, and cite.
Related
GEO vs AEO vs LLMO vs SGE: An Honest Taxonomy
Four acronyms, mostly the same thing — and a few clear distinctions worth keeping. We pulled the original definitions, checked who uses which term, and argue which one should win.
What is Generative Engine Optimization (GEO)?
An honest taxonomy of the discipline that is reshaping how the web is read — by both humans and machines.
llms.txt: Spec, 100-Domain Adoption Audit, and Setup
We audited 100 top developer-tools and SaaS sites for an llms.txt file. Only 37 of them serve one at the apex — and the gap is concentrated in the places you might expect it not to be. The full spec, the audit, and a 10-minute setup.