Photorealistic visualization of a neural network pipeline: data flowing from web crawlers through vector embedding layers into a ranked results interface, rendered in deep blue and electric teal on dark background

AI Search · Technical

How AI Search Engines Work: The Retrieval and Ranking Pipeline Explained

2026-07-01 By Tim Francis 14 min read

How do AI search engines like ChatGPT, Perplexity, and Google AI Overviews actually find and rank content?

AI answer engines run a five-stage pipeline: crawl and index public web content, retrieve candidate passages using hybrid keyword plus vector search, rerank by relevance and authority, select sources for citation, then synthesize a final answer. Each stage filters what gets cited — and most ranking weights are not publicly disclosed.

Photorealistic visualization of a neural network pipeline: data flowing from web crawlers through vector embedding layers into a ranked results interface, rendered in deep blue and electric teal on dark background
How AI Search Engines Work: The Retrieval and Ranking Pipeline Explained

Most people optimizing for AI search are working backward from outputs — they see a Perplexity answer cite a competitor and wonder what they did wrong. The question they should be asking is not 'what did that page say?' but 'how did this engine find, evaluate, and select that page in the first place?' The pipeline is what matters. If you don't understand the mechanics, you are optimizing by guesswork.

I run an AEO agency that tracks citation patterns across ChatGPT Search, Perplexity, Claude, and Google AI Overviews. What I can tell you is that these engines are not all the same — Google AI Overviews sits inside the traditional search stack and inherits Googlebot's index; ChatGPT Search and Perplexity run their own crawlers and retrieval pipelines; Claude blends live web fetch with internal knowledge. The mechanics differ in meaningful ways, and conflating them is one of the most common mistakes I see from site owners trying to rank.

This post breaks down every stage of the pipeline — crawling and indexing, retrieval, reranking, citation selection, and synthesis — using what is publicly documented, what we can infer from observed behavior, and where the honest answer is that the engines simply have not told us. I will flag each distinction clearly. There is no point pretending we have more visibility into these systems than we do.

What AI-Specific Crawlers Are Visiting Your Site Right Now?

At minimum, six distinct AI crawlers may be indexing your site: GPTBot and OAI-SearchBot from OpenAI, PerplexityBot and Perplexity-User from Perplexity, ClaudeBot and Claude-SearchBot from Anthropic, and Google-Extended from Google. Each has a separate user agent, a separate purpose, and separate robots.txt controls. Most site owners are being crawled by several of these without realizing it.

OpenAI publishes clear documentation on their crawler roster. GPTBot (user-agent: GPTBot/1.3) is the training crawler — it feeds OpenAI's foundation models. OAI-SearchBot (user-agent: OAI-SearchBot/1.3) is the retrieval crawler that powers ChatGPT Search citations specifically. They are independent: blocking GPTBot opts you out of training but has no effect on ChatGPT Search visibility; blocking OAI-SearchBot removes you from ChatGPT Search answers while leaving training unaffected. OpenAI confirms each setting works independently, and the published IP ranges are available at openai.com/searchbot.json and openai.com/gptbot.json for server-level verification.

Perplexity's official crawler documentation lists PerplexityBot as the indexing crawler — it is 'designed to surface and link websites in search results on Perplexity' and is explicitly not used for foundation model training. Perplexity-User is the real-time fetcher that fires when a user asks a question; because a human initiated that request, it generally ignores robots.txt rules, meaning robots.txt alone cannot fully suppress real-time Perplexity access. IP ranges are published at perplexity.com/perplexitybot.json.

Anthropic runs ClaudeBot for training data collection, Claude-User for user-initiated fetches, and Claude-SearchBot for improving search result quality. Anthropic updated their crawler documentation in February 2026 to clarify that disabling Claude-SearchBot 'prevents our system from indexing your content for search optimization, which may reduce your site's visibility and accuracy in user search results.' Google runs Google-Extended as the crawler for its AI products including Gemini and AI Overviews data. For Google AI Overviews specifically, pages do not need any special access — Googlebot handles it through the standard search index. Any page eligible for a snippet in Google Search is eligible to appear in AI Overviews, according to Google Search Central documentation.

How Does robots.txt Actually Control Your AI Search Visibility?

Robots.txt remains the primary crawl-control lever, but it only works at the origin server layer. CDN-level and WAF-level bot-blocking rules — especially Cloudflare's 'Block AI Bots' toggle — will override robots.txt entirely. A site that has allowed every AI crawler in robots.txt but has a WAF rule blocking bot user agents is still invisible to those engines.

The practical robots.txt decision breaks into two separate choices per engine: training crawler versus search/citation crawler. For OpenAI, most publishers land on allowing OAI-SearchBot (for ChatGPT citation eligibility) while disallowing GPTBot (to stay out of model training). For Perplexity, allowing PerplexityBot is straightforward — blocking it removes you from Perplexity's indexed corpus, and Perplexity has clarified that PerplexityBot only crawls in compliance with robots.txt. For Anthropic, allowing Claude-SearchBot keeps you eligible for Claude's search-surface results.

The gap most sites miss is the CDN layer. Cloudflare added an 'AI Scrapers and Crawlers' managed rule set, and many hosts have toggled it on by default. That rule blocks GPTBot, OAI-SearchBot, PerplexityBot, ClaudeBot, and others at the network edge — before your robots.txt is ever read. If you updated your robots.txt to allow these crawlers but are still not getting indexed, audit your Cloudflare WAF custom rules and Security Events log filtered by user agent. Add explicit Allow rules for the specific crawlers you want through. Robots.txt is advisory at the origin; it is invisible to the CDN.

One caveat worth stating plainly: even after you allow all the right crawlers, getting indexed is not the same as getting cited. PerplexityBot may crawl a page and decide it is not relevant enough to surface. OAI-SearchBot may index a domain and never pull from it in a given query session. Crawler access is the floor, not the ceiling. The ranking and retrieval layers on top of the index are what determine citation — and those layers have their own logic that crawl access alone cannot influence.

How Do AI Answer Engines Build and Maintain Their Index?

AI answer engines build indexes differently depending on their architecture. Perplexity and ChatGPT Search maintain live, continuously refreshed indexes from their own crawlers. Google AI Overviews draws from the same Google Search index that Googlebot builds — no separate indexing pipeline. Claude blends a static knowledge cutoff with live web retrieval at query time via Claude-User fetches.

For Perplexity and ChatGPT Search, the indexing pipeline has recognizable components borrowed from traditional search: URL discovery via sitemaps and link graphs, crawl scheduling prioritized by freshness signals and query demand, content parsing and extraction, and chunk-level storage for retrieval. What makes AI search indexes different is that they store not just URLs and keyword co-occurrence data, but also vector embeddings of content passages — numerical representations of meaning that allow semantic matching at query time. This dual-index architecture (keyword plus vector) is central to how retrieval works and will be covered in the next section.

Freshness matters differently across engines. Perplexity's documentation indicates PerplexityBot crawls on a schedule driven by query demand — pages with high search relevance get crawled more frequently. ChatGPT Search processes robots.txt changes within approximately 24 hours according to OpenAI's documentation. Google's index freshness for AI Overviews is tied to Googlebot's standard recrawl schedule, which means pages with strong crawl budgets and frequent updates get reflected more quickly in AI Overviews responses. There is no dedicated 'AI Overviews fast lane' — it uses the same infrastructure.

The honest limitation here: none of these engines publish the full specification of their indexing systems. What I know about vector embedding dimensions, chunk sizes, or deduplication logic at Perplexity or OpenAI comes from inference — watching which content gets cited, testing structured versus unstructured pages, and reading what engineers share publicly. The RAG architecture described in academic and industry literature (Lewis et al., 2020; IBM Research; Microsoft Azure AI documentation) is the best public model we have for how these systems work under the hood, and it is a reasonable approximation — but not a confirmed blueprint for any specific commercial engine.

What Is Hybrid Retrieval and Why Does It Determine What Gets Cited?

Hybrid retrieval combines two distinct search methods: sparse keyword retrieval (BM25) that matches exact terms, and dense vector retrieval that matches semantic meaning through embeddings. Running both in parallel and fusing the scores — typically via Reciprocal Rank Fusion — means a passage can be retrieved because it uses the right terminology, because it means the right thing, or both. Content that excels at only one often loses to content strong at both.

BM25 is a statistical relevance model that scores documents based on term frequency and inverse document frequency — how often a query term appears in a passage, weighted by how rare that term is across the whole corpus. It is excellent at exact-match precision but blind to meaning. If your page says 'cost per click' and the query says 'how much does paid search cost per visit,' BM25 may miss the connection. Dense vector retrieval encodes both the query and each indexed passage as high-dimensional vectors using embedding models; relevance is measured by cosine similarity or dot product in that embedding space, capturing semantic relationships that keyword overlap misses.

Reciprocal Rank Fusion (RRF) is the standard fusion mechanism: each passage receives a score from its rank position in both the BM25 list and the vector list, and the scores are combined. A passage that ranks at position 3 on BM25 and position 4 on vector search will outscore a passage that ranks at position 1 on BM25 but is absent from the top vector results. RRF heavily penalizes any passage that performs poorly on either method — which is why content that is both terminologically precise (uses the right vocabulary) and semantically rich (covers the concept thoroughly) has a structural advantage in AI retrieval. The Microsoft Azure AI documentation on RAG techniques and the Redis blog on hybrid search both confirm this dual-index, parallel-query approach as the production standard.

The practical implication for content is this: thin pages that contain a target keyword but do not substantively explain the topic will rank on BM25 but fall out of the vector results. Dense conceptual content that never uses the specific query terms will rank on vector search but miss on BM25. The pages that get cited consistently are the ones that use precise vocabulary and cover the topic in depth — which, not coincidentally, describes what expert-level content has always looked like. The retrieval mechanics validate the content strategy, not the other way around.

How Do AI Engines Rerank Candidate Passages Before Generating an Answer?

After hybrid retrieval produces a pool of candidate passages — typically 10 to 50 — a reranker scores each passage against the specific query with a dedicated cross-encoder model. The reranker is more expensive to run than the initial retrieval but more accurate; it can read both the query and the passage together and assign a relevance score that accounts for nuance, specificity, and query intent. Passages that survive reranking are the ones that reach the synthesis stage.

The reranker is architecturally separate from the retrieval step. Retrieval is optimized for speed — it must scan millions of indexed passages in milliseconds, so it uses approximate nearest-neighbor search and lightweight scoring functions. The reranker is optimized for accuracy — it takes the short list from retrieval and uses a heavier model (typically a cross-encoder trained on relevance judgments) to produce a more precise score. This two-stage architecture is well-documented in the academic and industry RAG literature. Pinecone's RAG documentation and Wikipedia's RAG entry both describe the reranking step as a standard component of production retrieval pipelines.

What signals influence reranking is where public documentation runs out. For open-source cross-encoder rerankers, the signals are primarily query-passage relevance and semantic similarity. For commercial engines like Perplexity or ChatGPT Search, the reranking model is proprietary — we do not know the exact feature set. From observed behavior in our portfolio, I can say that pages with clear structural answers (H2-organized, lead answers before supporting detail), explicit query-relevant terminology, and clean HTML tend to get cited over pages where the relevant information is buried in long paragraphs or rendered in JavaScript. Whether that is a reranking signal or a parsing preference upstream in the pipeline, I cannot say with certainty.

Authority signals are likely incorporated somewhere in the ranking pipeline — either at the index level (domain authority, link graph data), the retrieval level (passage scores weighted by domain reputation), or at reranking. Perplexity has stated publicly that its system is not a pure LLM — it uses real-time retrieval and cites sources, and the sources it selects tend to skew toward established publishers and domains with demonstrable topical authority. But the exact weighting of domain authority versus passage relevance versus freshness in the final reranking score is not disclosed. I am being honest with you: we infer this from output patterns, not from a published ranking specification.

How Does Citation Selection Work — Why Does One Page Get Cited Instead of Another?

Citation selection happens after reranking: the top passages are evaluated for whether they contain information the synthesis model will use in the answer. A page can rank highly on retrieval and reranking but still not be cited if its passage overlaps with a higher-authority source saying the same thing, or if the synthesis model simply does not use that passage in the final answer. Citation is a downstream event of synthesis, not a separate scoring step.

This is the part of the pipeline that most site owners misunderstand. They optimize for 'getting indexed' or 'ranking well' without realizing that citation happens at the generation stage. The LLM synthesizing the answer attributes inline citations to whichever passages it draws from — typically by token-level grounding, where the model references the passage it used for a specific claim. If the model paraphrases two sources but only quotes one, only the quoted source may receive a citation. The relationship between 'retrieved and reranked' and 'actually cited' is not one-to-one.

From a practical optimization standpoint, the content formats most likely to survive into citation are: direct, standalone answers to specific questions; lists and steps that can be quoted with attribution; specific statistics or claims with clear sourcing; and definitions or explanations of technical concepts that are harder to paraphrase without direct reference. Long-form explanatory content often gets retrieved and used for context without being cited — the LLM uses it to understand the topic but cites the shorter, more quotable passage from a different source for the final answer. This is a structural tension in how we think about AEO content.

Google AI Overviews works differently from LLM chat engines on this point. In AI Overviews, the supporting links shown alongside the answer are selected through Google's standard relevance systems — they are not necessarily the sources the AI text was 'generated from' in a RAG sense. Google uses a 'query fan-out' technique, issuing multiple related searches to develop a response, and the links shown are the most relevant results to those sub-queries. This is closer to traditional search result selection than to RAG-style passage grounding. It means the path to appearing as an AI Overviews source is, at its core, ranking well in Google Search on the subtopics the query fans out to — which is different from the passage-level grounding logic in Perplexity or ChatGPT Search. Understanding that distinction changes your optimization strategy significantly. See our posts on what is answer engine optimization and how to rank in Google AI Overviews for the tactical implications.

How Does Answer Synthesis Work and How Does Attribution Get Assigned?

The synthesis stage takes the top reranked passages and constructs an answer using an instruction-tuned LLM. The model writes the response while grounding each claim in a retrieved passage, then inserts citations. Attribution quality varies: some engines cite precisely at the sentence level, others group multiple claims under a single source, and some hallucinate citations to passages that technically support an adjacent but not identical claim.

In a standard RAG pipeline, the retrieved passages are prepended to the user query in the LLM's context window as 'grounding documents.' The LLM is then instructed to generate an answer based on those documents and to cite its sources. The quality of attribution depends on how the citation instruction is implemented — whether the model is trained to cite at the claim level or the paragraph level, whether it verifies that cited text actually supports the claim, and whether it defaults to confident generation when the retrieved passages are ambiguous. Commercial engines like Perplexity invest significantly in improving citation accuracy, but no engine gets it right 100 percent of the time. I have seen Perplexity cite a source that contains information adjacent to but not identical to the cited claim, and I have seen ChatGPT Search cite a page that has since been updated and no longer supports the original claim.

The synthesis model also determines answer length and scope. For a narrow factual query, the top one or two passages may be sufficient and only those sources get cited. For a broad research query, the engine fans out across more passages and citations multiply. Perplexity tends toward citation-heavy answers with 5 to 8 sources; ChatGPT Search is more selective; Google AI Overviews typically shows 3 to 5 supporting links. These are observable patterns from running queries across our portfolio, not disclosed parameters. The actual mechanisms for how many sources to cite and at what threshold are proprietary.

One structural point worth understanding: synthesis is lossy. The retrieved passages contain more information than the synthesized answer reflects. The model compresses, paraphrases, and selects. This means that even the pages the engine draws from most heavily may not show up as the top-cited sources — the engine may rely on their content for accuracy while attributing the final wording to a more quotable passage from a different source. For a content creator, the implication is that driving citation requires being the most quotable, cleanest, most direct source — not just the most comprehensive. Our content strategy for AEO operates explicitly on this principle.

How Is Google AI Overviews Different from LLM Answer Engines Like ChatGPT and Perplexity?

Google AI Overviews sit inside Google Search infrastructure and use Googlebot's existing index — no separate AI crawler needed. ChatGPT Search and Perplexity run independent crawlers, independent indexes, and independent retrieval pipelines. Optimizing for one does not automatically optimize for the others, though strong fundamentals — authoritative content, crawlable structure, clear direct answers — transfer across all of them.

The Google AI Overviews system is documented in Google Search Central. Google's own documentation states that to be shown as a supporting link in AI Overviews, a page must be indexed, eligible to be shown in Google Search with a snippet, and fulfilling Search's technical requirements. There are no additional technical requirements beyond standard search eligibility. This means the path into AI Overviews runs through traditional SEO fundamentals: crawlability, indexation, snippet eligibility, and relevance for the query. Google also uses the query fan-out technique — issuing multiple related sub-queries during response generation — which means appearing in AI Overviews for a complex topic requires ranking across the subtopics that topic fans out to, not just the primary keyword.

ChatGPT Search, by contrast, is powered by OAI-SearchBot's separate index and OpenAI's own retrieval and synthesis stack. There is no dependency on Google's infrastructure. A page that ranks well in Google Search may or may not appear in ChatGPT Search depending on whether OAI-SearchBot has crawled it, how it scores in OpenAI's retrieval pipeline, and whether its content is structured in a way that the ChatGPT synthesis model quotes directly. I have seen pages that rank on page one of Google Search consistently fail to appear in ChatGPT Search results — and the root cause is almost always either a CDN-level OAI-SearchBot block or content that is too diffuse for the LLM to extract a quotable answer from.

Perplexity runs similarly to ChatGPT Search in architecture — independent crawler, independent index, independent synthesis — but with heavier emphasis on real-time web retrieval and more citations per answer. Claude operates differently again: it has a large knowledge cutoff from training data, and for current-events queries it uses Claude-User to fetch live pages in real time. That means optimizing for Claude citation involves both the static authority signals from training data (links, brand mentions, topical coverage accumulated over time) and the live-page quality signals evaluated at query time. For a deep dive into ranking factors across all these engines, see our AEO ranking factors post.

What Are the 5 Stages Every AI Answer Engine Runs Before Producing a Citation?

Every major AI answer engine — ChatGPT Search, Perplexity, Claude, and Google AI Overviews — runs some version of this five-stage pipeline. The specific implementation differs, but the logical sequence is consistent across all of them. Understanding each stage tells you where your content can win and where it can get filtered out.

  1. Stage 1 — Crawl and Index: AI-specific crawlers (OAI-SearchBot, PerplexityBot, ClaudeBot, Googlebot for AI Overviews) discover pages via sitemaps, link graphs, and demand signals. Content is parsed, chunked, and stored in a dual index: a keyword (BM25) index for exact-match retrieval and a vector embedding index for semantic retrieval. Crawler access is controlled via robots.txt and CDN/WAF rules — if either layer blocks the crawler, the page is invisible to that engine regardless of content quality.
  2. Stage 2 — Query Processing: When a user submits a query, the engine encodes it in the same vector space used for document embeddings, and identifies the key terms for keyword retrieval. For complex queries, some engines run query fan-out — decomposing the query into sub-queries to cover multiple subtopics in parallel. Google AI Overviews explicitly uses this technique.
  3. Stage 3 — Hybrid Retrieval: The engine runs the query against both the keyword index (BM25) and the vector index (semantic search) simultaneously. Results from both are fused using Reciprocal Rank Fusion or a similar method to produce a unified ranked list of candidate passages — typically 10 to 50 candidates. Content that performs well on both retrieval methods scores highest.
  4. Stage 4 — Reranking: A cross-encoder reranking model scores each candidate passage against the full query in a joint pass. This is more computationally expensive than retrieval but more accurate — the model evaluates query-passage relevance directly rather than relying on pre-computed embeddings. The top 3 to 10 passages from reranking advance to synthesis. Proprietary authority and freshness signals may be applied here; the exact weights are not disclosed.
  5. Stage 5 — Synthesis and Citation: An instruction-tuned LLM generates the answer using the top reranked passages as grounding context. As it writes, it attributes claims to specific passages, inserting citations. Citation is not guaranteed for every retrieved passage — the synthesis model only cites passages it actually draws from in the final answer. Pages with direct, quotable answers have higher citation rates than pages with information buried in long prose.

Sources and further reading

These are the primary sources referenced in this article. Each is an authoritative documentation page or publication we verified before citing.

Questions

Frequently asked questions

Does blocking GPTBot prevent my site from appearing in ChatGPT Search?

No. GPTBot and OAI-SearchBot are independent crawlers controlled separately. Blocking GPTBot opts you out of OpenAI's foundation model training only. Your ChatGPT Search visibility depends on OAI-SearchBot, which can be allowed or blocked independently. OpenAI's own documentation confirms each setting works independently of the others.

Does my site need to rank in Google Search to appear in AI Overviews?

Essentially, yes. Google's documentation states that to appear as a supporting link in AI Overviews, a page must be indexed and eligible to be shown in Google Search with a snippet. There are no additional technical requirements — but if your page is not ranking or snippet-eligible in traditional Search, it is not eligible for AI Overviews either. Core SEO fundamentals are the prerequisite.

What is retrieval-augmented generation (RAG) and how does it relate to AI search?

RAG is the architecture underlying most AI answer engines. Instead of relying solely on a model's training data, RAG systems retrieve relevant external passages at query time and feed them to the LLM as context before generating an answer. This allows the engine to cite current, specific sources rather than hallucinating from memory. ChatGPT Search, Perplexity, and Claude all use some version of RAG at query time.

Why does my page get retrieved by AI crawlers but never cited in answers?

Getting crawled and indexed is necessary but not sufficient for citation. After retrieval and reranking, the synthesis model only cites passages it actively draws from in the final answer. Pages with diffuse, unfocused content often provide background context to the LLM without earning a citation. Restructuring key answers to be direct, standalone, and quotable — with precise vocabulary and clear evidence — significantly improves citation rates in our experience.

Are the exact ranking weights used by Perplexity or ChatGPT Search public?

No. Neither Perplexity nor OpenAI discloses the specific ranking weights used in retrieval, reranking, or synthesis. We know the general pipeline architecture from published RAG research and infer engine-specific behavior from observed citation patterns. Any claim about specific weights — 'authority accounts for 40% of the reranking score' — is fabricated. We work from documented signals and observable outputs, not confirmed specifications.

How often should I update content to stay fresh in AI search indexes?

Freshness signals vary by engine. ChatGPT Search processes robots.txt changes within approximately 24 hours according to OpenAI. PerplexityBot crawls on a schedule driven by query demand — established domains may be crawled several times per week. Google's recrawl schedule for AI Overviews matches Googlebot's standard cadence. For time-sensitive topics, updating content and submitting updated sitemaps is the most reliable freshness lever across all engines.

Tim Francis

Founder, SCALZ.AI

Tim Francis is the founder and CEO of SCALZ.AI, an AI search optimization agency headquartered in St. Augustine, Florida. He leads AEO, GEO, and LLM SEO strategy across a 50-state local-SEO site portfolio and is the architect of the SCALZ publishing platform. His work is grounded in live ranking data, not theory. Read more about Tim Francis or see our AI SEO services.

Free Analysis · No Commitment

See where your business stands

Run your site through the same audit we run on every client. In about a minute you will see where you rank in Google and whether ChatGPT, Perplexity, and AI Overviews cite you.

  • Full search and AI presence audit
  • Competitor gap report
  • Technical SEO health check
  • Custom action plan

No credit card. No contracts. Or call (772) 267-1611.