Should I block all AI bots or just training bots?

Only block training bots. Blocking all AI bots (including search and user bots) removes you from AI search results entirely. According to Search Engine Journal, 71% of top news sites block retrieval bots too, which experts consider counterproductive for AI search visibility.

Published: March 6, 2026•15 min read

We Blocked 12 AI Bots: What Worked, What Changed

In March 2026 we blocked 12 AI training crawlers and citations went up. In May 2026 we evolved the policy to selectively allow training bots that feed AI engines we want citing us. The original case study plus the retrospective on why we changed.

by Lloyd Pilapil

Update — May 2026: We have since evolved this policy. As of 2026-05-16 we selectively allow CCBot, Google-Extended, anthropic-ai, and Applebot-Extended — the training bots feeding AI engines we want citing Pixelmojo — while continuing to block data brokers and adversarial scrapers (cohere-ai, Meta-ExternalAgent, Bytespider, Diffbot, Omgili). The current reasoning lives in our updated case study. This post documents the original blocking policy and the citation lift it produced — kept as a dated snapshot because the before/after comparison is part of the lesson.

Geometric illustration showing dark robot figures blocked by a shield wall while colorful birds carry scrolls through open gates, representing bot segmentation strategy

38,000:1

crawls per referral visit for Anthropic training bots vs 14:1 for Google

Source: Cloudflare 2025 AI Crawler Analysis

We Blocked 12 AI Crawlers. Citations Went Up.

Most sites treat AI bots as a single category. Block them all or let them all through. That is the wrong approach.

AI bots come in three distinct types, each with different purposes, different economics, and different value to your business. When you understand the difference, the strategy becomes obvious: block the ones that take without giving, and welcome the ones that cite you.

We implemented this exact segmentation on Pixelmojo in early 2026. We blocked 12 training crawlers while keeping search and user bots fully open. The result was immediate: our server load dropped, our content stayed protected, and our AI search citations kept growing.

This post walks through the exact strategy, the specific bots in each category, and why this matters more in 2026 than ever before.

TL;DR

AI bots split into three types: training, search, and user. Only training bots should be blocked.
Blocking GPTBot does NOT block ChatGPT search. OAI-SearchBot and ChatGPT-User are independent.
Cloudflare data: Anthropic training bots make 38,000 crawls per referral visit. Google makes 14.
12 known training crawlers to block: Google-Extended, CCBot, anthropic-ai, cohere-ai, and 8 more.
Blocking Google-Extended has zero effect on Google Search rankings. Google confirmed this.
79% of top news sites block training bots, but 71% also block retrieval bots (a mistake).

Block training bots, keep search bots open. You protect your IP while increasing AI citations.

This is Step 1 of the GEO Playbook: How to Get Cited by ChatGPT, Perplexity, Claude & Gemini. For the full 7-step framework — robots.txt, content structure, schema, llms.txt, E-E-A-T, measurement, and case studies — start there.

The Three Types of AI Bots

Not all AI crawlers are equal. The industry has split into three distinct bot categories, each with its own user agent, purpose, and economics.

The Three Types of AI Bots

Training BotsBLOCKED

Collect content for model weights. No attribution, no traffic back.

Google-ExtendedCCBotanthropic-aicohere-aiMeta-ExternalAgentBytespiderApplebot-Extended

Search BotsALLOWED

Power AI search results. Cite with links. Drive referral traffic.

OAI-SearchBotClaude-SearchBotPerplexityBotGoogleOther

User BotsALLOWED

Real-time fetches when a user asks a question. Always cite the source.

ChatGPT-UserClaude-User

Training Bots: Take Everything, Give Nothing

Training bots crawl your entire site to collect content that gets compressed into model weights. Once your content is absorbed into a language model, there is no attribution, no link back, and no way to update or correct what the model learned.

These bots run continuous, automated crawls independent of any user action. They operate like a vacuum cleaner across the web, and the content they collect becomes anonymous training data.

The major training crawlers as of 2026:

Google-Extended (Gemini and Vertex AI training)
CCBot (Common Crawl, used by multiple AI companies)
anthropic-ai (Anthropic model training)
cohere-ai (Cohere model training)
Meta-ExternalAgent (Llama model training)
Applebot-Extended (Apple Intelligence training)
Bytespider (ByteDance/TikTok)
Diffbot (web data extraction for AI)
FacebookBot (Meta content gathering)
Omgili / Omgilibot (discussion forum scraping)
img2dataset (image dataset collection)

“80% of AI crawling activity is for training, 18% for search, and 2% for user-initiated fetches.”

Cloudflare 2025 AI Crawler Analysis

Search Bots: Cite You With Links

Search bots power the AI search engines that millions of people use daily. When someone asks ChatGPT, Perplexity, or Claude a question, these bots fetch relevant pages in real time and cite them in the answer.

The critical difference: search bots send traffic back to you. Every citation includes a source URL that users can click.

OAI-SearchBot (powers ChatGPT search results)
Claude-SearchBot (powers Claude search results)
PerplexityBot (powers Perplexity AI answers)
GoogleOther (Google AI Overviews)

User Bots: Real-Time Fetches Per Question

User bots are triggered by individual user actions. When someone pastes a URL into ChatGPT or asks Claude to analyze a specific page, these bots fetch that page on demand.

ChatGPT-User (user-initiated page fetches in ChatGPT)
Claude-User (user-initiated page fetches in Claude)

These are the highest-value bots because they represent a human actively seeking your content.

The Economics: Why Training Bots Are a Bad Deal

The numbers tell the whole story. Cloudflare published crawl-to-click ratio data in 2025, showing how many times each provider crawls your site compared to how many referral visits they send back.

Crawl-to-Click Ratio by Provider

Crawls per referral visit (Cloudflare, July 2025)

Google14:1

Perplexity194:1

OpenAI1,700:1

Anthropic38,000:1

Training bots consume the most resources while returning the least traffic

Google's training bot (Googlebot, which also handles search) has a reasonable 14:1 ratio. For every 14 crawls, you get one visit.

But the AI training bots are in a completely different league. OpenAI's training crawlers make approximately 1,700 crawls per referral visit. Anthropic's ratio was 38,000:1 as of July 2025 (down from 286,000:1 in January 2025).

These bots consume significant server resources while returning almost no traffic. They are, from a pure business perspective, a terrible deal.

“AI crawling rose 32% year-over-year by April 2025, with GPTBot going from 5% to 30% share of AI crawler traffic in a single year.”

Cloudflare Radar 2025

Why Blocking Training Bots Does NOT Hurt Search Citations

This is the most common misconception. Many site owners believe that blocking GPTBot means ChatGPT cannot cite them. That is incorrect.

Each Bot Type Is Independent

OpenAI, Anthropic, and Google have explicitly separated their training and search bots into independent systems:

Provider	Training Bot	Search Bot	User Bot
OpenAI	GPTBot	OAI-SearchBot	ChatGPT-User
Anthropic	ClaudeBot	Claude-SearchBot	Claude-User
Google	Google-Extended	GoogleOther	N/A
Perplexity	N/A	PerplexityBot	N/A

Blocking GPTBot has zero effect on OAI-SearchBot or ChatGPT-User. They use separate user agents, separate crawling infrastructure, and separate robots.txt directives. Anthropic has explicitly confirmed that blocking ClaudeBot does not affect Claude-SearchBot or Claude-User.

Similarly, Google has confirmed that blocking Google-Extended has zero impact on Googlebot or search rankings. It only controls whether your content trains Gemini and Vertex AI.

The Prior Training Argument

Even if you block training bots today, your content may already exist in model weights from prior crawls. This actually works in your favor: the model already "knows" about your content, and when users ask related questions, the search bots can fetch your current pages for real-time citation.

You get the benefit of historical training (the model recognizes your brand and expertise) while protecting future content from further training.

The Exact Robots.txt Strategy

Here is the segmentation we implement. Three categories, clearly separated.

How Bot Segmentation Works

Step 1

AI bot hits your site

Checks robots.txt first

Step 2

robots.txt identifies bot type

Training vs. search vs. user

Step 3

Training bots get blocked

Disallow: / (except llms.txt)

Step 4

Search bots get full access

Crawl, cite, link back to you

Section 1: Browsing Bots (Full Access)

# AI browsing / search engines - ALLOW full content
# These bots cite with links and drive referral traffic

User-agent: GPTBot
Allow: /blogs/
Allow: /services/
Allow: /projects/
Allow: /about/
Allow: /pricing/
Allow: /tools/
Allow: /vector
Allow: /hive
Allow: /capabilities
Allow: /contact-us
Allow: /llms.txt
Disallow: /api/
Disallow: /admin/

Note: We allow GPTBot selective access (not full site) because OpenAI states that if both GPTBot and OAI-SearchBot can access a page, they may combine crawls to avoid duplication. This gives us the benefit of search indexing while limiting training data exposure.

We apply the same Allow/Disallow pattern for ChatGPT-User, ClaudeBot, Claude-Web, GoogleOther, and PerplexityBot.

Section 2: Training Bots (Blocked)

# AI training crawlers - BLOCK entire site
# These collect for model weights with no attribution

User-agent: Google-Extended
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: anthropic-ai
Disallow: /

User-agent: cohere-ai
Disallow: /

User-agent: Applebot-Extended
Disallow: /

User-agent: Meta-ExternalAgent
Disallow: /

User-agent: Bytespider
Disallow: /

User-agent: Diffbot
Disallow: /

User-agent: FacebookBot
Disallow: /

Every training bot gets a blanket Disallow: / with one exception: we keep /llms.txt accessible so bots can at least read our AI policy declaration.

Section 3: Standard Crawlers (Normal Rules)

# Default rules for all other bots (Google, Bing, etc.)
User-agent: *
Allow: /
Disallow: /api/
Disallow: /admin/
Disallow: /_next/data/
Crawl-delay: 1

Regular search engines get standard access with protected admin and API paths.

Beyond Robots.txt: The Policy Layer

Robots.txt is an access control mechanism, but it does not communicate intent. We layer two additional signals on top.

ai-policy.json

A machine-readable policy file at /.well-known/ai-policy.json that explicitly declares our stance:

{
  "browsing": true,
  "indexing": true,
  "training": false,
  "fine_tuning": false,
  "embedding": true,
  "attribution_required": true,
  "rate_limit_rps": 1
}

This tells AI systems exactly what we allow: browse and index, yes. Train models on our content, no. While this is not yet a universally enforced standard, it signals clear intent to systems that respect policy files.

We covered the full implementation of ai-plugin.json and the AI discoverability stack in Part 6 of this series.

llms.txt: Structured Context for Search Bots

Our dynamic llms.txt gives search bots a structured overview of the site, including entity relationships, expertise areas, and key articles. Instead of crawling 46 blog posts to understand what Pixelmojo does, an AI system can read one file and get the full picture.

This is the carrot to robots.txt's stick. We block training bots from scraping, but we give search bots exactly the context they need to cite us accurately. Read more in our llms.txt implementation guide.

Training Bots vs. Search Bots: What You Get

What happens to your content

Training

Absorbed into model weights, no trace

Cited with link back to your site

Traffic you receive

Training

Zero referrals

Direct referral traffic from AI answers

Attribution

Training

None, content is anonymous training data

Source URL in every AI search result

Control over your content

Training

None after crawl, baked into weights

Update content, citations update too

Server resource cost

Training

High (aggressive bulk crawling)

Low (on-demand fetches per query)

What the Industry Gets Wrong

According to Search Engine Journal, 79% of top news sites block at least one AI training bot. Good. But 71% also block retrieval bots. That is a mistake.

Blocking retrieval bots removes you from AI search results. Anthropic states that blocking Claude-SearchBot "may reduce visibility in Claude's search results." OpenAI's documentation makes the same distinction for OAI-SearchBot.

The sites that block everything are choosing invisibility over protection. The correct strategy is selective: block training, allow search.

“Approximately 5.6 million websites had added GPTBot to their disallow list by mid-2025. Most did not distinguish between training and search bots.”

Industry estimates, 2025

The Compliance Reality

Robots.txt is voluntary. According to data from multiple sources, 13.26% of AI bot requests ignored robots.txt directives in Q2 2025, up from 3.3% in Q4 2024. For stricter enforcement, you can pair robots.txt with:

IP/reverse-DNS verification of bot identity
WAF-level bot controls (Cloudflare AI Crawl Control, Vercel Firewall)
Rate limiting at the edge

For most sites, robots.txt is sufficient because the major AI companies (OpenAI, Anthropic, Google, Perplexity) honor it. The non-compliant 13% tends to be smaller, less reputable crawlers.

The Emerging Standards Landscape

Several proposals aim to go beyond robots.txt for the AI era:

TDMRep (Text and Data Mining Reservation Protocol): Links policies to content fingerprints with granular purpose controls (search, ai-use, train-genai)
RSL (Really Simple Licensing): Open content licensing standard with collective bargaining options, embedded in robots.txt
ai.robots.txt: Community-maintained list of known AI crawlers with copy-paste blocking configs
Known Agents (formerly Dark Visitors): Comprehensive database of AI agents and their behavior

As of early 2026, none of these have achieved universal adoption. But the direction is clear: the web is moving toward granular, purpose-based access control for AI systems. Our current robots.txt segmentation is fully compatible with all of these emerging standards.

How This Fits Into the Full Stack

Bot segmentation is not a standalone tactic. It is the defensive layer in a broader AI discoverability strategy:

Structured data and knowledge graphs make your content understandable to AI systems
llms.txt gives search bots a structured site overview
GEO-optimized content follows the patterns that get cited
Brand authority makes AI systems trust you as a source
Machine-readable APIs let AI systems query your knowledge directly
Bot segmentation (this post) protects your IP while keeping the citation pipeline open

Each layer reinforces the others. Blocking training bots protects the content that your knowledge graph, llms.txt, and structured data make discoverable. You want AI systems to cite your content, not absorb it.

You can audit your own setup with our free robots.txt analyzer and AI crawl checker.

AI Bot Segmentation: Questions Site Owners Ask

Common questions about this topic, answered.

No. GPTBot is for model training only. ChatGPT search results are powered by OAI-SearchBot and ChatGPT-User, which are independent bots with separate robots.txt directives. You can block GPTBot while keeping OAI-SearchBot allowed, and ChatGPT will still cite you in real-time search results.

Training bots (GPTBot, Google-Extended, CCBot) crawl content in bulk to build model weights. They run continuously with no user trigger and provide no attribution. Search bots (OAI-SearchBot, Claude-SearchBot, PerplexityBot) fetch pages in real time when a user asks a question, and they cite the source with a link back to your site.

No. Google has explicitly confirmed that blocking Google-Extended has zero effect on Googlebot or search rankings. Google-Extended only controls whether your content trains Gemini and Vertex AI. The same separation applies to all other AI companies. Training bots and search bots are independent systems.

There are at least 12 known training crawlers as of 2026: Google-Extended, CCBot, anthropic-ai, cohere-ai, Meta-ExternalAgent, Applebot-Extended, Bytespider, Diffbot, FacebookBot, Omgili, Omgilibot, and img2dataset. Block all of them while keeping search and user bots allowed. The ai.robots.txt GitHub repository maintains an updated list.

According to Cloudflare 2025 data, Google has a 14:1 crawl-to-click ratio (14 crawls per referral visit), Perplexity 194:1, OpenAI 1,700:1, and Anthropic 38,000:1. Training bots are responsible for the vast majority of these crawls while returning negligible referral traffic. Blocking them reduces server load significantly.

Only block training bots. Blocking all AI bots removes you from AI search results entirely. Search Engine Journal reports that 71% of top news sites block retrieval bots alongside training bots. This is counterproductive because it eliminates AI search visibility while gaining no additional IP protection.

ai-policy.json is a machine-readable file at /.well-known/ai-policy.json that declares your AI content usage policy. It specifies whether you allow browsing, indexing, training, fine-tuning, and embedding. While not yet a universally enforced standard, it clearly signals your intent to AI systems. We recommend creating one as part of a layered bot management strategy.

Most major AI companies honor robots.txt, but compliance is not universal. Data from 2025 suggests that 13.26% of AI bot requests ignored robots.txt directives, up from 3.3% the previous quarter. The major providers (OpenAI, Anthropic, Google, Perplexity) do comply. For stricter enforcement, pair robots.txt with IP verification and WAF-level bot controls.

The Bottom Line

The AI bot landscape in 2026 rewards precision, not blunt force. Block everything and you disappear from AI search. Allow everything and your content becomes free training data.

The winning strategy is segmentation: block training bots that take without giving, welcome search bots that cite with links. Layer in ai-policy.json for intent signaling and llms.txt for structured context. Then build the content and infrastructure that makes you worth citing.

Your robots.txt is not just a technical file. It is an AI content strategy.

Ready to optimize your AI bot strategy?

AI Crawl Checker: See which AI bots can access your site
Robots.txt Analyzer: Audit your bot segmentation
Contact Us: Get help with your AI search strategy

The AI Search Playbook

Your Google Traffic Dropped 33%. Here's Where It Went.

The data behind the shift

SEO vs GEO vs AEO: The Only Guide That Actually Makes Sense

Understanding the three disciplines

The GEO Playbook: How to Get Cited by ChatGPT, Perplexity, and Claude

Tactical implementation

We Optimized for AI Search. Here's What Changed.

Our own results with real data

How to Build a Brand That AI Search Engines Cite

Brand authority for AI discovery

The AI Discoverability Stack: How We Got AI Search Engines to Cite Us

Machine-readable APIs and schemas

We Blocked 12 AI Bots: Citations Went Up [Case Study](you are here)

Bot segmentation and content protection

Want AI search engines to cite you, not just scrape you?

AI Crawl Checker

See which AI bots are crawling your site

Robots.txt Analyzer

Check your robots.txt bot segmentation

Get help with your AI visibility strategy

About the Author

Lloyd Pilapil

Founder & AI Product Architect at Pixelmojo

Lloyd Pilapil is the founder of Pixelmojo and a former Salesforce engineer who builds production AI systems for B2B companies. He writes about agentic AI, multi-agent orchestration, AX (Agentic Experience) design, GEO, and Thread-Based Engineering. His work focuses on shipping AI products that generate revenue, not prototypes.

Expertise

Agentic AI SystemsMulti-Agent OrchestrationAX DesignGEO & AI SearchThread-Based EngineeringAI Product DevelopmentGrowth MarketingUI/UX Design

We Blocked 12 AI Crawlers. Citations Went Up.

Most sites treat AI bots as a single category. Block them all or let them all through. That is the wrong approach.

This post walks through the exact strategy, the specific bots in each category, and why this matters more in 2026 than ever before.

TL;DR

AI bots split into three types: training, search, and user. Only training bots should be blocked.
Blocking GPTBot does NOT block ChatGPT search. OAI-SearchBot and ChatGPT-User are independent.
Cloudflare data: Anthropic training bots make 38,000 crawls per referral visit. Google makes 14.
12 known training crawlers to block: Google-Extended, CCBot, anthropic-ai, cohere-ai, and 8 more.
Blocking Google-Extended has zero effect on Google Search rankings. Google confirmed this.
79% of top news sites block training bots, but 71% also block retrieval bots (a mistake).

Block training bots, keep search bots open. You protect your IP while increasing AI citations.

The Three Types of AI Bots

Not all AI crawlers are equal. The industry has split into three distinct bot categories, each with its own user agent, purpose, and economics.

The Three Types of AI Bots

Training BotsBLOCKED

Collect content for model weights. No attribution, no traffic back.

Google-ExtendedCCBotanthropic-aicohere-aiMeta-ExternalAgentBytespiderApplebot-Extended

Search BotsALLOWED

Power AI search results. Cite with links. Drive referral traffic.

OAI-SearchBotClaude-SearchBotPerplexityBotGoogleOther

User BotsALLOWED

Real-time fetches when a user asks a question. Always cite the source.

ChatGPT-UserClaude-User

Training Bots: Take Everything, Give Nothing

These bots run continuous, automated crawls independent of any user action. They operate like a vacuum cleaner across the web, and the content they collect becomes anonymous training data.

The major training crawlers as of 2026:

Google-Extended (Gemini and Vertex AI training)
CCBot (Common Crawl, used by multiple AI companies)
anthropic-ai (Anthropic model training)
cohere-ai (Cohere model training)
Meta-ExternalAgent (Llama model training)
Applebot-Extended (Apple Intelligence training)
Bytespider (ByteDance/TikTok)
Diffbot (web data extraction for AI)
FacebookBot (Meta content gathering)
Omgili / Omgilibot (discussion forum scraping)
img2dataset (image dataset collection)

“80% of AI crawling activity is for training, 18% for search, and 2% for user-initiated fetches.”

Cloudflare 2025 AI Crawler Analysis

Search Bots: Cite You With Links

The critical difference: search bots send traffic back to you. Every citation includes a source URL that users can click.

OAI-SearchBot (powers ChatGPT search results)
Claude-SearchBot (powers Claude search results)
PerplexityBot (powers Perplexity AI answers)
GoogleOther (Google AI Overviews)

User Bots: Real-Time Fetches Per Question

User bots are triggered by individual user actions. When someone pastes a URL into ChatGPT or asks Claude to analyze a specific page, these bots fetch that page on demand.

ChatGPT-User (user-initiated page fetches in ChatGPT)
Claude-User (user-initiated page fetches in Claude)

These are the highest-value bots because they represent a human actively seeking your content.

The Economics: Why Training Bots Are a Bad Deal

The numbers tell the whole story. Cloudflare published crawl-to-click ratio data in 2025, showing how many times each provider crawls your site compared to how many referral visits they send back.

Crawl-to-Click Ratio by Provider

Crawls per referral visit (Cloudflare, July 2025)

Google14:1

Perplexity194:1

OpenAI1,700:1

Anthropic38,000:1

Training bots consume the most resources while returning the least traffic

Google's training bot (Googlebot, which also handles search) has a reasonable 14:1 ratio. For every 14 crawls, you get one visit.

These bots consume significant server resources while returning almost no traffic. They are, from a pure business perspective, a terrible deal.

“AI crawling rose 32% year-over-year by April 2025, with GPTBot going from 5% to 30% share of AI crawler traffic in a single year.”

Cloudflare Radar 2025

Why Blocking Training Bots Does NOT Hurt Search Citations

This is the most common misconception. Many site owners believe that blocking GPTBot means ChatGPT cannot cite them. That is incorrect.

Each Bot Type Is Independent

OpenAI, Anthropic, and Google have explicitly separated their training and search bots into independent systems:

Provider	Training Bot	Search Bot	User Bot
OpenAI	GPTBot	OAI-SearchBot	ChatGPT-User
Anthropic	ClaudeBot	Claude-SearchBot	Claude-User
Google	Google-Extended	GoogleOther	N/A
Perplexity	N/A	PerplexityBot	N/A

Similarly, Google has confirmed that blocking Google-Extended has zero impact on Googlebot or search rankings. It only controls whether your content trains Gemini and Vertex AI.

The Prior Training Argument

You get the benefit of historical training (the model recognizes your brand and expertise) while protecting future content from further training.

The Exact Robots.txt Strategy

Here is the segmentation we implement. Three categories, clearly separated.

How Bot Segmentation Works

Step 1

AI bot hits your site

Checks robots.txt first

Step 2

robots.txt identifies bot type

Training vs. search vs. user

Step 3

Training bots get blocked

Disallow: / (except llms.txt)

Step 4

Search bots get full access

Crawl, cite, link back to you

Section 1: Browsing Bots (Full Access)

# AI browsing / search engines - ALLOW full content
# These bots cite with links and drive referral traffic

User-agent: GPTBot
Allow: /blogs/
Allow: /services/
Allow: /projects/
Allow: /about/
Allow: /pricing/
Allow: /tools/
Allow: /vector
Allow: /hive
Allow: /capabilities
Allow: /contact-us
Allow: /llms.txt
Disallow: /api/
Disallow: /admin/

We apply the same Allow/Disallow pattern for ChatGPT-User, ClaudeBot, Claude-Web, GoogleOther, and PerplexityBot.

Section 2: Training Bots (Blocked)

# AI training crawlers - BLOCK entire site
# These collect for model weights with no attribution

User-agent: Google-Extended
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: anthropic-ai
Disallow: /

User-agent: cohere-ai
Disallow: /

User-agent: Applebot-Extended
Disallow: /

User-agent: Meta-ExternalAgent
Disallow: /

User-agent: Bytespider
Disallow: /

User-agent: Diffbot
Disallow: /

User-agent: FacebookBot
Disallow: /

Every training bot gets a blanket Disallow: / with one exception: we keep /llms.txt accessible so bots can at least read our AI policy declaration.

Section 3: Standard Crawlers (Normal Rules)

# Default rules for all other bots (Google, Bing, etc.)
User-agent: *
Allow: /
Disallow: /api/
Disallow: /admin/
Disallow: /_next/data/
Crawl-delay: 1

Regular search engines get standard access with protected admin and API paths.

Beyond Robots.txt: The Policy Layer

Robots.txt is an access control mechanism, but it does not communicate intent. We layer two additional signals on top.

ai-policy.json

A machine-readable policy file at /.well-known/ai-policy.json that explicitly declares our stance:

{
  "browsing": true,
  "indexing": true,
  "training": false,
  "fine_tuning": false,
  "embedding": true,
  "attribution_required": true,
  "rate_limit_rps": 1
}

We covered the full implementation of ai-plugin.json and the AI discoverability stack in Part 6 of this series.

llms.txt: Structured Context for Search Bots

Training Bots vs. Search Bots: What You Get

What happens to your content

Training

Absorbed into model weights, no trace

Cited with link back to your site

Traffic you receive

Training

Zero referrals

Direct referral traffic from AI answers

Attribution

Training

None, content is anonymous training data

Source URL in every AI search result

Control over your content

Training

None after crawl, baked into weights

Update content, citations update too

Server resource cost

Training

High (aggressive bulk crawling)

Low (on-demand fetches per query)

What the Industry Gets Wrong

According to Search Engine Journal, 79% of top news sites block at least one AI training bot. Good. But 71% also block retrieval bots. That is a mistake.

The sites that block everything are choosing invisibility over protection. The correct strategy is selective: block training, allow search.

“Approximately 5.6 million websites had added GPTBot to their disallow list by mid-2025. Most did not distinguish between training and search bots.”

Industry estimates, 2025

The Compliance Reality

IP/reverse-DNS verification of bot identity
WAF-level bot controls (Cloudflare AI Crawl Control, Vercel Firewall)
Rate limiting at the edge

For most sites, robots.txt is sufficient because the major AI companies (OpenAI, Anthropic, Google, Perplexity) honor it. The non-compliant 13% tends to be smaller, less reputable crawlers.

The Emerging Standards Landscape

Several proposals aim to go beyond robots.txt for the AI era:

TDMRep (Text and Data Mining Reservation Protocol): Links policies to content fingerprints with granular purpose controls (search, ai-use, train-genai)
RSL (Really Simple Licensing): Open content licensing standard with collective bargaining options, embedded in robots.txt
ai.robots.txt: Community-maintained list of known AI crawlers with copy-paste blocking configs
Known Agents (formerly Dark Visitors): Comprehensive database of AI agents and their behavior

How This Fits Into the Full Stack

Bot segmentation is not a standalone tactic. It is the defensive layer in a broader AI discoverability strategy:

Structured data and knowledge graphs make your content understandable to AI systems
llms.txt gives search bots a structured site overview
GEO-optimized content follows the patterns that get cited
Brand authority makes AI systems trust you as a source
Machine-readable APIs let AI systems query your knowledge directly
Bot segmentation (this post) protects your IP while keeping the citation pipeline open

You can audit your own setup with our free robots.txt analyzer and AI crawl checker.

AI Bot Segmentation: Questions Site Owners Ask

Common questions about this topic, answered.

The Bottom Line

The AI bot landscape in 2026 rewards precision, not blunt force. Block everything and you disappear from AI search. Allow everything and your content becomes free training data.

Your robots.txt is not just a technical file. It is an AI content strategy.

Ready to optimize your AI bot strategy?

AI Crawl Checker: See which AI bots can access your site
Robots.txt Analyzer: Audit your bot segmentation
Contact Us: Get help with your AI search strategy

The AI Search Playbook

Your Google Traffic Dropped 33%. Here's Where It Went.

The data behind the shift

SEO vs GEO vs AEO: The Only Guide That Actually Makes Sense

Understanding the three disciplines

The GEO Playbook: How to Get Cited by ChatGPT, Perplexity, and Claude

Tactical implementation

We Optimized for AI Search. Here's What Changed.

Our own results with real data

How to Build a Brand That AI Search Engines Cite

Brand authority for AI discovery

The AI Discoverability Stack: How We Got AI Search Engines to Cite Us

Machine-readable APIs and schemas

We Blocked 12 AI Bots: Citations Went Up [Case Study](you are here)

Bot segmentation and content protection

Want AI search engines to cite you, not just scrape you?

AI Crawl Checker

See which AI bots are crawling your site

Robots.txt Analyzer

Check your robots.txt bot segmentation

Get help with your AI visibility strategy

About the Author

Lloyd Pilapil

Founder & AI Product Architect at Pixelmojo

Expertise

Agentic AI SystemsMulti-Agent OrchestrationAX DesignGEO & AI SearchThread-Based EngineeringAI Product DevelopmentGrowth MarketingUI/UX Design

We Blocked 12 AI Crawlers. Citations Went Up.

Most sites treat AI bots as a single category. Block them all or let them all through. That is the wrong approach.

This post walks through the exact strategy, the specific bots in each category, and why this matters more in 2026 than ever before.

TL;DR

AI bots split into three types: training, search, and user. Only training bots should be blocked.
Blocking GPTBot does NOT block ChatGPT search. OAI-SearchBot and ChatGPT-User are independent.
Cloudflare data: Anthropic training bots make 38,000 crawls per referral visit. Google makes 14.
12 known training crawlers to block: Google-Extended, CCBot, anthropic-ai, cohere-ai, and 8 more.
Blocking Google-Extended has zero effect on Google Search rankings. Google confirmed this.
79% of top news sites block training bots, but 71% also block retrieval bots (a mistake).

Block training bots, keep search bots open. You protect your IP while increasing AI citations.

The Three Types of AI Bots

Not all AI crawlers are equal. The industry has split into three distinct bot categories, each with its own user agent, purpose, and economics.

The Three Types of AI Bots

Training BotsBLOCKED

Collect content for model weights. No attribution, no traffic back.

Google-ExtendedCCBotanthropic-aicohere-aiMeta-ExternalAgentBytespiderApplebot-Extended

Search BotsALLOWED

Power AI search results. Cite with links. Drive referral traffic.

OAI-SearchBotClaude-SearchBotPerplexityBotGoogleOther

User BotsALLOWED

Real-time fetches when a user asks a question. Always cite the source.

ChatGPT-UserClaude-User

Training Bots: Take Everything, Give Nothing

These bots run continuous, automated crawls independent of any user action. They operate like a vacuum cleaner across the web, and the content they collect becomes anonymous training data.

The major training crawlers as of 2026:

Google-Extended (Gemini and Vertex AI training)
CCBot (Common Crawl, used by multiple AI companies)
anthropic-ai (Anthropic model training)
cohere-ai (Cohere model training)
Meta-ExternalAgent (Llama model training)
Applebot-Extended (Apple Intelligence training)
Bytespider (ByteDance/TikTok)
Diffbot (web data extraction for AI)
FacebookBot (Meta content gathering)
Omgili / Omgilibot (discussion forum scraping)
img2dataset (image dataset collection)

“80% of AI crawling activity is for training, 18% for search, and 2% for user-initiated fetches.”

Cloudflare 2025 AI Crawler Analysis

Search Bots: Cite You With Links

The critical difference: search bots send traffic back to you. Every citation includes a source URL that users can click.

OAI-SearchBot (powers ChatGPT search results)
Claude-SearchBot (powers Claude search results)
PerplexityBot (powers Perplexity AI answers)
GoogleOther (Google AI Overviews)

User Bots: Real-Time Fetches Per Question

User bots are triggered by individual user actions. When someone pastes a URL into ChatGPT or asks Claude to analyze a specific page, these bots fetch that page on demand.

ChatGPT-User (user-initiated page fetches in ChatGPT)
Claude-User (user-initiated page fetches in Claude)

These are the highest-value bots because they represent a human actively seeking your content.

The Economics: Why Training Bots Are a Bad Deal

The numbers tell the whole story. Cloudflare published crawl-to-click ratio data in 2025, showing how many times each provider crawls your site compared to how many referral visits they send back.

Crawl-to-Click Ratio by Provider

Crawls per referral visit (Cloudflare, July 2025)

Google14:1

Perplexity194:1

OpenAI1,700:1

Anthropic38,000:1

Training bots consume the most resources while returning the least traffic

Google's training bot (Googlebot, which also handles search) has a reasonable 14:1 ratio. For every 14 crawls, you get one visit.

These bots consume significant server resources while returning almost no traffic. They are, from a pure business perspective, a terrible deal.

“AI crawling rose 32% year-over-year by April 2025, with GPTBot going from 5% to 30% share of AI crawler traffic in a single year.”

Cloudflare Radar 2025

Why Blocking Training Bots Does NOT Hurt Search Citations

This is the most common misconception. Many site owners believe that blocking GPTBot means ChatGPT cannot cite them. That is incorrect.

Each Bot Type Is Independent

OpenAI, Anthropic, and Google have explicitly separated their training and search bots into independent systems:

Provider	Training Bot	Search Bot	User Bot
OpenAI	GPTBot	OAI-SearchBot	ChatGPT-User
Anthropic	ClaudeBot	Claude-SearchBot	Claude-User
Google	Google-Extended	GoogleOther	N/A
Perplexity	N/A	PerplexityBot	N/A

Similarly, Google has confirmed that blocking Google-Extended has zero impact on Googlebot or search rankings. It only controls whether your content trains Gemini and Vertex AI.

The Prior Training Argument

You get the benefit of historical training (the model recognizes your brand and expertise) while protecting future content from further training.

The Exact Robots.txt Strategy

Here is the segmentation we implement. Three categories, clearly separated.

How Bot Segmentation Works

Step 1

AI bot hits your site

Checks robots.txt first

Step 2

robots.txt identifies bot type

Training vs. search vs. user

Step 3

Training bots get blocked

Disallow: / (except llms.txt)

Step 4

Search bots get full access

Crawl, cite, link back to you

Section 1: Browsing Bots (Full Access)

# AI browsing / search engines - ALLOW full content
# These bots cite with links and drive referral traffic

User-agent: GPTBot
Allow: /blogs/
Allow: /services/
Allow: /projects/
Allow: /about/
Allow: /pricing/
Allow: /tools/
Allow: /vector
Allow: /hive
Allow: /capabilities
Allow: /contact-us
Allow: /llms.txt
Disallow: /api/
Disallow: /admin/

We apply the same Allow/Disallow pattern for ChatGPT-User, ClaudeBot, Claude-Web, GoogleOther, and PerplexityBot.

Section 2: Training Bots (Blocked)

# AI training crawlers - BLOCK entire site
# These collect for model weights with no attribution

User-agent: Google-Extended
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: anthropic-ai
Disallow: /

User-agent: cohere-ai
Disallow: /

User-agent: Applebot-Extended
Disallow: /

User-agent: Meta-ExternalAgent
Disallow: /

User-agent: Bytespider
Disallow: /

User-agent: Diffbot
Disallow: /

User-agent: FacebookBot
Disallow: /

Every training bot gets a blanket Disallow: / with one exception: we keep /llms.txt accessible so bots can at least read our AI policy declaration.

Section 3: Standard Crawlers (Normal Rules)

# Default rules for all other bots (Google, Bing, etc.)
User-agent: *
Allow: /
Disallow: /api/
Disallow: /admin/
Disallow: /_next/data/
Crawl-delay: 1

Regular search engines get standard access with protected admin and API paths.

Beyond Robots.txt: The Policy Layer

Robots.txt is an access control mechanism, but it does not communicate intent. We layer two additional signals on top.

ai-policy.json

A machine-readable policy file at /.well-known/ai-policy.json that explicitly declares our stance:

{
  "browsing": true,
  "indexing": true,
  "training": false,
  "fine_tuning": false,
  "embedding": true,
  "attribution_required": true,
  "rate_limit_rps": 1
}

We covered the full implementation of ai-plugin.json and the AI discoverability stack in Part 6 of this series.

llms.txt: Structured Context for Search Bots

Training Bots vs. Search Bots: What You Get

What happens to your content

Training

Absorbed into model weights, no trace

Cited with link back to your site

Traffic you receive

Training

Zero referrals

Direct referral traffic from AI answers

Attribution

Training

None, content is anonymous training data

Source URL in every AI search result

Control over your content

Training

None after crawl, baked into weights

Update content, citations update too

Server resource cost

Training

High (aggressive bulk crawling)

Low (on-demand fetches per query)

What the Industry Gets Wrong

According to Search Engine Journal, 79% of top news sites block at least one AI training bot. Good. But 71% also block retrieval bots. That is a mistake.

The sites that block everything are choosing invisibility over protection. The correct strategy is selective: block training, allow search.

“Approximately 5.6 million websites had added GPTBot to their disallow list by mid-2025. Most did not distinguish between training and search bots.”

Industry estimates, 2025

The Compliance Reality

IP/reverse-DNS verification of bot identity
WAF-level bot controls (Cloudflare AI Crawl Control, Vercel Firewall)
Rate limiting at the edge

For most sites, robots.txt is sufficient because the major AI companies (OpenAI, Anthropic, Google, Perplexity) honor it. The non-compliant 13% tends to be smaller, less reputable crawlers.

The Emerging Standards Landscape

Several proposals aim to go beyond robots.txt for the AI era:

TDMRep (Text and Data Mining Reservation Protocol): Links policies to content fingerprints with granular purpose controls (search, ai-use, train-genai)
RSL (Really Simple Licensing): Open content licensing standard with collective bargaining options, embedded in robots.txt
ai.robots.txt: Community-maintained list of known AI crawlers with copy-paste blocking configs
Known Agents (formerly Dark Visitors): Comprehensive database of AI agents and their behavior

How This Fits Into the Full Stack

Bot segmentation is not a standalone tactic. It is the defensive layer in a broader AI discoverability strategy:

Structured data and knowledge graphs make your content understandable to AI systems
llms.txt gives search bots a structured site overview
GEO-optimized content follows the patterns that get cited
Brand authority makes AI systems trust you as a source
Machine-readable APIs let AI systems query your knowledge directly
Bot segmentation (this post) protects your IP while keeping the citation pipeline open

You can audit your own setup with our free robots.txt analyzer and AI crawl checker.

AI Bot Segmentation: Questions Site Owners Ask

Common questions about this topic, answered.

The Bottom Line

The AI bot landscape in 2026 rewards precision, not blunt force. Block everything and you disappear from AI search. Allow everything and your content becomes free training data.

Your robots.txt is not just a technical file. It is an AI content strategy.

Ready to optimize your AI bot strategy?

AI Crawl Checker: See which AI bots can access your site
Robots.txt Analyzer: Audit your bot segmentation
Contact Us: Get help with your AI search strategy

The AI Search Playbook

Your Google Traffic Dropped 33%. Here's Where It Went.

The data behind the shift

SEO vs GEO vs AEO: The Only Guide That Actually Makes Sense

Understanding the three disciplines

The GEO Playbook: How to Get Cited by ChatGPT, Perplexity, and Claude

Tactical implementation

We Optimized for AI Search. Here's What Changed.

Our own results with real data

How to Build a Brand That AI Search Engines Cite

Brand authority for AI discovery

The AI Discoverability Stack: How We Got AI Search Engines to Cite Us

Machine-readable APIs and schemas

We Blocked 12 AI Bots: Citations Went Up [Case Study](you are here)

Bot segmentation and content protection

Want AI search engines to cite you, not just scrape you?

AI Crawl Checker

See which AI bots are crawling your site

Robots.txt Analyzer

Check your robots.txt bot segmentation

Get help with your AI visibility strategy

We Blocked 12 AI Crawlers. Citations Went Up.

TL;DR

The Three Types of AI Bots

The Three Types of AI Bots

Training Bots: Take Everything, Give Nothing

Search Bots: Cite You With Links

User Bots: Real-Time Fetches Per Question

The Economics: Why Training Bots Are a Bad Deal

Crawl-to-Click Ratio by Provider

Why Blocking Training Bots Does NOT Hurt Search Citations

Each Bot Type Is Independent

The Prior Training Argument

The Exact Robots.txt Strategy

How Bot Segmentation Works

Section 1: Browsing Bots (Full Access)

Section 2: Training Bots (Blocked)

Section 3: Standard Crawlers (Normal Rules)

Beyond Robots.txt: The Policy Layer

ai-policy.json

llms.txt: Structured Context for Search Bots

Training Bots vs. Search Bots: What You Get

What the Industry Gets Wrong

The Compliance Reality

The Emerging Standards Landscape

How This Fits Into the Full Stack

AI Bot Segmentation: Questions Site Owners Ask

The Bottom Line

Want AI search engines to cite you, not just scrape you?

About the Author

Lloyd Pilapil

Related Reading

We Blocked 12 AI Crawlers. Citations Went Up.

TL;DR

The Three Types of AI Bots

The Three Types of AI Bots

Training Bots: Take Everything, Give Nothing

Search Bots: Cite You With Links

User Bots: Real-Time Fetches Per Question

The Economics: Why Training Bots Are a Bad Deal

Crawl-to-Click Ratio by Provider

Why Blocking Training Bots Does NOT Hurt Search Citations

Each Bot Type Is Independent

The Prior Training Argument

The Exact Robots.txt Strategy

How Bot Segmentation Works

Section 1: Browsing Bots (Full Access)

Section 2: Training Bots (Blocked)

Section 3: Standard Crawlers (Normal Rules)

Beyond Robots.txt: The Policy Layer

ai-policy.json

llms.txt: Structured Context for Search Bots

Training Bots vs. Search Bots: What You Get

What the Industry Gets Wrong

The Compliance Reality

The Emerging Standards Landscape

How This Fits Into the Full Stack

AI Bot Segmentation: Questions Site Owners Ask

The Bottom Line

Want AI search engines to cite you, not just scrape you?

About the Author

Lloyd Pilapil

Related Reading

We Blocked 12 AI Crawlers. Citations Went Up.

TL;DR

The Three Types of AI Bots

The Three Types of AI Bots

Training Bots: Take Everything, Give Nothing

Search Bots: Cite You With Links

User Bots: Real-Time Fetches Per Question

The Economics: Why Training Bots Are a Bad Deal

Crawl-to-Click Ratio by Provider

Why Blocking Training Bots Does NOT Hurt Search Citations

Each Bot Type Is Independent

The Prior Training Argument

The Exact Robots.txt Strategy

How Bot Segmentation Works

Section 1: Browsing Bots (Full Access)

Section 2: Training Bots (Blocked)

Section 3: Standard Crawlers (Normal Rules)

Beyond Robots.txt: The Policy Layer