
We Blocked 12 AI Crawlers. Citations Went Up.
Most sites treat AI bots as a single category. Block them all or let them all through. That is the wrong approach.
AI bots come in three distinct types, each with different purposes, different economics, and different value to your business. When you understand the difference, the strategy becomes obvious: block the ones that take without giving, and welcome the ones that cite you.
We implemented this exact segmentation on Pixelmojo in early 2026. We blocked 12 training crawlers while keeping search and user bots fully open. The result was immediate: our server load dropped, our content stayed protected, and our AI search citations kept growing.
This post walks through the exact strategy, the specific bots in each category, and why this matters more in 2026 than ever before.
The Three Types of AI Bots
Not all AI crawlers are equal. The industry has split into three distinct bot categories, each with its own user agent, purpose, and economics.
The Three Types of AI Bots
Collect content for model weights. No attribution, no traffic back.
Power AI search results. Cite with links. Drive referral traffic.
Real-time fetches when a user asks a question. Always cite the source.
Training Bots: Take Everything, Give Nothing
Training bots crawl your entire site to collect content that gets compressed into model weights. Once your content is absorbed into a language model, there is no attribution, no link back, and no way to update or correct what the model learned.
These bots run continuous, automated crawls independent of any user action. They operate like a vacuum cleaner across the web, and the content they collect becomes anonymous training data.
The major training crawlers as of 2026:
- Google-Extended (Gemini and Vertex AI training)
- CCBot (Common Crawl, used by multiple AI companies)
- anthropic-ai (Anthropic model training)
- cohere-ai (Cohere model training)
- Meta-ExternalAgent (Llama model training)
- Applebot-Extended (Apple Intelligence training)
- Bytespider (ByteDance/TikTok)
- Diffbot (web data extraction for AI)
- FacebookBot (Meta content gathering)
- Omgili / Omgilibot (discussion forum scraping)
- img2dataset (image dataset collection)
Search Bots: Cite You With Links
Search bots power the AI search engines that millions of people use daily. When someone asks ChatGPT, Perplexity, or Claude a question, these bots fetch relevant pages in real time and cite them in the answer.
The critical difference: search bots send traffic back to you. Every citation includes a source URL that users can click.
- OAI-SearchBot (powers ChatGPT search results)
- Claude-SearchBot (powers Claude search results)
- PerplexityBot (powers Perplexity AI answers)
- GoogleOther (Google AI Overviews)
User Bots: Real-Time Fetches Per Question
User bots are triggered by individual user actions. When someone pastes a URL into ChatGPT or asks Claude to analyze a specific page, these bots fetch that page on demand.
- ChatGPT-User (user-initiated page fetches in ChatGPT)
- Claude-User (user-initiated page fetches in Claude)
These are the highest-value bots because they represent a human actively seeking your content.
The Economics: Why Training Bots Are a Bad Deal
The numbers tell the whole story. Cloudflare published crawl-to-click ratio data in 2025, showing how many times each provider crawls your site compared to how many referral visits they send back.
Crawl-to-Click Ratio by Provider
Crawls per referral visit (Cloudflare, July 2025)
Training bots consume the most resources while returning the least traffic
Google's training bot (Googlebot, which also handles search) has a reasonable 14:1 ratio. For every 14 crawls, you get one visit.
But the AI training bots are in a completely different league. OpenAI's training crawlers make approximately 1,700 crawls per referral visit. Anthropic's ratio was 38,000:1 as of July 2025 (down from 286,000:1 in January 2025).
These bots consume significant server resources while returning almost no traffic. They are, from a pure business perspective, a terrible deal.
Why Blocking Training Bots Does NOT Hurt Search Citations
This is the most common misconception. Many site owners believe that blocking GPTBot means ChatGPT cannot cite them. That is incorrect.
Each Bot Type Is Independent
OpenAI, Anthropic, and Google have explicitly separated their training and search bots into independent systems:
| Provider | Training Bot | Search Bot | User Bot |
|---|---|---|---|
| OpenAI | GPTBot | OAI-SearchBot | ChatGPT-User |
| Anthropic | ClaudeBot | Claude-SearchBot | Claude-User |
| Google-Extended | GoogleOther | N/A | |
| Perplexity | N/A | PerplexityBot | N/A |
Blocking GPTBot has zero effect on OAI-SearchBot or ChatGPT-User. They use separate user agents, separate crawling infrastructure, and separate robots.txt directives. Anthropic has explicitly confirmed that blocking ClaudeBot does not affect Claude-SearchBot or Claude-User.
Similarly, Google has confirmed that blocking Google-Extended has zero impact on Googlebot or search rankings. It only controls whether your content trains Gemini and Vertex AI.
The Prior Training Argument
Even if you block training bots today, your content may already exist in model weights from prior crawls. This actually works in your favor: the model already "knows" about your content, and when users ask related questions, the search bots can fetch your current pages for real-time citation.
You get the benefit of historical training (the model recognizes your brand and expertise) while protecting future content from further training.
The Exact Robots.txt Strategy
Here is the segmentation we implement. Three categories, clearly separated.
How Bot Segmentation Works
AI bot hits your site
Checks robots.txt first
robots.txt identifies bot type
Training vs. search vs. user
Training bots get blocked
Disallow: / (except llms.txt)
Search bots get full access
Crawl, cite, link back to you
Section 1: Browsing Bots (Full Access)
# AI browsing / search engines - ALLOW full content
# These bots cite with links and drive referral traffic
User-agent: GPTBot
Allow: /blogs/
Allow: /services/
Allow: /projects/
Allow: /about/
Allow: /pricing/
Allow: /tools/
Allow: /vector
Allow: /hive
Allow: /capabilities
Allow: /contact-us
Allow: /llms.txt
Disallow: /api/
Disallow: /admin/
Note: We allow GPTBot selective access (not full site) because OpenAI states that if both GPTBot and OAI-SearchBot can access a page, they may combine crawls to avoid duplication. This gives us the benefit of search indexing while limiting training data exposure.
We apply the same Allow/Disallow pattern for ChatGPT-User, ClaudeBot, Claude-Web, GoogleOther, and PerplexityBot.
Section 2: Training Bots (Blocked)
# AI training crawlers - BLOCK entire site
# These collect for model weights with no attribution
User-agent: Google-Extended
Disallow: /
User-agent: CCBot
Disallow: /
User-agent: anthropic-ai
Disallow: /
User-agent: cohere-ai
Disallow: /
User-agent: Applebot-Extended
Disallow: /
User-agent: Meta-ExternalAgent
Disallow: /
User-agent: Bytespider
Disallow: /
User-agent: Diffbot
Disallow: /
User-agent: FacebookBot
Disallow: /
Every training bot gets a blanket Disallow: / with one exception: we keep /llms.txt accessible so bots can at least read our AI policy declaration.
Section 3: Standard Crawlers (Normal Rules)
# Default rules for all other bots (Google, Bing, etc.)
User-agent: *
Allow: /
Disallow: /api/
Disallow: /admin/
Disallow: /_next/data/
Crawl-delay: 1
Regular search engines get standard access with protected admin and API paths.
Beyond Robots.txt: The Policy Layer
Robots.txt is an access control mechanism, but it does not communicate intent. We layer two additional signals on top.
ai-policy.json
A machine-readable policy file at /.well-known/ai-policy.json that explicitly declares our stance:
{
"browsing": true,
"indexing": true,
"training": false,
"fine_tuning": false,
"embedding": true,
"attribution_required": true,
"rate_limit_rps": 1
}
This tells AI systems exactly what we allow: browse and index, yes. Train models on our content, no. While this is not yet a universally enforced standard, it signals clear intent to systems that respect policy files.
We covered the full implementation of ai-plugin.json and the AI discoverability stack in Part 6 of this series.
llms.txt: Structured Context for Search Bots
Our dynamic llms.txt gives search bots a structured overview of the site, including entity relationships, expertise areas, and key articles. Instead of crawling 46 blog posts to understand what Pixelmojo does, an AI system can read one file and get the full picture.
This is the carrot to robots.txt's stick. We block training bots from scraping, but we give search bots exactly the context they need to cite us accurately. Read more in our llms.txt implementation guide.
Training Bots vs. Search Bots: What You Get
What happens to your content
Absorbed into model weights, no trace
Cited with link back to your site
Traffic you receive
Zero referrals
Direct referral traffic from AI answers
Attribution
None, content is anonymous training data
Source URL in every AI search result
Control over your content
None after crawl, baked into weights
Update content, citations update too
Server resource cost
High (aggressive bulk crawling)
Low (on-demand fetches per query)
What the Industry Gets Wrong
According to Search Engine Journal, 79% of top news sites block at least one AI training bot. Good. But 71% also block retrieval bots. That is a mistake.
Blocking retrieval bots removes you from AI search results. Anthropic states that blocking Claude-SearchBot "may reduce visibility in Claude's search results." OpenAI's documentation makes the same distinction for OAI-SearchBot.
The sites that block everything are choosing invisibility over protection. The correct strategy is selective: block training, allow search.
The Compliance Reality
Robots.txt is voluntary. According to data from multiple sources, 13.26% of AI bot requests ignored robots.txt directives in Q2 2025, up from 3.3% in Q4 2024. For stricter enforcement, you can pair robots.txt with:
- IP/reverse-DNS verification of bot identity
- WAF-level bot controls (Cloudflare AI Crawl Control, Vercel Firewall)
- Rate limiting at the edge
For most sites, robots.txt is sufficient because the major AI companies (OpenAI, Anthropic, Google, Perplexity) honor it. The non-compliant 13% tends to be smaller, less reputable crawlers.
The Emerging Standards Landscape
Several proposals aim to go beyond robots.txt for the AI era:
- TDMRep (Text and Data Mining Reservation Protocol): Links policies to content fingerprints with granular purpose controls (search, ai-use, train-genai)
- RSL (Really Simple Licensing): Open content licensing standard with collective bargaining options, embedded in robots.txt
- ai.robots.txt: Community-maintained list of known AI crawlers with copy-paste blocking configs
- Known Agents (formerly Dark Visitors): Comprehensive database of AI agents and their behavior
As of early 2026, none of these have achieved universal adoption. But the direction is clear: the web is moving toward granular, purpose-based access control for AI systems. Our current robots.txt segmentation is fully compatible with all of these emerging standards.
How This Fits Into the Full Stack
Bot segmentation is not a standalone tactic. It is the defensive layer in a broader AI discoverability strategy:
- Structured data and knowledge graphs make your content understandable to AI systems
- llms.txt gives search bots a structured site overview
- GEO-optimized content follows the patterns that get cited
- Brand authority makes AI systems trust you as a source
- Machine-readable APIs let AI systems query your knowledge directly
- Bot segmentation (this post) protects your IP while keeping the citation pipeline open
Each layer reinforces the others. Blocking training bots protects the content that your knowledge graph, llms.txt, and structured data make discoverable. You want AI systems to cite your content, not absorb it.
You can audit your own setup with our free robots.txt analyzer and AI crawl checker.
AI Bot Segmentation: Questions Site Owners Ask
Common questions about this topic, answered.
The Bottom Line
The AI bot landscape in 2026 rewards precision, not blunt force. Block everything and you disappear from AI search. Allow everything and your content becomes free training data.
The winning strategy is segmentation: block training bots that take without giving, welcome search bots that cite with links. Layer in ai-policy.json for intent signaling and llms.txt for structured context. Then build the content and infrastructure that makes you worth citing.
Your robots.txt is not just a technical file. It is an AI content strategy.
Ready to optimize your AI bot strategy?
- AI Crawl Checker: See which AI bots can access your site
- Robots.txt Analyzer: Audit your bot segmentation
- Contact Us: Get help with your AI search strategy
