Your robots.txt file is a few kilobytes of plain text sitting on your server. For 30 years, it did one thing: tell Googlebot which pages to crawl. Set it and forget it.
That era is over.
In 2026, dozens of AI crawlers are hitting your site daily. GPTBot, ClaudeBot, PerplexityBot, Google-Extended, Meta-ExternalAgent. Each one decides whether AI platforms know about your business, your products, your expertise. And every one of them checks your robots.txt first.
If your robots.txt still looks like it did in 2023, you’re either invisible to AI search or giving away access you haven’t thought about. Either way, it’s costing you.
This article explains how robots.txt affects AI search visibility, which crawlers matter, and how to configure your file for maximum GEO (Generative Engine Optimization) impact. If you need help with the basics first, our companion guide on how to check and test your robots.txt file covers syntax, directives, and common mistakes.
The Shift: Why Robots.txt Matters More Now
Before AI search, robots.txt was simple. One crawler mattered (Googlebot), and blocking it meant one thing: no Google rankings. The relationship between crawling and visibility was straightforward.
Now that relationship is complicated. There are dozens of meaningful crawlers, each serving different purposes, each controlled by different companies, and each with different consequences for blocking.
now AI bots
growth (2024-2025)
AI-specific robots.txt rules
According to Cloudflare’s 2025 crawler report, crawler traffic rose 18% year-over-year. GPTBot’s crawl volume grew 305%. AI crawlers combined now represent over half of all crawler traffic, surpassing traditional search engine bots.
Yet only 14% of top domains have added AI-specific rules to their robots.txt. That means 86% of websites are either invisible to AI by accident or open to AI by default, without ever making a conscious decision.
Your robots.txt is now your AI visibility policy. It deserves the same strategic thought as your SEO strategy.
Training Bots vs. Retrieval Bots: The Critical Distinction
Not all AI crawlers do the same thing. Understanding the difference between training bots and retrieval bots is the single most important concept for robots.txt strategy in 2026.
Training Bots
Purpose: Crawl content to include in future model training data
Impact timeline: 3-12 months (when new models are released)
If you block them: Future AI models won’t know about your business
Examples: GPTBot, ClaudeBot, Google-Extended, Applebot-Extended
Retrieval Bots
Purpose: Fetch content in real time to answer user queries
Impact timeline: Immediate (affects AI answers right now)
If you block them: You disappear from AI search results today
Examples: ChatGPT-User, OAI-SearchBot, PerplexityBot, Perplexity-User
This distinction is crucial. You might have legitimate reasons to block training bots (intellectual property, paywalled content). But blocking retrieval bots is almost never the right move. That’s the equivalent of blocking Googlebot for traditional search: you voluntarily disappear.
The nuance most guides miss: Some bots do both. Googlebot handles traditional search indexing and feeds data into Google AI Overviews. PerplexityBot both indexes content for future use and retrieves it for real-time answers. Blocking these hybrid bots has compounding consequences.
The Complete AI Crawler Reference
Here’s every AI crawler that matters in 2026, organized by company, purpose, and what happens if you block them:
Tier 1: Block These and You Lose Visibility
| User-Agent | Company | Type | What It Powers |
|---|---|---|---|
| Googlebot | Hybrid | Google Search + AI Overviews | |
| ChatGPT-User | OpenAI | Retrieval | ChatGPT live browsing |
| OAI-SearchBot | OpenAI | Retrieval | ChatGPT Search features |
| PerplexityBot | Perplexity | Hybrid | Perplexity AI search |
| GPTBot | OpenAI | Training | Future GPT model knowledge |
| ClaudeBot | Anthropic | Training | Future Claude model knowledge |
| Google-Extended | Training | Gemini AI training | |
| Applebot-Extended | Apple | Training | Apple Intelligence, Siri |
Tier 2: Consider Based on Your Needs
| User-Agent | Company | Purpose | Recommendation |
|---|---|---|---|
| Meta-ExternalAgent | Meta | Meta AI features | Allow (growing platform) |
| Amazonbot | Amazon | Alexa + AI shopping | Allow (commerce visibility) |
| cohere-ai | Cohere | Enterprise AI training | Optional (B2B relevance) |
| Bytespider | ByteDance | TikTok AI features | Optional (aggressive crawling) |
Tier 3: Usually Block
| User-Agent | Company | Why Block |
|---|---|---|
| CCBot | Common Crawl | Bulk data aggregator, feeds many AI projects |
| DataForSeoBot | DataForSEO | Commercial data scraper |
| DeepSeekBot | DeepSeek | Limited Western visibility benefit |
The Crawl-to-Refer Reality: What AI Bots Actually Give Back
Here’s the data most articles won’t show you. Not all AI crawlers return equal value. SEOmator analyzed Cloudflare Radar data from Q1 2026 and found massive disparities in how much AI bots take versus what they give back:
| Platform | Crawl-to-Refer Ratio | What This Means |
|---|---|---|
| DuckDuckGo | 1.5 : 1 | Near-parity: crawls 1.5 pages per referral |
| 5 : 1 | Strong return: 5 pages crawled per referral | |
| Microsoft (Copilot) | 33 : 1 | Moderate: 33 pages per referral |
| Perplexity | 111 : 1 | Growing platform, referrals improving |
| OpenAI (GPTBot) | 1,276 : 1 | Heavy crawling, limited direct referrals |
| Anthropic (ClaudeBot) | 23,951 : 1 | Crawls ~24,000 pages per referral sent back |
What to make of this data:
The ratios look alarming, but context matters. ClaudeBot’s ratio is extreme because Anthropic doesn’t operate a search engine that sends referral traffic. The value of allowing ClaudeBot isn’t in referrals. It’s in Claude knowing about your business when millions of people ask it questions.
The same logic applies to GPTBot. The direct referral ratio is poor, but ChatGPT has over 300 million weekly users. When ChatGPT recommends your business to someone, that recommendation carries weight that analytics can’t easily measure.
The strategic takeaway: Don’t make robots.txt decisions based solely on crawl-to-refer ratios. The value of AI visibility extends beyond trackable referral clicks.
The Decision Framework: Block or Allow?
Here’s the practical decision process for your robots.txt AI strategy:
For the vast majority of businesses, the answer is simple: allow everything. If you’re a local plumber, a web design agency, a restaurant, a SaaS company, an e-commerce store, the downside of AI invisibility is far greater than any theoretical risk of being crawled. You want ChatGPT recommending you. You want Perplexity citing you. You want Google AI Overviews mentioning you.
What a GEO-Optimized Robots.txt Looks Like
Here’s a robots.txt template designed for maximum AI search visibility while maintaining common-sense security:
# Standard: allow all legitimate bots
User-agent: *
Allow: /
Disallow: /admin/
Disallow: /api/
Disallow: /private/
# OpenAI (ChatGPT + training)
User-agent: GPTBot
Allow: /
User-agent: ChatGPT-User
Allow: /
User-agent: OAI-SearchBot
Allow: /
# Google (Search + AI Overviews + Gemini)
User-agent: Googlebot
Allow: /
User-agent: Google-Extended
Allow: /
# Anthropic (Claude)
User-agent: ClaudeBot
Allow: /
User-agent: Claude-Web
Allow: /
# Perplexity
User-agent: PerplexityBot
Allow: /
User-agent: Perplexity-User
Allow: /
# Apple Intelligence
User-agent: Applebot-Extended
Allow: /
# Meta AI
User-agent: Meta-ExternalAgent
Allow: /
# Block data scrapers with no search value
User-agent: CCBot
Disallow: /
User-agent: DataForSeoBot
Disallow: /
# Sitemap
Sitemap: https://yoursite.com/sitemap.xmlKey principles in this template:
- Explicitly allow every major AI crawler (don’t rely on the wildcard
*rule alone, since some bots only check their own user-agent section) - Block data scrapers that offer no search or AI visibility benefit
- Include your sitemap URL so all crawlers can discover your content structure
- Keep admin and API routes blocked for security
Need to generate a properly formatted robots.txt? Our Robots.txt Generator lets you toggle individual AI bots on and off and generates the file for you.
Already have a robots.txt? Use our Robots.txt Tester to check which bots your current file allows or blocks.
Beyond Robots.txt: The Full AI Visibility Stack
Robots.txt is the first layer, the gate. But getting AI crawlers through the gate isn’t enough. You also need to give them content worth citing.
The complete AI visibility stack includes:
1. Robots.txt (access control) Allow the right crawlers in. Block the ones that don’t add value. You’re reading about this now.
2. Schema markup (structured data) Tell AI systems exactly what your page is about, who wrote it, and what your business does. Schema markup helps AI systems cite your content by providing structured signals they can trust. Our Schema Markup Validator checks whether your structured data is complete and valid.
3. llms.txt (AI site guide) A newer standard that gives AI a structured summary of your site. While robots.txt says “you can come in,” llms.txt says “here’s what’s important.” Generate one with our llms.txt Generator.
4. Content structure (headings, FAQs, clear answers) AI systems prefer content with clear headings, direct answers to questions, FAQ sections, and logical structure. Pages with FAQPage schema are 3.2x more likely to appear in Google AI Overviews.
5. Technical health (speed, crawlability, sitemap) AI crawlers need to be able to access and parse your pages quickly. A clean sitemap, fast load times, and proper canonical tags all contribute to crawlability.
Want to check where you stand across all of these? The AI Search Visibility Checker evaluates your robots.txt, schema markup, headings, and other GEO signals in one audit. For businesses that want hands-on help building their AI visibility stack, our GEO services cover the full pipeline from robots.txt configuration to schema implementation.
Common Robots.txt Mistakes That Kill AI Visibility
The most dangerous mistake we see: businesses copy-pasting “block all AI bots” code snippets from security blogs without realizing they’re making themselves invisible to the fastest-growing discovery channel in history. For 79% of news publishers, this might make sense (they’re protecting paywalled content). For a business selling products or services, it’s self-sabotage.
How to Check Your Current Robots.txt
Before making changes, audit what you have:
- View your current file: Go to
yourdomain.com/robots.txtin your browser - Test it: Run your URL through our Robots.txt Tester to see which bots are allowed or blocked
- Check for AI crawlers: Look for any rules mentioning GPTBot, ClaudeBot, PerplexityBot, Google-Extended, or ChatGPT-User
- Identify gaps: If you have no AI-specific rules, your
User-agent: *rule applies to everything - Check your overall AI readiness: Run the AI Search Visibility Checker for a full GEO audit that includes robots.txt analysis
If your robots.txt needs updating, the Robots.txt Generator creates a properly formatted file with individual toggles for each AI crawler. For a deeper dive into syntax rules, valid vs invalid examples, and the full list of common mistakes, see our complete guide on how to check and test your robots.txt file.
Frequently Asked Questions
Does robots.txt affect AI search visibility?
Should I block or allow AI crawlers in robots.txt?
What is the difference between AI training bots and retrieval bots?
Which AI crawlers should I allow in robots.txt?
Does blocking GPTBot affect Google rankings?
What is the crawl-to-refer ratio for AI bots?
How do I check if my robots.txt is blocking AI crawlers?
What is llms.txt and how does it relate to robots.txt?
Check Your AI Search Readiness
See whether AI crawlers can access your content, check your schema markup, and audit your site’s GEO signals. Free, instant results.


