Your robots.txt file is a few lines of plain text that control how every search engine and AI bot interacts with your website. Get it right, and crawlers efficiently index your best content. Get it wrong, and you could accidentally block Google from your entire site or make yourself invisible to AI search.
This guide covers everything you need to know about robots.txt: how to find yours, how to read it, what valid syntax looks like, how to test it, and the most common mistakes that cause real damage.
If you’re specifically interested in how robots.txt affects AI search and GEO, read our companion article on why robots.txt matters for AI search and GEO.
How to Find Your Robots.txt File
Every robots.txt file lives at the same location: the root of your domain.
To check yours: Add /robots.txt to the end of your domain name in your browser:
https://yourdomain.com/robots.txtWhat you’ll see:
- A text file with rules: Your site has a robots.txt. Read on to learn how to interpret it.
- A 404 error: Your site doesn’t have a robots.txt. All crawlers can access everything. This is the default, and it’s fine for many sites, but you’re missing an opportunity to manage crawl behavior.
- An HTML page instead of plain text: Something is wrong. Your server is returning a webpage instead of a text file, which can confuse crawlers.
Important: The file must be at the exact root of your domain. It won’t work at yourdomain.com/folder/robots.txt or on a subdomain like blog.yourdomain.com/robots.txt (subdomains need their own separate robots.txt files).
Robots.txt Syntax: The Complete Reference
Robots.txt uses a simple syntax with only a handful of directives. Here’s every directive you need to know:
The 5 Core Directives
| Directive | What It Does | Example |
|---|---|---|
| User-agent | Specifies which crawler the following rules apply to | User-agent: Googlebot |
| Disallow | Blocks crawlers from accessing a path | Disallow: /admin/ |
| Allow | Overrides a Disallow for a specific path | Allow: /admin/login |
| Sitemap | Tells crawlers where to find your sitemap | Sitemap: https://example.com/sitemap.xml |
| Crawl-delay | Adds delay between requests (not supported by Google) | Crawl-delay: 10 |
Syntax Rules You Must Follow
Wildcards and Pattern Matching
Google and Bing support two wildcard characters in robots.txt paths, even though they’re not part of the original specification:
| Character | Meaning | Example | Matches |
|---|---|---|---|
| * | Matches any sequence of characters | Disallow: /*.pdf | All PDF files anywhere on the site |
| $ | Matches the end of the URL | Disallow: /*.php$ | URLs ending in .php (not /page.php?id=1) |
Valid vs. Invalid Robots.txt: Side-by-Side Examples
Understanding what makes a robots.txt valid or broken is easier with examples:
Example 1: Allow all crawlers, block admin
User-agent: *
Allow: /
Disallow: /admin/
Disallow: /api/
Sitemap: https://example.com/sitemap.xmlUser-agent: * Disallow: /admin/
Allow /
sitemap: sitemap.xmlIssues: directive on same line as User-agent, missing colon after Allow, sitemap URL is relative (needs full https:// URL)
Example 2: Block specific bot from specific directory
User-agent: GPTBot
Disallow: /private/
Disallow: /customer-data/
User-agent: *
Allow: /User-agent: GPTBot
Disallow: private/
Disallow: customer-data
User-agent: *
Allow: /Issues: paths missing leading / (must be /private/ not private/). Second path missing trailing / so it only blocks the exact file “customer-data” not the directory.
Example 3: Block everything (the nuclear option)
User-agent: *
Disallow: /This blocks ALL crawlers from ALL pages. Your site will disappear from Google, AI search, everything. Only use this on staging/dev sites.
User-agent: *
Allow: /
Disallow: /staging/
Disallow: /test/Allow everything except specific directories you want hidden.
How to Test Your Robots.txt: 3 Methods
Method 1: Robots.txt tester tool (quickest)
Use a robots.txt tester tool to analyze your file instantly:
- Enter your website URL
- The tool fetches your robots.txt and parses every directive
- See which user agents have rules, what’s allowed, what’s blocked
- Check specific URLs against specific bots to verify behavior
This is the fastest way to catch syntax errors, conflicting rules, and accidental blocks.
Method 2: Google Search Console (most authoritative)
Google Search Console has a built-in robots.txt report:
- Go to Google Search Console
- Select your property
- Navigate to Settings in the left sidebar
- Look for the robots.txt section
- Google shows the last crawled version, any warnings, and errors
What to look for:
- Warnings: Google found rules that might not work as intended
- Errors: Syntax issues that prevent proper parsing
- Last crawled: When Google last fetched your robots.txt (should be recent)
You can also test specific URLs: enter a URL and a user agent to see whether the page is allowed or blocked.
Method 3: Manual review (for simple files)
For small robots.txt files, you can review manually:
- Go to
yourdomain.com/robots.txt - Read through each group (each User-agent: section)
- Verify that important pages aren’t accidentally blocked
- Check that a Sitemap line is present
- Look for the AI crawler rules we covered in why robots.txt matters for AI search
The 10 Most Common Robots.txt Mistakes
These are the mistakes we see most often when testing client websites. Some of them can cause serious SEO damage.
1. Blocking your entire site by accident
The mistake: Leaving Disallow: / under User-agent: * from a development or staging environment.
How it happens: You block crawlers during development, then push the staging robots.txt to production. Or a CMS plugin adds a blanket disallow that you don’t notice.
The fix: Always verify your robots.txt after deploying to production. Set up monitoring to alert you if the file changes.
2. No robots.txt file at all
The issue: Without a robots.txt file, crawlers return a 404 and assume full access. This isn’t an error, but it means you can’t manage crawl behavior, can’t direct crawlers to your sitemap, and can’t make informed decisions about AI crawler access.
The fix: Create a basic robots.txt file. Even a minimal one helps. Use our Robots.txt Generator if you’re not sure where to start.
3. Missing leading slash on paths
The mistake: Writing Disallow: admin/ instead of Disallow: /admin/.
Why it matters: Paths in robots.txt must start with /. Without it, the rule may not be recognized or may behave unpredictably.
The fix: Always start paths with /.
4. Confusing Disallow with noindex
The mistake: Using Disallow when you actually want to prevent a page from appearing in search results.
Why it matters: Disallow tells crawlers not to crawl a page, but Google can still index and display the URL in search results if other sites link to it. You’ll see the page in Google with a “No information is available for this page” message.
The fix: For pages you want removed from search results, use a noindex meta tag or X-Robots-Tag header. Disallow is for crawl budget management, not for hiding pages from search.
5. Conflicting Allow and Disallow rules
The mistake: Having rules that contradict each other:
User-agent: *
Disallow: /blog/
Allow: /blog/How Google resolves it: When Allow and Disallow rules conflict, Google uses the most specific rule (the one with the longest path). If they’re the same length, Google defaults to Allow. But other crawlers may handle this differently.
The fix: Be explicit. Don’t rely on conflict resolution rules. Make your intent clear with non-overlapping paths.
6. Blocking CSS, JavaScript, or images
The mistake: Blocking crawlers from accessing CSS, JS, or image files that are needed to render your pages.
Why it matters: Google needs to render your pages to understand their content. If you block CSS or JavaScript, Google can’t see your page the way users do. This can harm your rankings and prevent proper indexing.
The fix: Don’t block /css/, /js/, /images/, or /assets/ directories unless you have a specific reason. Check Google’s URL Inspection tool to see if rendering issues exist.
7. Wrong file location
The mistake: Placing robots.txt in a subdirectory like /pages/robots.txt instead of the root.
Why it matters: Crawlers only check the root domain for robots.txt. A file anywhere else is ignored.
The fix: robots.txt must be at https://yourdomain.com/robots.txt. No exceptions.
8. Protocol mismatch (http vs https)
The mistake: Having robots.txt only on http:// when your site runs on https:// (or vice versa).
Why it matters: Robots.txt rules only apply to the protocol where the file is hosted. If your site is HTTPS but your robots.txt is only accessible via HTTP, HTTPS crawling is unaffected.
The fix: Ensure robots.txt is accessible at the same protocol your site uses. If you redirect HTTP to HTTPS, make sure the robots.txt is available at the HTTPS URL.
9. Relative sitemap URL
The mistake: Using Sitemap: /sitemap.xml instead of the full URL.
Why it matters: The Sitemap directive requires an absolute URL including the protocol and domain.
The fix: Always use the full URL: Sitemap: https://yourdomain.com/sitemap.xml
10. Not updating for AI crawlers
The mistake: Having a robots.txt from 2020 that doesn’t mention GPTBot, ClaudeBot, PerplexityBot, or any other AI crawler.
Why it matters: Without explicit rules for AI bots, they fall under your User-agent: * rule, which may not reflect your actual intent. You’re either letting all AI bots in by default or blocking them all by accident.
The fix: Add explicit rules for AI crawlers. Our article on why robots.txt matters for AI search and GEO includes a complete template with every major AI bot.
Robots.txt vs. Meta Robots vs. X-Robots-Tag
These three mechanisms all control crawler behavior, but they do different things:
| Method | Scope | Controls | Use When |
|---|---|---|---|
| robots.txt | Entire site or directories | Crawling (what bots can access) | Managing crawl budget, blocking directories, controlling AI bot access |
| Meta robots tag | Individual pages | Indexing (noindex, nofollow) | Preventing specific pages from appearing in search results |
| X-Robots-Tag | Individual URLs (HTTP header) | Indexing (noindex, nofollow) | Non-HTML files (PDFs, images) or when you can’t edit page HTML |
Key insight: You often need more than one of these. Use robots.txt for crawl-level control and meta robots or X-Robots-Tag for index-level control. They’re complementary, not alternatives. And for AI search specifically, consider adding an llms.txt file alongside your robots.txt. While robots.txt controls access, llms.txt provides AI systems with a structured summary of your site’s most important content. Together, they form the foundation of a solid GEO strategy.
Robots.txt Checklist: Before Going Live
Use this checklist every time you create or update a robots.txt file:
Next Steps
Your robots.txt is one piece of your site’s technical SEO foundation. Here’s how everything connects:
- Test your current file with our Robots.txt Tester to find issues instantly
- Generate a new file with our Robots.txt Generator that includes AI crawler toggles
- Understand the AI angle in our guide on why robots.txt matters for AI search and GEO
- Check your full SEO foundation by validating your sitemap, meta tags, and schema markup
- Audit AI search readiness with our AI Search Visibility Checker that includes robots.txt analysis
- Get professional help through our SEO services or GEO services
Frequently Asked Questions
How do I find my website's robots.txt file?
What happens if my website has no robots.txt file?
What is the difference between Disallow and noindex?
Is robots.txt case-sensitive?
How do I test if my robots.txt is working correctly?
What does 'User-agent: *' mean in robots.txt?
Can I use robots.txt to remove a page from Google?
How often should I review my robots.txt file?
Test Your Robots.txt Now
Find syntax errors, conflicting rules, and accidental blocks. Free, instant results.


