How to Check and Test Your Robots.txt File: The Complete Guide

Q: How do I find my website's robots.txt file?

Add /robots.txt to the end of your domain name. For example, if your website is example.com, go to example.com/robots.txt in your browser. If a robots.txt file exists, you'll see its contents as plain text. If you get a 404 error, your site doesn't have one. The file must be in the root directory of your domain to work.

Q: What happens if my website has no robots.txt file?

If your website has no robots.txt file, all crawlers assume they can access every page on your site. This is fine for many websites, but it means you have no control over which pages get crawled or how AI bots interact with your content. You also miss the opportunity to point crawlers to your sitemap via the Sitemap directive.

Q: What is the difference between Disallow and noindex?

Disallow in robots.txt tells crawlers not to crawl a page, but it does not prevent the page from appearing in search results. If other pages link to a disallowed URL, Google can still index it and show it in search results without crawling it. The noindex meta tag tells Google not to include the page in search results at all. For pages you want completely removed from search, use noindex. For pages you just want to save crawl budget on, use Disallow.

Q: Is robots.txt case-sensitive?

Yes. Path rules in robots.txt are case-sensitive. Disallow: /Photos/ blocks /Photos/ but not /photos/. Directive names (User-agent, Disallow, Allow) are case-insensitive, but the URL paths after them are case-sensitive. This is a common source of mistakes.

Q: How do I test if my robots.txt is working correctly?

Three ways: Use Google Search Console's robots.txt report to see what Google found and any errors. Use a robots.txt tester tool to paste your file and check specific URLs against specific user agents. Or view your file directly at yourdomain.com/robots.txt and manually verify the rules. Always test after making changes.

Q: What does 'User-agent: *' mean in robots.txt?

The asterisk (*) is a wildcard that matches all crawlers. Rules under 'User-agent: *' apply to every bot that doesn't have its own specific section. If you also have a section for a specific bot like Googlebot, that specific section takes priority for Googlebot, and the wildcard rules apply to all other bots.

Q: Can I use robots.txt to remove a page from Google?

No. Disallow in robots.txt prevents Google from crawling a page, but Google can still index and display the URL in search results if other pages link to it. To remove a page from Google, use the noindex meta tag or the X-Robots-Tag HTTP header. You can also use the URL Removal tool in Google Search Console for temporary removal.

Q: How often should I review my robots.txt file?

Review your robots.txt after any site redesign, URL structure change, CMS migration, or plugin update. Also review it quarterly to ensure no accidental changes have been made. With the rise of AI crawlers, you should also review it whenever new AI bots emerge to decide whether to allow or block them.

Your robots.txt file is a few lines of plain text that control how every search engine and AI bot interacts with your website. Get it right, and crawlers efficiently index your best content. Get it wrong, and you could accidentally block Google from your entire site or make yourself invisible to AI search.

This guide covers everything you need to know about robots.txt: how to find yours, how to read it, what valid syntax looks like, how to test it, and the most common mistakes that cause real damage.

If you’re specifically interested in how robots.txt affects AI search and GEO, read our companion article on why robots.txt matters for AI search and GEO.

How to Find Your Robots.txt File

Every robots.txt file lives at the same location: the root of your domain.

To check yours: Add /robots.txt to the end of your domain name in your browser:

https://yourdomain.com/robots.txt

What you’ll see:

A text file with rules: Your site has a robots.txt. Read on to learn how to interpret it.
A 404 error: Your site doesn’t have a robots.txt. All crawlers can access everything. This is the default, and it’s fine for many sites, but you’re missing an opportunity to manage crawl behavior.
An HTML page instead of plain text: Something is wrong. Your server is returning a webpage instead of a text file, which can confuse crawlers.

Important: The file must be at the exact root of your domain. It won’t work at yourdomain.com/folder/robots.txt or on a subdomain like blog.yourdomain.com/robots.txt (subdomains need their own separate robots.txt files).

Robots.txt Syntax: The Complete Reference

Robots.txt uses a simple syntax with only a handful of directives. Here’s every directive you need to know:

The 5 Core Directives

Directive	What It Does	Example
User-agent	Specifies which crawler the following rules apply to	User-agent: Googlebot
Disallow	Blocks crawlers from accessing a path	Disallow: /admin/
Allow	Overrides a Disallow for a specific path	Allow: /admin/login
Sitemap	Tells crawlers where to find your sitemap	Sitemap: https://example.com/sitemap.xml
Crawl-delay	Adds delay between requests (not supported by Google)	Crawl-delay: 10

Syntax Rules You Must Follow

One directive per line (never put User-agent and Disallow on the same line)

Each group starts with a User-agent: line followed by one or more Disallow/Allow lines

Separate groups with a blank line

Path values are case-sensitive (/Photos/ is different from /photos/)

Directive names are case-insensitive (User-agent and user-agent both work)

Paths must start with / (forward slash)

Comments start with # and are ignored by crawlers

The file must be saved as plain text (UTF-8 encoding), not HTML or rich text

Sitemap directives can go anywhere in the file (they're not group-specific)

Wildcards and Pattern Matching

Google and Bing support two wildcard characters in robots.txt paths, even though they’re not part of the original specification:

Character	Meaning	Example	Matches
*	Matches any sequence of characters	Disallow: /*.pdf	All PDF files anywhere on the site
$	Matches the end of the URL	Disallow: /*.php$	URLs ending in .php (not /page.php?id=1)

Valid vs. Invalid Robots.txt: Side-by-Side Examples

Understanding what makes a robots.txt valid or broken is easier with examples:

Example 1: Allow all crawlers, block admin

✓Valid

User-agent: *
Allow: /
Disallow: /admin/
Disallow: /api/

Sitemap: https://example.com/sitemap.xml

✕Invalid

User-agent: * Disallow: /admin/
Allow /
sitemap: sitemap.xml

Issues: directive on same line as User-agent, missing colon after Allow, sitemap URL is relative (needs full https:// URL)

Example 2: Block specific bot from specific directory

✓Valid

User-agent: GPTBot
Disallow: /private/
Disallow: /customer-data/

User-agent: *
Allow: /

✕Invalid

User-agent: GPTBot
Disallow: private/
Disallow: customer-data

User-agent: *
Allow: /

Issues: paths missing leading / (must be /private/ not private/). Second path missing trailing / so it only blocks the exact file “customer-data” not the directory.

Example 3: Block everything (the nuclear option)

⚠Valid but dangerous

User-agent: *
Disallow: /

This blocks ALL crawlers from ALL pages. Your site will disappear from Google, AI search, everything. Only use this on staging/dev sites.

✓What you probably want instead

User-agent: *
Allow: /
Disallow: /staging/
Disallow: /test/

Allow everything except specific directories you want hidden.

How to Test Your Robots.txt: 3 Methods

Method 1: Robots.txt tester tool (quickest)

Use a robots.txt tester tool to analyze your file instantly:

Enter your website URL
The tool fetches your robots.txt and parses every directive
See which user agents have rules, what’s allowed, what’s blocked
Check specific URLs against specific bots to verify behavior

This is the fastest way to catch syntax errors, conflicting rules, and accidental blocks.

Method 2: Google Search Console (most authoritative)

Google Search Console has a built-in robots.txt report:

Go to Google Search Console
Select your property
Navigate to Settings in the left sidebar
Look for the robots.txt section
Google shows the last crawled version, any warnings, and errors

What to look for:

Warnings: Google found rules that might not work as intended
Errors: Syntax issues that prevent proper parsing
Last crawled: When Google last fetched your robots.txt (should be recent)

You can also test specific URLs: enter a URL and a user agent to see whether the page is allowed or blocked.

Method 3: Manual review (for simple files)

For small robots.txt files, you can review manually:

Go to yourdomain.com/robots.txt
Read through each group (each User-agent: section)
Verify that important pages aren’t accidentally blocked
Check that a Sitemap line is present
Look for the AI crawler rules we covered in why robots.txt matters for AI search

The 10 Most Common Robots.txt Mistakes

These are the mistakes we see most often when testing client websites. Some of them can cause serious SEO damage.

1. Blocking your entire site by accident

The mistake: Leaving Disallow: / under User-agent: * from a development or staging environment.

How it happens: You block crawlers during development, then push the staging robots.txt to production. Or a CMS plugin adds a blanket disallow that you don’t notice.

The fix: Always verify your robots.txt after deploying to production. Set up monitoring to alert you if the file changes.

2. No robots.txt file at all

The issue: Without a robots.txt file, crawlers return a 404 and assume full access. This isn’t an error, but it means you can’t manage crawl behavior, can’t direct crawlers to your sitemap, and can’t make informed decisions about AI crawler access.

The fix: Create a basic robots.txt file. Even a minimal one helps. Use our Robots.txt Generator if you’re not sure where to start.

3. Missing leading slash on paths

The mistake: Writing Disallow: admin/ instead of Disallow: /admin/.

Why it matters: Paths in robots.txt must start with /. Without it, the rule may not be recognized or may behave unpredictably.

The fix: Always start paths with /.

4. Confusing Disallow with noindex

The mistake: Using Disallow when you actually want to prevent a page from appearing in search results.

Why it matters: Disallow tells crawlers not to crawl a page, but Google can still index and display the URL in search results if other sites link to it. You’ll see the page in Google with a “No information is available for this page” message.

The fix: For pages you want removed from search results, use a noindex meta tag or X-Robots-Tag header. Disallow is for crawl budget management, not for hiding pages from search.

5. Conflicting Allow and Disallow rules

The mistake: Having rules that contradict each other:

User-agent: *
Disallow: /blog/
Allow: /blog/

How Google resolves it: When Allow and Disallow rules conflict, Google uses the most specific rule (the one with the longest path). If they’re the same length, Google defaults to Allow. But other crawlers may handle this differently.

The fix: Be explicit. Don’t rely on conflict resolution rules. Make your intent clear with non-overlapping paths.

6. Blocking CSS, JavaScript, or images

The mistake: Blocking crawlers from accessing CSS, JS, or image files that are needed to render your pages.

Why it matters: Google needs to render your pages to understand their content. If you block CSS or JavaScript, Google can’t see your page the way users do. This can harm your rankings and prevent proper indexing.

The fix: Don’t block /css/, /js/, /images/, or /assets/ directories unless you have a specific reason. Check Google’s URL Inspection tool to see if rendering issues exist.

7. Wrong file location

The mistake: Placing robots.txt in a subdirectory like /pages/robots.txt instead of the root.

Why it matters: Crawlers only check the root domain for robots.txt. A file anywhere else is ignored.

The fix: robots.txt must be at https://yourdomain.com/robots.txt. No exceptions.

8. Protocol mismatch (http vs https)

The mistake: Having robots.txt only on http:// when your site runs on https:// (or vice versa).

Why it matters: Robots.txt rules only apply to the protocol where the file is hosted. If your site is HTTPS but your robots.txt is only accessible via HTTP, HTTPS crawling is unaffected.

The fix: Ensure robots.txt is accessible at the same protocol your site uses. If you redirect HTTP to HTTPS, make sure the robots.txt is available at the HTTPS URL.

9. Relative sitemap URL

The mistake: Using Sitemap: /sitemap.xml instead of the full URL.

Why it matters: The Sitemap directive requires an absolute URL including the protocol and domain.

The fix: Always use the full URL: Sitemap: https://yourdomain.com/sitemap.xml

10. Not updating for AI crawlers

The mistake: Having a robots.txt from 2020 that doesn’t mention GPTBot, ClaudeBot, PerplexityBot, or any other AI crawler.

Why it matters: Without explicit rules for AI bots, they fall under your User-agent: * rule, which may not reflect your actual intent. You’re either letting all AI bots in by default or blocking them all by accident.

The fix: Add explicit rules for AI crawlers. Our article on why robots.txt matters for AI search and GEO includes a complete template with every major AI bot.

Robots.txt vs. Meta Robots vs. X-Robots-Tag

These three mechanisms all control crawler behavior, but they do different things:

Method	Scope	Controls	Use When
robots.txt	Entire site or directories	Crawling (what bots can access)	Managing crawl budget, blocking directories, controlling AI bot access
Meta robots tag	Individual pages	Indexing (noindex, nofollow)	Preventing specific pages from appearing in search results
X-Robots-Tag	Individual URLs (HTTP header)	Indexing (noindex, nofollow)	Non-HTML files (PDFs, images) or when you can’t edit page HTML

Key insight: You often need more than one of these. Use robots.txt for crawl-level control and meta robots or X-Robots-Tag for index-level control. They’re complementary, not alternatives. And for AI search specifically, consider adding an llms.txt file alongside your robots.txt. While robots.txt controls access, llms.txt provides AI systems with a structured summary of your site’s most important content. Together, they form the foundation of a solid GEO strategy.

Robots.txt Checklist: Before Going Live

Use this checklist every time you create or update a robots.txt file:

File is at the root of your domain (yourdomain.com/robots.txt)

File returns HTTP 200 (not 404, 500, or a redirect)

File is plain text, not HTML or rich text

Every group starts with a User-agent: line

One directive per line (no combining on a single line)

All paths start with / (forward slash)

No accidental Disallow: / under User-agent: * (unless intentional)

CSS, JS, and image directories are not blocked

Sitemap directive uses full absolute URL (https://...)

AI crawler rules are explicitly set (GPTBot, ClaudeBot, PerplexityBot)

Groups are separated by blank lines

Tested with a robots.txt tester tool after every change

No conflicting Allow/Disallow rules for the same path

Next Steps

Your robots.txt is one piece of your site’s technical SEO foundation. Here’s how everything connects:

Test your current file with our Robots.txt Tester to find issues instantly
Generate a new file with our Robots.txt Generator that includes AI crawler toggles
Understand the AI angle in our guide on why robots.txt matters for AI search and GEO
Check your full SEO foundation by validating your sitemap, meta tags, and schema markup
Audit AI search readiness with our AI Search Visibility Checker that includes robots.txt analysis
Get professional help through our SEO services or GEO services

Frequently Asked Questions

How do I find my website's robots.txt file?

Add /robots.txt to your domain name (example.com/robots.txt). If a file exists, you'll see its contents as plain text. If you get a 404, your site doesn't have one.

What happens if my website has no robots.txt file?

All crawlers assume they can access every page. This is fine for many websites, but you lose the ability to manage crawl behavior or direct crawlers to your sitemap.

What is the difference between Disallow and noindex?

Disallow tells crawlers not to crawl a page, but Google can still index the URL if other pages link to it. Noindex tells Google not to show the page in search results at all. Use noindex when you want a page completely removed from search.

Is robots.txt case-sensitive?

Path rules are case-sensitive: /Photos/ is different from /photos/. Directive names (User-agent, Disallow) are case-insensitive. This is a common source of mistakes.

How do I test if my robots.txt is working correctly?

Use Google Search Console's robots.txt report, a robots.txt tester tool, or view the file directly in your browser. Always test after making changes.

What does 'User-agent: *' mean in robots.txt?

The asterisk is a wildcard matching all crawlers. Rules under User-agent: * apply to every bot that doesn't have its own specific section in the file.

Can I use robots.txt to remove a page from Google?

No. Disallow prevents crawling but not indexing. Use the noindex meta tag or X-Robots-Tag header to remove pages from search results.

How often should I review my robots.txt file?

After any site redesign, migration, or plugin update. Also quarterly to catch accidental changes. Review whenever new AI crawlers emerge to decide whether to allow them.

Test Your Robots.txt Now

Find syntax errors, conflicting rules, and accidental blocks. Free, instant results.

Test Your Robots.txt →Generate Robots.txt →