How to Check and Test Your Robots.txt File: The Complete Guide

Author: Lucky Oleg | Published

Your robots.txt file is a few lines of plain text that control how every search engine and AI bot interacts with your website. Get it right, and crawlers efficiently index your best content. Get it wrong, and you could accidentally block Google from your entire site or make yourself invisible to AI search.

This guide covers everything you need to know about robots.txt: how to find yours, how to read it, what valid syntax looks like, how to test it, and the most common mistakes that cause real damage.

If you’re specifically interested in how robots.txt affects AI search and GEO, read our companion article on why robots.txt matters for AI search and GEO.

How to Find Your Robots.txt File

Every robots.txt file lives at the same location: the root of your domain.

To check yours: Add /robots.txt to the end of your domain name in your browser:

https://yourdomain.com/robots.txt

What you’ll see:

  • A text file with rules: Your site has a robots.txt. Read on to learn how to interpret it.
  • A 404 error: Your site doesn’t have a robots.txt. All crawlers can access everything. This is the default, and it’s fine for many sites, but you’re missing an opportunity to manage crawl behavior.
  • An HTML page instead of plain text: Something is wrong. Your server is returning a webpage instead of a text file, which can confuse crawlers.

Important: The file must be at the exact root of your domain. It won’t work at yourdomain.com/folder/robots.txt or on a subdomain like blog.yourdomain.com/robots.txt (subdomains need their own separate robots.txt files).

Robots.txt Syntax: The Complete Reference

Robots.txt uses a simple syntax with only a handful of directives. Here’s every directive you need to know:

The 5 Core Directives

DirectiveWhat It DoesExample
User-agentSpecifies which crawler the following rules apply toUser-agent: Googlebot
DisallowBlocks crawlers from accessing a pathDisallow: /admin/
AllowOverrides a Disallow for a specific pathAllow: /admin/login
SitemapTells crawlers where to find your sitemapSitemap: https://example.com/sitemap.xml
Crawl-delayAdds delay between requests (not supported by Google)Crawl-delay: 10

Syntax Rules You Must Follow

One directive per line (never put User-agent and Disallow on the same line)
Each group starts with a User-agent: line followed by one or more Disallow/Allow lines
Separate groups with a blank line
Path values are case-sensitive (/Photos/ is different from /photos/)
Directive names are case-insensitive (User-agent and user-agent both work)
Paths must start with / (forward slash)
Comments start with # and are ignored by crawlers
The file must be saved as plain text (UTF-8 encoding), not HTML or rich text
Sitemap directives can go anywhere in the file (they're not group-specific)

Wildcards and Pattern Matching

Google and Bing support two wildcard characters in robots.txt paths, even though they’re not part of the original specification:

CharacterMeaningExampleMatches
*Matches any sequence of charactersDisallow: /*.pdfAll PDF files anywhere on the site
$Matches the end of the URLDisallow: /*.php$URLs ending in .php (not /page.php?id=1)

Valid vs. Invalid Robots.txt: Side-by-Side Examples

Understanding what makes a robots.txt valid or broken is easier with examples:

Example 1: Allow all crawlers, block admin

Valid
User-agent: *
Allow: /
Disallow: /admin/
Disallow: /api/

Sitemap: https://example.com/sitemap.xml
Invalid
User-agent: * Disallow: /admin/
Allow /
sitemap: sitemap.xml

Issues: directive on same line as User-agent, missing colon after Allow, sitemap URL is relative (needs full https:// URL)

Example 2: Block specific bot from specific directory

Valid
User-agent: GPTBot
Disallow: /private/
Disallow: /customer-data/

User-agent: *
Allow: /
Invalid
User-agent: GPTBot
Disallow: private/
Disallow: customer-data

User-agent: *
Allow: /

Issues: paths missing leading / (must be /private/ not private/). Second path missing trailing / so it only blocks the exact file “customer-data” not the directory.

Example 3: Block everything (the nuclear option)

Valid but dangerous
User-agent: *
Disallow: /

This blocks ALL crawlers from ALL pages. Your site will disappear from Google, AI search, everything. Only use this on staging/dev sites.

What you probably want instead
User-agent: *
Allow: /
Disallow: /staging/
Disallow: /test/

Allow everything except specific directories you want hidden.

How to Test Your Robots.txt: 3 Methods

Method 1: Robots.txt tester tool (quickest)

Use a robots.txt tester tool to analyze your file instantly:

  1. Enter your website URL
  2. The tool fetches your robots.txt and parses every directive
  3. See which user agents have rules, what’s allowed, what’s blocked
  4. Check specific URLs against specific bots to verify behavior

This is the fastest way to catch syntax errors, conflicting rules, and accidental blocks.

Method 2: Google Search Console (most authoritative)

Google Search Console has a built-in robots.txt report:

  1. Go to Google Search Console
  2. Select your property
  3. Navigate to Settings in the left sidebar
  4. Look for the robots.txt section
  5. Google shows the last crawled version, any warnings, and errors

What to look for:

  • Warnings: Google found rules that might not work as intended
  • Errors: Syntax issues that prevent proper parsing
  • Last crawled: When Google last fetched your robots.txt (should be recent)

You can also test specific URLs: enter a URL and a user agent to see whether the page is allowed or blocked.

Method 3: Manual review (for simple files)

For small robots.txt files, you can review manually:

  1. Go to yourdomain.com/robots.txt
  2. Read through each group (each User-agent: section)
  3. Verify that important pages aren’t accidentally blocked
  4. Check that a Sitemap line is present
  5. Look for the AI crawler rules we covered in why robots.txt matters for AI search

The 10 Most Common Robots.txt Mistakes

These are the mistakes we see most often when testing client websites. Some of them can cause serious SEO damage.

1. Blocking your entire site by accident

The mistake: Leaving Disallow: / under User-agent: * from a development or staging environment.

How it happens: You block crawlers during development, then push the staging robots.txt to production. Or a CMS plugin adds a blanket disallow that you don’t notice.

The fix: Always verify your robots.txt after deploying to production. Set up monitoring to alert you if the file changes.

2. No robots.txt file at all

The issue: Without a robots.txt file, crawlers return a 404 and assume full access. This isn’t an error, but it means you can’t manage crawl behavior, can’t direct crawlers to your sitemap, and can’t make informed decisions about AI crawler access.

The fix: Create a basic robots.txt file. Even a minimal one helps. Use our Robots.txt Generator if you’re not sure where to start.

3. Missing leading slash on paths

The mistake: Writing Disallow: admin/ instead of Disallow: /admin/.

Why it matters: Paths in robots.txt must start with /. Without it, the rule may not be recognized or may behave unpredictably.

The fix: Always start paths with /.

4. Confusing Disallow with noindex

The mistake: Using Disallow when you actually want to prevent a page from appearing in search results.

Why it matters: Disallow tells crawlers not to crawl a page, but Google can still index and display the URL in search results if other sites link to it. You’ll see the page in Google with a “No information is available for this page” message.

The fix: For pages you want removed from search results, use a noindex meta tag or X-Robots-Tag header. Disallow is for crawl budget management, not for hiding pages from search.

5. Conflicting Allow and Disallow rules

The mistake: Having rules that contradict each other:

User-agent: *
Disallow: /blog/
Allow: /blog/

How Google resolves it: When Allow and Disallow rules conflict, Google uses the most specific rule (the one with the longest path). If they’re the same length, Google defaults to Allow. But other crawlers may handle this differently.

The fix: Be explicit. Don’t rely on conflict resolution rules. Make your intent clear with non-overlapping paths.

6. Blocking CSS, JavaScript, or images

The mistake: Blocking crawlers from accessing CSS, JS, or image files that are needed to render your pages.

Why it matters: Google needs to render your pages to understand their content. If you block CSS or JavaScript, Google can’t see your page the way users do. This can harm your rankings and prevent proper indexing.

The fix: Don’t block /css/, /js/, /images/, or /assets/ directories unless you have a specific reason. Check Google’s URL Inspection tool to see if rendering issues exist.

7. Wrong file location

The mistake: Placing robots.txt in a subdirectory like /pages/robots.txt instead of the root.

Why it matters: Crawlers only check the root domain for robots.txt. A file anywhere else is ignored.

The fix: robots.txt must be at https://yourdomain.com/robots.txt. No exceptions.

8. Protocol mismatch (http vs https)

The mistake: Having robots.txt only on http:// when your site runs on https:// (or vice versa).

Why it matters: Robots.txt rules only apply to the protocol where the file is hosted. If your site is HTTPS but your robots.txt is only accessible via HTTP, HTTPS crawling is unaffected.

The fix: Ensure robots.txt is accessible at the same protocol your site uses. If you redirect HTTP to HTTPS, make sure the robots.txt is available at the HTTPS URL.

9. Relative sitemap URL

The mistake: Using Sitemap: /sitemap.xml instead of the full URL.

Why it matters: The Sitemap directive requires an absolute URL including the protocol and domain.

The fix: Always use the full URL: Sitemap: https://yourdomain.com/sitemap.xml

10. Not updating for AI crawlers

The mistake: Having a robots.txt from 2020 that doesn’t mention GPTBot, ClaudeBot, PerplexityBot, or any other AI crawler.

Why it matters: Without explicit rules for AI bots, they fall under your User-agent: * rule, which may not reflect your actual intent. You’re either letting all AI bots in by default or blocking them all by accident.

The fix: Add explicit rules for AI crawlers. Our article on why robots.txt matters for AI search and GEO includes a complete template with every major AI bot.

Robots.txt vs. Meta Robots vs. X-Robots-Tag

These three mechanisms all control crawler behavior, but they do different things:

MethodScopeControlsUse When
robots.txtEntire site or directoriesCrawling (what bots can access)Managing crawl budget, blocking directories, controlling AI bot access
Meta robots tagIndividual pagesIndexing (noindex, nofollow)Preventing specific pages from appearing in search results
X-Robots-TagIndividual URLs (HTTP header)Indexing (noindex, nofollow)Non-HTML files (PDFs, images) or when you can’t edit page HTML

Key insight: You often need more than one of these. Use robots.txt for crawl-level control and meta robots or X-Robots-Tag for index-level control. They’re complementary, not alternatives. And for AI search specifically, consider adding an llms.txt file alongside your robots.txt. While robots.txt controls access, llms.txt provides AI systems with a structured summary of your site’s most important content. Together, they form the foundation of a solid GEO strategy.

Robots.txt Checklist: Before Going Live

Use this checklist every time you create or update a robots.txt file:

File is at the root of your domain (yourdomain.com/robots.txt)
File returns HTTP 200 (not 404, 500, or a redirect)
File is plain text, not HTML or rich text
Every group starts with a User-agent: line
One directive per line (no combining on a single line)
All paths start with / (forward slash)
No accidental Disallow: / under User-agent: * (unless intentional)
CSS, JS, and image directories are not blocked
Sitemap directive uses full absolute URL (https://...)
AI crawler rules are explicitly set (GPTBot, ClaudeBot, PerplexityBot)
Groups are separated by blank lines
Tested with a robots.txt tester tool after every change
No conflicting Allow/Disallow rules for the same path

Next Steps

Your robots.txt is one piece of your site’s technical SEO foundation. Here’s how everything connects:

Frequently Asked Questions

How do I find my website's robots.txt file?
Add /robots.txt to your domain name (example.com/robots.txt). If a file exists, you'll see its contents as plain text. If you get a 404, your site doesn't have one.
What happens if my website has no robots.txt file?
All crawlers assume they can access every page. This is fine for many websites, but you lose the ability to manage crawl behavior or direct crawlers to your sitemap.
What is the difference between Disallow and noindex?
Disallow tells crawlers not to crawl a page, but Google can still index the URL if other pages link to it. Noindex tells Google not to show the page in search results at all. Use noindex when you want a page completely removed from search.
Is robots.txt case-sensitive?
Path rules are case-sensitive: /Photos/ is different from /photos/. Directive names (User-agent, Disallow) are case-insensitive. This is a common source of mistakes.
How do I test if my robots.txt is working correctly?
Use Google Search Console's robots.txt report, a robots.txt tester tool, or view the file directly in your browser. Always test after making changes.
What does 'User-agent: *' mean in robots.txt?
The asterisk is a wildcard matching all crawlers. Rules under User-agent: * apply to every bot that doesn't have its own specific section in the file.
Can I use robots.txt to remove a page from Google?
No. Disallow prevents crawling but not indexing. Use the noindex meta tag or X-Robots-Tag header to remove pages from search results.
How often should I review my robots.txt file?
After any site redesign, migration, or plugin update. Also quarterly to catch accidental changes. Review whenever new AI crawlers emerge to decide whether to allow them.

Test Your Robots.txt Now

Find syntax errors, conflicting rules, and accidental blocks. Free, instant results.

Useful info? Spread the Aloha:

Lucky Oleg

Lucky Oleg is the founder of Web Aloha, a web design & SEO agency helping businesses ride the digital wave. With years of experience in WordPress, technical SEO, and web performance, he writes about what actually works in the real world.