robots.txt: The Complete Guide to Crawl Control

robots.txt is a plain-text file placed at the root of your domain (e.g., yourdomain.com/robots.txt) that tells web crawlers which pages they're allowed to access. A single misconfiguration — a Disallow: / left over from development — can make your entire site invisible to Google. Understanding robots.txt syntax isn't optional for anyone managing an SEO-dependent website.

How robots.txt works

When Googlebot visits your site, it first requests yourdomain.com/robots.txt. If the file exists, it reads the directives and determines which paths it's allowed to crawl. If the file doesn't exist, Googlebot treats it as no restrictions — full crawl access. Robots.txt directives are voluntary — malicious crawlers ignore them entirely. Robots.txt only affects legitimate, well-behaved crawlers like Googlebot, Bingbot, and others that respect the standard.

⚠️ Warning

Robots.txt controls crawling, not indexing. A page blocked by robots.txt can still appear in Google's index if other pages link to it — Google knows it exists from the links even if it can't crawl it. To prevent indexing, use a noindex meta tag (which requires the page to be crawlable). Robots.txt and noindex serve different purposes.

Basic robots.txt syntax

# robots.txt — basic structure

# Apply rules to all crawlers
User-agent: *
Disallow: /admin/          # Block the admin section
Disallow: /private/        # Block private files
Allow: /                   # Allow everything else

# Apply different rules to Googlebot specifically
User-agent: Googlebot
Disallow: /internal-tools/

# Point crawlers to your sitemap
Sitemap: https://yourdomain.com/sitemap.xml

The most common robots.txt directives

User-agent: * — applies the following rules to all crawlers
User-agent: Googlebot — applies rules only to Google's crawler
Disallow: /path/ — blocks crawlers from accessing this path and all sub-paths
Allow: /path/ — explicitly permits a path that might otherwise be blocked by a broader Disallow rule
Sitemap: URL — points crawlers to your XML sitemap location
Crawl-delay: 10 — asks crawlers to wait 10 seconds between requests (note: Googlebot ignores this; use GSC crawl rate settings instead)

Critical robots.txt mistakes

Mistake 1: Blocking the entire site

The most catastrophic robots.txt error. Set during development to prevent Google from indexing an unfinished site, then never removed at launch. Result: zero indexed pages, zero organic traffic.

# DANGEROUS — blocks all crawlers from everything
User-agent: *
Disallow: /

# CORRECT — allows full crawl access
User-agent: *
Allow: /

Mistake 2: Blocking CSS and JavaScript

Blocking /wp-content/ or static asset directories prevents Googlebot from rendering your pages correctly. If Google can't load your CSS and JavaScript, it sees a broken, unrendered version of your site — which can hurt rankings significantly.

Mistake 3: Using robots.txt instead of noindex for sensitive pages

If you want a page to not appear in Google's index, blocking it in robots.txt doesn't guarantee that. Google may still list the URL in search results if other sites link to it — it just can't read the content. Use noindex meta tags for pages that must not appear in search results.

How to test your robots.txt

Google Search Console provides a robots.txt tester under Settings → robots.txt. Enter any URL on your site and it will tell you whether Googlebot can crawl it based on your current rules. Always test before deploying changes to robots.txt — a typo in a path can block thousands of pages.

Check GSC Settings → robots.txt to view and test your current file
Test every critical URL type: homepage, product pages, blog posts, sitemap
After any robots.txt change, submit the updated file via the GSC robots.txt report
Monitor GSC → Coverage report for spikes in 'Blocked by robots.txt' errors after changes

robots.txt best practices

Always include a Sitemap: directive pointing to your XML sitemap
Block admin, login, and internal tool paths from all crawlers
Do not block CSS, JavaScript, or font files — Google needs them to render your pages
Use the GSC robots.txt tester before deploying any change to production
Remove development Disallow: / rules before launch — set a deployment checklist item
Keep the file simple — complex robots.txt files with many conflicting rules cause unpredictable behavior

💡 Tip

Practice this in the game: Chapter 1-1 (The Silent Launch) puts you in the middle of a Disallow: / disaster — a 2,000-product e-commerce store invisible to Google because of one line in robots.txt.