Back to Blog

robots.txt: The Complete Guide to Crawl Control

A misconfigured robots.txt can make your entire site invisible to Google overnight. This guide covers every directive, common mistakes, and how to test safely.

Marcus Webb7 min readMarch 28, 2026

SEO consultant, 9 years experience, formerly Head of SEO at two Series B startups

robots.txt is a plain-text file placed at the root of your domain (e.g., yourdomain.com/robots.txt) that tells web crawlers which pages they're allowed to access. A single misconfiguration — a Disallow: / left over from development — can make your entire site invisible to Google. Understanding robots.txt syntax isn't optional for anyone managing an SEO-dependent website.

How robots.txt works

When Googlebot visits your site, it first requests yourdomain.com/robots.txt. If the file exists, it reads the directives and determines which paths it's allowed to crawl. If the file doesn't exist, Googlebot treats it as no restrictions — full crawl access. Robots.txt directives are voluntary — malicious crawlers ignore them entirely. Robots.txt only affects legitimate, well-behaved crawlers like Googlebot, Bingbot, and others that respect the standard.

⚠️ Warning

Robots.txt controls crawling, not indexing. A page blocked by robots.txt can still appear in Google's index if other pages link to it — Google knows it exists from the links even if it can't crawl it. To prevent indexing, use a noindex meta tag (which requires the page to be crawlable). Robots.txt and noindex serve different purposes.

Basic robots.txt syntax

# robots.txt — basic structure

# Apply rules to all crawlers
User-agent: *
Disallow: /admin/          # Block the admin section
Disallow: /private/        # Block private files
Allow: /                   # Allow everything else

# Apply different rules to Googlebot specifically
User-agent: Googlebot
Disallow: /internal-tools/

# Point crawlers to your sitemap
Sitemap: https://yourdomain.com/sitemap.xml

The most common robots.txt directives

  • User-agent: * — applies the following rules to all crawlers
  • User-agent: Googlebot — applies rules only to Google's crawler
  • Disallow: /path/ — blocks crawlers from accessing this path and all sub-paths
  • Allow: /path/ — explicitly permits a path that might otherwise be blocked by a broader Disallow rule
  • Sitemap: URL — points crawlers to your XML sitemap location
  • Crawl-delay: 10 — asks crawlers to wait 10 seconds between requests (note: Googlebot ignores this; use GSC crawl rate settings instead)

Critical robots.txt mistakes

Mistake 1: Blocking the entire site

The most catastrophic robots.txt error. Set during development to prevent Google from indexing an unfinished site, then never removed at launch. Result: zero indexed pages, zero organic traffic.

# DANGEROUS — blocks all crawlers from everything
User-agent: *
Disallow: /

# CORRECT — allows full crawl access
User-agent: *
Allow: /

Mistake 2: Blocking CSS and JavaScript

Blocking /wp-content/ or static asset directories prevents Googlebot from rendering your pages correctly. If Google can't load your CSS and JavaScript, it sees a broken, unrendered version of your site — which can hurt rankings significantly.

Mistake 3: Using robots.txt instead of noindex for sensitive pages

If you want a page to not appear in Google's index, blocking it in robots.txt doesn't guarantee that. Google may still list the URL in search results if other sites link to it — it just can't read the content. Use noindex meta tags for pages that must not appear in search results.

How to test your robots.txt

Google Search Console provides a robots.txt tester under Settings → robots.txt. Enter any URL on your site and it will tell you whether Googlebot can crawl it based on your current rules. Always test before deploying changes to robots.txt — a typo in a path can block thousands of pages.

  • Check GSC Settings → robots.txt to view and test your current file
  • Test every critical URL type: homepage, product pages, blog posts, sitemap
  • After any robots.txt change, submit the updated file via the GSC robots.txt report
  • Monitor GSC → Coverage report for spikes in 'Blocked by robots.txt' errors after changes

robots.txt best practices

  • Always include a Sitemap: directive pointing to your XML sitemap
  • Block admin, login, and internal tool paths from all crawlers
  • Do not block CSS, JavaScript, or font files — Google needs them to render your pages
  • Use the GSC robots.txt tester before deploying any change to production
  • Remove development Disallow: / rules before launch — set a deployment checklist item
  • Keep the file simple — complex robots.txt files with many conflicting rules cause unpredictable behavior

💡 Tip

Practice this in the game: Chapter 1-1 (The Silent Launch) puts you in the middle of a Disallow: / disaster — a 2,000-product e-commerce store invisible to Google because of one line in robots.txt.

Learn this by doing — not just reading.

SEOdisaster.com teaches SEO through interactive disaster scenarios. Put these concepts into practice in the game.

Play Free →