Robots.txt

📖 7 min readUpdated 2026-04-19

robots.txt is a plain-text file at the root of your domain that tells crawlers which URLs they're allowed to visit. It's one of the highest-impact, lowest-effort technical files on your site. Get it right and you direct crawl budget efficiently. Get it wrong and you accidentally de-index the entire site. This page walks through what robots.txt does, what it doesn't do, the common mistakes that wipe out sites, and the decision tree for when to use robots.txt vs noindex vs canonical.

The one thing robots.txt gets confused for

People think robots.txt prevents pages from showing up in search. It doesn't. A URL blocked by robots.txt can still appear in Google's index if external sites link to it. Google just can't see the content, so the listing in search will be sparse ("no info available").

robots.txt is about crawling, not indexing. Getting this distinction wrong is the single biggest source of robots.txt-related SEO disasters.

Where it lives

Always at https://yourdomain.com/robots.txt. Not at any other path. Bots only check this one location.

Basic syntax

User-agent: *
Disallow: /private/
Allow: /public/

Sitemap: https://yourdomain.com/sitemap.xml

User-agent. Which bot this rule applies to. * means all bots. Specific bots (Googlebot, Bingbot) override general rules for that bot.
Disallow. Path the bot should not crawl.
Allow. Path the bot IS allowed to crawl, overriding a parent Disallow.
Sitemap. Points to your sitemap file. Optional but recommended.

What robots.txt does and doesn't do

Common patterns

Allow everything (the default)

User-agent: *
Allow: /

You actually don't need a file for this. If robots.txt doesn't exist, bots assume everything is allowed.

Block admin and dev paths

User-agent: *
Disallow: /admin/
Disallow: /wp-admin/
Disallow: /staging/

Block parameter-heavy URLs (be careful)

User-agent: *
Disallow: /*?sort=
Disallow: /*?filter=

Allow specific bots special access

User-agent: *
Disallow: /api/

User-agent: Googlebot
Allow: /api/public/

Common mistakes that wipe out sites

The Disallow: / mistake is catastrophic. It's most often left over from staging environments accidentally deployed to production. The fix is fast (remove the line), but recovery in rankings takes weeks.

Testing robots.txt

Search Console robots.txt Tester. Test specific URLs against your rules.
curl -A "Googlebot" https://yourdomain.com/robots.txt. See what Google sees.
After any change, re-check Search Console for new coverage issues.

The decision tree: robots.txt vs noindex vs canonical

What to do with this

Open yourdomain.com/robots.txt right now. Read every line. Is there anything there you don't recognize? Does anything say Disallow: /? Check. Any modern site should have a minimal, intentional robots.txt. If yours is long or weird, clean it up.

Next: canonical tags, the tool that tells Google which version of a duplicate URL should rank.

Where it lives

Always at https://yourdomain.com/robots.txt. Not at any other path.

Basic syntax

User-agent: *
Disallow: /private/
Allow: /public/

Sitemap: https://yourdomain.com/sitemap.xml

User-agent, which bot this rule applies to. * = all. Specific bots (Googlebot) override general rules for that bot.
Disallow, path the bot should not crawl
Allow, path the bot IS allowed, overriding a parent Disallow
Sitemap, points to your sitemap file

What it does

Tells well-behaved bots what to skip. Malicious bots and most scrapers ignore it. It's a crawl directive, not a security mechanism.

What it doesn't do

It doesn't prevent indexing. A URL blocked by robots.txt can still be indexed if discovered via external links. Google just can't see the content. To prevent indexing, use noindex meta tag (which requires the URL to be crawlable so Google can read the tag).
It doesn't hide URLs. robots.txt is public. Anyone can visit it. Don't disallow "secret" URLs in robots.txt, it's a roadmap for people looking for things to poke at.

Common patterns

Allow everything

User-agent: *
Allow: /

(This is the default, you don't actually need a file.)

Block admin/development paths

User-agent: *
Disallow: /admin/
Disallow: /wp-admin/
Disallow: /staging/

Block parameter-heavy URLs (cautiously)

User-agent: *
Disallow: /*?sort=
Disallow: /*?filter=

Allow specific bots special access

User-agent: *
Disallow: /api/

User-agent: Googlebot
Allow: /api/public/

Common mistakes

Disallow: /, blocks the entire site from crawling. Used for staging/dev; catastrophic if accidentally deployed to production.
Blocking CSS/JS. Googlebot needs these to render your pages properly. Since ~2015, blocking CSS/JS has been explicitly discouraged.
Disallow + noindex, if you Disallow a URL, Google can't crawl it, can't see the noindex tag, and may still index based on inbound links. To de-index, allow crawling + add noindex.
Trailing slash inconsistency. /admin and /admin/ are different. Be consistent.

Testing

Google Search Console → robots.txt Tester, test specific URLs against your rules
curl -A "Googlebot" https://yourdomain.com/robots.txt, see what Google sees
After any change: re-check Search Console for coverage issues

When to use robots.txt vs noindex vs canonical

Don't want it crawled and don't care about indexing: robots.txt disallow
Want it crawled but NOT indexed: meta noindex
Duplicate content, want to indicate the primary: canonical tag
Don't want it to exist at all: 410 response or remove + let it 404

Robots.txt

The one thing robots.txt gets confused for

Where it lives

Basic syntax

What robots.txt does and doesn't do

Common patterns

Allow everything (the default)

Block admin and dev paths

Block parameter-heavy URLs (be careful)

Allow specific bots special access

Common mistakes that wipe out sites

Testing robots.txt

The decision tree: robots.txt vs noindex vs canonical

What to do with this

Where it lives

Basic syntax

What it does

What it doesn't do

Common patterns

Allow everything

Block admin/development paths

Block parameter-heavy URLs (cautiously)

Allow specific bots special access

Common mistakes

Testing

When to use robots.txt vs noindex vs canonical

Further reading