Robots.txt

robots.txt is a plain-text file at the root of your domain that tells crawlers which URLs they're allowed to visit. It's one of the highest-impact, lowest-effort technical files on your site. Get it right and you direct crawl budget efficiently. Get it wrong and you accidentally de-index the entire site. This page walks through what robots.txt does, what it doesn't do, the common mistakes that wipe out sites, and the decision tree for when to use robots.txt vs noindex vs canonical.

The one thing robots.txt gets confused for

People think robots.txt prevents pages from showing up in search. It doesn't. A URL blocked by robots.txt can still appear in Google's index if external sites link to it. Google just can't see the content, so the listing in search will be sparse ("no info available").

robots.txt is about crawling, not indexing. Getting this distinction wrong is the single biggest source of robots.txt-related SEO disasters.

Where it lives

Always at https://yourdomain.com/robots.txt. Not at any other path. Bots only check this one location.

Basic syntax

User-agent: *
Disallow: /private/
Allow: /public/

Sitemap: https://yourdomain.com/sitemap.xml

What robots.txt does and doesn't do

Common patterns

Allow everything (the default)

User-agent: *
Allow: /

You actually don't need a file for this. If robots.txt doesn't exist, bots assume everything is allowed.

Block admin and dev paths

User-agent: *
Disallow: /admin/
Disallow: /wp-admin/
Disallow: /staging/

Block parameter-heavy URLs (be careful)

User-agent: *
Disallow: /*?sort=
Disallow: /*?filter=

Allow specific bots special access

User-agent: *
Disallow: /api/

User-agent: Googlebot
Allow: /api/public/

Common mistakes that wipe out sites

The Disallow: / mistake is catastrophic. It's most often left over from staging environments accidentally deployed to production. The fix is fast (remove the line), but recovery in rankings takes weeks.

Testing robots.txt

The decision tree: robots.txt vs noindex vs canonical

What to do with this

Open yourdomain.com/robots.txt right now. Read every line. Is there anything there you don't recognize? Does anything say Disallow: /? Check. Any modern site should have a minimal, intentional robots.txt. If yours is long or weird, clean it up.

Next: canonical tags, the tool that tells Google which version of a duplicate URL should rank.

Where it lives

Always at https://yourdomain.com/robots.txt. Not at any other path.

Basic syntax

User-agent: *
Disallow: /private/
Allow: /public/

Sitemap: https://yourdomain.com/sitemap.xml

What it does

Tells well-behaved bots what to skip. Malicious bots and most scrapers ignore it. It's a crawl directive, not a security mechanism.

What it doesn't do

Common patterns

Allow everything

User-agent: *
Allow: /

(This is the default, you don't actually need a file.)

Block admin/development paths

User-agent: *
Disallow: /admin/
Disallow: /wp-admin/
Disallow: /staging/

Block parameter-heavy URLs (cautiously)

User-agent: *
Disallow: /*?sort=
Disallow: /*?filter=

Allow specific bots special access

User-agent: *
Disallow: /api/

User-agent: Googlebot
Allow: /api/public/

Common mistakes

Testing

When to use robots.txt vs noindex vs canonical