Robots.txt

robots.txt is a plain-text file at the root of your domain that tells crawlers which URLs they're allowed to visit. Get it right and you direct crawl budget; get it wrong and you can accidentally de-index your site.

Where it lives

Always at https://yourdomain.com/robots.txt. Not at any other path.

Basic syntax

User-agent: *
Disallow: /private/
Allow: /public/

Sitemap: https://yourdomain.com/sitemap.xml

What it does

Tells well-behaved bots what to skip. Malicious bots and most scrapers ignore it. It's a crawl directive, not a security mechanism.

What it doesn't do

Common patterns

Allow everything

User-agent: *
Allow: /

(This is the default, you don't actually need a file.)

Block admin/development paths

User-agent: *
Disallow: /admin/
Disallow: /wp-admin/
Disallow: /staging/

Block parameter-heavy URLs (cautiously)

User-agent: *
Disallow: /*?sort=
Disallow: /*?filter=

Allow specific bots special access

User-agent: *
Disallow: /api/

User-agent: Googlebot
Allow: /api/public/

Common mistakes

Testing

When to use robots.txt vs noindex vs canonical