Robots.txt
📖 4 min readUpdated 2026-04-18
robots.txt is a plain-text file at the root of your domain that tells crawlers which URLs they're allowed to visit. Get it right and you direct crawl budget; get it wrong and you can accidentally de-index your site.
Where it lives
Always at https://yourdomain.com/robots.txt. Not at any other path.
Basic syntax
User-agent: *
Disallow: /private/
Allow: /public/
Sitemap: https://yourdomain.com/sitemap.xml
- User-agent, which bot this rule applies to.
* = all. Specific bots (Googlebot) override general rules for that bot.
- Disallow, path the bot should not crawl
- Allow, path the bot IS allowed, overriding a parent Disallow
- Sitemap, points to your sitemap file
What it does
Tells well-behaved bots what to skip. Malicious bots and most scrapers ignore it. It's a crawl directive, not a security mechanism.
What it doesn't do
- It doesn't prevent indexing. A URL blocked by robots.txt can still be indexed if discovered via external links. Google just can't see the content. To prevent indexing, use
noindex meta tag (which requires the URL to be crawlable so Google can read the tag).
- It doesn't hide URLs. robots.txt is public. Anyone can visit it. Don't disallow "secret" URLs in robots.txt, it's a roadmap for people looking for things to poke at.
Common patterns
Allow everything
User-agent: *
Allow: /
(This is the default, you don't actually need a file.)
Block admin/development paths
User-agent: *
Disallow: /admin/
Disallow: /wp-admin/
Disallow: /staging/
Block parameter-heavy URLs (cautiously)
User-agent: *
Disallow: /*?sort=
Disallow: /*?filter=
Allow specific bots special access
User-agent: *
Disallow: /api/
User-agent: Googlebot
Allow: /api/public/
Common mistakes
Disallow: /, blocks the entire site from crawling. Used for staging/dev; catastrophic if accidentally deployed to production.
- Blocking CSS/JS. Googlebot needs these to render your pages properly. Since ~2015, blocking CSS/JS has been explicitly discouraged.
- Disallow + noindex, if you Disallow a URL, Google can't crawl it, can't see the noindex tag, and may still index based on inbound links. To de-index, allow crawling + add noindex.
- Trailing slash inconsistency.
/admin and /admin/ are different. Be consistent.
Testing
- Google Search Console → robots.txt Tester, test specific URLs against your rules
- curl -A "Googlebot" https://yourdomain.com/robots.txt, see what Google sees
- After any change: re-check Search Console for coverage issues
When to use robots.txt vs noindex vs canonical
- Don't want it crawled and don't care about indexing: robots.txt disallow
- Want it crawled but NOT indexed: meta noindex
- Duplicate content, want to indicate the primary: canonical tag
- Don't want it to exist at all: 410 response or remove + let it 404