robots.txt is a plain-text file at the root of your domain that tells crawlers which URLs they're allowed to visit. It's one of the highest-impact, lowest-effort technical files on your site. Get it right and you direct crawl budget efficiently. Get it wrong and you accidentally de-index the entire site. This page walks through what robots.txt does, what it doesn't do, the common mistakes that wipe out sites, and the decision tree for when to use robots.txt vs noindex vs canonical.
People think robots.txt prevents pages from showing up in search. It doesn't. A URL blocked by robots.txt can still appear in Google's index if external sites link to it. Google just can't see the content, so the listing in search will be sparse ("no info available").
robots.txt is about crawling, not indexing. Getting this distinction wrong is the single biggest source of robots.txt-related SEO disasters.
Always at https://yourdomain.com/robots.txt. Not at any other path. Bots only check this one location.
User-agent: *
Disallow: /private/
Allow: /public/
Sitemap: https://yourdomain.com/sitemap.xml
* means all bots. Specific bots (Googlebot, Bingbot) override general rules for that bot.User-agent: *
Allow: /
You actually don't need a file for this. If robots.txt doesn't exist, bots assume everything is allowed.
User-agent: *
Disallow: /admin/
Disallow: /wp-admin/
Disallow: /staging/
User-agent: *
Disallow: /*?sort=
Disallow: /*?filter=
User-agent: *
Disallow: /api/
User-agent: Googlebot
Allow: /api/public/
The Disallow: / mistake is catastrophic. It's most often left over from staging environments accidentally deployed to production. The fix is fast (remove the line), but recovery in rankings takes weeks.
Open yourdomain.com/robots.txt right now. Read every line. Is there anything there you don't recognize? Does anything say Disallow: /? Check. Any modern site should have a minimal, intentional robots.txt. If yours is long or weird, clean it up.
Next: canonical tags, the tool that tells Google which version of a duplicate URL should rank.
Always at https://yourdomain.com/robots.txt. Not at any other path.
User-agent: *
Disallow: /private/
Allow: /public/
Sitemap: https://yourdomain.com/sitemap.xml
* = all. Specific bots (Googlebot) override general rules for that bot.Tells well-behaved bots what to skip. Malicious bots and most scrapers ignore it. It's a crawl directive, not a security mechanism.
noindex meta tag (which requires the URL to be crawlable so Google can read the tag).User-agent: *
Allow: /
(This is the default, you don't actually need a file.)
User-agent: *
Disallow: /admin/
Disallow: /wp-admin/
Disallow: /staging/
User-agent: *
Disallow: /*?sort=
Disallow: /*?filter=
User-agent: *
Disallow: /api/
User-agent: Googlebot
Allow: /api/public/
Disallow: /, blocks the entire site from crawling. Used for staging/dev; catastrophic if accidentally deployed to production./admin and /admin/ are different. Be consistent.