Crawling + indexability
📖 8 min readUpdated 2026-04-19
The first two questions any technical SEO audit has to answer. Can Google find my pages? Can it index them? Until both answers are "yes," every other SEO effort is wasted. You can write the greatest page in the world. If Googlebot can't reach it, or reaches it but declines to add it to the index, it doesn't exist for search purposes. This page walks through what actually blocks crawling, what blocks indexing, and how to debug a page that should be ranking but isn't.
The mental model
Crawling and indexing are two different gates your page has to pass. Both have to succeed. Either can fail silently.
Crawling is reaching the page. Indexing is Google deciding to add it to its searchable database. A page can be crawled and then not indexed. It can be indexable in theory but never crawled. Separate the two in your head and you'll troubleshoot faster.
The full pipeline
Crawling: what blocks it
- robots.txt disallow. Your site explicitly tells bots not to enter. Often accidental.
- Authentication walls. Page requires login. Bots can't sign in.
- Orphan pages. No internal or external links point to the URL. Bots have no path.
- JavaScript-only navigation. Menu only appears after JS execution. Bots may miss it.
- Server issues. Slow response times, rate limits, errors. Bots give up.
- Crawl budget exhaustion. On large sites, bots run out of budget before reaching all pages.
Indexability: what blocks it
- noindex meta tag. Your page literally tells Google not to index.
- Canonical tag pointing elsewhere. Your page says "another URL is primary, don't index me."
- Duplicate content without a clear canonical. Google picks one and ignores the rest.
- Thin content. Google may decline to index pages it considers too shallow to be useful.
- HTTP errors. 404, 500, 503. Page doesn't exist or failed to render.
- Soft 404. Page returns 200 OK but content is effectively "not found." Google detects and skips.
- Low site quality. If the overall site is low-trust, Google skips marginal pages.
Debugging: is my page indexed?
Step-by-step debug process
- Search
site:yourdomain.com/your-url in Google. If it appears, it's indexed. If not, keep going.
- Open Search Console. Paste the URL into URL Inspection. Check coverage status.
- Match the status against the table above
- Fix the underlying cause (not the symptom)
- Click "Request Indexing" once the cause is fixed
- Wait a few days. Recheck.
Crawl budget, when it matters
Google allocates a finite crawl budget per site: roughly how many pages per day it will fetch. For sites under a few thousand pages, this is never a concern. Google has plenty of budget.
For sites over 10,000 pages, crawl budget matters. Poorly optimized crawl budget means important pages get crawled less often, take longer to update, and drift out of sync with what's actually on the site.
Optimizing crawl budget:
- Noindex low-value pages (tag archives, duplicate sort URLs) so Google stops wasting crawls on them
- Fix 404s and reduce redirect chains
- Keep sitemap clean and current
- Minimize low-value parameter URLs (
?sort=, ?filter=)
- Use pagination and facets intentionally, not blindly
The tools
- Google Search Console. URL Inspection, Index Coverage report, Crawl Stats. Free and essential.
- Screaming Frog. Crawl your site the way Google would. Catches technical blockers.
- Sitebulb. Similar to Screaming Frog with better visualizations.
- Log file analyzers. See what Googlebot actually crawls. Deeper than other tools.
The quick-win list
- Submit a fresh XML sitemap to Search Console
- Link from your homepage to every major category
- Fix or remove internal 404 links
- Audit for unintended noindex tags or wrong canonical tags
- Check robots.txt isn't blocking anything important
- Inspect your "Crawled, not indexed" list in Search Console and improve or delete those pages
What to do with this
Open Search Console right now. Go to the Pages report. Look at the "Why pages aren't indexed" breakdown. Every bucket there is an opportunity. Work through the largest bucket first. This is often the highest-leverage afternoon of SEO you'll have.
Next: XML sitemaps, the main way you tell Google which pages exist and matter.