Log analysis tools
📖 5 min readUpdated 2026-04-18
Log file analysis shows what Googlebot actually crawls, how often, where it goes, where it fails. For large sites, it's the only way to see crawl budget in action. The tools do the heavy lifting.
Why logs matter
GSC shows you what Google reports it crawls. Server logs show what actually happened. The two don't always match. Log analysis catches:
- Pages Googlebot never crawls (orphan, deep, or blocked)
- Pages Googlebot crawls too often (wasting budget)
- Response codes Googlebot hits (404s, 500s)
- Redirect chains
- Slow response times specifically to Googlebot
- User-agent spoofing (bots pretending to be Googlebot)
Major tools
Screaming Frog Log File Analyser
Desktop app (separate from the Screaming Frog crawler). Imports log files, joins with crawl data, visualizes.
- Pros: one-time license (cheaper long-run), full control, no data sent externally
- Cons: requires your own logs, manual import
- Price: ~$150/year license
OnCrawl
Cloud platform. Integrates log analysis with crawl data + Analytics + GSC.
- Pros: rich segmentation, cross-joins crawl + log + traffic data
- Cons: expensive, requires log upload
- Price: $$$, custom pricing
Botify
Enterprise platform. Similar to OnCrawl but aimed at very large sites.
- Pros: scale, deep analytics, real-time log ingestion options
- Cons: most expensive option
- Price: enterprise-only, $$$$
Semrush Log File Analyzer
Bundled with Semrush subscriptions. Basic log analysis.
- Pros: if you already have Semrush, no extra cost
- Cons: less deep than dedicated tools
Custom (Python / ELK stack)
For teams with engineering resources, custom pipelines using Python, Elasticsearch + Kibana, or BigQuery provide ultimate flexibility.
- Pros: unlimited customization, scales to any size
- Cons: engineering overhead, no out-of-box reports
What to look for
Crawl frequency per URL type
For each section of site (blog, category, product pages, legal), how often does Googlebot visit? High-value pages should be visited often; low-value should be visited less.
Crawl-to-index ratio
Of URLs Googlebot crawls, how many end up indexed? Ratio <50% = quality issues.
404 rate from Googlebot
Percent of Googlebot requests returning 404. Should be <1%. Higher = broken links to be fixed.
5xx rate
Should be near 0. Any consistent 5xx traffic indicates server reliability problems.
Average response time to Googlebot
Should be <500ms. Slow response = crawl budget wasted, rankings can suffer.
Redirect chains
Any URL where Googlebot follows 3+ redirects is a chain. Clean up.
Orphan URLs in logs
URLs Googlebot crawls that aren't internally linked anymore. Often old URLs from a prior version of the site.
URLs never crawled
In your sitemap but Googlebot never visited. Crawl budget or discoverability issue.
Getting the logs
Depends on hosting:
- Direct hosting (Nginx/Apache): access logs at
/var/log/nginx/access.log or similar
- Cloudflare: dashboard → Logs (Enterprise plan)
- AWS CloudFront: S3 logs or CloudWatch
- Vercel / Netlify: platform-specific log access
- Managed WordPress / shared hosting: check cPanel for "Raw Access Logs"
Verifying Googlebot
Not all "Googlebot" in logs is real. Verify via reverse DNS:
- Take IP from log
- Reverse DNS lookup → hostname should end with
googlebot.com or google.com
- Forward DNS on that hostname → should match original IP
All the mentioned tools automate this.
Cadence
- Small sites: quarterly review sufficient
- Medium: monthly
- Enterprise: weekly or continuous dashboard
After major changes (migrations, redesigns), daily for 2-4 weeks.
When logs save you
Real scenarios where log analysis surfaced issues invisible elsewhere:
- Googlebot crawling the staging site in production after a deploy misconfiguration
- Bot traps (infinite parameter URLs) absorbing all crawl budget
- Parameter URLs being crawled 1000x more than canonical URLs
- Key pages not being crawled because of buried nav after a redesign
- 5xx spikes during specific hours (server capacity issues)