Log file analysis
📖 4 min readUpdated 2026-04-18
Every request to your server is logged. Among those logs: every single Googlebot visit. Log analysis tells you what Google actually crawls (not what you think it crawls), where it spends time, what it skips, and what it's breaking on.
When log analysis is worth it
- Sites >10k URLs (crawl budget matters)
- E-commerce sites (parameter handling issues)
- After a major migration or site restructure
- When Search Console shows unexplained indexing issues
- For enterprise SEO teams building data-driven strategies
What you can learn
- What Googlebot crawls daily, compared to your sitemap, total URLs
- Crawl frequency per page, high-value pages should be crawled often; low-value rarely
- Response codes Googlebot hits. 404s, 500s, slow responses
- Redirect chains, chains of 301→301→301 are crawl-budget wasters
- URL parameters being crawled unnecessarily, filter, sort, session parameters
- Orphan pages Google found via sitemap but you don't link to
- Bot verification, are requests claiming to be Googlebot actually Googlebot? (Many scrapers spoof.)
Getting the logs
Depends on hosting:
- Cloudflare → Logs in the dashboard (Enterprise plan)
- AWS CloudFront → CloudWatch or S3 logs
- Nginx/Apache → standard access log files
- Shared hosting → cPanel often has raw logs
- CDN-fronted → your origin logs may not capture cached requests; check CDN logs
Tools
- Screaming Frog Log File Analyser, desktop app, imports raw logs, filters by user-agent
- OnCrawl. SaaS, integrates logs with crawl data
- Botify, enterprise-grade log analysis + crawl data integration
- Custom with Python/ELK, for teams with engineering resources
Verifying Googlebot
Bad actors spoof the Googlebot user-agent. Verify by doing a reverse DNS lookup on the IP:
- Take the IP from the log
nslookup [IP], should resolve to *.googlebot.com or *.google.com
- Then forward-lookup that hostname, should match the original IP
If the reverse doesn't match, it's not real Googlebot. Most log analysis tools automate this.
Key ratios to compute
- Crawl budget per URL type: for each section of site (blog, category, products), how often does Googlebot visit?
- Crawl-to-index ratio: of all URLs crawled, how many end up indexed? Low ratio = quality issues.
- Response code distribution: >1% 404s or 5xxs on Googlebot traffic = problem.
- Average response time to Googlebot: slow = crawl budget wasted.
Findings that matter
- Googlebot crawling stuff you don't care about, filter/sort/session parameter URLs. Handle with robots.txt or better URL hygiene.
- Googlebot NOT crawling stuff you do care about, low crawl frequency on money pages. Fix architecture or freshness signals.
- Crawl frequency drop on specific sections, can precede ranking drops. Early warning.
- High 404 rate, fix the links or delete the pages cleanly.
- Slow response time to bot, your LCP and crawl efficiency are both hurting.
Cadence
- Small site: quarterly log analysis
- Medium site: monthly
- Enterprise: weekly, often part of a dashboard
After big changes (migrations, restructures): daily for 2-4 weeks.