Every web page has ~10% content and ~90% navigation, ads, menus, footers, cookie banners, and boilerplate. Naive HTML-to-text extraction feeds all of that into your index, which pollutes retrieval and embedding space. Clean web extraction is a discipline.
HTTP GET (or a headless browser for JS-heavy sites). Handle redirects, cookies, user agents, rate limits. Respect robots.txt when applicable.
If the site is an SPA (React, Vue, Angular), raw HTML is empty. You need a headless browser (Playwright, Puppeteer) to render the DOM before extraction. Significantly slower and more expensive than static fetching.
Strip navigation, sidebars, footers, comments, ads. This is the make-or-break step.
Preserve structure: headings, paragraphs, lists, code blocks, tables.
Absolute-ize relative URLs. Keep link targets as metadata if they matter for your use case.
Python libraries that heuristically extract "main content" from HTML. Trafilatura is my default, best signal-to-noise ratio on general-purpose web content. Handles blog posts, news articles, documentation pages well.
The algorithm that powers browser reader modes. Good baseline. Works well for article-style content.
Low-level HTML parsers. Use when you know the structure and can write specific selectors (e.g., "every .docs-content div"). Precise but brittle, breaks when sites redesign.
Managed services that return clean markdown from URLs. Jina Reader is free for light use, Firecrawl is the commercial leader for RAG-oriented scraping. Worth it when you have many sources and don't want to maintain extraction logic.
Naive converters. Avoid for RAG, they preserve all the junk.
For RAG, convert HTML to Markdown rather than plaintext. Markdown preserves structure (headings, lists, code, links) in a format that chunking and embeddings both handle well. Plain text loses everything.
My stack: trafilatura or Firecrawl → Markdown → chunker. Skipping to plaintext drops too much signal.
Developer docs, knowledge bases, and product help centers usually have predictable structure. For these:
Many sites have print versions, AMP versions, paginated archives, and tag pages that duplicate core content. Deduplicate aggressively during ingestion.
A site might have a 2015 blog post right next to a 2024 product doc. Without publish dates as metadata, retrieval can surface outdated info. Always capture dates when available and use them as filters or boosts.
Public URL ≠ public content. Check what your scraper is actually seeing. Many sites serve a paywall page with the article name but no content.
If curl returns "Loading..." and nothing else, you need a headless browser. This adds 3-10x latency per page and requires significantly more infrastructure.
Aggressive scraping gets blocked. Use polite delays, rotate user agents, respect Retry-After headers. For large-scale scraping of a single domain, talk to them about a partnership or an official feed before you get blocked.
Same as for PDFs: sample 10 pages randomly from your extracted corpus. Look at the raw extracted text. Is it the article? Is it the article plus ten lines of nav menu? Is it empty because the extraction failed silently? Fix the extraction before embedding.
Next: Tables and figures.