Parsing HTML + web pages

Every web page has ~10% content and ~90% navigation, ads, menus, footers, cookie banners, and boilerplate. Naive HTML-to-text extraction feeds all of that into your index, which pollutes retrieval and embedding space. Clean web extraction is a discipline.

The extraction stack

1. Fetch

HTTP GET (or a headless browser for JS-heavy sites). Handle redirects, cookies, user agents, rate limits. Respect robots.txt when applicable.

2. Dynamic content rendering

If the site is an SPA (React, Vue, Angular), raw HTML is empty. You need a headless browser (Playwright, Puppeteer) to render the DOM before extraction. Significantly slower and more expensive than static fetching.

3. Boilerplate removal

Strip navigation, sidebars, footers, comments, ads. This is the make-or-break step.

4. Content extraction

Preserve structure: headings, paragraphs, lists, code blocks, tables.

5. Link rewriting

Absolute-ize relative URLs. Keep link targets as metadata if they matter for your use case.

The library landscape

Readability / trafilatura

Python libraries that heuristically extract "main content" from HTML. Trafilatura is my default, best signal-to-noise ratio on general-purpose web content. Handles blog posts, news articles, documentation pages well.

Readability.js / Mozilla Readability

The algorithm that powers browser reader modes. Good baseline. Works well for article-style content.

BeautifulSoup / lxml

Low-level HTML parsers. Use when you know the structure and can write specific selectors (e.g., "every .docs-content div"). Precise but brittle, breaks when sites redesign.

Jina Reader / Diffbot / Firecrawl

Managed services that return clean markdown from URLs. Jina Reader is free for light use, Firecrawl is the commercial leader for RAG-oriented scraping. Worth it when you have many sources and don't want to maintain extraction logic.

html-to-text / html-to-markdown

Naive converters. Avoid for RAG, they preserve all the junk.

HTML → Markdown is usually the right format

For RAG, convert HTML to Markdown rather than plaintext. Markdown preserves structure (headings, lists, code, links) in a format that chunking and embeddings both handle well. Plain text loses everything.

My stack: trafilatura or Firecrawl → Markdown → chunker. Skipping to plaintext drops too much signal.

Patterns for documentation sites

Developer docs, knowledge bases, and product help centers usually have predictable structure. For these:

Patterns for messy sources (blogs, news, general web)

The hidden problems

Duplicate content

Many sites have print versions, AMP versions, paginated archives, and tag pages that duplicate core content. Deduplicate aggressively during ingestion.

Stale + evergreen mixed

A site might have a 2015 blog post right next to a 2024 product doc. Without publish dates as metadata, retrieval can surface outdated info. Always capture dates when available and use them as filters or boosts.

Paywalls and auth walls

Public URL ≠ public content. Check what your scraper is actually seeing. Many sites serve a paywall page with the article name but no content.

Single-page apps that break fetching

If curl returns "Loading..." and nothing else, you need a headless browser. This adds 3-10x latency per page and requires significantly more infrastructure.

Rate limiting

Aggressive scraping gets blocked. Use polite delays, rotate user agents, respect Retry-After headers. For large-scale scraping of a single domain, talk to them about a partnership or an official feed before you get blocked.

The quality check

Same as for PDFs: sample 10 pages randomly from your extracted corpus. Look at the raw extracted text. Is it the article? Is it the article plus ten lines of nav menu? Is it empty because the extraction failed silently? Fix the extraction before embedding.

Next: Tables and figures.