Semantic chunking

Semantic chunking splits text where the meaning changes, not where a character counter runs out. Instead of "1000 characters from here, then overlap, then 1000 more," it asks: "where does one idea end and the next begin?" The result is chunks that look more like coherent thoughts, which usually retrieve better than arbitrary slices.

The core idea

  1. Split the document into candidate boundaries (sentences or paragraphs)
  2. Embed each candidate
  3. Measure the similarity between adjacent candidates
  4. When similarity drops below a threshold, you've hit a topic boundary, start a new chunk

The output: variable-length chunks that each cover a single topic or subtopic.

The algorithm in detail

Sentence-level approach

1. Split document into sentences
2. Embed each sentence
3. Compute cosine similarity between sentence i and sentence i+1
4. If similarity < threshold, mark as boundary
5. Form chunks by grouping consecutive sentences between boundaries
6. If a chunk is too small, merge with neighbor
7. If a chunk is too large, split at the weakest similarity within it

Window-based approach

Instead of comparing sentence-to-sentence (noisy), compare windows of N sentences to the next N. Smoother signal, better boundaries.

Percentile-based thresholding

Instead of a fixed similarity threshold (which varies by embedding model), use the Nth percentile of all adjacent similarities in the document. E.g., split at the bottom 5% of similarities. Adapts to each document's intrinsic similarity distribution.

When semantic chunking wins

When it's overkill

The cost

Semantic chunking requires embedding every sentence or window during ingestion, often 5-20x more embedding calls than you'd need for retrieval alone. For a 100M-token corpus, this is a material cost.

Mitigation: use a cheap embedding model for chunking (text-embedding-3-small, open-source models) and a better one for the retrieval index. The chunking-time embeddings don't have to match your retrieval embeddings.

The tuning knobs

Implementations

The diminishing returns question

Semantic chunking usually beats fixed-size by 5-15% on retrieval metrics for long-form prose. For highly structured content, structure-aware chunking beats both. Before switching to semantic, ask: do my documents have structure I could use instead? If yes, use that. Semantic is the fallback for unstructured content.

Next: Recursive chunking.