Home›Expertise›RAGS to Riches›Semantic chunking

Semantic chunking

📖 5 min readUpdated 2026-04-18

Semantic chunking splits text where the meaning changes, not where a character counter runs out. Instead of "1000 characters from here, then overlap, then 1000 more," it asks: "where does one idea end and the next begin?" The result is chunks that look more like coherent thoughts, which usually retrieve better than arbitrary slices.

The core idea

Split the document into candidate boundaries (sentences or paragraphs)
Embed each candidate
Measure the similarity between adjacent candidates
When similarity drops below a threshold, you've hit a topic boundary, start a new chunk

The output: variable-length chunks that each cover a single topic or subtopic.

The algorithm in detail

Sentence-level approach

1. Split document into sentences
2. Embed each sentence
3. Compute cosine similarity between sentence i and sentence i+1
4. If similarity < threshold, mark as boundary
5. Form chunks by grouping consecutive sentences between boundaries
6. If a chunk is too small, merge with neighbor
7. If a chunk is too large, split at the weakest similarity within it

Window-based approach

Instead of comparing sentence-to-sentence (noisy), compare windows of N sentences to the next N. Smoother signal, better boundaries.

Percentile-based thresholding

Instead of a fixed similarity threshold (which varies by embedding model), use the Nth percentile of all adjacent similarities in the document. E.g., split at the bottom 5% of similarities. Adapts to each document's intrinsic similarity distribution.

When semantic chunking is worth it

Wins

Long-form content with shifting topics (books, long articles, research papers)
Mixed content where one part is explanation, another is examples, another is references
Content where manual structure (headings) isn't available or isn't reliable
Cases where fixed-size chunking is demonstrably splitting mid-thought

Overkill

Short documents where one or two chunks cover the whole thing anyway
Highly structured content (docs with clear headings), use structure-aware chunking instead
Very homogeneous content (FAQs, product catalog entries) where topic boundaries are already obvious from structure
When embedding costs matter and the corpus is large

The cost

Semantic chunking requires embedding every sentence or window during ingestion, often 5-20x more embedding calls than you'd need for retrieval alone. For a 100M-token corpus, this is a material cost.

Mitigation: use a cheap embedding model for chunking (text-embedding-3-small, open-source models) and a better one for the retrieval index. The chunking-time embeddings don't have to match your retrieval embeddings.

The tuning knobs

Similarity threshold (or percentile): controls how often you split. Lower = more chunks, smaller size.
Window size: sentence-level is noisy, window of 3-5 sentences is smoother.
Min/max chunk size: prevent degenerate cases. Clamp between, say, 100 and 1500 tokens.
Merge strategy for small chunks: merge with previous, next, or highest-similarity neighbor.

Implementations

LlamaIndex SemanticSplitterNodeParser: the reference implementation. Works. Configurable.
LangChain SemanticChunker: similar, different defaults.
Custom: the algorithm is ~30 lines of Python. Most teams who care end up writing their own.

The diminishing returns question

Semantic chunking usually beats fixed-size by 5-15% on retrieval metrics for long-form prose. For highly structured content, structure-aware chunking beats both. Before switching to semantic, ask: do my documents have structure I could use instead? If yes, use that. Semantic is the fallback for unstructured content.

What to do with this

Reach for semantic when your content is long-form + unstructured and fixed-size is demonstrably splitting mid-thought.
Use a cheap model for chunking-time embeddings to keep costs reasonable.
Always clamp min/max chunk size to avoid degenerate outputs.