Semantic chunking
📖 5 min readUpdated 2026-04-18
Semantic chunking splits text where the meaning changes, not where a character counter runs out. Instead of "1000 characters from here, then overlap, then 1000 more," it asks: "where does one idea end and the next begin?" The result is chunks that look more like coherent thoughts, which usually retrieve better than arbitrary slices.
The core idea
- Split the document into candidate boundaries (sentences or paragraphs)
- Embed each candidate
- Measure the similarity between adjacent candidates
- When similarity drops below a threshold, you've hit a topic boundary, start a new chunk
The output: variable-length chunks that each cover a single topic or subtopic.
The algorithm in detail
Sentence-level approach
1. Split document into sentences
2. Embed each sentence
3. Compute cosine similarity between sentence i and sentence i+1
4. If similarity < threshold, mark as boundary
5. Form chunks by grouping consecutive sentences between boundaries
6. If a chunk is too small, merge with neighbor
7. If a chunk is too large, split at the weakest similarity within it
Window-based approach
Instead of comparing sentence-to-sentence (noisy), compare windows of N sentences to the next N. Smoother signal, better boundaries.
Percentile-based thresholding
Instead of a fixed similarity threshold (which varies by embedding model), use the Nth percentile of all adjacent similarities in the document. E.g., split at the bottom 5% of similarities. Adapts to each document's intrinsic similarity distribution.
When semantic chunking is worth it
Wins
- Long-form content with shifting topics (books, long articles, research papers)
- Mixed content where one part is explanation, another is examples, another is references
- Content where manual structure (headings) isn't available or isn't reliable
- Cases where fixed-size chunking is demonstrably splitting mid-thought
Overkill
- Short documents where one or two chunks cover the whole thing anyway
- Highly structured content (docs with clear headings), use structure-aware chunking instead
- Very homogeneous content (FAQs, product catalog entries) where topic boundaries are already obvious from structure
- When embedding costs matter and the corpus is large
The cost
Semantic chunking requires embedding every sentence or window during ingestion, often 5-20x more embedding calls than you'd need for retrieval alone. For a 100M-token corpus, this is a material cost.
Mitigation: use a cheap embedding model for chunking (text-embedding-3-small, open-source models) and a better one for the retrieval index. The chunking-time embeddings don't have to match your retrieval embeddings.
The tuning knobs
- Similarity threshold (or percentile): controls how often you split. Lower = more chunks, smaller size.
- Window size: sentence-level is noisy, window of 3-5 sentences is smoother.
- Min/max chunk size: prevent degenerate cases. Clamp between, say, 100 and 1500 tokens.
- Merge strategy for small chunks: merge with previous, next, or highest-similarity neighbor.
Implementations
- LlamaIndex SemanticSplitterNodeParser: the reference implementation. Works. Configurable.
- LangChain SemanticChunker: similar, different defaults.
- Custom: the algorithm is ~30 lines of Python. Most teams who care end up writing their own.
The diminishing returns question
Semantic chunking usually beats fixed-size by 5-15% on retrieval metrics for long-form prose. For highly structured content, structure-aware chunking beats both. Before switching to semantic, ask: do my documents have structure I could use instead? If yes, use that. Semantic is the fallback for unstructured content.
What to do with this
- Reach for semantic when your content is long-form + unstructured and fixed-size is demonstrably splitting mid-thought.
- Use a cheap model for chunking-time embeddings to keep costs reasonable.
- Always clamp min/max chunk size to avoid degenerate outputs.