Semantic chunking splits text where the meaning changes, not where a character counter runs out. Instead of "1000 characters from here, then overlap, then 1000 more," it asks: "where does one idea end and the next begin?" The result is chunks that look more like coherent thoughts, which usually retrieve better than arbitrary slices.
The output: variable-length chunks that each cover a single topic or subtopic.
1. Split document into sentences 2. Embed each sentence 3. Compute cosine similarity between sentence i and sentence i+1 4. If similarity < threshold, mark as boundary 5. Form chunks by grouping consecutive sentences between boundaries 6. If a chunk is too small, merge with neighbor 7. If a chunk is too large, split at the weakest similarity within it
Instead of comparing sentence-to-sentence (noisy), compare windows of N sentences to the next N. Smoother signal, better boundaries.
Instead of a fixed similarity threshold (which varies by embedding model), use the Nth percentile of all adjacent similarities in the document. E.g., split at the bottom 5% of similarities. Adapts to each document's intrinsic similarity distribution.
Semantic chunking requires embedding every sentence or window during ingestion, often 5-20x more embedding calls than you'd need for retrieval alone. For a 100M-token corpus, this is a material cost.
Mitigation: use a cheap embedding model for chunking (text-embedding-3-small, open-source models) and a better one for the retrieval index. The chunking-time embeddings don't have to match your retrieval embeddings.
Semantic chunking usually beats fixed-size by 5-15% on retrieval metrics for long-form prose. For highly structured content, structure-aware chunking beats both. Before switching to semantic, ask: do my documents have structure I could use instead? If yes, use that. Semantic is the fallback for unstructured content.
Next: Recursive chunking.