Why chunking matters
📖 5 min readUpdated 2026-04-18
Chunking is where the most avoidable quality loss in RAG happens. Most teams use LangChain's default 1000-character splitter and never revisit the decision. That default is wrong for almost every real corpus. Chunking affects retrieval quality more than most teams realize, and getting it right is one of the highest-leverage changes you can make.
Why chunking exists at all
Three reasons you can't just embed whole documents:
- Embedding context windows. Most embedding models cap at 512-8192 tokens. Documents often exceed this.
- Retrieval granularity. If the answer to a question lives in one paragraph, retrieving the whole 50-page document buries it.
- Prompt context limits. Even with long-context LLMs, passing 10 whole documents of retrieved context is expensive and dilutes attention.
What you're optimizing for
A good chunk has three properties:
- Self-contained meaning. A reader can understand it without surrounding context.
- Single topic. The chunk is about one thing, not a mix.
- Retrievable. It matches the kind of query a user would make.
Fixed-size chunking often fails on all three. A 1000-character window can split mid-sentence, blend two topics, and lose enough context that the chunk is meaningless standalone.
The chunking trade-off curve
Too small
- Fragments without context
- High recall, low precision (too many false-positive matches)
- LLM can't reason about the isolated snippet
- More chunks = more embeddings = more cost
Too large
- Single chunk covers multiple topics, embedding becomes a muddled average
- Lower recall (the specific answer is buried in surrounding text)
- LLM context budget wasted on irrelevant content
- Harder to cite precisely
Goldilocks
Usually 200-800 tokens for prose, with some overlap, and respecting natural boundaries (paragraphs, sections). Exact sweet spot depends heavily on content type.
The overlap question
Overlap (including some tokens from the end of the previous chunk in the start of the next) prevents information loss at boundaries. Common settings: 10-20% overlap.
Overlap helps when:
- Important information spans arbitrary boundaries
- Context before/after a fact matters for answering
- You're using fixed-size chunking without structure awareness
Overlap hurts when:
- You have good natural boundaries (paragraphs, headings) and should use those instead
- You care about precision (overlapping chunks can match the same query and both show up in top-k, reducing diversity)
The chunking-for-query principle
Chunks should look like the retrievable unit of a question, not the unit of a document. If users ask "what's our refund policy?" the right chunk is the refund policy paragraph, not arbitrarily chunked 1000-character slices of a policy document.
This means chunking strategy depends on the queries you expect, not just the documents you have. For many corpora, semantic or structure-aware chunking beats fixed-size because it produces chunks that look more like complete thoughts.
Why the defaults are wrong
LangChain's default RecursiveCharacterTextSplitter with 1000 chars and 200 overlap is:
- Too large for short-query retrieval
- Too small for reasoning-intensive content
- Blind to structure (doesn't care about headings, sections, or paragraphs)
- Character-based (ignores that tokens vary in size across languages)
It's a reasonable zero-config starting point. It's a terrible final answer.
Chunking strategies by content type
- Documentation / knowledge bases: structure-aware, by heading sections
- Long prose / books: semantic or recursive, 400-800 tokens
- FAQs / Q&A: one question+answer per chunk
- Code: function/class/file boundaries (see chunking code)
- Chat logs / conversational: by conversation turn or topic shift
- Legal / contracts: by clause with surrounding context
- Scientific papers: by section, with special handling for abstracts and references
The experiment mindset
Chunking is one of the easiest things to A/B test in RAG. Index the same corpus with different chunking strategies, run the same eval set against both, compare retrieval metrics. Most teams never do this. The teams that do typically find their chunking baseline was leaving 20-40% of retrieval quality on the table.
Next: Fixed-size chunking.