Why chunking matters

Chunking is where the most avoidable quality loss in RAG happens. Most teams use LangChain's default 1000-character splitter and never revisit the decision. That default is wrong for almost every real corpus. Chunking affects retrieval quality more than most teams realize, and getting it right is one of the highest-leverage changes you can make.

Why chunking exists at all

Three reasons you can't just embed whole documents:

  1. Embedding context windows. Most embedding models cap at 512-8192 tokens. Documents often exceed this.
  2. Retrieval granularity. If the answer to a question lives in one paragraph, retrieving the whole 50-page document buries it.
  3. Prompt context limits. Even with long-context LLMs, passing 10 whole documents of retrieved context is expensive and dilutes attention.

What you're optimizing for

A good chunk has three properties:

Fixed-size chunking often fails on all three. A 1000-character window can split mid-sentence, blend two topics, and lose enough context that the chunk is meaningless standalone.

The chunking trade-off curve

Too small

Too large

Goldilocks

Usually 200-800 tokens for prose, with some overlap, and respecting natural boundaries (paragraphs, sections). Exact sweet spot depends heavily on content type.

The overlap question

Overlap (including some tokens from the end of the previous chunk in the start of the next) prevents information loss at boundaries. Common settings: 10-20% overlap.

Overlap helps when:

Overlap hurts when:

The chunking-for-query principle

Chunks should look like the retrievable unit of a question, not the unit of a document. If users ask "what's our refund policy?" the right chunk is the refund policy paragraph, not arbitrarily chunked 1000-character slices of a policy document.

This means chunking strategy depends on the queries you expect, not just the documents you have. For many corpora, semantic or structure-aware chunking beats fixed-size because it produces chunks that look more like complete thoughts.

Why the defaults are wrong

LangChain's default RecursiveCharacterTextSplitter with 1000 chars and 200 overlap is:

It's a reasonable zero-config starting point. It's a terrible final answer.

Chunking strategies by content type

The experiment mindset

Chunking is one of the easiest things to A/B test in RAG. Index the same corpus with different chunking strategies, run the same eval set against both, compare retrieval metrics. Most teams never do this. The teams that do typically find their chunking baseline was leaving 20-40% of retrieval quality on the table.

Next: Fixed-size chunking.