Recursive chunking

Recursive chunking is the pragmatic middle ground: try to split on natural boundaries (paragraphs, then sentences, then spaces), but enforce a size limit. It's simple, fast, and usually the best default. It's what LangChain's RecursiveCharacterTextSplitter does, and it's a big improvement over pure fixed-size chunking.

The algorithm

  1. Define a list of separators in order of preference: typically ["\n\n", "\n", ". ", " ", ""]
  2. Try to split the text on the first separator
  3. If any resulting piece is still larger than the target size, recursively split that piece using the next separator
  4. Continue until all pieces fit within the target size
  5. Combine adjacent pieces to approach the target size without exceeding it

The result: chunks that prefer paragraph boundaries, fall back to sentence boundaries, and only split mid-sentence as a last resort.

Why it works well as a default

The separator list matters

Default separator lists are reasonable for generic prose. For specific content types, customize:

Size tuning

Same sweet spots as fixed-size:

When recursive falls short

The hybrid pattern

The strongest practical chunking strategy for most corpora:

  1. Parse the document into structural elements (headings, paragraphs, lists, tables)
  2. Apply recursive chunking within each structural element
  3. Respect element boundaries, never merge a heading into a paragraph, never split a table across chunks

This combines recursive chunking's simplicity with structure-aware chunking's semantic respect. See structure-aware chunking.

Common mistakes

Next: Structure-aware chunking.