Chunking code

Code has syntax and structure that prose doesn't. Fixed-size chunking of a Python file can split a function in the middle, which makes both halves useless for retrieval. Chunking code needs its own rules, and the rules are straightforward once you accept that code is not prose.

The principle

Code chunks should align with semantic units of the language:

Never split a function across two chunks. Never put half a class definition in one chunk and half in another.

Language-aware parsers

Tree-sitter

Industry-standard incremental parser with grammars for 100+ languages. Produces concrete syntax trees from source code. The backbone of most serious code-chunking implementations.

LangChain CodeSplitter

Uses tree-sitter under the hood for common languages. Works for prototyping.

Language-specific AST tools

The chunking strategy by code type

Small file (< 500 tokens)

One chunk for the whole file.

Medium file (500-2000 tokens)

One chunk per top-level definition (function, class, module). Keep imports and file-level context attached to the first chunk.

Large file (> 2000 tokens)

Split by class or function. For large classes, one chunk per method with class signature prepended as context.

Monolithic function (rare but real)

If a single function exceeds chunk size, split by logical blocks (while respecting brace matching). This is a code smell anyway, surface it in documentation.

Context enrichment

Each code chunk benefits from surrounding context:

Pattern: prepend a context header to each chunk:

# File: services/payments/processor.py
# Module docstring: Handles payment processing and refunds.
# Class: PaymentProcessor
# Imports: stripe, pydantic, ..repositories.orders

def process_refund(self, order_id: str, amount: Decimal) -> RefundResult:
    ...

Embedding models for code

General-purpose text embeddings work for code but not optimally. Code-specific embedding models include:

For systems that primarily retrieve code, a code-specific embedding model typically gives 15-30% better retrieval quality.

Code + docs hybrid

Most documentation RAG systems need to retrieve across code AND prose:

Strategy: chunk prose and code with their respective strategies, embed everything into the same index with element_type metadata. At retrieval, you can optionally filter or boost by type depending on query intent.

Code search use case specifics

If the use case is "find the function that does X" (semantic code search):

The common mistakes

Next: What embeddings are.