Code has syntax and structure that prose doesn't. Fixed-size chunking of a Python file can split a function in the middle, which makes both halves useless for retrieval. Chunking code needs its own rules, and the rules are straightforward once you accept that code is not prose.
Code chunks should align with semantic units of the language:
Never split a function across two chunks. Never put half a class definition in one chunk and half in another.
Industry-standard incremental parser with grammars for 100+ languages. Produces concrete syntax trees from source code. The backbone of most serious code-chunking implementations.
Uses tree-sitter under the hood for common languages. Works for prototyping.
ast module@babel/parser, tsc's compiler APIgo/astsynOne chunk for the whole file.
One chunk per top-level definition (function, class, module). Keep imports and file-level context attached to the first chunk.
Split by class or function. For large classes, one chunk per method with class signature prepended as context.
If a single function exceeds chunk size, split by logical blocks (while respecting brace matching). This is a code smell anyway, surface it in documentation.
Each code chunk benefits from surrounding context:
Pattern: prepend a context header to each chunk:
# File: services/payments/processor.py
# Module docstring: Handles payment processing and refunds.
# Class: PaymentProcessor
# Imports: stripe, pydantic, ..repositories.orders
def process_refund(self, order_id: str, amount: Decimal) -> RefundResult:
...
General-purpose text embeddings work for code but not optimally. Code-specific embedding models include:
For systems that primarily retrieve code, a code-specific embedding model typically gives 15-30% better retrieval quality.
Most documentation RAG systems need to retrieve across code AND prose:
Strategy: chunk prose and code with their respective strategies, embed everything into the same index with element_type metadata. At retrieval, you can optionally filter or boost by type depending on query intent.
If the use case is "find the function that does X" (semantic code search):
Next: What embeddings are.