Home›Expertise›RAGS to Riches›Chunking code

Chunking code

📖 5 min readUpdated 2026-04-18

Code has syntax and structure that prose doesn't. Fixed-size chunking of a Python file can split a function in the middle, which makes both halves useless for retrieval. Chunking code needs its own rules, and the rules are straightforward once you accept that code is not prose.

The principle

Code chunks should align with semantic units of the language:

Functions
Classes
Methods
Modules (for small files)
Logical blocks (for scripts without functions)

Never split a function across two chunks. Never put half a class definition in one chunk and half in another.

Language-aware parsers

Tree-sitter

Industry-standard incremental parser with grammars for 100+ languages. Produces concrete syntax trees from source code. The backbone of most serious code-chunking implementations.

LangChain CodeSplitter

Uses tree-sitter under the hood for common languages. Works for prototyping.

Language-specific AST tools

Python: ast module
JavaScript/TypeScript: @babel/parser, tsc's compiler API
Go: go/ast
Rust: syn
Java: JavaParser

The chunking strategy by code type

Small file (< 500 tokens)

One chunk for the whole file.

Medium file (500-2000 tokens)

One chunk per top-level definition (function, class, module). Keep imports and file-level context attached to the first chunk.

Large file (> 2000 tokens)

Split by class or function. For large classes, one chunk per method with class signature prepended as context.

Monolithic function (rare but real)

If a single function exceeds chunk size, split by logical blocks (while respecting brace matching). This is a code smell anyway, surface it in documentation.

Context enrichment

Each code chunk benefits from surrounding context:

File path and project structure
Class signature (for method chunks)
Import statements from the same file
Module docstring
Immediate preceding/following functions (for cross-reference understanding)

Pattern: prepend a context header to each chunk:

# File: services/payments/processor.py
# Module docstring: Handles payment processing and refunds.
# Class: PaymentProcessor
# Imports: stripe, pydantic, ..repositories.orders

def process_refund(self, order_id: str, amount: Decimal) -> RefundResult:
    ...

Embedding models for code

General-purpose text embeddings work for code but not optimally. Code-specific embedding models include:

Voyage Code - optimized for code retrieval
OpenAI text-embedding-3-large with code-specific prompting
CodeBERT, GraphCodeBERT - research models
jina-embeddings-v2-base-code - open-source code embeddings

For systems that primarily retrieve code, a code-specific embedding model typically gives 15-30% better retrieval quality.

Code + docs hybrid

Most documentation RAG systems need to retrieve across code AND prose:

API reference docs (prose)
Code examples (code)
Error messages and their causes (mixed)
Changelog and migration guides (prose with embedded code)

Strategy: chunk prose and code with their respective strategies, embed everything into the same index with element_type metadata. At retrieval, you can optionally filter or boost by type depending on query intent.

Code search use case specifics

If the use case is "find the function that does X" (semantic code search):

Embed function name + signature + docstring + body as the chunk
For very large bodies, emit a second chunk with just name + signature + docstring (higher weight for name-based queries)
Include function invocations in the chunk context (helps with "how is this used" queries)

The common mistakes

Using prose chunking defaults on code (splits functions)
Treating code and comments as separate chunks (they reinforce each other)
Dropping imports (they encode dependencies)
Ignoring file structure (a function's meaning depends on what file it's in)

What to do with this

Use tree-sitter or the native AST for your language. Don't fall back to character splitting.
Always prepend file path + class signature as context.
Use a code-specific embedding model if code retrieval is the primary job.