RAG over codebases powers modern coding assistants, internal code search, and automated code generation. Code has different structure, different retrieval patterns, and different quality metrics than prose. Here's what's specific.
A code RAG corpus typically includes:
Function/class/file-level boundaries, not size-based. See chunking code.
General text embeddings work but code-specific embeddings work better:
These models are trained on code and understand syntax, function patterns, and common abstractions better.
Rich metadata helps code retrieval:
Initial query retrieves a function. Expand by pulling in:
Provides rich context for the generator without requiring the user to spell it all out.
For large multi-repo environments, filter by repo. User working on service-A usually wants service-A code first.
Recent code is more likely to be correct, relevant, and well-understood. Boost recently modified code.
BM25 is critical for code:
Dense vectors handle semantic queries ("function that validates email addresses"). BM25 handles specific identifier queries ("validateEmail").
User is in a file; assistant needs context from related files. Retrieve related code and pass as context. This is roughly how GitHub Copilot's broader workspace context works.
User asks "how does our auth flow work?" Retrieve relevant code + docs, generate explanation, cite specific functions and files.
User wants a new feature similar to existing ones. Retrieve similar existing code as examples, generate new code in the same style.
Code RAG quality is more concrete than prose RAG:
Ground-truth evaluation is easier because correctness is more objective.
Code RAG lives in IDEs:
Latency is tight: sub-500ms ideal for inline completions, 2-3 seconds acceptable for chat interactions.
Codebases change rapidly. Stale indexes serve stale code. Patterns:
Code is often the most sensitive data a company has. Many organizations run code RAG entirely on-prem or in VPC:
The compliance story is simpler with full isolation.
Code RAG with good retrieval enables downstream agents:
These higher-order tasks compound the value of the underlying retrieval system. Good retrieval is the foundation; capabilities stack on top.
Next: Legal and compliance RAG.