Code search + generation RAG

RAG over codebases powers modern coding assistants, internal code search, and automated code generation. Code has different structure, different retrieval patterns, and different quality metrics than prose. Here's what's specific.

What code RAG does

The content

A code RAG corpus typically includes:

Chunking code

Function/class/file-level boundaries, not size-based. See chunking code.

Embeddings for code

General text embeddings work but code-specific embeddings work better:

These models are trained on code and understand syntax, function patterns, and common abstractions better.

Metadata for code

Rich metadata helps code retrieval:

Retrieval patterns

Query-then-expand

Initial query retrieves a function. Expand by pulling in:

Provides rich context for the generator without requiring the user to spell it all out.

Repository-aware retrieval

For large multi-repo environments, filter by repo. User working on service-A usually wants service-A code first.

Recency bias

Recent code is more likely to be correct, relevant, and well-understood. Boost recently modified code.

Hybrid retrieval essential

BM25 is critical for code:

Dense vectors handle semantic queries ("function that validates email addresses"). BM25 handles specific identifier queries ("validateEmail").

Generation patterns

Code completion with context

User is in a file; assistant needs context from related files. Retrieve related code and pass as context. This is roughly how GitHub Copilot's broader workspace context works.

Answer technical questions with citations

User asks "how does our auth flow work?" Retrieve relevant code + docs, generate explanation, cite specific functions and files.

Generate code using examples

User wants a new feature similar to existing ones. Retrieve similar existing code as examples, generate new code in the same style.

Evaluation for code RAG

Code RAG quality is more concrete than prose RAG:

Ground-truth evaluation is easier because correctness is more objective.

IDE integration

Code RAG lives in IDEs:

Latency is tight: sub-500ms ideal for inline completions, 2-3 seconds acceptable for chat interactions.

The "outdated code" problem

Codebases change rapidly. Stale indexes serve stale code. Patterns:

Security considerations

Self-hosted is common

Code is often the most sensitive data a company has. Many organizations run code RAG entirely on-prem or in VPC:

The compliance story is simpler with full isolation.

The compound quality benefit

Code RAG with good retrieval enables downstream agents:

These higher-order tasks compound the value of the underlying retrieval system. Good retrieval is the foundation; capabilities stack on top.

Next: Legal and compliance RAG.