Home›Expertise›RAGS to Riches›Code search + generation RAG

Code search + generation RAG

📖 4 min readUpdated 2026-04-18

RAG over codebases powers modern coding assistants, internal code search, and automated code generation. Code has different structure, different retrieval patterns, and different quality metrics than prose. Here's what's specific.

What code RAG does

Semantic code search: "find the function that handles refunds"
Code generation with context: "write a function like this one but for products"
Documentation generation: generate API docs from code + comments
Debugging assistance: "why is this failing" with relevant code pulled in
Refactoring helpers: find all usages, suggest migration paths

The content

A code RAG corpus typically includes:

Source code (primary)
Docstrings and inline comments
README and architecture docs
Commit messages
Test files (show how code is used)
Issue tracker content
PR descriptions

Chunking code

Function/class/file-level boundaries, not size-based. See chunking code.

Embeddings for code

General text embeddings work but code-specific embeddings work better:

Voyage Code 3
jina-embeddings-v2-base-code
nomic-embed-code

These models are trained on code and understand syntax, function patterns, and common abstractions better.

Metadata for code

Rich metadata helps code retrieval:

File path (often encodes module/component)
Language
Function signature
Imports used
Class (if method)
Last modified timestamp
Recent authors
Test coverage (if available)

Retrieval patterns

Query-then-expand

Initial query retrieves a function. Expand by pulling in:

Functions it calls
Functions that call it
Its tests
Related docs

Provides rich context for the generator without requiring the user to spell it all out.

Repository-aware retrieval

For large multi-repo environments, filter by repo. User working on service-A usually wants service-A code first.

Recency bias

Recent code is more likely to be correct, relevant, and well-understood. Boost recently modified code.

Hybrid retrieval essential

BM25 is critical for code:

Function names are specific identifiers that BM25 matches exactly
Error messages are often copy-pasted verbatim
API names, config keys, specific strings

Dense vectors handle semantic queries ("function that validates email addresses"). BM25 handles specific identifier queries ("validateEmail").

Generation patterns

Code completion with context

User is in a file; assistant needs context from related files. Retrieve related code and pass as context. This is roughly how GitHub Copilot's broader workspace context works.

Answer technical questions with citations

User asks "how does our auth flow work?" Retrieve relevant code + docs, generate explanation, cite specific functions and files.

Generate code using examples

User wants a new feature similar to existing ones. Retrieve similar existing code as examples, generate new code in the same style.

Evaluation for code RAG

Code RAG quality is more concrete than prose RAG:

Does retrieved code actually compile/run?
Do retrieved tests pass on retrieved code?
Do generated changes pass existing tests?
Is the generated code style-consistent with the codebase?

Ground-truth evaluation is easier because correctness is more objective.

IDE integration

Code RAG lives in IDEs:

VS Code extensions
JetBrains plugins
Terminal tools
Git hooks

Latency is tight: sub-500ms ideal for inline completions, 2-3 seconds acceptable for chat interactions.

The "outdated code" problem

Codebases change rapidly. Stale indexes serve stale code. Patterns:

Event-driven ingestion (webhooks from git)
Periodic full re-index
Branch-aware indexing (different branch, different index)

Security considerations

Don't expose secrets (API keys, credentials in code, filter at ingestion)
Access control per repo (not everyone should see all code)
Don't send proprietary code to external LLM APIs without the right agreement

Self-hosted is common

Code is often the most sensitive data a company has. Many organizations run code RAG entirely on-prem or in VPC:

Self-hosted embedding model
Self-hosted vector DB
Self-hosted or VPC-deployed LLM

The compliance story is simpler with full isolation.

The compound quality benefit

Code RAG with good retrieval enables downstream agents:

Automated refactoring
Dependency updates
Bug fixes with PR generation
Code review automation

These higher-order tasks compound the value of the underlying retrieval system. Good retrieval is the foundation; capabilities stack on top.

What to do with this

Use a code-specific embedding model; it meaningfully outperforms generic text embeddings.
Chunk on function / class boundaries. Fixed-size splits functions and destroys meaning.
Filter secrets at ingestion. Never index credentials.