Code search + generation RAG
📖 4 min readUpdated 2026-04-18
RAG over codebases powers modern coding assistants, internal code search, and automated code generation. Code has different structure, different retrieval patterns, and different quality metrics than prose. Here's what's specific.
What code RAG does
- Semantic code search: "find the function that handles refunds"
- Code generation with context: "write a function like this one but for products"
- Documentation generation: generate API docs from code + comments
- Debugging assistance: "why is this failing" with relevant code pulled in
- Refactoring helpers: find all usages, suggest migration paths
The content
A code RAG corpus typically includes:
- Source code (primary)
- Docstrings and inline comments
- README and architecture docs
- Commit messages
- Test files (show how code is used)
- Issue tracker content
- PR descriptions
Chunking code
Function/class/file-level boundaries, not size-based. See chunking code.
Embeddings for code
General text embeddings work but code-specific embeddings work better:
- Voyage Code 3
- jina-embeddings-v2-base-code
- nomic-embed-code
These models are trained on code and understand syntax, function patterns, and common abstractions better.
Metadata for code
Rich metadata helps code retrieval:
- File path (often encodes module/component)
- Language
- Function signature
- Imports used
- Class (if method)
- Last modified timestamp
- Recent authors
- Test coverage (if available)
Retrieval patterns
Query-then-expand
Initial query retrieves a function. Expand by pulling in:
- Functions it calls
- Functions that call it
- Its tests
- Related docs
Provides rich context for the generator without requiring the user to spell it all out.
Repository-aware retrieval
For large multi-repo environments, filter by repo. User working on service-A usually wants service-A code first.
Recency bias
Recent code is more likely to be correct, relevant, and well-understood. Boost recently modified code.
Hybrid retrieval essential
BM25 is critical for code:
- Function names are specific identifiers that BM25 matches exactly
- Error messages are often copy-pasted verbatim
- API names, config keys, specific strings
Dense vectors handle semantic queries ("function that validates email addresses"). BM25 handles specific identifier queries ("validateEmail").
Generation patterns
Code completion with context
User is in a file; assistant needs context from related files. Retrieve related code and pass as context. This is roughly how GitHub Copilot's broader workspace context works.
Answer technical questions with citations
User asks "how does our auth flow work?" Retrieve relevant code + docs, generate explanation, cite specific functions and files.
Generate code using examples
User wants a new feature similar to existing ones. Retrieve similar existing code as examples, generate new code in the same style.
Evaluation for code RAG
Code RAG quality is more concrete than prose RAG:
- Does retrieved code actually compile/run?
- Do retrieved tests pass on retrieved code?
- Do generated changes pass existing tests?
- Is the generated code style-consistent with the codebase?
Ground-truth evaluation is easier because correctness is more objective.
IDE integration
Code RAG lives in IDEs:
- VS Code extensions
- JetBrains plugins
- Terminal tools
- Git hooks
Latency is tight: sub-500ms ideal for inline completions, 2-3 seconds acceptable for chat interactions.
The "outdated code" problem
Codebases change rapidly. Stale indexes serve stale code. Patterns:
- Event-driven ingestion (webhooks from git)
- Periodic full re-index
- Branch-aware indexing (different branch, different index)
Security considerations
- Don't expose secrets (API keys, credentials in code, filter at ingestion)
- Access control per repo (not everyone should see all code)
- Don't send proprietary code to external LLM APIs without the right agreement
Self-hosted is common
Code is often the most sensitive data a company has. Many organizations run code RAG entirely on-prem or in VPC:
- Self-hosted embedding model
- Self-hosted vector DB
- Self-hosted or VPC-deployed LLM
The compliance story is simpler with full isolation.
The compound quality benefit
Code RAG with good retrieval enables downstream agents:
- Automated refactoring
- Dependency updates
- Bug fixes with PR generation
- Code review automation
These higher-order tasks compound the value of the underlying retrieval system. Good retrieval is the foundation; capabilities stack on top.
What to do with this
- Use a code-specific embedding model; it meaningfully outperforms generic text embeddings.
- Chunk on function / class boundaries. Fixed-size splits functions and destroys meaning.
- Filter secrets at ingestion. Never index credentials.