Home›Expertise›RAGS to Riches›Agentic RAG

Agentic RAG

📖 5 min readUpdated 2026-04-18

Vanilla RAG retrieves once and generates once. Agentic RAG lets the model decide when to retrieve, what to retrieve, and whether the retrieval was sufficient. It's the natural evolution from "search then answer" to "reason, search, search again, refine, answer." Most serious production systems end up in some version of this architecture.

The pattern

Instead of hardcoded retrieve-then-generate, you expose retrieval as a tool the LLM can call:

1. User asks question
2. LLM thinks: "I need information about X"
3. LLM calls: search_documents(query="X")
4. Tool returns documents
5. LLM evaluates: "I have enough to answer Y but need more on Z"
6. LLM calls: search_documents(query="Z related to X")
7. Tool returns more documents
8. LLM synthesizes answer from all retrieved context

Why this beats one-shot RAG

Multi-hop questions: "Compare X and Y" needs information about both. One-shot retrieval on the full query gets mediocre results for each. Separate retrievals per entity get clean results.
Disambiguation: First retrieval can clarify what the question is actually about before the focused retrieval.
Fallback: If the first retrieval is weak, the model can recognize that and try a different query.
Tool variety: Different tools for different data sources (knowledge base, ticketing system, SQL database).

The tool definition

{
  "name": "search_documents",
  "description": "Search the internal knowledge base for documents
                  relevant to a query. Returns top 5 most relevant
                  document chunks.",
  "parameters": {
    "query": {"type": "string", "description": "search query"},
    "filters": {"type": "object", "description": "optional filters
                on source, date, type"}
  }
}

Multi-tool setups

An agentic RAG system often has multiple retrieval tools:

search_docs(query): knowledge base vector search
search_tickets(query): past support tickets
query_database(sql): structured data
get_document(id): fetch a specific document by ID when the model needs more context from a partial retrieval
list_related(document_id): find related documents

The LLM picks tools based on the query. "What's our refund policy?" → search_docs. "How many refunds last quarter?" → query_database. "How did we handle this customer's last ticket?" → search_tickets.

The ReAct pattern

ReAct (Reasoning + Acting) is the classical scaffold:

Thought: what do I need to figure out?
Action: call a tool
Observation: what did the tool return?
Thought: do I have enough, or do I need more?
...repeat until ready to answer
Final answer

Models trained with tool-use support (GPT-4, Claude, Gemini) do ReAct-style reasoning natively.

The failure modes

Infinite loops

Agent keeps retrieving without terminating. Mitigation: max steps, max cost budget, timeout.

Over-retrieval

Model makes 5 calls when one would do. Cost balloons. Mitigation: instruct the model to only retrieve when necessary; log and alert on high-call-count queries.

Under-retrieval

Model answers too quickly from insufficient context. Hallucinates. Mitigation: strong system prompt about when to retrieve vs answer from own knowledge.

Wrong tool choice

Model uses search_docs when the answer is in the database. Mitigation: clear tool descriptions, examples.

Observability is critical

Agentic systems are harder to debug. Invest in tracing:

Log every tool call with inputs and outputs
Track total calls per query
Record model thinking/reasoning steps
Measure: average calls per query, success rate, latency distribution

See observability.

Cost and latency

Agentic RAG is slower and more expensive than one-shot:

Each tool call adds one LLM turn (200-1000ms) plus tool execution
3-5 tool calls per query is typical
Total query latency often 2-5x single-shot RAG
Token cost also 2-5x (multiple LLM turns)

For simple queries, one-shot is fine. Agentic RAG pays off on complex queries where one-shot fails.

Hybrid: routing between one-shot and agentic

Advanced pattern: classify the query first. If it's simple and answerable with one retrieval, use one-shot. If it's multi-hop or ambiguous, use agentic. Saves cost on easy queries, preserves quality on hard ones.

When to use agentic RAG

Complex queries requiring multiple retrievals or different data sources
Corpora where query-shape is unpredictable (users ask anything)
Systems where quality matters more than latency
Enterprises with multiple structured and unstructured data sources to coordinate

When to stick with one-shot

Simple Q&A over a single homogeneous corpus
Latency-critical applications
Cost-sensitive high-volume systems
When one-shot retrieval is already giving good results on eval

What to do with this

Start one-shot. Go agentic only when one-shot demonstrably fails on your eval.
Route by query complexity; don't pay agentic cost on simple queries.
Cap steps + cost from day one. Agentic without caps is a runaway bill.