Agentic RAG

Vanilla RAG retrieves once and generates once. Agentic RAG lets the model decide when to retrieve, what to retrieve, and whether the retrieval was sufficient. It's the natural evolution from "search then answer" to "reason, search, search again, refine, answer." Most serious production systems end up in some version of this architecture.

The pattern

Instead of hardcoded retrieve-then-generate, you expose retrieval as a tool the LLM can call:

1. User asks question
2. LLM thinks: "I need information about X"
3. LLM calls: search_documents(query="X")
4. Tool returns documents
5. LLM evaluates: "I have enough to answer Y but need more on Z"
6. LLM calls: search_documents(query="Z related to X")
7. Tool returns more documents
8. LLM synthesizes answer from all retrieved context

Why this beats one-shot RAG

The tool definition

{
  "name": "search_documents",
  "description": "Search the internal knowledge base for documents
                  relevant to a query. Returns top 5 most relevant
                  document chunks.",
  "parameters": {
    "query": {"type": "string", "description": "search query"},
    "filters": {"type": "object", "description": "optional filters
                on source, date, type"}
  }
}

Multi-tool setups

An agentic RAG system often has multiple retrieval tools:

The LLM picks tools based on the query. "What's our refund policy?" → search_docs. "How many refunds last quarter?" → query_database. "How did we handle this customer's last ticket?" → search_tickets.

The ReAct pattern

ReAct (Reasoning + Acting) is the classical scaffold:

  1. Thought: what do I need to figure out?
  2. Action: call a tool
  3. Observation: what did the tool return?
  4. Thought: do I have enough, or do I need more?
  5. ...repeat until ready to answer
  6. Final answer

Models trained with tool-use support (GPT-4, Claude, Gemini) do ReAct-style reasoning natively.

The failure modes

Infinite loops

Agent keeps retrieving without terminating. Mitigation: max steps, max cost budget, timeout.

Over-retrieval

Model makes 5 calls when one would do. Cost balloons. Mitigation: instruct the model to only retrieve when necessary; log and alert on high-call-count queries.

Under-retrieval

Model answers too quickly from insufficient context. Hallucinates. Mitigation: strong system prompt about when to retrieve vs answer from own knowledge.

Wrong tool choice

Model uses search_docs when the answer is in the database. Mitigation: clear tool descriptions, examples.

Observability is critical

Agentic systems are harder to debug. Invest in tracing:

See observability.

Cost and latency

Agentic RAG is slower and more expensive than one-shot:

For simple queries, one-shot is fine. Agentic RAG pays off on complex queries where one-shot fails.

Hybrid: routing between one-shot and agentic

Advanced pattern: classify the query first. If it's simple and answerable with one retrieval, use one-shot. If it's multi-hop or ambiguous, use agentic. Saves cost on easy queries, preserves quality on hard ones.

When to use agentic RAG

When to stick with one-shot

Next: GraphRAG.