Vanilla RAG retrieves once and generates once. Agentic RAG lets the model decide when to retrieve, what to retrieve, and whether the retrieval was sufficient. It's the natural evolution from "search then answer" to "reason, search, search again, refine, answer." Most serious production systems end up in some version of this architecture.
Instead of hardcoded retrieve-then-generate, you expose retrieval as a tool the LLM can call:
1. User asks question 2. LLM thinks: "I need information about X" 3. LLM calls: search_documents(query="X") 4. Tool returns documents 5. LLM evaluates: "I have enough to answer Y but need more on Z" 6. LLM calls: search_documents(query="Z related to X") 7. Tool returns more documents 8. LLM synthesizes answer from all retrieved context
{
"name": "search_documents",
"description": "Search the internal knowledge base for documents
relevant to a query. Returns top 5 most relevant
document chunks.",
"parameters": {
"query": {"type": "string", "description": "search query"},
"filters": {"type": "object", "description": "optional filters
on source, date, type"}
}
}
An agentic RAG system often has multiple retrieval tools:
search_docs(query): knowledge base vector searchsearch_tickets(query): past support ticketsquery_database(sql): structured dataget_document(id): fetch a specific document by ID when the model needs more context from a partial retrievallist_related(document_id): find related documentsThe LLM picks tools based on the query. "What's our refund policy?" → search_docs. "How many refunds last quarter?" → query_database. "How did we handle this customer's last ticket?" → search_tickets.
ReAct (Reasoning + Acting) is the classical scaffold:
Models trained with tool-use support (GPT-4, Claude, Gemini) do ReAct-style reasoning natively.
Agent keeps retrieving without terminating. Mitigation: max steps, max cost budget, timeout.
Model makes 5 calls when one would do. Cost balloons. Mitigation: instruct the model to only retrieve when necessary; log and alert on high-call-count queries.
Model answers too quickly from insufficient context. Hallucinates. Mitigation: strong system prompt about when to retrieve vs answer from own knowledge.
Model uses search_docs when the answer is in the database. Mitigation: clear tool descriptions, examples.
Agentic systems are harder to debug. Invest in tracing:
See observability.
Agentic RAG is slower and more expensive than one-shot:
For simple queries, one-shot is fine. Agentic RAG pays off on complex queries where one-shot fails.
Advanced pattern: classify the query first. If it's simple and answerable with one retrieval, use one-shot. If it's multi-hop or ambiguous, use agentic. Saves cost on easy queries, preserves quality on hard ones.
Next: GraphRAG.