RAG solves a specific problem: grounding an LLM's answers in data the model didn't see during training. If your problem isn't that, you don't need RAG. Here's when I push teams toward simpler or different architectures.
If your "knowledge base" is 30 pages of internal docs, put it in a system prompt and move on. Retrieval pipelines add latency, cost, and surface area. For small, bounded knowledge that rarely changes, a long system prompt beats RAG every time.
Rule of thumb: if all your content fits in 40-60K tokens comfortably, skip RAG. You can always add it later if the knowledge base grows.
"Generate a marketing email in our brand voice" doesn't need RAG. It needs a well-crafted system prompt with examples. "Summarize this document" doesn't need RAG. You already have the document.
RAG is for questions that require finding the right document. If there's no retrieval problem, there's no RAG.
If the user uploads one PDF and asks complex questions about it, you're better off sending the whole document (or large sections of it) to a long-context model than chunking-retrieving-generating. RAG chunking can break chains of reasoning that span multiple sections of a single document.
For "analyze this contract" or "review this codebase file," long-context inference outperforms chunked retrieval.
RAG is optimistic retrieval: fetch the most likely documents and let the model figure it out. For legal discovery, compliance auditing, or anything where a missed document is a lawsuit, you want exhaustive search with human review, not approximate vector similarity. Traditional keyword search with Boolean operators and proper workflows beats RAG in these cases.
If your "documents" are rows in a database, you don't want vector retrieval, you want SQL. An LLM with structured tool access and a well-defined query interface beats RAG for any task where the underlying data is tabular and queryable.
Text-to-SQL or tool-calling agents are the right pattern here. RAG over a database export is usually worse than querying the database directly.
Full RAG adds 200-1500ms of latency (embedding + search + rerank + generation). If you're building something real-time (voice, streaming autocomplete), RAG may be too slow. Consider:
RAG is a live system. It needs document ingestion pipelines, index maintenance, embedding recomputation when you change models, and monitoring. If your team won't own this, RAG will rot. The index will drift out of sync with the actual data, and answers will silently get worse.
For one-off projects or prototypes that won't have ongoing ownership, a long-context approach is often better. You pay more per query but you avoid an operational burden nobody will carry.
Revisit: if you need stable output structure, consistent tone, or latency-critical task-specific behavior, fine-tuning may beat RAG. See why RAG over fine-tuning for the split.
If the information is in the model's training data (common knowledge, standard programming patterns, general reasoning), RAG adds no value. Trying to RAG-retrieve "how do I reverse a string in Python" is worse than just asking the model.
A quick test: does the base model, without any context, get this question mostly right? If yes, RAG is solving the wrong problem.
Whenever a team says "we want to use RAG for...", I ask them to complete the sentence without the word RAG. "We want to ground answers in our product docs." "We want users to ask questions about their uploaded files." "We want to search across internal wikis with a natural-language interface." Those are real use cases. "We want to use RAG" is architecture looking for a problem.
Next: The ingestion pipeline.