Why RAG over fine-tuning

Every time a team says "we want to fine-tune a model on our company docs," I ask one question: do you want to change how the model behaves, or do you want to make facts available to it? If it's the second one, and it almost always is, you want RAG, not fine-tuning. This is the most common architectural mistake I see in AI projects.

The clean split

Use RAG when you need

Use fine-tuning when you need

Use both when you need

The economics are not close

Fine-tuning a serious model costs anywhere from a few hundred to tens of thousands of dollars per run. Updating a RAG index costs pennies. When your knowledge base changes, which it does constantly, the RAG system is updated in seconds. The fine-tuned model is a stale artifact until the next training run.

For any system where facts update more than once a quarter, RAG wins on cost alone.

The failure mode nobody warns you about

Fine-tuning on factual data teaches the model to produce tokens that look like your data. Not to reason about it. The model will confidently invent answers that sound like your docs but contradict them. This is especially bad with small datasets, where the model memorizes surface patterns without generalizing.

I've seen teams fine-tune on internal wikis and get a model that hallucinates internal wiki content, which is worse than useless. RAG, done well, doesn't have this failure mode because the actual source text is in the prompt at generation time.

The hybrid pattern, when it pays

The right hybrid: fine-tune a smaller/cheaper model on the shape of responses you want (format, style, common patterns), then use RAG to inject the specific facts at inference time. You get the efficiency of a smaller model, the consistency of fine-tuning, and the factuality of RAG.

This is the architecture most serious production systems converge on. Not "RAG or fine-tuning" but "fine-tuning for behavior, RAG for facts."

A quick test

Ask yourself: if a new fact is added to our knowledge base tomorrow, do we need the system to answer about it?

Long context isn't a replacement

Some teams now argue that with 1M+ token context windows, you can just stuff everything in. In practice: you can't afford to, your latency triples, and retrieval quality (what the model pays attention to) degrades in the middle of massive contexts. The "lost in the middle" phenomenon is real. Long context is a tool in the RAG toolbox, not a replacement for retrieval.

The right mental model: use a long context window to pass more retrieved chunks, not to skip retrieval.

Next: The RAG architecture map.