Every time a team says "we want to fine-tune a model on our company docs," I ask one question: do you want to change how the model behaves, or do you want to make facts available to it? If it's the second one, and it almost always is, you want RAG, not fine-tuning. This is the most common architectural mistake I see in AI projects.
Fine-tuning a serious model costs anywhere from a few hundred to tens of thousands of dollars per run. Updating a RAG index costs pennies. When your knowledge base changes, which it does constantly, the RAG system is updated in seconds. The fine-tuned model is a stale artifact until the next training run.
For any system where facts update more than once a quarter, RAG wins on cost alone.
Fine-tuning on factual data teaches the model to produce tokens that look like your data. Not to reason about it. The model will confidently invent answers that sound like your docs but contradict them. This is especially bad with small datasets, where the model memorizes surface patterns without generalizing.
I've seen teams fine-tune on internal wikis and get a model that hallucinates internal wiki content, which is worse than useless. RAG, done well, doesn't have this failure mode because the actual source text is in the prompt at generation time.
The right hybrid: fine-tune a smaller/cheaper model on the shape of responses you want (format, style, common patterns), then use RAG to inject the specific facts at inference time. You get the efficiency of a smaller model, the consistency of fine-tuning, and the factuality of RAG.
This is the architecture most serious production systems converge on. Not "RAG or fine-tuning" but "fine-tuning for behavior, RAG for facts."
Ask yourself: if a new fact is added to our knowledge base tomorrow, do we need the system to answer about it?
Some teams now argue that with 1M+ token context windows, you can just stuff everything in. In practice: you can't afford to, your latency triples, and retrieval quality (what the model pays attention to) degrades in the middle of massive contexts. The "lost in the middle" phenomenon is real. Long context is a tool in the RAG toolbox, not a replacement for retrieval.
The right mental model: use a long context window to pass more retrieved chunks, not to skip retrieval.
Next: The RAG architecture map.