Corrective RAG (CRAG) adds a quality-check step between retrieval and generation. If retrieved documents are judged insufficient or irrelevant, the system triggers a fallback strategy, typically a web search, query rewriting, or different retrieval approach. It's a practical pattern for handling retrieval failures gracefully.
1. Retrieve from internal corpus 2. Evaluate retrieved documents: - Correct (high confidence, relevant) - Ambiguous (some relevant, some not) - Incorrect (nothing relevant found) 3. Based on evaluation: - Correct: generate answer from retrieved docs - Ambiguous: refine retrieval, add web search, or both - Incorrect: fall back to web search or "I don't know" 4. Generate final answer
The key component is the retrieval evaluator, a model (small classifier, LLM with a prompt, or score-based heuristic) that judges whether retrieval gave enough to answer.
SYSTEM: Given a user query and retrieved documents, judge whether the documents contain enough information to answer the query. Respond with: - SUFFICIENT: can answer directly - PARTIAL: some info present, need more - INSUFFICIENT: no relevant info USER: Query: [query] Documents: [doc1, doc2, doc3]
If all retrieved documents have cosine similarity below 0.7 (example threshold), flag as low-confidence. Cheap, no extra LLM call, but less reliable than LLM evaluation.
Score-based first pass (fast). For borderline cases, LLM evaluator.
When internal corpus lacks info, search the web (via Serper, Tavily, Brave Search, etc.). Append web results to context. Useful for current events, general knowledge, questions outside your corpus.
Rewrite the query with different terminology, retry retrieval. Simple, no external dependencies.
Escalate to broader knowledge sources: general docs, encyclopedic sources, parent organization's docs.
If nothing is found, acknowledge the gap instead of hallucinating. This is the underrated answer most systems skip.
The alternative to CRAG is: the model receives low-quality context and hallucinates a plausible but wrong answer. CRAG treats "retrieval was bad" as a first-class state, not a silent failure.
CRAG adds one evaluation step per query. If the evaluator is a small fast model, this adds 100-300ms. For queries that trigger fallback, total latency grows (web search adds seconds).
Evaluate per-document (which docs are relevant?) or per-set (is the overall set sufficient?). Per-set is simpler, per-document is more precise.
How strict is the evaluator? Strict = more fallbacks (better quality, higher cost). Lenient = fewer fallbacks (faster, cheaper, more hallucinations).
Sequential: try retrieval, if bad, fall back. Parallel: always run retrieval + web search, use the better result. Parallel is faster for queries that need fallback but wastes compute on queries that don't.
def corrective_rag(query):
retrieved = retrieve(query, top_k=10)
judgment = evaluate(query, retrieved)
if judgment == "SUFFICIENT":
return generate(query, retrieved)
if judgment == "PARTIAL":
web_results = web_search(query)
return generate(query, retrieved + web_results)
if judgment == "INSUFFICIENT":
rewritten = rewrite_query(query)
retry_retrieved = retrieve(rewritten, top_k=20)
retry_judgment = evaluate(query, retry_retrieved)
if retry_judgment != "INSUFFICIENT":
return generate(query, retry_retrieved)
web_results = web_search(query)
return generate(query, web_results)
You don't need a full CRAG implementation to benefit from the idea. Start with:
This simple version already stops most hallucination failures in production RAG.