Back to Resources
Guide10 min

RAG vs Fine-tuning Decision Framework

A 7-factor framework for picking between RAG, fine-tuning, or both — with a decision table you can actually use in a planning meeting.

RAG vs Fine-tuning Decision Framework

Nine times out of ten, the answer to "should we RAG or fine-tune?" is RAG. The tenth time, it's both. But the question itself is usually the wrong one — what you're really deciding is where new knowledge lives, how fast it needs to change, and who pays the latency tax. This guide walks through seven factors we use on every engagement, with a decision table at the end you can take straight into a scoping meeting.

If you can't point to the specific failure mode you're trying to fix, don't fine-tune. Fine-tuning is not a knowledge transplant — it's a behavior nudge.

The seven factors

1. Data freshness

How often does the underlying knowledge change?

  • Daily or faster (prices, inventory, tickets, policies): RAG. Fine-tuning cycles are measured in days-to-weeks. You will never catch up.
  • Monthly: RAG still wins. The cost of a retraining cadence is usually higher than the cost of a retriever.
  • Rarely (product taxonomy, legal citations, a fixed corpus): either works. Lean fine-tune only if other factors push you there.

2. Citation and traceability requirements

If a lawyer, regulator, or enterprise procurement team will read the answer, you need citations. RAG gives you sources for free — you already have the chunks. Fine-tuning does not, and "the model knows it" is not an audit trail.

If your domain is healthcare, legal, financial services, or B2B enterprise sales: RAG, or hybrid with RAG on top.

3. Knowledge base size

  • < 10K tokens of "stable truth": put it in the system prompt. Don't build anything.
  • 10K – 10M tokens: RAG with a single vector store and a decent chunker.
  • 10M – 1B tokens: RAG with hierarchical retrieval, hybrid BM25 + dense, and rerankers.
  • > 1B tokens: RAG plus careful sharding. Fine-tuning at this scale competes with RAG on cost, not on capability.

4. Behavior vs knowledge

This is the factor people get wrong most often. Ask: do I want the model to know something new, or act differently?

  • Know (facts, products, procedures) → RAG.
  • Act (tone, format, structured output, classification, refusal behavior, code style) → fine-tuning, or a very good prompt.

A fine-tune that's trying to teach facts will hallucinate confidently. A RAG that's trying to teach tone will sound like a different writer on every query.

5. Latency and cost per request

Fine-tuned models skip the retrieval hop and can run on smaller base models. For high-QPS, cost-sensitive workloads (classification at scale, internal routing, agent sub-steps), a fine-tuned small model beats a frontier model + RAG on both axes.

Rule of thumb: if you're serving more than 100 QPS and the task is narrow, fine-tune a small model.

6. Output structure and consistency

If you need strict JSON, a specific schema, or a domain-specific format that frontier models get 85% right — fine-tuning or constrained decoding wins. RAG does not help here; the problem is output shape, not knowledge.

7. Deployment constraints

Air-gapped, on-prem, edge, or data-residency constraints push you toward self-hosted models, which makes fine-tuning (or continued pre-training) viable and sometimes necessary. RAG still applies — you just host the retriever too.

Decision table

Factor Lean RAG Lean Fine-tune Hybrid
Data changes daily/weekly Yes No Yes
Need citations Yes No Yes
Knowledge base > 100K tokens Yes No Yes
Goal is new knowledge Yes No Yes
Goal is tone/format/classification No Yes Maybe
> 100 QPS, cost-sensitive No Yes Yes
Strict output schema No Yes Maybe
Narrow domain, stable vocabulary No Yes Yes
Regulated industry Yes No Yes
Edge / air-gapped deploy Maybe Yes Yes

Read the table as signal strength, not a vote count. One "need citations" yes outweighs three "narrow domain" yeses in a regulated setting.

When to do both (hybrid)

The hybrid pattern shows up more than people expect. Typical shapes:

  1. Fine-tuned router + RAG responder. A small fine-tuned classifier decides the query type and retrieval strategy; a frontier model with RAG answers. Cuts cost and improves retrieval precision.
  2. RAG for facts, fine-tune for format. Retrieval grounds the answer. A fine-tune on (retrieved_context, ideal_answer) pairs teaches the model to use the context in your house style.
  3. Fine-tuned embedder + RAG. Off-the-shelf embeddings underperform on domain jargon (finance tickers, medical codes, internal project names). A fine-tuned embedder on a few thousand pairs often doubles retrieval recall.
  4. RAG with a fine-tuned reranker. Cheapest hybrid. Keep your retriever generic; fine-tune a cross-encoder reranker on thumbs-up/down data.

If you're going to fine-tune anything in a RAG system, fine-tune the embedder or reranker before you fine-tune the generator. The leverage is 5-10x higher.

Common anti-patterns

  • Fine-tuning a frontier model to "teach it our product." You'll spend $50K and the model will still invent SKUs. Put your product catalog in a retriever.
  • RAG with a 200-page PDF and no chunking strategy. The retrieval step is the product. Spend time there, not on prompt tweaks.
  • Evaluating fine-tuning against RAG on a 20-example eyeball test. You need 200+ eval cases and a held-out set. See our LLM Eval Starter Kit.
  • Fine-tuning to fix hallucination. This rarely works. Hallucination is usually a grounding problem (RAG) or a prompt problem.
  • RAG when the answer lives in a structured database. Use SQL / tool calling. A vector search over a customer table is embarrassing for everyone involved.

A worked example

A client came to us wanting to fine-tune GPT-4o to "sound like our support team." They had 18 months of Zendesk tickets.

What we actually shipped:

  1. RAG over their knowledge base and resolved tickets. Solved 80% of the "sound like us" problem because the retrieved examples carried the tone.
  2. Fine-tuned embedder on ticket similarity pairs. Retrieval recall@5 went from 0.62 to 0.89.
  3. Small prompt-level style guide for tone (10 lines).
  4. Skipped fine-tuning the generator. Re-evaluated quarterly; still not needed.

Total cost ~1/8 of the original fine-tuning plan. Ships in 3 weeks instead of 3 months. This is the shape of most real projects.

Next step

If you're staring at a build-vs-buy-vs-fine-tune slide and want a second opinion before the PO goes out, get in touch. We'll tell you if fine-tuning is worth it, and more often, why it isn't.

Full Guide

Finish reading the guide.

Drop your details to unlock the rest of the guide on this page. RAG vs Fine-tuning Decision Framework.

No spam — we just want to know who's reading.