RAG vs Fine-Tuning: Choosing the Right LLM Strategy for Your Use Case
Two dominant approaches to customizing LLMs for enterprise use. Here's how to choose between RAG and fine-tuning based on your data, latency requirements, and budget.
The Core Question
When you need an LLM that knows about your company's specific products, policies, or domain knowledge, you have two primary paths: Retrieval-Augmented Generation (RAG) or fine-tuning. They solve different problems, and using the wrong one is expensive.
What RAG Actually Does
RAG doesn't change the model. It changes what the model sees.
At inference time, you retrieve relevant documents from a vector store and inject them into the prompt as context. The model reasons over that context to produce its answer.
def rag_query(user_question: str) -> str:
# 1. Embed the question
question_embedding = embedder.encode(user_question)
# 2. Retrieve relevant chunks
relevant_docs = vector_store.search(
query_vector=question_embedding,
k=5,
score_threshold=0.75
)
# 3. Build context
context = "\n\n".join([doc.content for doc in relevant_docs])
# 4. Generate with context
prompt = f"""Answer the question based on the provided context.
Context:
{context}
Question: {user_question}
If the context doesn't contain enough information to answer, say so clearly."""
return llm.complete(prompt)
RAG is best when:
- Your knowledge base is large and changes frequently
- You need source citations ("this answer comes from document X")
- You want to avoid hallucinations on factual questions
- You're working with <$10k budget and need results in days, not weeks
What Fine-Tuning Actually Does
Fine-tuning modifies the model weights using examples of the behavior you want. The model learns patterns, style, format, and domain-specific reasoning.
from openai import OpenAI
client = OpenAI()
# Upload training data
response = client.files.create(
file=open("training_data.jsonl", "rb"),
purpose="fine-tune"
)
# Create fine-tuning job
job = client.fine_tuning.jobs.create(
training_file=response.id,
model="gpt-4o-mini-2024-07-18",
hyperparameters={
"n_epochs": 3,
"batch_size": 4,
}
)
Your training data format (JSONL):
{"messages": [
{"role": "system", "content": "You are a support agent for ADS."},
{"role": "user", "content": "How do I reset my API key?"},
{"role": "assistant", "content": "To reset your API key: navigate to Settings > API Keys > Regenerate. Your old key will be invalidated immediately."}
]}
Fine-tuning is best when:
- You need a specific output format consistently (JSON, structured reports)
- You're teaching a communication style or persona
- You have high-volume, latency-sensitive inference where injecting long context is too slow
- You have 500+ high-quality labeled examples
Decision Matrix
| Dimension | RAG | Fine-Tuning |
|---|---|---|
| Setup time | 1–3 days | 1–3 weeks |
| Cost | Low (retrieval + inference) | High (training) + ongoing inference |
| Knowledge freshness | Real-time (update index) | Requires re-training |
| Source attribution | Native | Difficult |
| Output format control | Moderate | Excellent |
| Hallucination risk | Lower (grounded) | Higher on facts |
| Latency | Higher (retrieval step) | Lower (no retrieval) |
The Hybrid Approach (What We Usually Recommend)
Use RAG for factual knowledge retrieval + fine-tuning for output format and style.
- Fine-tune a small model (GPT-4o-mini, Llama 3 8B) on 500 examples of your desired output format
- Use RAG to inject current, relevant context at inference time
- The fine-tuned model knows how to respond; RAG tells it what to respond about
This combination gets you: consistent format, current knowledge, lower latency than pure RAG, and lower cost than fine-tuning on all your knowledge.
The Mistake to Avoid
Don't fine-tune to memorize facts. LLMs are notoriously bad at reliably recalling specific data (dates, numbers, names) even after fine-tuning. Use RAG for facts, fine-tuning for behavior.