Fine-Tuning
When and how to fine-tune LLMs — dataset preparation, training approaches (full, LoRA, QLoRA), evaluation of fine-tuned models, and cost-benefit vs. prompting.
Fine-tuning is the process of further training a pre-trained LLM on your own dataset to improve its performance on a specific task, domain, or style. It’s a powerful tool — but also one that’s frequently reached for too early, when better prompting or RAG would be a faster and cheaper solution.
Fine-Tune Last, Not First
The typical mistake: reaching for fine-tuning when you hit quality issues, before exhausting prompt engineering and RAG. Fine-tuning is expensive, slow to iterate, and hard to debug. Prompt engineering is free to iterate; RAG can be updated continuously. Use fine-tuning when both have been maximised.
When to Fine-Tune vs. Prompt
| Situation | Solution |
|---|---|
| Model doesn’t follow a consistent output format | Prompt engineering (few-shot examples) |
| Model lacks knowledge about your domain | RAG |
| Model is too verbose / terse for your use case | Prompt engineering (system prompt) |
| Model consistently fails at a specific reasoning pattern | Fine-tuning or chain-of-thought prompting |
| You need to internalise a specific writing style at scale | Fine-tuning |
| Response latency is critical (smaller model needed) | Fine-tune a smaller model to match frontier quality |
| API costs are prohibitive at scale | Fine-tune a smaller/cheaper model |
| Model needs to follow complex, multi-step instructions consistently | Fine-tuning |
| Domain has specialised terminology + knowledge base | Fine-tuning + RAG |
The Decision Tree
Does the model have the knowledge? ─── No ──→ Use RAG
│
Yes
│
Does prompt engineering fix it? ─────── Yes ──→ Describe format in prompt
│ + add few-shot examples
No
│
Is it a style/format problem? ──────── Yes ──→ Fine-tune with 50-200 examples
│
No (core reasoning problem)
│
Fine-tune with large, high-quality dataset
Types of Fine-Tuning
Full Fine-Tuning
Updates all model weights. Maximum performance, maximum cost. Only practical for open-source models (Llama, Mistral) or if you have A100/H100 GPU access.
# Using HuggingFace Trainer
from transformers import Trainer, TrainingArguments
training_args = TrainingArguments(
output_dir='./models/finetuned',
num_train_epochs=3,
per_device_train_batch_size=4,
gradient_accumulation_steps=4,
learning_rate=2e-5,
fp16=True,
save_strategy='epoch',
evaluation_strategy='epoch',
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
)
trainer.train()
Requirements: 40-80GB+ GPU VRAM for 7B models, 160GB+ for 70B models.
LoRA (Low-Rank Adaptation)
Trains a small number of additional parameters (adapter weights) while keeping the base model frozen. Dramatically reduces VRAM and computation requirements with modest quality trade-off.
from peft import LoraConfig, get_peft_model
lora_config = LoraConfig(
r=16, # LoRA rank — higher = more capacity, more params
lora_alpha=32, # Scaling factor (usually 2× rank)
target_modules=["q_proj", "v_proj"], # Which layers to adapt
lora_dropout=0.1,
bias="none",
task_type="CAUSAL_LM",
)
model = get_peft_model(base_model, lora_config)
print(model.print_trainable_parameters())
# trainable params: 4,194,304 / 6,738,415,616 = 0.06% 🤯
Requirements: 10-20GB GPU VRAM for 7B models. The standard approach for most fine-tuning use cases.
QLoRA (Quantised LoRA)
Combines LoRA with 4-bit quantisation of the base model, enabling fine-tuning of large models (70B+) on consumer-grade hardware:
from transformers import BitsAndBytesConfig
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
)
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Meta-Llama-3-8B",
quantization_config=bnb_config,
device_map="auto",
)
Requirements: 6-10GB GPU VRAM for 7B models. Enables fine-tuning on a single RTX 4090 (24GB).
Managed Fine-Tuning (API-based)
For proprietary models, use the provider’s managed fine-tuning:
// OpenAI fine-tuning
import OpenAI from 'openai';
const openai = new OpenAI();
// 1. Upload training file (JSONL format)
const file = await openai.files.create({
file: fs.createReadStream('training-data.jsonl'),
purpose: 'fine-tune',
});
// 2. Create fine-tuning job
const job = await openai.fineTuning.jobs.create({
training_file: file.id,
model: 'gpt-4o-mini-2024-07-18',
hyperparameters: { n_epochs: 3 },
});
// 3. Monitor progress
const jobStatus = await openai.fineTuning.jobs.retrieve(job.id);
Dataset Preparation
The quality of your fine-tuned model is bounded by the quality of your training data. “Garbage in, garbage out” is even more true here than in traditional ML.
Data Format
Most LLM fine-tuning uses instruction-following format (JSONL):
{"messages": [{"role": "system", "content": "You are Aircury's code review assistant."}, {"role": "user", "content": "Review this TypeScript class for SOLID violations:\n\nclass UserService {\n db = new PostgresDatabase();\n ...\n}"}, {"role": "assistant", "content": "**DIP Violation Found**\n\nThe `UserService` class directly instantiates `PostgresDatabase`, violating the Dependency Inversion Principle.\n\n**Issues:**\n1. `db` is a concrete dependency, not an abstraction\n2. The class is directly coupled to PostgreSQL\n\n**Fix:** Inject an `UserRepository` interface via constructor:\n```typescript\nclass UserService {\n constructor(private readonly users: UserRepository) {}\n}\n```"}]}
{"messages": [{"role": "system", "content": "..."}, {"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}]}
Dataset Size Guidelines
| Task complexity | Minimum examples | Recommended |
|---|---|---|
| Simple format/style | 50-100 | 200-500 |
| Domain knowledge | 200-500 | 1000-5000 |
| Complex reasoning | 500-1000 | 5000-10000 |
| High accuracy critical | 1000+ | 10000+ |
Collect real examples
The best training data comes from real interactions. Capture examples where: the model produced excellent outputs, expert humans wrote ideal responses, or your current system failed and you have corrected versions.
Ensure quality and diversity
- Every example must be high quality — one bad example can hurt more than 10 good ones help
- Cover the full distribution of inputs your system will see
- Include edge cases and challenging examples
- Balanced representation across categories
Split and validate
Training set: 80% of examples (model learns from these)
Validation set: 10% of examples (monitor for overfitting during training)
Test set: 10% of examples (evaluate final model on never-seen examples)Clean and format
Remove duplicates, fix formatting issues, ensure instruction/response pairs are consistent. A data quality audit before training saves many hours of debugging bad fine-tuned models.
Evaluating Fine-Tuned Models
Never deploy a fine-tuned model without rigorous evaluation:
- Held-out test set performance: Quality on examples the model never saw during training
- Regression testing: Does the fine-tuned model perform worse on general tasks? (Models can “forget” general capability — this is called catastrophic forgetting)
- Head-to-head comparison: Compare fine-tuned vs. base model + prompt on your eval dataset
- Edge case probing: Test adversarial inputs, out-of-distribution examples, inputs the model might refuse
// Head-to-head evaluation
const results = await Promise.all(evalDataset.map(async (example) => {
const [baseResponse, ftResponse] = await Promise.all([
baseModel.complete(example.input),
finetunedModel.complete(example.input),
]);
const baseScore = await judge.evaluate(example.input, baseResponse, example.criteria);
const ftScore = await judge.evaluate(example.input, ftResponse, example.criteria);
return { baseScore, ftScore, improved: ftScore > baseScore };
}));
const improvementRate = results.filter(r => r.improved).length / results.length;
console.log(`Fine-tuned model improved on ${(improvementRate * 100).toFixed(1)}% of examples`);
Cost-Benefit Analysis
Fine-tuning has significant upfront costs but can reduce inference costs dramatically if you can shift from a large model to a smaller fine-tuned one:
| Base model (GPT-4o) | Fine-tuned (GPT-4o-mini) | |
|---|---|---|
| Per-token cost | ~$5/M tokens | ~$0.30/M tokens |
| Quality on your task | Good (via prompting) | Target: equal or better |
| Time to iterate | Minutes (change prompt) | Hours/days (retrain) |
| Update when data changes | Instant | Requires retraining |
| Breakeven volume | — | ~1M tokens/month |
The rule of thumb: If you’re spending >$1000/month on inference and a smaller fine-tuned model can match quality, fine-tuning has a positive ROI.