Docs
Fine-Tuning

Fine-Tuning

When and how to fine-tune LLMs — dataset preparation, training approaches (full, LoRA, QLoRA), evaluation of fine-tuned models, and cost-benefit vs. prompting.


Fine-tuning is the process of further training a pre-trained LLM on your own dataset to improve its performance on a specific task, domain, or style. It’s a powerful tool — but also one that’s frequently reached for too early, when better prompting or RAG would be a faster and cheaper solution.

Fine-Tune Last, Not First

The typical mistake: reaching for fine-tuning when you hit quality issues, before exhausting prompt engineering and RAG. Fine-tuning is expensive, slow to iterate, and hard to debug. Prompt engineering is free to iterate; RAG can be updated continuously. Use fine-tuning when both have been maximised.

When to Fine-Tune vs. Prompt

SituationSolution
Model doesn’t follow a consistent output formatPrompt engineering (few-shot examples)
Model lacks knowledge about your domainRAG
Model is too verbose / terse for your use casePrompt engineering (system prompt)
Model consistently fails at a specific reasoning patternFine-tuning or chain-of-thought prompting
You need to internalise a specific writing style at scaleFine-tuning
Response latency is critical (smaller model needed)Fine-tune a smaller model to match frontier quality
API costs are prohibitive at scaleFine-tune a smaller/cheaper model
Model needs to follow complex, multi-step instructions consistentlyFine-tuning
Domain has specialised terminology + knowledge baseFine-tuning + RAG

The Decision Tree

Does the model have the knowledge? ─── No ──→ Use RAG

           Yes

Does prompt engineering fix it? ─────── Yes ──→ Describe format in prompt
            │                              + add few-shot examples
            No

Is it a style/format problem? ──────── Yes ──→ Fine-tune with 50-200 examples

            No (core reasoning problem)

Fine-tune with large, high-quality dataset

Types of Fine-Tuning

Full Fine-Tuning

Updates all model weights. Maximum performance, maximum cost. Only practical for open-source models (Llama, Mistral) or if you have A100/H100 GPU access.

# Using HuggingFace Trainer
from transformers import Trainer, TrainingArguments

training_args = TrainingArguments(
    output_dir='./models/finetuned',
    num_train_epochs=3,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    learning_rate=2e-5,
    fp16=True,
    save_strategy='epoch',
    evaluation_strategy='epoch',
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
)
trainer.train()

Requirements: 40-80GB+ GPU VRAM for 7B models, 160GB+ for 70B models.

LoRA (Low-Rank Adaptation)

Trains a small number of additional parameters (adapter weights) while keeping the base model frozen. Dramatically reduces VRAM and computation requirements with modest quality trade-off.

from peft import LoraConfig, get_peft_model

lora_config = LoraConfig(
    r=16,               # LoRA rank — higher = more capacity, more params
    lora_alpha=32,      # Scaling factor (usually 2× rank)
    target_modules=["q_proj", "v_proj"],  # Which layers to adapt
    lora_dropout=0.1,
    bias="none",
    task_type="CAUSAL_LM",
)

model = get_peft_model(base_model, lora_config)
print(model.print_trainable_parameters())
# trainable params: 4,194,304 / 6,738,415,616 = 0.06% 🤯

Requirements: 10-20GB GPU VRAM for 7B models. The standard approach for most fine-tuning use cases.

QLoRA (Quantised LoRA)

Combines LoRA with 4-bit quantisation of the base model, enabling fine-tuning of large models (70B+) on consumer-grade hardware:

from transformers import BitsAndBytesConfig

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
)

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Meta-Llama-3-8B",
    quantization_config=bnb_config,
    device_map="auto",
)

Requirements: 6-10GB GPU VRAM for 7B models. Enables fine-tuning on a single RTX 4090 (24GB).

Managed Fine-Tuning (API-based)

For proprietary models, use the provider’s managed fine-tuning:

// OpenAI fine-tuning
import OpenAI from 'openai';

const openai = new OpenAI();

// 1. Upload training file (JSONL format)
const file = await openai.files.create({
  file: fs.createReadStream('training-data.jsonl'),
  purpose: 'fine-tune',
});

// 2. Create fine-tuning job
const job = await openai.fineTuning.jobs.create({
  training_file: file.id,
  model: 'gpt-4o-mini-2024-07-18',
  hyperparameters: { n_epochs: 3 },
});

// 3. Monitor progress
const jobStatus = await openai.fineTuning.jobs.retrieve(job.id);

Dataset Preparation

The quality of your fine-tuned model is bounded by the quality of your training data. “Garbage in, garbage out” is even more true here than in traditional ML.

Data Format

Most LLM fine-tuning uses instruction-following format (JSONL):

{"messages": [{"role": "system", "content": "You are Aircury's code review assistant."}, {"role": "user", "content": "Review this TypeScript class for SOLID violations:\n\nclass UserService {\n  db = new PostgresDatabase();\n  ...\n}"}, {"role": "assistant", "content": "**DIP Violation Found**\n\nThe `UserService` class directly instantiates `PostgresDatabase`, violating the Dependency Inversion Principle.\n\n**Issues:**\n1. `db` is a concrete dependency, not an abstraction\n2. The class is directly coupled to PostgreSQL\n\n**Fix:** Inject an `UserRepository` interface via constructor:\n```typescript\nclass UserService {\n  constructor(private readonly users: UserRepository) {}\n}\n```"}]}
{"messages": [{"role": "system", "content": "..."}, {"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}]}

Dataset Size Guidelines

Task complexityMinimum examplesRecommended
Simple format/style50-100200-500
Domain knowledge200-5001000-5000
Complex reasoning500-10005000-10000
High accuracy critical1000+10000+

Collect real examples

The best training data comes from real interactions. Capture examples where: the model produced excellent outputs, expert humans wrote ideal responses, or your current system failed and you have corrected versions.

Ensure quality and diversity

  • Every example must be high quality — one bad example can hurt more than 10 good ones help
  • Cover the full distribution of inputs your system will see
  • Include edge cases and challenging examples
  • Balanced representation across categories

Split and validate

Training set:   80% of examples  (model learns from these)
Validation set: 10% of examples  (monitor for overfitting during training)
Test set:       10% of examples  (evaluate final model on never-seen examples)

Clean and format

Remove duplicates, fix formatting issues, ensure instruction/response pairs are consistent. A data quality audit before training saves many hours of debugging bad fine-tuned models.

Evaluating Fine-Tuned Models

Never deploy a fine-tuned model without rigorous evaluation:

  1. Held-out test set performance: Quality on examples the model never saw during training
  2. Regression testing: Does the fine-tuned model perform worse on general tasks? (Models can “forget” general capability — this is called catastrophic forgetting)
  3. Head-to-head comparison: Compare fine-tuned vs. base model + prompt on your eval dataset
  4. Edge case probing: Test adversarial inputs, out-of-distribution examples, inputs the model might refuse
// Head-to-head evaluation
const results = await Promise.all(evalDataset.map(async (example) => {
  const [baseResponse, ftResponse] = await Promise.all([
    baseModel.complete(example.input),
    finetunedModel.complete(example.input),
  ]);
  
  const baseScore = await judge.evaluate(example.input, baseResponse, example.criteria);
  const ftScore = await judge.evaluate(example.input, ftResponse, example.criteria);
  
  return { baseScore, ftScore, improved: ftScore > baseScore };
}));

const improvementRate = results.filter(r => r.improved).length / results.length;
console.log(`Fine-tuned model improved on ${(improvementRate * 100).toFixed(1)}% of examples`);

Cost-Benefit Analysis

Fine-tuning has significant upfront costs but can reduce inference costs dramatically if you can shift from a large model to a smaller fine-tuned one:

Base model (GPT-4o)Fine-tuned (GPT-4o-mini)
Per-token cost~$5/M tokens~$0.30/M tokens
Quality on your taskGood (via prompting)Target: equal or better
Time to iterateMinutes (change prompt)Hours/days (retrain)
Update when data changesInstantRequires retraining
Breakeven volume~1M tokens/month

The rule of thumb: If you’re spending >$1000/month on inference and a smaller fine-tuned model can match quality, fine-tuning has a positive ROI.