Fine-Tuning

Fine-tuning is the process of further training a pre-trained LLM on your own dataset to improve its performance on a specific task, domain, or style. It’s a powerful tool — but also one that’s frequently reached for too early, when better prompting or RAG would be a faster and cheaper solution.

Fine-Tune Last, Not First

The typical mistake: reaching for fine-tuning when you hit quality issues, before exhausting prompt engineering and RAG. Fine-tuning is expensive, slow to iterate, and hard to debug. Prompt engineering is free to iterate; RAG can be updated continuously. Use fine-tuning when both have been maximised.

When to Fine-Tune vs. Prompt

Situation	Solution
Model doesn’t follow a consistent output format	Prompt engineering (few-shot examples)
Model lacks knowledge about your domain	RAG
Model is too verbose / terse for your use case	Prompt engineering (system prompt)
Model consistently fails at a specific reasoning pattern	Fine-tuning or chain-of-thought prompting
You need to internalise a specific writing style at scale	Fine-tuning
Response latency is critical (smaller model needed)	Fine-tune a smaller model to match frontier quality
API costs are prohibitive at scale	Fine-tune a smaller/cheaper model
Model needs to follow complex, multi-step instructions consistently	Fine-tuning
Domain has specialised terminology + knowledge base	Fine-tuning + RAG

The Decision Tree

Does the model have the knowledge? ─── No ──→ Use RAG
            │
           Yes
            │
Does prompt engineering fix it? ─────── Yes ──→ Describe format in prompt
            │                              + add few-shot examples
            No
            │
Is it a style/format problem? ──────── Yes ──→ Fine-tune with 50-200 examples
            │
            No (core reasoning problem)
            │
Fine-tune with large, high-quality dataset

Types of Fine-Tuning

Full Fine-Tuning

Updates all model weights. Maximum performance, maximum cost. Only practical for open-source models (Llama, Mistral) or if you have A100/H100 GPU access.

# Using HuggingFace Trainer
from transformers import Trainer, TrainingArguments

training_args = TrainingArguments(
    output_dir='./models/finetuned',
    num_train_epochs=3,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    learning_rate=2e-5,
    fp16=True,
    save_strategy='epoch',
    evaluation_strategy='epoch',
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
)
trainer.train()

Requirements: 40-80GB+ GPU VRAM for 7B models, 160GB+ for 70B models.

LoRA (Low-Rank Adaptation)

Trains a small number of additional parameters (adapter weights) while keeping the base model frozen. Dramatically reduces VRAM and computation requirements with modest quality trade-off.

from peft import LoraConfig, get_peft_model

lora_config = LoraConfig(
    r=16,               # LoRA rank — higher = more capacity, more params
    lora_alpha=32,      # Scaling factor (usually 2× rank)
    target_modules=["q_proj", "v_proj"],  # Which layers to adapt
    lora_dropout=0.1,
    bias="none",
    task_type="CAUSAL_LM",
)

model = get_peft_model(base_model, lora_config)
print(model.print_trainable_parameters())
# trainable params: 4,194,304 / 6,738,415,616 = 0.06% 🤯

Requirements: 10-20GB GPU VRAM for 7B models. The standard approach for most fine-tuning use cases.

QLoRA (Quantised LoRA)

Combines LoRA with 4-bit quantisation of the base model, enabling fine-tuning of large models (70B+) on consumer-grade hardware:

from transformers import BitsAndBytesConfig

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
)

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Meta-Llama-3-8B",
    quantization_config=bnb_config,
    device_map="auto",
)

Requirements: 6-10GB GPU VRAM for 7B models. Enables fine-tuning on a single RTX 4090 (24GB).

Managed Fine-Tuning (API-based)

For proprietary models, use the provider’s managed fine-tuning:

// OpenAI fine-tuning
import OpenAI from 'openai';

const openai = new OpenAI();

// 1. Upload training file (JSONL format)
const file = await openai.files.create({
  file: fs.createReadStream('training-data.jsonl'),
  purpose: 'fine-tune',
});

// 2. Create fine-tuning job
const job = await openai.fineTuning.jobs.create({
  training_file: file.id,
  model: 'gpt-4o-mini-2024-07-18',
  hyperparameters: { n_epochs: 3 },
});

// 3. Monitor progress
const jobStatus = await openai.fineTuning.jobs.retrieve(job.id);

Dataset Preparation

The quality of your fine-tuned model is bounded by the quality of your training data. “Garbage in, garbage out” is even more true here than in traditional ML.

Data Format

Most LLM fine-tuning uses instruction-following format (JSONL):

{"messages": [{"role": "system", "content": "You are Aircury's code review assistant."}, {"role": "user", "content": "Review this TypeScript class for SOLID violations:\n\nclass UserService {\n  db = new PostgresDatabase();\n  ...\n}"}, {"role": "assistant", "content": "**DIP Violation Found**\n\nThe `UserService` class directly instantiates `PostgresDatabase`, violating the Dependency Inversion Principle.\n\n**Issues:**\n1. `db` is a concrete dependency, not an abstraction\n2. The class is directly coupled to PostgreSQL\n\n**Fix:** Inject an `UserRepository` interface via constructor:\n```typescript\nclass UserService {\n  constructor(private readonly users: UserRepository) {}\n}\n```"}]}
{"messages": [{"role": "system", "content": "..."}, {"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}]}

Dataset Size Guidelines

Task complexity	Minimum examples	Recommended
Simple format/style	50-100	200-500
Domain knowledge	200-500	1000-5000
Complex reasoning	500-1000	5000-10000
High accuracy critical	1000+	10000+

Collect real examples

The best training data comes from real interactions. Capture examples where: the model produced excellent outputs, expert humans wrote ideal responses, or your current system failed and you have corrected versions.

Ensure quality and diversity

Every example must be high quality — one bad example can hurt more than 10 good ones help
Cover the full distribution of inputs your system will see
Include edge cases and challenging examples
Balanced representation across categories

Split and validate

Training set:   80% of examples  (model learns from these)
Validation set: 10% of examples  (monitor for overfitting during training)
Test set:       10% of examples  (evaluate final model on never-seen examples)

Clean and format

Remove duplicates, fix formatting issues, ensure instruction/response pairs are consistent. A data quality audit before training saves many hours of debugging bad fine-tuned models.

Evaluating Fine-Tuned Models

Never deploy a fine-tuned model without rigorous evaluation:

Held-out test set performance: Quality on examples the model never saw during training
Regression testing: Does the fine-tuned model perform worse on general tasks? (Models can “forget” general capability — this is called catastrophic forgetting)
Head-to-head comparison: Compare fine-tuned vs. base model + prompt on your eval dataset
Edge case probing: Test adversarial inputs, out-of-distribution examples, inputs the model might refuse

// Head-to-head evaluation
const results = await Promise.all(evalDataset.map(async (example) => {
  const [baseResponse, ftResponse] = await Promise.all([
    baseModel.complete(example.input),
    finetunedModel.complete(example.input),
  ]);
  
  const baseScore = await judge.evaluate(example.input, baseResponse, example.criteria);
  const ftScore = await judge.evaluate(example.input, ftResponse, example.criteria);
  
  return { baseScore, ftScore, improved: ftScore > baseScore };
}));

const improvementRate = results.filter(r => r.improved).length / results.length;
console.log(`Fine-tuned model improved on ${(improvementRate * 100).toFixed(1)}% of examples`);

Cost-Benefit Analysis

Fine-tuning has significant upfront costs but can reduce inference costs dramatically if you can shift from a large model to a smaller fine-tuned one:

	Base model (GPT-4o)	Fine-tuned (GPT-4o-mini)
Per-token cost	~$5/M tokens	~$0.30/M tokens
Quality on your task	Good (via prompting)	Target: equal or better
Time to iterate	Minutes (change prompt)	Hours/days (retrain)
Update when data changes	Instant	Requires retraining
Breakeven volume	—	~1M tokens/month

The rule of thumb: If you’re spending >$1000/month on inference and a smaller fine-tuned model can match quality, fine-tuning has a positive ROI.