Evaluation & Evals
How to measure AI output quality systematically — metrics, evaluation frameworks, LLM-as-judge, and building eval datasets that give you confidence before you ship.
Evaluation is the most under-invested area in AI Engineering and the most consequential. Without a rigorous evaluation strategy, you’re shipping blind — you don’t know if your system is working well, degrading, or failing for a subset of users. Evals are to AI systems what tests are to traditional software: the mechanism that separates guessing from knowing.
The Most Common AI Engineering Mistake
Shipping an LLM-powered feature without evals. You’ll get anecdotal feedback from a few test cases, feel good, ship, and discover in production that 20% of real-world inputs produce degraded output. Evals catch this before it reaches users.
What Are Evals?
Evals (evaluations) are structured tests for AI system quality. Unlike unit tests that verify specific outputs, evals probe the system’s quality across a distribution of inputs.
Unit Test: given(X) → expect(Y) [deterministic pass/fail]
Eval: over(dataset) → quality_score ≥ threshold [probabilistic]
A good eval suite tells you:
- Overall quality: What percentage of outputs meet quality standards?
- Failure modes: Which categories of inputs produce bad outputs?
- Regression detection: Did a prompt change make things better or worse?
- Model comparison: Is GPT-4o better than Claude 3.5 for this task?
Types of Evaluation Metrics
Exact Match
For tasks with a single correct answer (classification, extraction):
def exact_match_score(predictions: list[str], labels: list[str]) -> float:
correct = sum(p == l for p, l in zip(predictions, labels))
return correct / len(labels)
Use when: Sentiment classification, entity extraction, yes/no questions, structured output parsing.
Reference-Based Metrics
BLEU (Bilingual Evaluation Understudy): Measures n-gram overlap between generated text and reference text. Originally for translation, used for structured text generation.
ROUGE (Recall-Oriented Understudy for Gisting Evaluation): Measures recall of n-grams from the reference in the output. Better for summarisation.
from rouge_score import rouge_scorer
scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)
scores = scorer.score(reference, generated)
# scores.rouge1.f1measure → overall F1
Limitations: These metrics don’t capture semantic similarity or factual accuracy. A paraphrase that means the same thing scores poorly; a fluent-sounding factual error scores well. Use as signals, not ground truth.
Semantic Similarity
Embedding-based metrics catch quality that n-gram metrics miss:
import { OpenAI } from 'openai';
async function semanticSimilarity(text1: string, text2: string): Promise<number> {
const openai = new OpenAI();
const [emb1, emb2] = await Promise.all([
openai.embeddings.create({ model: 'text-embedding-3-small', input: text1 }),
openai.embeddings.create({ model: 'text-embedding-3-small', input: text2 }),
]);
return cosineSimilarity(emb1.data[0].embedding, emb2.data[0].embedding);
}
Human Evaluation
The gold standard. Humans judge whether outputs are correct, helpful, and appropriate.
| Format | Description | Best for |
|---|---|---|
| Binary | Good / Bad | First-pass screening |
| Rating scale | 1-5 on specific dimensions | Detailed quality assessment |
| Preference | A vs B, which is better? | Model comparison, A/B testing prompts |
| Error taxonomy | Categorise the type of error | Diagnosing systematic failures |
Human eval is expensive — use it to calibrate automatic metrics, not for continuous evaluation.
LLM-as-Judge
Use a powerful LLM (GPT-4o, Claude Opus) to evaluate the output of another LLM. This scales to large volumes while preserving semantic understanding.
const judge_prompt = `
You are evaluating the quality of an AI assistant's response to a user question.
Question: ${question}
AI Response: ${response}
Expected criteria:
- Factually accurate
- Directly answers the question
- Appropriate length (not too long or short)
- No hallucinations
Rate the response quality on a scale of 1-5, where:
1 = Completely wrong or harmful
3 = Partially correct with significant issues
5 = Excellent, fully meets criteria
Respond with JSON: { "score": number, "reasoning": string }
`;
LLM Judge Biases
LLM judges have known biases: they favour longer answers, favour their own outputs, and prefer confident-sounding text even when wrong. Mitigate by: using a different model family as judge than the one being evaluated, providing explicit rubrics, and spot-checking judge decisions against human ratings.
Building an Eval Dataset
An eval dataset is a set of (input, expected_output) or (input, evaluation_criteria) pairs.
Collect representative inputs
Start with real user inputs if available. If building from scratch:
- Write 20-30 inputs covering the normal distribution of use cases
- Include 5-10 edge cases (unusual formats, ambiguous requests, empty inputs)
- Include 5-10 adversarial cases (inputs that might cause hallucination or failure)
Define expected outputs or criteria
For each input, define what “good” looks like:
- Classification tasks: The correct label
- Generation tasks: A rubric (must mention X, must not be longer than Y, must be factually accurate about Z)
- Extraction tasks: The exact values that should be extracted
Establish a baseline
Run your current system on the eval set. Record the scores. This is your baseline. Every future change should be measured against it.
Grow the dataset over time
When you find a production failure, add it to the eval set immediately. This is called adversarial data collection — your eval dataset should grow to capture every failure mode you discover.
Integrating Evals into the Development Workflow
Evals shouldn’t live in a spreadsheet you run once. They should be part of your CI/CD pipeline.
// eval-runner.ts — runs on every prompt change
import { EvalDataset } from './eval-dataset';
import { llmJudge } from './judges/llm-judge';
const QUALITY_THRESHOLD = 0.85; // 85% of cases must pass
async function runEvals(): Promise<void> {
const dataset = await EvalDataset.load('./evals/customer-support.jsonl');
const scores: number[] = [];
for (const example of dataset) {
const response = await system.process(example.input);
const score = await llmJudge.evaluate(example.input, response, example.criteria);
scores.push(score);
}
const passRate = scores.filter(s => s >= 3).length / scores.length;
if (passRate < QUALITY_THRESHOLD) {
console.error(`Eval failed: ${(passRate * 100).toFixed(1)}% pass rate (threshold: ${QUALITY_THRESHOLD * 100}%)`);
process.exit(1);
}
console.log(`Evals passed: ${(passRate * 100).toFixed(1)}% pass rate ✓`);
}
The Eval-Driven Prompt Improvement Loop
Establish baseline scores
↓
Make a prompt change
↓
Run evals against full dataset
↓
Compare scores to baseline
↓
Did scores improve overall? → Yes → Keep change
→ No → Investigate regressions, iterate
This loop makes prompt optimisation empirical and systematic, replacing “feels better” with “scored better on N examples.”