Evaluation & Evals

Evaluation is the most under-invested area in AI Engineering and the most consequential. Without a rigorous evaluation strategy, you’re shipping blind — you don’t know if your system is working well, degrading, or failing for a subset of users. Evals are to AI systems what tests are to traditional software: the mechanism that separates guessing from knowing.

The Most Common AI Engineering Mistake

Shipping an LLM-powered feature without evals. You’ll get anecdotal feedback from a few test cases, feel good, ship, and discover in production that 20% of real-world inputs produce degraded output. Evals catch this before it reaches users.

What Are Evals?

Evals (evaluations) are structured tests for AI system quality. Unlike unit tests that verify specific outputs, evals probe the system’s quality across a distribution of inputs.

Unit Test:          given(X) → expect(Y)   [deterministic pass/fail]
Eval:               over(dataset) → quality_score ≥ threshold  [probabilistic]

A good eval suite tells you:

Overall quality: What percentage of outputs meet quality standards?
Failure modes: Which categories of inputs produce bad outputs?
Regression detection: Did a prompt change make things better or worse?
Model comparison: Is GPT-4o better than Claude 3.5 for this task?

Types of Evaluation Metrics

Exact Match

For tasks with a single correct answer (classification, extraction):

def exact_match_score(predictions: list[str], labels: list[str]) -> float:
    correct = sum(p == l for p, l in zip(predictions, labels))
    return correct / len(labels)

Use when: Sentiment classification, entity extraction, yes/no questions, structured output parsing.

Reference-Based Metrics

BLEU (Bilingual Evaluation Understudy): Measures n-gram overlap between generated text and reference text. Originally for translation, used for structured text generation.

ROUGE (Recall-Oriented Understudy for Gisting Evaluation): Measures recall of n-grams from the reference in the output. Better for summarisation.

from rouge_score import rouge_scorer

scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)
scores = scorer.score(reference, generated)
# scores.rouge1.f1measure → overall F1

Limitations: These metrics don’t capture semantic similarity or factual accuracy. A paraphrase that means the same thing scores poorly; a fluent-sounding factual error scores well. Use as signals, not ground truth.

Semantic Similarity

Embedding-based metrics catch quality that n-gram metrics miss:

import { OpenAI } from 'openai';

async function semanticSimilarity(text1: string, text2: string): Promise<number> {
  const openai = new OpenAI();
  const [emb1, emb2] = await Promise.all([
    openai.embeddings.create({ model: 'text-embedding-3-small', input: text1 }),
    openai.embeddings.create({ model: 'text-embedding-3-small', input: text2 }),
  ]);
  return cosineSimilarity(emb1.data[0].embedding, emb2.data[0].embedding);
}

Human Evaluation

The gold standard. Humans judge whether outputs are correct, helpful, and appropriate.

Format	Description	Best for
Binary	Good / Bad	First-pass screening
Rating scale	1-5 on specific dimensions	Detailed quality assessment
Preference	A vs B, which is better?	Model comparison, A/B testing prompts
Error taxonomy	Categorise the type of error	Diagnosing systematic failures

Human eval is expensive — use it to calibrate automatic metrics, not for continuous evaluation.

LLM-as-Judge

Use a powerful LLM (GPT-4o, Claude Opus) to evaluate the output of another LLM. This scales to large volumes while preserving semantic understanding.

const judge_prompt = `
You are evaluating the quality of an AI assistant's response to a user question.

Question: ${question}
AI Response: ${response}
Expected criteria:
- Factually accurate
- Directly answers the question
- Appropriate length (not too long or short)
- No hallucinations

Rate the response quality on a scale of 1-5, where:
1 = Completely wrong or harmful
3 = Partially correct with significant issues
5 = Excellent, fully meets criteria

Respond with JSON: { "score": number, "reasoning": string }
`;

LLM Judge Biases

LLM judges have known biases: they favour longer answers, favour their own outputs, and prefer confident-sounding text even when wrong. Mitigate by: using a different model family as judge than the one being evaluated, providing explicit rubrics, and spot-checking judge decisions against human ratings.

Building an Eval Dataset

An eval dataset is a set of (input, expected_output) or (input, evaluation_criteria) pairs.

Collect representative inputs

Start with real user inputs if available. If building from scratch:

Write 20-30 inputs covering the normal distribution of use cases
Include 5-10 edge cases (unusual formats, ambiguous requests, empty inputs)
Include 5-10 adversarial cases (inputs that might cause hallucination or failure)

Define expected outputs or criteria

For each input, define what “good” looks like:

Classification tasks: The correct label
Generation tasks: A rubric (must mention X, must not be longer than Y, must be factually accurate about Z)
Extraction tasks: The exact values that should be extracted

Establish a baseline

Run your current system on the eval set. Record the scores. This is your baseline. Every future change should be measured against it.

Grow the dataset over time

When you find a production failure, add it to the eval set immediately. This is called adversarial data collection — your eval dataset should grow to capture every failure mode you discover.

Integrating Evals into the Development Workflow

Evals shouldn’t live in a spreadsheet you run once. They should be part of your CI/CD pipeline.

// eval-runner.ts — runs on every prompt change
import { EvalDataset } from './eval-dataset';
import { llmJudge } from './judges/llm-judge';

const QUALITY_THRESHOLD = 0.85; // 85% of cases must pass

async function runEvals(): Promise<void> {
  const dataset = await EvalDataset.load('./evals/customer-support.jsonl');
  const scores: number[] = [];

  for (const example of dataset) {
    const response = await system.process(example.input);
    const score = await llmJudge.evaluate(example.input, response, example.criteria);
    scores.push(score);
  }

  const passRate = scores.filter(s => s >= 3).length / scores.length;
  
  if (passRate < QUALITY_THRESHOLD) {
    console.error(`Eval failed: ${(passRate * 100).toFixed(1)}% pass rate (threshold: ${QUALITY_THRESHOLD * 100}%)`);
    process.exit(1);
  }
  
  console.log(`Evals passed: ${(passRate * 100).toFixed(1)}% pass rate ✓`);
}

The Eval-Driven Prompt Improvement Loop

Establish baseline scores
         ↓
Make a prompt change
         ↓
Run evals against full dataset
         ↓
Compare scores to baseline
         ↓
Did scores improve overall?  → Yes → Keep change
                              → No  → Investigate regressions, iterate

This loop makes prompt optimisation empirical and systematic, replacing “feels better” with “scored better on N examples.”