AI Engineering Foundations

AI Engineering is the discipline of building reliable, production-grade software systems that incorporate AI capabilities — particularly Large Language Models (LLMs). It’s a relatively new field, but the patterns and problems are recurring enough that it’s worth understanding them properly.

What is AI Engineering

AI Engineering sits at the intersection of software engineering and AI/ML. It’s distinct from traditional ML Engineering or Data Science:

Discipline	Primary focus	Typical outputs
Data Science	Insights from data, statistical modelling	Notebooks, reports, models
ML Engineering	Training, evaluating, deploying ML models	Training pipelines, model artifacts
AI Engineering	Building products and systems using AI	Applications, pipelines, APIs

An AI Engineer typically doesn’t train models from scratch. They use frontier models (GPT, Claude, Gemini, Llama) as components integrated into larger systems. The core challenges are different:

Reliability: LLMs are probabilistic. How do you build reliable systems on non-deterministic components?
Evaluation: How do you know the system is working well? Traditional unit tests aren’t sufficient.
Cost: LLM inference at scale is expensive. How do you optimise usage without sacrificing quality?
Safety: How do you prevent the system from producing harmful, incorrect, or inappropriate output?

The central challenge

Traditional software is deterministic: given input X, it always produces output Y. LLMs are stochastic: given input X, they produce outputs drawn from a probability distribution. Building reliable systems with probabilistic components is the central challenge of AI Engineering.

How LLMs work (an engineer’s mental model)

You don’t need to understand backpropagation to be an effective AI Engineer. You do need a mental model of how LLMs behave.

The completion machine

At its core, an LLM is a function: text → next_token_probability_distribution. Given a sequence of text (the prompt), it predicts what token is most likely to come next. LLMs are trained to optimise this prediction on enormous amounts of text.

This has an important implication: LLMs don’t “know” things — they predict what plausible text would come next. When an LLM gives you a wrong answer with full confidence, it’s because that kind of wrong-but-confident text exists in the training data.

The context window

Every LLM has a context window — the amount of text it can “see” at once when generating a response. This is measured in tokens (roughly 4 characters or 1 word per token in English).

Context Window = System Prompt + Chat History + Your Current Message + Space for Response

What matters here:

Information outside the context window doesn’t exist to the model
Older parts of a long conversation have less weight (recency bias)
Context windows have grown significantly: from 4K tokens (GPT-3.5) to 200K+ (Claude 3.5)

Temperature and sampling

Temperature controls how much the model’s output varies:

temperature: 0.0 → near-deterministic, picks the highest-probability token
temperature: 1.0 → standard sampling from the probability distribution
temperature: 2.0 → high variation, can produce incoherent output

For coding or factual tasks: low temperature (0.0–0.3). For creative writing or brainstorming: higher temperature (0.7–1.0).

The LLM landscape

Model families

Provider	Family	Strengths
Anthropic	Claude 3.5/4.x	Instruction-following, code, long context, safety
OpenAI	GPT-4o, o1, o3	Broad capability, tool use, vision, reasoning
Google	Gemini 1.5/2.x	Very long context (1M tokens), multimodal, Google integration
Meta	Llama 3.x	Open weights, self-hostable, privacy, fine-tuning
Mistral	Mistral/Mixtral	Efficient, multilingual, self-hostable

Choosing a model

It depends on:

Task type: code generation, analysis, creative writing, summarisation, classification
Context needs: how much text needs to be in context at once?
Privacy requirements: can data leave your infrastructure? (No → self-hosted Llama)
Cost: tokens × price. Claude Haiku << Claude Opus in cost, comparable on many tasks.
Latency: smaller models are faster. Streaming helps with perceived latency.

Multimodal capabilities

Modern frontier models are multimodal — they process text, images, audio, and sometimes video:

Vision: analyse screenshots, diagrams, photos (GPT-4o, Claude 3.5, Gemini)
Code interpretation: execute code, analyse outputs (Code Interpreter / GPT-4o)
Voice: real-time speech-to-speech (GPT-4o Realtime)

AI Engineering core competencies

A well-rounded AI Engineer should understand:

┌─────────────────────────────────────────────────┐
│                AI Engineering                    │
├──────────────┬──────────────┬────────────────────┤
│   Prompting  │  Evaluation  │   System Design    │
│              │              │                    │
│ • Prompt     │ • Metrics    │ • RAG              │
│   patterns  │ • Evals      │ • Agents           │
│ • System     │ • Datasets   │ • Fine-tuning      │
│   prompts   │ • LLM judge  │ • Orchestration    │
├──────────────┴──────────────┴────────────────────┤
│            Operations (LLMOps)                   │
│   Monitoring · Cost · Reliability · Deployment   │
└─────────────────────────────────────────────────┘

Each of these areas has its own page in this wiki. Start with Prompt Engineering — it’s the foundation everything else builds on.

A shift in perspective

Working with LLMs requires moving from deterministic to probabilistic thinking:

Traditional engineering	AI Engineering
”Does the test pass?"	"Does the output quality distribution meet our threshold?"
"Is the function correct?"	"Is the system reliable across the full input distribution?"
"Deploy and monitor for errors"	"Evaluate before deploy, then monitor for quality drift"
"Debug the specific failure"	"Identify systematic failure modes in the model’s behaviour”

The evaluation gap

The most common mistake when starting in AI Engineering is shipping without evaluations. Traditional tests verify specific inputs. LLM-based systems need evals — a test set that probes the system’s quality across the distribution of real inputs. Without evals, you’re operating blind.

That’s why Evaluation is one of the most important skills. Evals get built before launch, and quality gets monitored continuously afterward.