Docs
AI Engineering Foundations

AI Engineering Foundations

What AI Engineering is, how LLMs actually work at a conceptual level, the model landscape, and the mental models that help you work effectively with AI systems.


AI Engineering is the discipline of building reliable, production-grade software systems that incorporate AI capabilities β€” particularly Large Language Models (LLMs). It’s a relatively new field, but the patterns and problems are recurring enough that it’s worth understanding them properly.

What is AI Engineering

AI Engineering sits at the intersection of software engineering and AI/ML. It’s distinct from traditional ML Engineering or Data Science:

DisciplinePrimary focusTypical outputs
Data ScienceInsights from data, statistical modellingNotebooks, reports, models
ML EngineeringTraining, evaluating, deploying ML modelsTraining pipelines, model artifacts
AI EngineeringBuilding products and systems using AIApplications, pipelines, APIs

An AI Engineer typically doesn’t train models from scratch. They use frontier models (GPT, Claude, Gemini, Llama) as components integrated into larger systems. The core challenges are different:

  • Reliability: LLMs are probabilistic. How do you build reliable systems on non-deterministic components?
  • Evaluation: How do you know the system is working well? Traditional unit tests aren’t sufficient.
  • Cost: LLM inference at scale is expensive. How do you optimise usage without sacrificing quality?
  • Safety: How do you prevent the system from producing harmful, incorrect, or inappropriate output?
The central challenge

Traditional software is deterministic: given input X, it always produces output Y. LLMs are stochastic: given input X, they produce outputs drawn from a probability distribution. Building reliable systems with probabilistic components is the central challenge of AI Engineering.

How LLMs work (an engineer’s mental model)

You don’t need to understand backpropagation to be an effective AI Engineer. You do need a mental model of how LLMs behave.

The completion machine

At its core, an LLM is a function: text β†’ next_token_probability_distribution. Given a sequence of text (the prompt), it predicts what token is most likely to come next. LLMs are trained to optimise this prediction on enormous amounts of text.

This has an important implication: LLMs don’t β€œknow” things β€” they predict what plausible text would come next. When an LLM gives you a wrong answer with full confidence, it’s because that kind of wrong-but-confident text exists in the training data.

The context window

Every LLM has a context window β€” the amount of text it can β€œsee” at once when generating a response. This is measured in tokens (roughly 4 characters or 1 word per token in English).

Context Window = System Prompt + Chat History + Your Current Message + Space for Response

What matters here:

  • Information outside the context window doesn’t exist to the model
  • Older parts of a long conversation have less weight (recency bias)
  • Context windows have grown significantly: from 4K tokens (GPT-3.5) to 200K+ (Claude 3.5)

Temperature and sampling

Temperature controls how much the model’s output varies:

  • temperature: 0.0 β†’ near-deterministic, picks the highest-probability token
  • temperature: 1.0 β†’ standard sampling from the probability distribution
  • temperature: 2.0 β†’ high variation, can produce incoherent output

For coding or factual tasks: low temperature (0.0–0.3). For creative writing or brainstorming: higher temperature (0.7–1.0).

The LLM landscape

Model families

ProviderFamilyStrengths
AnthropicClaude 3.5/4.xInstruction-following, code, long context, safety
OpenAIGPT-4o, o1, o3Broad capability, tool use, vision, reasoning
GoogleGemini 1.5/2.xVery long context (1M tokens), multimodal, Google integration
MetaLlama 3.xOpen weights, self-hostable, privacy, fine-tuning
MistralMistral/MixtralEfficient, multilingual, self-hostable

Choosing a model

It depends on:

  1. Task type: code generation, analysis, creative writing, summarisation, classification
  2. Context needs: how much text needs to be in context at once?
  3. Privacy requirements: can data leave your infrastructure? (No β†’ self-hosted Llama)
  4. Cost: tokens Γ— price. Claude Haiku << Claude Opus in cost, comparable on many tasks.
  5. Latency: smaller models are faster. Streaming helps with perceived latency.

Multimodal capabilities

Modern frontier models are multimodal β€” they process text, images, audio, and sometimes video:

  • Vision: analyse screenshots, diagrams, photos (GPT-4o, Claude 3.5, Gemini)
  • Code interpretation: execute code, analyse outputs (Code Interpreter / GPT-4o)
  • Voice: real-time speech-to-speech (GPT-4o Realtime)

AI Engineering core competencies

A well-rounded AI Engineer should understand:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                AI Engineering                    β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚   Prompting  β”‚  Evaluation  β”‚   System Design    β”‚
β”‚              β”‚              β”‚                    β”‚
β”‚ β€’ Prompt     β”‚ β€’ Metrics    β”‚ β€’ RAG              β”‚
β”‚   patterns  β”‚ β€’ Evals      β”‚ β€’ Agents           β”‚
β”‚ β€’ System     β”‚ β€’ Datasets   β”‚ β€’ Fine-tuning      β”‚
β”‚   prompts   β”‚ β€’ LLM judge  β”‚ β€’ Orchestration    β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚            Operations (LLMOps)                   β”‚
β”‚   Monitoring Β· Cost Β· Reliability Β· Deployment   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Each of these areas has its own page in this wiki. Start with Prompt Engineering β€” it’s the foundation everything else builds on.

A shift in perspective

Working with LLMs requires moving from deterministic to probabilistic thinking:

Traditional engineeringAI Engineering
”Does the test pass?""Does the output quality distribution meet our threshold?"
"Is the function correct?""Is the system reliable across the full input distribution?"
"Deploy and monitor for errors""Evaluate before deploy, then monitor for quality drift"
"Debug the specific failure""Identify systematic failure modes in the model’s behaviour”
The evaluation gap

The most common mistake when starting in AI Engineering is shipping without evaluations. Traditional tests verify specific inputs. LLM-based systems need evals β€” a test set that probes the system’s quality across the distribution of real inputs. Without evals, you’re operating blind.

That’s why Evaluation is one of the most important skills. Evals get built before launch, and quality gets monitored continuously afterward.