AI Engineering Foundations
What AI Engineering is, how LLMs actually work at a conceptual level, the model landscape, and the mental models that help you work effectively with AI systems.
AI Engineering is the discipline of building reliable, production-grade software systems that incorporate AI capabilities β particularly Large Language Models (LLMs). Itβs a relatively new field, but the patterns and problems are recurring enough that itβs worth understanding them properly.
What is AI Engineering
AI Engineering sits at the intersection of software engineering and AI/ML. Itβs distinct from traditional ML Engineering or Data Science:
| Discipline | Primary focus | Typical outputs |
|---|---|---|
| Data Science | Insights from data, statistical modelling | Notebooks, reports, models |
| ML Engineering | Training, evaluating, deploying ML models | Training pipelines, model artifacts |
| AI Engineering | Building products and systems using AI | Applications, pipelines, APIs |
An AI Engineer typically doesnβt train models from scratch. They use frontier models (GPT, Claude, Gemini, Llama) as components integrated into larger systems. The core challenges are different:
- Reliability: LLMs are probabilistic. How do you build reliable systems on non-deterministic components?
- Evaluation: How do you know the system is working well? Traditional unit tests arenβt sufficient.
- Cost: LLM inference at scale is expensive. How do you optimise usage without sacrificing quality?
- Safety: How do you prevent the system from producing harmful, incorrect, or inappropriate output?
The central challenge
Traditional software is deterministic: given input X, it always produces output Y. LLMs are stochastic: given input X, they produce outputs drawn from a probability distribution. Building reliable systems with probabilistic components is the central challenge of AI Engineering.
How LLMs work (an engineerβs mental model)
You donβt need to understand backpropagation to be an effective AI Engineer. You do need a mental model of how LLMs behave.
The completion machine
At its core, an LLM is a function: text β next_token_probability_distribution. Given a sequence of text (the prompt), it predicts what token is most likely to come next. LLMs are trained to optimise this prediction on enormous amounts of text.
This has an important implication: LLMs donβt βknowβ things β they predict what plausible text would come next. When an LLM gives you a wrong answer with full confidence, itβs because that kind of wrong-but-confident text exists in the training data.
The context window
Every LLM has a context window β the amount of text it can βseeβ at once when generating a response. This is measured in tokens (roughly 4 characters or 1 word per token in English).
Context Window = System Prompt + Chat History + Your Current Message + Space for Response
What matters here:
- Information outside the context window doesnβt exist to the model
- Older parts of a long conversation have less weight (recency bias)
- Context windows have grown significantly: from 4K tokens (GPT-3.5) to 200K+ (Claude 3.5)
Temperature and sampling
Temperature controls how much the modelβs output varies:
temperature: 0.0β near-deterministic, picks the highest-probability tokentemperature: 1.0β standard sampling from the probability distributiontemperature: 2.0β high variation, can produce incoherent output
For coding or factual tasks: low temperature (0.0β0.3). For creative writing or brainstorming: higher temperature (0.7β1.0).
The LLM landscape
Model families
| Provider | Family | Strengths |
|---|---|---|
| Anthropic | Claude 3.5/4.x | Instruction-following, code, long context, safety |
| OpenAI | GPT-4o, o1, o3 | Broad capability, tool use, vision, reasoning |
| Gemini 1.5/2.x | Very long context (1M tokens), multimodal, Google integration | |
| Meta | Llama 3.x | Open weights, self-hostable, privacy, fine-tuning |
| Mistral | Mistral/Mixtral | Efficient, multilingual, self-hostable |
Choosing a model
It depends on:
- Task type: code generation, analysis, creative writing, summarisation, classification
- Context needs: how much text needs to be in context at once?
- Privacy requirements: can data leave your infrastructure? (No β self-hosted Llama)
- Cost: tokens Γ price. Claude Haiku << Claude Opus in cost, comparable on many tasks.
- Latency: smaller models are faster. Streaming helps with perceived latency.
Multimodal capabilities
Modern frontier models are multimodal β they process text, images, audio, and sometimes video:
- Vision: analyse screenshots, diagrams, photos (GPT-4o, Claude 3.5, Gemini)
- Code interpretation: execute code, analyse outputs (Code Interpreter / GPT-4o)
- Voice: real-time speech-to-speech (GPT-4o Realtime)
AI Engineering core competencies
A well-rounded AI Engineer should understand:
βββββββββββββββββββββββββββββββββββββββββββββββββββ
β AI Engineering β
ββββββββββββββββ¬βββββββββββββββ¬βββββββββββββββββββββ€
β Prompting β Evaluation β System Design β
β β β β
β β’ Prompt β β’ Metrics β β’ RAG β
β patterns β β’ Evals β β’ Agents β
β β’ System β β’ Datasets β β’ Fine-tuning β
β prompts β β’ LLM judge β β’ Orchestration β
ββββββββββββββββ΄βββββββββββββββ΄βββββββββββββββββββββ€
β Operations (LLMOps) β
β Monitoring Β· Cost Β· Reliability Β· Deployment β
βββββββββββββββββββββββββββββββββββββββββββββββββββ
Each of these areas has its own page in this wiki. Start with Prompt Engineering β itβs the foundation everything else builds on.
A shift in perspective
Working with LLMs requires moving from deterministic to probabilistic thinking:
| Traditional engineering | AI Engineering |
|---|---|
| βDoes the test pass?" | "Does the output quality distribution meet our threshold?" |
| "Is the function correct?" | "Is the system reliable across the full input distribution?" |
| "Deploy and monitor for errors" | "Evaluate before deploy, then monitor for quality drift" |
| "Debug the specific failure" | "Identify systematic failure modes in the modelβs behaviourβ |
The evaluation gap
The most common mistake when starting in AI Engineering is shipping without evaluations. Traditional tests verify specific inputs. LLM-based systems need evals β a test set that probes the systemβs quality across the distribution of real inputs. Without evals, youβre operating blind.
Thatβs why Evaluation is one of the most important skills. Evals get built before launch, and quality gets monitored continuously afterward.