Operations (LLMOps)
Running AI systems in production — monitoring, observability, cost management, rate limiting, caching, fallback strategies, and deployment patterns for LLM-powered applications.
Building an LLM feature that works in a demo is table stakes. Running it reliably in production, at scale, with acceptable cost and latency — that’s LLMOps. Production AI systems fail in unique ways that traditional software monitoring doesn’t catch: they degrade gradually rather than crashing, they fail probabilistically rather than deterministically, and their failures are often invisible to conventional health checks.
The LLMOps Stack
┌────────────────────────────────────────────────────────┐
│ Your Application │
├────────────────┬───────────────────┬────────────────────┤
│ Cost Control │ Reliability │ Quality │
│ - Caching │ - Rate limiting │ - Monitoring │
│ - Model tiers │ - Retries │ - Evals in prod │
│ - Token opt. │ - Fallbacks │ - Drift detection │
├────────────────┴───────────────────┴────────────────────┤
│ Observability Layer │
│ Traces · Spans · Logs · Metrics · Prompt logging │
├─────────────────────────────────────────────────────────┤
│ LLM Provider APIs │
│ OpenAI · Anthropic · Gemini · Self-hosted │
└─────────────────────────────────────────────────────────┘
Observability and Monitoring
Traditional APM (Application Performance Monitoring) tools don’t capture what you need for LLM systems. You need to trace:
- What was in the prompt? (inputs)
- What did the model return? (outputs)
- How long did it take? (latency)
- How many tokens were used? (cost)
- Did the output meet quality standards? (quality)
LLM-Specific Tracing
import { LangSmith } from 'langsmith';
// Option 1: LangSmith (LangChain's observability platform)
const client = new LangSmith({ apiKey: process.env.LANGSMITH_API_KEY });
// Option 2: Roll your own using OpenTelemetry
import { trace, SpanStatusCode } from '@opentelemetry/api';
async function tracedLLMCall(prompt: string, model: string): Promise<string> {
const tracer = trace.getTracer('llm-service');
return tracer.startActiveSpan('llm.completion', async (span) => {
span.setAttributes({
'llm.model': model,
'llm.prompt_tokens': estimateTokens(prompt),
'llm.prompt_preview': prompt.substring(0, 200),
});
try {
const start = Date.now();
const response = await llm.complete(prompt);
span.setAttributes({
'llm.completion_tokens': estimateTokens(response),
'llm.latency_ms': Date.now() - start,
'llm.response_preview': response.substring(0, 200),
});
span.setStatus({ code: SpanStatusCode.OK });
return response;
} catch (error) {
span.setStatus({ code: SpanStatusCode.ERROR, message: String(error) });
throw error;
} finally {
span.end();
}
});
}
Key Metrics to Track
| Metric | Why it matters | Alert threshold |
|---|---|---|
| Latency (p50/p95/p99) | User experience | p95 > 5s for most use cases |
| Token usage per request | Cost efficiency | Set budget alerts |
| Error rate | Reliability | > 1% errors warrant investigation |
| Quality score | Output quality | Below baseline from eval suite |
| Cache hit rate | Cost savings | Low rates suggest caching opportunities |
| Rate limit hits | Capacity planning | Frequent hits → need higher tier or backpressure |
Production Quality Monitoring
Unlike traditional software, quality can degrade in production without errors. Implement continuous quality sampling:
import { llmJudge } from './judges/llm-judge';
async function monitoredCompletion(input: string): Promise<string> {
const response = await llm.complete(input);
// Sample 5% of production traffic for quality evaluation
if (Math.random() < 0.05) {
// Async — doesn't block the response
setImmediate(async () => {
const qualityScore = await llmJudge.evaluate(input, response);
metrics.record('llm.quality_score', qualityScore, { model: 'gpt-4o' });
if (qualityScore < QUALITY_THRESHOLD) {
logger.warn('Low quality response detected', { input, response, qualityScore });
alerts.trigger('quality_degradation', { score: qualityScore });
}
});
}
return response;
}
Cost Management
LLM inference costs scale with usage in ways that traditional infrastructure doesn’t. A feature that costs $0.001 per call becomes $1000/day at 1M calls.
The Cost Formula
Cost = (input_tokens × input_price) + (output_tokens × output_price)
Token optimisation strategies:
Compress system prompts
Long system prompts are paid on every request. Keep them focused and concise. Consider prompt compression techniques (remove verbose explanations, use denser language).
Use the right model tier
Not all calls need the most expensive model. Route by task complexity:
function selectModel(taskType: string): string {
const modelMap: Record<string, string> = {
'classification': 'gpt-4o-mini', // simple → cheap model
'summarisation': 'gpt-4o-mini',
'code-generation': 'gpt-4o', // complex → capable model
'architecture-review': 'claude-opus-4', // critical → best model
};
return modelMap[taskType] || 'gpt-4o-mini';
}Implement semantic caching
Cache responses for semantically similar queries — not just exact string matches:
import { GPTCache } from 'gptcache'; // or implement with your vector store
async function cachedComplete(query: string): Promise<string> {
// Embed the query and find semantically similar cached responses
const queryEmbedding = await embed(query);
const cachedResult = await cache.findSimilar(queryEmbedding, threshold=0.95);
if (cachedResult) {
metrics.increment('cache.hits');
return cachedResult.response;
}
const response = await llm.complete(query);
await cache.store(queryEmbedding, query, response);
metrics.increment('cache.misses');
return response;
}Cache hit rates of 30-50% are achievable for many applications, translating directly to cost reduction.
Set token budgets
const response = await openai.chat.completions.create({
model: 'gpt-4o',
max_tokens: 1024, // Hard cap — won't generate more
messages: [...],
});
// Monitor when max_tokens is being hit
if (response.choices[0].finish_reason === 'length') {
logger.warn('Response hit max_tokens limit — may be truncated');
}Reliability Patterns
Retry with Exponential Backoff
LLM APIs have transient errors — rate limits, timeouts, 500s. Always retry:
import { retry } from 'ts-retry-promise';
const response = await retry(
() => openai.chat.completions.create({ model: 'gpt-4o', messages }),
{
retries: 3,
delay: 1000, // 1s initial delay
backoff: 'EXPONENTIAL',
maxDelay: 10000, // max 10s between retries
retryIf: (error) => {
// Retry on rate limits and server errors, not on auth failures
return error.status === 429 || error.status >= 500;
},
}
);
Fallback Chains
When your primary model is unavailable, fall over to alternatives:
async function resilientComplete(prompt: string): Promise<string> {
const providers = [
() => openai.complete(prompt, { model: 'gpt-4o' }),
() => anthropic.complete(prompt, { model: 'claude-3-5-sonnet' }),
() => openai.complete(prompt, { model: 'gpt-4o-mini' }), // last resort: cheaper model
];
for (const provider of providers) {
try {
return await provider();
} catch (error) {
logger.warn('Provider failed, trying next', { error });
continue;
}
}
throw new AllProvidersFailedError();
}
Rate Limiting and Backpressure
Protect your API quota and your backend:
import Bottleneck from 'bottleneck';
const limiter = new Bottleneck({
maxConcurrent: 10, // max 10 parallel API calls
minTime: 100, // minimum 100ms between calls
reservoir: 100, // token bucket size (for burst)
reservoirRefreshAmount: 100,
reservoirRefreshInterval: 60 * 1000, // refill every minute
});
const response = await limiter.schedule(() =>
openai.chat.completions.create({ model: 'gpt-4o', messages })
);
Deployment Patterns
Blue-Green Deployment for Prompts
When updating prompts in production, treat them like code deployments:
// Feature flags for prompt versions
const promptVersion = featureFlags.get('llm_prompt_version', 'v1');
const prompts = {
v1: 'You are a helpful assistant...',
v2: 'You are Aircury\'s specialised engineering assistant...',
};
// Gradually route traffic to new version
const response = await llm.complete(prompts[promptVersion]);
Observability Tools
| Tool | Best for |
|---|---|
| LangSmith | LangChain integration, prompt management, eval tracking |
| Helicone | OpenAI proxy with logging, analytics, caching |
| Braintrust | Eval-first observability, dataset management |
| Arize Phoenix | Open source, self-hosted, LLM + ML unified |
| Grafana + OpenTelemetry | Custom metrics if you already use Grafana |
Start with Helicone or LangSmith
If you’re just adding LLM observability to an existing project, Helicone is the fastest to integrate (one-line proxy change). If you’re building a new LangChain-based system, LangSmith is the native observability solution. Both offer generous free tiers.
Agent Maturity Model
Not all teams are in the same place with agent adoption. The maturity model below describes five levels of organisational capability. Understanding which level your team is at helps you identify what to build next — and what gaps are most likely to cause problems.
| Level | Name | Description |
|---|---|---|
| 1 | Experimentation | Individual engineers testing agents on their own. No coordination, no security controls, no cost tracking. |
| 2 | Individual adoption | Regular use for personal productivity. Some informal knowledge sharing. No team standards. |
| 3 | Team integration | AGENTS.md in all repos, MCP servers for internal tools, basic review guidelines, cost tracking in place. |
| 4 | Orchestration | Multi-agent workflows, fine-grained authorisation (OpenFGA), full OpenTelemetry tracing, Conductor Model as default working style. |
| 5 | Autonomy | Background agents without human supervision, anomaly detection, auto-remediation, outcome-based review. |
Measure your floor, not your ceiling
A common mistake is measuring maturity by the most advanced capability in use rather than the weakest. A team running Level 4 orchestration with Level 1 security practices is not at Level 4 — it’s at Level 1 with a significant risk exposure. The level that matters is the level of your least mature dimension.
Most teams in 2026 are at Level 2–3. The largest value jump happens in the transition from Level 2 to Level 3 — moving from individual tool use to shared infrastructure and standards. The largest risk is introduced in the transition from Level 3 to Level 4, where agents start acting without continuous human oversight.
The Economic Case
The financial argument for investing in agent infrastructure is straightforward:
- A software engineer costs $150K–$250K per year fully loaded
- An agent handling 30% of routine work (tests, documentation, simple bug fixes) saves roughly $45K–$75K per engineer per year
- The infrastructure to support that agent work costs $30K–$100K per year for a small team
That’s a rough 5–10x ROI before accounting for quality improvements and the compound effect of engineers spending more time on higher-value work.
On the cost side, model selection matters significantly. There’s roughly a 35x cost difference between frontier models and efficient small models. A team that routes 70% of tasks to cheaper models (classification, summarisation, simple code edits) and reserves expensive models for complex reasoning tasks can reduce inference costs by 60–70% without meaningful quality loss.
// Route by task complexity — don't use your most expensive model for everything
function selectModel(task: TaskType): string {
const routing: Record<TaskType, string> = {
'classification': 'claude-haiku-4-5', // fast, cheap, sufficient
'summarisation': 'claude-haiku-4-5',
'code-generation': 'claude-sonnet-4-6', // balanced capability/cost
'architecture-review': 'claude-opus-4-6', // complex reasoning, worth the cost
};
return routing[task];
}
Adoption Roadmap
The following is a practical week-by-week path for a team moving from Level 2 to Level 4. Each phase builds on the previous — don’t skip ahead.
Weeks 1–2: Foundation
- Install agents in IDEs (Cursor, Claude Code, GitHub Copilot)
- Start with low-risk tasks: documentation, tests, simple bug fixes
- Track personal productivity changes — gather data before making team-wide decisions
Months 1–2: Team Integration
- Add AGENTS.md to all active repositories (see Context Engineering for a full anatomy)
- Set up MCP servers for internal APIs and tools your team uses regularly
- Establish basic review guidelines: what to check, what to trust, what requires extra scrutiny
- Implement cost tracking and set budget alerts
- Add automated pre-review to CI/CD (linting, type checking, test runs triggered by agent PRs)
Months 2–4: Orchestration
- Deploy multi-agent workflows for complex tasks (research + implement + review pipeline)
- Implement OpenFGA for agent authorisation — define what each agent can and cannot do
- Configure OpenTelemetry for full trace visibility across agent decisions and actions
- Formalise the two-layer review policy: automated checks handle mechanical issues, human review handles judgement calls
- Set cost budgets and alerts at the task and team level
- Run Conductor Model training for the team (see Playbooks)
6+ Months: Autonomy
- Deploy background agents (asynchronous, without session-level supervision)
- Implement defence in depth fully (see Agents for the backpressure hierarchy)
- Set up anomaly detection for agent behaviour drift
- Transition to outcome-based review: review results and metrics, not individual actions
- Continuously improve context and tooling based on what the agents get wrong
Common Adoption Failures
Teams adopting agents at scale run into a predictable set of problems. Most of them are avoidable:
| Failure | Root cause | Fix |
|---|---|---|
| Rubber-stamping agent PRs | Reviewers don’t know what to look for | Establish a review checklist before scaling volume — see Quality Guardrails |
| Shadow AI without guardrails | Team uses agents informally without infrastructure | The cleanup cost later exceeds the investment up front |
| Skipping maturity levels | Wanting to go directly to Level 4 | Each level is a prerequisite for the next — shortcuts create gaps |
| Measuring only velocity | Ignoring burnout and quality signals | Track overtime, error rates, and team stress alongside throughput |
| Too many tools | Exposing everything as an MCP tool | A model with 100 tools makes worse decisions than one with 20 well-designed ones. Curate the toolset. |
| Inconsistent AGENTS.md | Files go stale or aren’t reviewed | Assign ownership; add AGENTS.md review to your sprint retrospectives |
On This Page
- The LLMOps Stack
- Observability and Monitoring
- LLM-Specific Tracing
- Key Metrics to Track
- Production Quality Monitoring
- Cost Management
- The Cost Formula
- Reliability Patterns
- Retry with Exponential Backoff
- Fallback Chains
- Rate Limiting and Backpressure
- Deployment Patterns
- Blue-Green Deployment for Prompts
- Observability Tools
- Agent Maturity Model
- The Economic Case
- Adoption Roadmap
- Weeks 1–2: Foundation
- Months 1–2: Team Integration
- Months 2–4: Orchestration
- 6+ Months: Autonomy
- Common Adoption Failures