Operations (LLMOps)

Building an LLM feature that works in a demo is table stakes. Running it reliably in production, at scale, with acceptable cost and latency — that’s LLMOps. Production AI systems fail in unique ways that traditional software monitoring doesn’t catch: they degrade gradually rather than crashing, they fail probabilistically rather than deterministically, and their failures are often invisible to conventional health checks.

The LLMOps Stack

┌────────────────────────────────────────────────────────┐
│                    Your Application                     │
├────────────────┬───────────────────┬────────────────────┤
│  Cost Control  │    Reliability    │   Quality          │
│  - Caching     │  - Rate limiting  │  - Monitoring      │
│  - Model tiers │  - Retries        │  - Evals in prod   │
│  - Token opt.  │  - Fallbacks      │  - Drift detection │
├────────────────┴───────────────────┴────────────────────┤
│                 Observability Layer                      │
│  Traces · Spans · Logs · Metrics · Prompt logging       │
├─────────────────────────────────────────────────────────┤
│              LLM Provider APIs                          │
│   OpenAI · Anthropic · Gemini · Self-hosted             │
└─────────────────────────────────────────────────────────┘

Observability and Monitoring

Traditional APM (Application Performance Monitoring) tools don’t capture what you need for LLM systems. You need to trace:

What was in the prompt? (inputs)
What did the model return? (outputs)
How long did it take? (latency)
How many tokens were used? (cost)
Did the output meet quality standards? (quality)

LLM-Specific Tracing

import { LangSmith } from 'langsmith';

// Option 1: LangSmith (LangChain's observability platform)
const client = new LangSmith({ apiKey: process.env.LANGSMITH_API_KEY });

// Option 2: Roll your own using OpenTelemetry
import { trace, SpanStatusCode } from '@opentelemetry/api';

async function tracedLLMCall(prompt: string, model: string): Promise<string> {
  const tracer = trace.getTracer('llm-service');
  
  return tracer.startActiveSpan('llm.completion', async (span) => {
    span.setAttributes({
      'llm.model': model,
      'llm.prompt_tokens': estimateTokens(prompt),
      'llm.prompt_preview': prompt.substring(0, 200),
    });
    
    try {
      const start = Date.now();
      const response = await llm.complete(prompt);
      
      span.setAttributes({
        'llm.completion_tokens': estimateTokens(response),
        'llm.latency_ms': Date.now() - start,
        'llm.response_preview': response.substring(0, 200),
      });
      span.setStatus({ code: SpanStatusCode.OK });
      return response;
    } catch (error) {
      span.setStatus({ code: SpanStatusCode.ERROR, message: String(error) });
      throw error;
    } finally {
      span.end();
    }
  });
}

Key Metrics to Track

Metric	Why it matters	Alert threshold
Latency (p50/p95/p99)	User experience	p95 > 5s for most use cases
Token usage per request	Cost efficiency	Set budget alerts
Error rate	Reliability	> 1% errors warrant investigation
Quality score	Output quality	Below baseline from eval suite
Cache hit rate	Cost savings	Low rates suggest caching opportunities
Rate limit hits	Capacity planning	Frequent hits → need higher tier or backpressure

Production Quality Monitoring

Unlike traditional software, quality can degrade in production without errors. Implement continuous quality sampling:

import { llmJudge } from './judges/llm-judge';

async function monitoredCompletion(input: string): Promise<string> {
  const response = await llm.complete(input);
  
  // Sample 5% of production traffic for quality evaluation
  if (Math.random() < 0.05) {
    // Async — doesn't block the response
    setImmediate(async () => {
      const qualityScore = await llmJudge.evaluate(input, response);
      metrics.record('llm.quality_score', qualityScore, { model: 'gpt-4o' });
      
      if (qualityScore < QUALITY_THRESHOLD) {
        logger.warn('Low quality response detected', { input, response, qualityScore });
        alerts.trigger('quality_degradation', { score: qualityScore });
      }
    });
  }
  
  return response;
}

Cost Management

LLM inference costs scale with usage in ways that traditional infrastructure doesn’t. A feature that costs $0.001 per call becomes $1000/day at 1M calls.

The Cost Formula

Cost = (input_tokens × input_price) + (output_tokens × output_price)

Token optimisation strategies:

Compress system prompts

Long system prompts are paid on every request. Keep them focused and concise. Consider prompt compression techniques (remove verbose explanations, use denser language).

Use the right model tier

Not all calls need the most expensive model. Route by task complexity:

function selectModel(taskType: string): string {
  const modelMap: Record<string, string> = {
    'classification':     'gpt-4o-mini',   // simple → cheap model
    'summarisation':     'gpt-4o-mini',
    'code-generation':   'gpt-4o',          // complex → capable model
    'architecture-review': 'claude-opus-4', // critical → best model
  };
  return modelMap[taskType] || 'gpt-4o-mini';
}

Implement semantic caching

Cache responses for semantically similar queries — not just exact string matches:

import { GPTCache } from 'gptcache'; // or implement with your vector store

async function cachedComplete(query: string): Promise<string> {
  // Embed the query and find semantically similar cached responses
  const queryEmbedding = await embed(query);
  const cachedResult = await cache.findSimilar(queryEmbedding, threshold=0.95);
  
  if (cachedResult) {
    metrics.increment('cache.hits');
    return cachedResult.response;
  }
  
  const response = await llm.complete(query);
  await cache.store(queryEmbedding, query, response);
  metrics.increment('cache.misses');
  return response;
}

Cache hit rates of 30-50% are achievable for many applications, translating directly to cost reduction.

Set token budgets

const response = await openai.chat.completions.create({
  model: 'gpt-4o',
  max_tokens: 1024,  // Hard cap — won't generate more
  messages: [...],
});

// Monitor when max_tokens is being hit
if (response.choices[0].finish_reason === 'length') {
  logger.warn('Response hit max_tokens limit — may be truncated');
}

Reliability Patterns

Retry with Exponential Backoff

LLM APIs have transient errors — rate limits, timeouts, 500s. Always retry:

import { retry } from 'ts-retry-promise';

const response = await retry(
  () => openai.chat.completions.create({ model: 'gpt-4o', messages }),
  {
    retries: 3,
    delay: 1000,       // 1s initial delay
    backoff: 'EXPONENTIAL',
    maxDelay: 10000,   // max 10s between retries
    retryIf: (error) => {
      // Retry on rate limits and server errors, not on auth failures
      return error.status === 429 || error.status >= 500;
    },
  }
);

Fallback Chains

When your primary model is unavailable, fall over to alternatives:

async function resilientComplete(prompt: string): Promise<string> {
  const providers = [
    () => openai.complete(prompt, { model: 'gpt-4o' }),
    () => anthropic.complete(prompt, { model: 'claude-3-5-sonnet' }),
    () => openai.complete(prompt, { model: 'gpt-4o-mini' }), // last resort: cheaper model
  ];
  
  for (const provider of providers) {
    try {
      return await provider();
    } catch (error) {
      logger.warn('Provider failed, trying next', { error });
      continue;
    }
  }
  
  throw new AllProvidersFailedError();
}

Rate Limiting and Backpressure

Protect your API quota and your backend:

import Bottleneck from 'bottleneck';

const limiter = new Bottleneck({
  maxConcurrent: 10,    // max 10 parallel API calls
  minTime: 100,         // minimum 100ms between calls
  reservoir: 100,       // token bucket size (for burst)
  reservoirRefreshAmount: 100,
  reservoirRefreshInterval: 60 * 1000, // refill every minute
});

const response = await limiter.schedule(() =>
  openai.chat.completions.create({ model: 'gpt-4o', messages })
);

Deployment Patterns

Blue-Green Deployment for Prompts

When updating prompts in production, treat them like code deployments:

// Feature flags for prompt versions
const promptVersion = featureFlags.get('llm_prompt_version', 'v1');

const prompts = {
  v1: 'You are a helpful assistant...',
  v2: 'You are Aircury\'s specialised engineering assistant...',
};

// Gradually route traffic to new version
const response = await llm.complete(prompts[promptVersion]);

Observability Tools

Tool	Best for
LangSmith	LangChain integration, prompt management, eval tracking
Helicone	OpenAI proxy with logging, analytics, caching
Braintrust	Eval-first observability, dataset management
Arize Phoenix	Open source, self-hosted, LLM + ML unified
Grafana + OpenTelemetry	Custom metrics if you already use Grafana

Start with Helicone or LangSmith

If you’re just adding LLM observability to an existing project, Helicone is the fastest to integrate (one-line proxy change). If you’re building a new LangChain-based system, LangSmith is the native observability solution. Both offer generous free tiers.

Agent Maturity Model

Not all teams are in the same place with agent adoption. The maturity model below describes five levels of organisational capability. Understanding which level your team is at helps you identify what to build next — and what gaps are most likely to cause problems.

Level	Name	Description
1	Experimentation	Individual engineers testing agents on their own. No coordination, no security controls, no cost tracking.
2	Individual adoption	Regular use for personal productivity. Some informal knowledge sharing. No team standards.
3	Team integration	AGENTS.md in all repos, MCP servers for internal tools, basic review guidelines, cost tracking in place.
4	Orchestration	Multi-agent workflows, fine-grained authorisation (OpenFGA), full OpenTelemetry tracing, Conductor Model as default working style.
5	Autonomy	Background agents without human supervision, anomaly detection, auto-remediation, outcome-based review.

Measure your floor, not your ceiling

A common mistake is measuring maturity by the most advanced capability in use rather than the weakest. A team running Level 4 orchestration with Level 1 security practices is not at Level 4 — it’s at Level 1 with a significant risk exposure. The level that matters is the level of your least mature dimension.

Most teams in 2026 are at Level 2–3. The largest value jump happens in the transition from Level 2 to Level 3 — moving from individual tool use to shared infrastructure and standards. The largest risk is introduced in the transition from Level 3 to Level 4, where agents start acting without continuous human oversight.

The Economic Case

The financial argument for investing in agent infrastructure is straightforward:

A software engineer costs $150K–$250K per year fully loaded
An agent handling 30% of routine work (tests, documentation, simple bug fixes) saves roughly $45K–$75K per engineer per year
The infrastructure to support that agent work costs $30K–$100K per year for a small team

That’s a rough 5–10x ROI before accounting for quality improvements and the compound effect of engineers spending more time on higher-value work.

On the cost side, model selection matters significantly. There’s roughly a 35x cost difference between frontier models and efficient small models. A team that routes 70% of tasks to cheaper models (classification, summarisation, simple code edits) and reserves expensive models for complex reasoning tasks can reduce inference costs by 60–70% without meaningful quality loss.

// Route by task complexity — don't use your most expensive model for everything
function selectModel(task: TaskType): string {
  const routing: Record<TaskType, string> = {
    'classification':      'claude-haiku-4-5',    // fast, cheap, sufficient
    'summarisation':       'claude-haiku-4-5',
    'code-generation':     'claude-sonnet-4-6',   // balanced capability/cost
    'architecture-review': 'claude-opus-4-6',    // complex reasoning, worth the cost
  };
  return routing[task];
}

Adoption Roadmap

The following is a practical week-by-week path for a team moving from Level 2 to Level 4. Each phase builds on the previous — don’t skip ahead.

Weeks 1–2: Foundation

Install agents in IDEs (Cursor, Claude Code, GitHub Copilot)
Start with low-risk tasks: documentation, tests, simple bug fixes
Track personal productivity changes — gather data before making team-wide decisions

Months 1–2: Team Integration

Add AGENTS.md to all active repositories (see Context Engineering for a full anatomy)
Set up MCP servers for internal APIs and tools your team uses regularly
Establish basic review guidelines: what to check, what to trust, what requires extra scrutiny
Implement cost tracking and set budget alerts
Add automated pre-review to CI/CD (linting, type checking, test runs triggered by agent PRs)

Months 2–4: Orchestration

Deploy multi-agent workflows for complex tasks (research + implement + review pipeline)
Implement OpenFGA for agent authorisation — define what each agent can and cannot do
Configure OpenTelemetry for full trace visibility across agent decisions and actions
Formalise the two-layer review policy: automated checks handle mechanical issues, human review handles judgement calls
Set cost budgets and alerts at the task and team level
Run Conductor Model training for the team (see Playbooks)

6+ Months: Autonomy

Deploy background agents (asynchronous, without session-level supervision)
Implement defence in depth fully (see Agents for the backpressure hierarchy)
Set up anomaly detection for agent behaviour drift
Transition to outcome-based review: review results and metrics, not individual actions
Continuously improve context and tooling based on what the agents get wrong

Common Adoption Failures

Teams adopting agents at scale run into a predictable set of problems. Most of them are avoidable:

Failure	Root cause	Fix
Rubber-stamping agent PRs	Reviewers don’t know what to look for	Establish a review checklist before scaling volume — see Quality Guardrails
Shadow AI without guardrails	Team uses agents informally without infrastructure	The cleanup cost later exceeds the investment up front
Skipping maturity levels	Wanting to go directly to Level 4	Each level is a prerequisite for the next — shortcuts create gaps
Measuring only velocity	Ignoring burnout and quality signals	Track overtime, error rates, and team stress alongside throughput
Too many tools	Exposing everything as an MCP tool	A model with 100 tools makes worse decisions than one with 20 well-designed ones. Curate the toolset.
Inconsistent AGENTS.md	Files go stale or aren’t reviewed	Assign ownership; add AGENTS.md review to your sprint retrospectives