Docs
Operations (LLMOps)

Operations (LLMOps)

Running AI systems in production — monitoring, observability, cost management, rate limiting, caching, fallback strategies, and deployment patterns for LLM-powered applications.


Building an LLM feature that works in a demo is table stakes. Running it reliably in production, at scale, with acceptable cost and latency — that’s LLMOps. Production AI systems fail in unique ways that traditional software monitoring doesn’t catch: they degrade gradually rather than crashing, they fail probabilistically rather than deterministically, and their failures are often invisible to conventional health checks.

The LLMOps Stack

┌────────────────────────────────────────────────────────┐
│                    Your Application                     │
├────────────────┬───────────────────┬────────────────────┤
│  Cost Control  │    Reliability    │   Quality          │
│  - Caching     │  - Rate limiting  │  - Monitoring      │
│  - Model tiers │  - Retries        │  - Evals in prod   │
│  - Token opt.  │  - Fallbacks      │  - Drift detection │
├────────────────┴───────────────────┴────────────────────┤
│                 Observability Layer                      │
│  Traces · Spans · Logs · Metrics · Prompt logging       │
├─────────────────────────────────────────────────────────┤
│              LLM Provider APIs                          │
│   OpenAI · Anthropic · Gemini · Self-hosted             │
└─────────────────────────────────────────────────────────┘

Observability and Monitoring

Traditional APM (Application Performance Monitoring) tools don’t capture what you need for LLM systems. You need to trace:

  • What was in the prompt? (inputs)
  • What did the model return? (outputs)
  • How long did it take? (latency)
  • How many tokens were used? (cost)
  • Did the output meet quality standards? (quality)

LLM-Specific Tracing

import { LangSmith } from 'langsmith';

// Option 1: LangSmith (LangChain's observability platform)
const client = new LangSmith({ apiKey: process.env.LANGSMITH_API_KEY });

// Option 2: Roll your own using OpenTelemetry
import { trace, SpanStatusCode } from '@opentelemetry/api';

async function tracedLLMCall(prompt: string, model: string): Promise<string> {
  const tracer = trace.getTracer('llm-service');
  
  return tracer.startActiveSpan('llm.completion', async (span) => {
    span.setAttributes({
      'llm.model': model,
      'llm.prompt_tokens': estimateTokens(prompt),
      'llm.prompt_preview': prompt.substring(0, 200),
    });
    
    try {
      const start = Date.now();
      const response = await llm.complete(prompt);
      
      span.setAttributes({
        'llm.completion_tokens': estimateTokens(response),
        'llm.latency_ms': Date.now() - start,
        'llm.response_preview': response.substring(0, 200),
      });
      span.setStatus({ code: SpanStatusCode.OK });
      return response;
    } catch (error) {
      span.setStatus({ code: SpanStatusCode.ERROR, message: String(error) });
      throw error;
    } finally {
      span.end();
    }
  });
}

Key Metrics to Track

MetricWhy it mattersAlert threshold
Latency (p50/p95/p99)User experiencep95 > 5s for most use cases
Token usage per requestCost efficiencySet budget alerts
Error rateReliability> 1% errors warrant investigation
Quality scoreOutput qualityBelow baseline from eval suite
Cache hit rateCost savingsLow rates suggest caching opportunities
Rate limit hitsCapacity planningFrequent hits → need higher tier or backpressure

Production Quality Monitoring

Unlike traditional software, quality can degrade in production without errors. Implement continuous quality sampling:

import { llmJudge } from './judges/llm-judge';

async function monitoredCompletion(input: string): Promise<string> {
  const response = await llm.complete(input);
  
  // Sample 5% of production traffic for quality evaluation
  if (Math.random() < 0.05) {
    // Async — doesn't block the response
    setImmediate(async () => {
      const qualityScore = await llmJudge.evaluate(input, response);
      metrics.record('llm.quality_score', qualityScore, { model: 'gpt-4o' });
      
      if (qualityScore < QUALITY_THRESHOLD) {
        logger.warn('Low quality response detected', { input, response, qualityScore });
        alerts.trigger('quality_degradation', { score: qualityScore });
      }
    });
  }
  
  return response;
}

Cost Management

LLM inference costs scale with usage in ways that traditional infrastructure doesn’t. A feature that costs $0.001 per call becomes $1000/day at 1M calls.

The Cost Formula

Cost = (input_tokens × input_price) + (output_tokens × output_price)

Token optimisation strategies:

Compress system prompts

Long system prompts are paid on every request. Keep them focused and concise. Consider prompt compression techniques (remove verbose explanations, use denser language).

Use the right model tier

Not all calls need the most expensive model. Route by task complexity:

function selectModel(taskType: string): string {
  const modelMap: Record<string, string> = {
    'classification':     'gpt-4o-mini',   // simple → cheap model
    'summarisation':     'gpt-4o-mini',
    'code-generation':   'gpt-4o',          // complex → capable model
    'architecture-review': 'claude-opus-4', // critical → best model
  };
  return modelMap[taskType] || 'gpt-4o-mini';
}

Implement semantic caching

Cache responses for semantically similar queries — not just exact string matches:

import { GPTCache } from 'gptcache'; // or implement with your vector store

async function cachedComplete(query: string): Promise<string> {
  // Embed the query and find semantically similar cached responses
  const queryEmbedding = await embed(query);
  const cachedResult = await cache.findSimilar(queryEmbedding, threshold=0.95);
  
  if (cachedResult) {
    metrics.increment('cache.hits');
    return cachedResult.response;
  }
  
  const response = await llm.complete(query);
  await cache.store(queryEmbedding, query, response);
  metrics.increment('cache.misses');
  return response;
}

Cache hit rates of 30-50% are achievable for many applications, translating directly to cost reduction.

Set token budgets

const response = await openai.chat.completions.create({
  model: 'gpt-4o',
  max_tokens: 1024,  // Hard cap — won't generate more
  messages: [...],
});

// Monitor when max_tokens is being hit
if (response.choices[0].finish_reason === 'length') {
  logger.warn('Response hit max_tokens limit — may be truncated');
}

Reliability Patterns

Retry with Exponential Backoff

LLM APIs have transient errors — rate limits, timeouts, 500s. Always retry:

import { retry } from 'ts-retry-promise';

const response = await retry(
  () => openai.chat.completions.create({ model: 'gpt-4o', messages }),
  {
    retries: 3,
    delay: 1000,       // 1s initial delay
    backoff: 'EXPONENTIAL',
    maxDelay: 10000,   // max 10s between retries
    retryIf: (error) => {
      // Retry on rate limits and server errors, not on auth failures
      return error.status === 429 || error.status >= 500;
    },
  }
);

Fallback Chains

When your primary model is unavailable, fall over to alternatives:

async function resilientComplete(prompt: string): Promise<string> {
  const providers = [
    () => openai.complete(prompt, { model: 'gpt-4o' }),
    () => anthropic.complete(prompt, { model: 'claude-3-5-sonnet' }),
    () => openai.complete(prompt, { model: 'gpt-4o-mini' }), // last resort: cheaper model
  ];
  
  for (const provider of providers) {
    try {
      return await provider();
    } catch (error) {
      logger.warn('Provider failed, trying next', { error });
      continue;
    }
  }
  
  throw new AllProvidersFailedError();
}

Rate Limiting and Backpressure

Protect your API quota and your backend:

import Bottleneck from 'bottleneck';

const limiter = new Bottleneck({
  maxConcurrent: 10,    // max 10 parallel API calls
  minTime: 100,         // minimum 100ms between calls
  reservoir: 100,       // token bucket size (for burst)
  reservoirRefreshAmount: 100,
  reservoirRefreshInterval: 60 * 1000, // refill every minute
});

const response = await limiter.schedule(() =>
  openai.chat.completions.create({ model: 'gpt-4o', messages })
);

Deployment Patterns

Blue-Green Deployment for Prompts

When updating prompts in production, treat them like code deployments:

// Feature flags for prompt versions
const promptVersion = featureFlags.get('llm_prompt_version', 'v1');

const prompts = {
  v1: 'You are a helpful assistant...',
  v2: 'You are Aircury\'s specialised engineering assistant...',
};

// Gradually route traffic to new version
const response = await llm.complete(prompts[promptVersion]);

Observability Tools

ToolBest for
LangSmithLangChain integration, prompt management, eval tracking
HeliconeOpenAI proxy with logging, analytics, caching
BraintrustEval-first observability, dataset management
Arize PhoenixOpen source, self-hosted, LLM + ML unified
Grafana + OpenTelemetryCustom metrics if you already use Grafana
Start with Helicone or LangSmith

If you’re just adding LLM observability to an existing project, Helicone is the fastest to integrate (one-line proxy change). If you’re building a new LangChain-based system, LangSmith is the native observability solution. Both offer generous free tiers.

Agent Maturity Model

Not all teams are in the same place with agent adoption. The maturity model below describes five levels of organisational capability. Understanding which level your team is at helps you identify what to build next — and what gaps are most likely to cause problems.

LevelNameDescription
1ExperimentationIndividual engineers testing agents on their own. No coordination, no security controls, no cost tracking.
2Individual adoptionRegular use for personal productivity. Some informal knowledge sharing. No team standards.
3Team integrationAGENTS.md in all repos, MCP servers for internal tools, basic review guidelines, cost tracking in place.
4OrchestrationMulti-agent workflows, fine-grained authorisation (OpenFGA), full OpenTelemetry tracing, Conductor Model as default working style.
5AutonomyBackground agents without human supervision, anomaly detection, auto-remediation, outcome-based review.
Measure your floor, not your ceiling

A common mistake is measuring maturity by the most advanced capability in use rather than the weakest. A team running Level 4 orchestration with Level 1 security practices is not at Level 4 — it’s at Level 1 with a significant risk exposure. The level that matters is the level of your least mature dimension.

Most teams in 2026 are at Level 2–3. The largest value jump happens in the transition from Level 2 to Level 3 — moving from individual tool use to shared infrastructure and standards. The largest risk is introduced in the transition from Level 3 to Level 4, where agents start acting without continuous human oversight.

The Economic Case

The financial argument for investing in agent infrastructure is straightforward:

  • A software engineer costs $150K–$250K per year fully loaded
  • An agent handling 30% of routine work (tests, documentation, simple bug fixes) saves roughly $45K–$75K per engineer per year
  • The infrastructure to support that agent work costs $30K–$100K per year for a small team

That’s a rough 5–10x ROI before accounting for quality improvements and the compound effect of engineers spending more time on higher-value work.

On the cost side, model selection matters significantly. There’s roughly a 35x cost difference between frontier models and efficient small models. A team that routes 70% of tasks to cheaper models (classification, summarisation, simple code edits) and reserves expensive models for complex reasoning tasks can reduce inference costs by 60–70% without meaningful quality loss.

// Route by task complexity — don't use your most expensive model for everything
function selectModel(task: TaskType): string {
  const routing: Record<TaskType, string> = {
    'classification':      'claude-haiku-4-5',    // fast, cheap, sufficient
    'summarisation':       'claude-haiku-4-5',
    'code-generation':     'claude-sonnet-4-6',   // balanced capability/cost
    'architecture-review': 'claude-opus-4-6',    // complex reasoning, worth the cost
  };
  return routing[task];
}

Adoption Roadmap

The following is a practical week-by-week path for a team moving from Level 2 to Level 4. Each phase builds on the previous — don’t skip ahead.

Weeks 1–2: Foundation

  • Install agents in IDEs (Cursor, Claude Code, GitHub Copilot)
  • Start with low-risk tasks: documentation, tests, simple bug fixes
  • Track personal productivity changes — gather data before making team-wide decisions

Months 1–2: Team Integration

  • Add AGENTS.md to all active repositories (see Context Engineering for a full anatomy)
  • Set up MCP servers for internal APIs and tools your team uses regularly
  • Establish basic review guidelines: what to check, what to trust, what requires extra scrutiny
  • Implement cost tracking and set budget alerts
  • Add automated pre-review to CI/CD (linting, type checking, test runs triggered by agent PRs)

Months 2–4: Orchestration

  • Deploy multi-agent workflows for complex tasks (research + implement + review pipeline)
  • Implement OpenFGA for agent authorisation — define what each agent can and cannot do
  • Configure OpenTelemetry for full trace visibility across agent decisions and actions
  • Formalise the two-layer review policy: automated checks handle mechanical issues, human review handles judgement calls
  • Set cost budgets and alerts at the task and team level
  • Run Conductor Model training for the team (see Playbooks)

6+ Months: Autonomy

  • Deploy background agents (asynchronous, without session-level supervision)
  • Implement defence in depth fully (see Agents for the backpressure hierarchy)
  • Set up anomaly detection for agent behaviour drift
  • Transition to outcome-based review: review results and metrics, not individual actions
  • Continuously improve context and tooling based on what the agents get wrong

Common Adoption Failures

Teams adopting agents at scale run into a predictable set of problems. Most of them are avoidable:

FailureRoot causeFix
Rubber-stamping agent PRsReviewers don’t know what to look forEstablish a review checklist before scaling volume — see Quality Guardrails
Shadow AI without guardrailsTeam uses agents informally without infrastructureThe cleanup cost later exceeds the investment up front
Skipping maturity levelsWanting to go directly to Level 4Each level is a prerequisite for the next — shortcuts create gaps
Measuring only velocityIgnoring burnout and quality signalsTrack overtime, error rates, and team stress alongside throughput
Too many toolsExposing everything as an MCP toolA model with 100 tools makes worse decisions than one with 20 well-designed ones. Curate the toolset.
Inconsistent AGENTS.mdFiles go stale or aren’t reviewedAssign ownership; add AGENTS.md review to your sprint retrospectives