Observability

Observability isn’t monitoring. Monitoring tells you when something is wrong. Observability tells you why. The difference is that observability has to be designed in — it can’t be bolted on after the fact.

When agents generate code, they generate observable code only if the spec and design explicitly require it. By default, they’ll produce code that works in tests and fails silently in production.

The silent failure problem

Agent-generated code is optimised for test passage, not production operation. Tests don’t need structured logs. Tests don’t emit metrics. Tests don’t need distributed traces. If observability isn’t in the spec, it won’t be in the code.

The three pillars

Logs — discrete events with context. What happened, when, and what was the state at that moment. Logs are for developers debugging specific incidents.

Metrics — numerical measurements over time. Request rates, error rates, latency percentiles, queue depths. Metrics are for understanding system behaviour at scale.

Traces — the path of a single request across services and layers. Traces are for understanding why a specific operation took as long as it did, or where it failed.

All three are necessary. Metrics tell you that error rate spiked. Traces tell you which requests failed. Logs tell you what the application state was when they failed.

Observability in design documents

Every design document for a non-trivial feature should explicitly state:

What events this feature logs (and at which level)
What metrics this feature emits
What trace spans this feature creates
What an on-call engineer needs to diagnose a failure in this feature

This isn’t over-engineering. It takes one paragraph. The alternative is debugging a production incident with no instruments.

## Observability

### Logs
- INFO: Order created (order_id, customer_id, item_count, total_amount)
- INFO: Payment charged successfully (order_id, payment_id, amount)
- WARN: Payment retry attempt (order_id, attempt_number, reason)
- ERROR: Payment failed after all retries (order_id, final_error, attempts)

### Metrics
- orders.created (counter)
- orders.payment_success (counter)
- orders.payment_failed (counter)
- orders.processing_duration_ms (histogram)

### Traces
- Span: CreateOrder (includes payment charge as child span)
- Span: ProcessPayment (includes Stripe API call as child span)

Structured logging

Logs are only useful if they’re queryable. Free-text logs are not queryable at scale. Structured logs — JSON or key-value pairs — are.

// ❌ Unstructured — can't filter, can't aggregate
console.log(`Order ${orderId} failed with error: ${error.message}`);

// ✓ Structured — filterable by any field
logger.error('Order payment failed', {
  order_id: orderId.value,
  customer_id: customerId.value,
  error_code: error.code,
  error_message: error.message,
  attempt: attemptNumber,
});

Log levels

Level	When to use	Examples
`ERROR`	Something failed and requires attention	Payment charge failed, database connection lost
`WARN`	Something unexpected but handled	Payment retry, fallback triggered, rate limit hit
`INFO`	Normal, significant events	Order created, user authenticated, job completed
`DEBUG`	Diagnostic detail for development	Query parameters, response body, timing breakdown

DEBUG should be off in production by default. INFO and above should always be on.

Metrics instrumentation

Metrics should be emitted at use case boundaries — not inside domain entities. The application layer knows when a business operation succeeds or fails; the domain layer doesn’t need to know it’s being measured.

class CreateOrderUseCase implements CreateOrderUseCasePort {
  constructor(
    private readonly orderRepository: OrderRepository,
    private readonly paymentGateway: PaymentGateway,
    private readonly metrics: MetricsPort,
    private readonly logger: LoggerPort,
  ) {}

  async execute(command: CreateOrderCommand): Promise<OrderId> {
    const start = Date.now();

    try {
      const order = Order.create(command.customerId, command.items);
      await this.paymentGateway.charge(order.total(), command.paymentMethod);
      await this.orderRepository.save(order);

      this.metrics.increment('orders.created');
      this.metrics.histogram('orders.processing_duration_ms', Date.now() - start);
      this.logger.info('Order created', { order_id: order.id.value });

      return order.id;
    } catch (error) {
      this.metrics.increment('orders.creation_failed');
      this.logger.error('Order creation failed', { error, command });
      throw error;
    }
  }
}

Note that MetricsPort and LoggerPort are injected as interfaces — the application layer doesn’t know whether metrics go to Datadog, Prometheus, or a test stub.

Observability rules for AGENTS.md

## Observability Rules

- All use cases must inject LoggerPort and emit INFO on success, ERROR on failure
- Log fields must be structured key-value pairs — no string interpolation in logs
- Use WARN for handled errors (retries, fallbacks), ERROR for unhandled failures
- Never log PII (email addresses, phone numbers, payment data, tokens)
- Metrics must be emitted at use case boundaries, not inside domain entities
- Error logs must include enough context to diagnose without access to the database
- Use trace spans for any operation that crosses a service or makes an external call

What goes in every error log

An error log is useful only if the on-call engineer can diagnose the issue without asking the user to reproduce it. Every error log needs:

What failed — the operation name
Why it failed — the error message and code
What the state was — the relevant IDs and context at the time of failure
What was attempted — inputs, retry count if applicable

logger.error('Payment charge failed', {
  order_id: order.id.value,
  customer_id: order.customerId.value,
  amount_cents: order.total().inCents(),
  currency: order.total().currency,
  payment_provider: 'stripe',
  error_code: stripeError.code,
  error_message: stripeError.message,
  attempt: attemptNumber,
  correlation_id: traceContext.traceId,
});

This log tells the on-call engineer everything they need to investigate — without touching the database.