Operability

Operability is the set of properties that make software safe to deploy, operate, and change in production. Code that passes tests but can’t be deployed safely, rolled back cleanly, or degraded gracefully under load is not finished — it’s a liability waiting to manifest.

Like observability and security, operability must be designed in. An agent building a feature won’t add feature flags, circuit breakers, or health checks unless they’re in the spec.

Feature flags

Feature flags decouple deployment from release. Code can be deployed at any time; the feature becomes available only when the flag is enabled. This dramatically reduces deployment risk — and gives you the ability to disable a feature in production without a rollback.

When to require a feature flag (in AGENTS.md or per-design):

New user-facing behaviour that isn’t backwards-compatible
Features that affect billing, permissions, or data processing
Features being rolled out gradually (% of users)
Anything that might need to be disabled quickly in production

Flag implementation pattern

Keep flags outside the domain layer — they’re an infrastructure concern:

// Port — domain doesn't know about flagging infrastructure
interface FeatureFlags {
  isEnabled(flag: string, context?: FlagContext): boolean;
}

// Use case — receives flags through injection
class CreateOrderUseCase {
  constructor(
    private readonly orderRepository: OrderRepository,
    private readonly flags: FeatureFlags,
  ) {}

  async execute(command: CreateOrderCommand): Promise<OrderId> {
    if (this.flags.isEnabled('new-pricing-engine', { userId: command.customerId })) {
      // new path
    } else {
      // existing path
    }
  }
}

Flags are temporary

Feature flags accumulate technical debt. Every flag is a code path that must be maintained, tested, and eventually removed. Add a cleanup task to AGENTS.md when you add a flag, and remove the flag once the rollout is complete.

AGENTS.md rules for feature flags

## Feature flags

- New user-facing features must be behind a feature flag
- Flags are injected as FeatureFlags interface — never read from environment directly in domain
- Flag names: kebab-case, descriptive (new-pricing-engine, not flag_v2)
- Every flag added must have a corresponding removal task created
- Tests must cover both flag-enabled and flag-disabled paths

Health checks

Every service must expose health check endpoints. Agents won’t add them without an explicit requirement.

// Minimal — is the service alive?
app.get('/health', (req, res) => res.status(200).json({ status: 'ok' }));

// Deep — is the service ready to handle traffic?
app.get('/ready', async (req, res) => {
  const dbStatus = await checkDatabaseConnection();
  const cacheStatus = await checkCacheConnection();

  const ready = dbStatus.ok && cacheStatus.ok;
  res.status(ready ? 200 : 503).json({
    status: ready ? 'ready' : 'not_ready',
    checks: { database: dbStatus, cache: cacheStatus },
  });
});

Two endpoints:

/health — is the process running? Used by load balancers for liveness.
/ready — can the service handle requests? Used by orchestrators before sending traffic.

Graceful degradation

Services fail. The question is whether they fail catastrophically or gracefully. Resilience patterns let a service continue operating — in a degraded state — when dependencies are unavailable.

Circuit breaker

Prevents repeated calls to a failing dependency. After a threshold of failures, the circuit “opens” and calls fail fast for a period, then allow a test call through.

// Specify in design docs: which external calls get a circuit breaker?
interface PaymentGateway {
  charge(amount: Money, card: CardToken): Promise<PaymentResult>;
}

// Infrastructure adapter wraps the real client with a breaker
class ResilientStripeGateway implements PaymentGateway {
  private breaker = new CircuitBreaker(this.stripeClient.charge, {
    timeout: 3000,
    errorThresholdPercentage: 50,
    resetTimeout: 30000,
  });

  async charge(amount: Money, card: CardToken): Promise<PaymentResult> {
    return this.breaker.fire(amount, card);
  }
}

Retry with backoff

For transient failures (network blips, rate limits), retry with exponential backoff before failing:

async function withRetry<T>(
  operation: () => Promise<T>,
  maxAttempts = 3,
  baseDelayMs = 100,
): Promise<T> {
  for (let attempt = 1; attempt <= maxAttempts; attempt++) {
    try {
      return await operation();
    } catch (error) {
      if (attempt === maxAttempts) throw error;
      await sleep(baseDelayMs * Math.pow(2, attempt - 1));
    }
  }
}

AGENTS.md rules for resilience

## Resilience rules

- External HTTP calls must have explicit timeouts (never rely on provider defaults)
- External calls that can fail transiently: wrap with retry + exponential backoff
- High-value external calls (payment, email): wrap with circuit breaker
- Timeouts: specify in the design doc — don't leave defaults
- Specify fallback behaviour: what happens when the external call fails entirely?

Rollback strategy

Every deployment should have a defined rollback path before it goes out. Agents can implement database migrations but won’t design rollback strategies without being asked.

For each migration, design documents must state:

Is this migration reversible? (Add column = yes; remove column = requires more care)
What’s the rollback command?
Does a rollback require a code rollback too?

Migration safety pattern — expand/contract:

Phase 1: Expand — add the new column, old code still works
Phase 2: Migrate data — populate the new column
Phase 3: Dual-write — new code writes both old and new column
Phase 4: Switch — new code reads from new column only
Phase 5: Contract — remove the old column

This pattern ensures any phase can be rolled back independently. Agents can execute it — but you have to specify it in the design document.

Operability checklist for design review

Before implementation starts, verify the design document answers:

Which new behaviours need a feature flag?
What are the health check requirements for any new service endpoints?
Which external dependencies need circuit breakers or retry logic?
What are the explicit timeout values for external calls?
What’s the rollback strategy for any database migrations?
What happens to in-flight requests during deployment?
What does graceful degradation look like if a key dependency is unavailable?