Docs
Human-Agent Collaboration

Human-Agent Collaboration

The Conductor model, when to delegate vs. intervene, AI fatigue, and how to sustain effective work with agents at scale.


Working effectively with agents is a skill. It’s not the same skill as writing code, and it doesn’t develop automatically just from using agents. This page describes what effective collaboration looks like, what it costs, and how to structure it sustainably.

The Conductor model

The right mental model for working with AI agents isn’t “pair programmer” — it’s conductor. A conductor doesn’t play each instrument. They understand what each instrument can do, set the tempo, make interpretation decisions, and ensure the ensemble produces something coherent. The agents are the instruments. You are the conductor.

This isn’t a reduction in responsibility — it’s a shift in where judgement is applied. The engineer’s role moves from writing code to defining contracts and verifying whether the agent honours them.

How the distribution of time shifts:

ActivityBefore agentsWith agents
Writing code60%15%
Reviewing code15%35%
Specifying intent5%20%
Architecture decisions10%20%
Debugging10%10%

Total time spent is the same. Where it’s spent changes. The leverage per hour increases because you’re operating at a higher level of abstraction.

When to delegate, when to intervene

Not every task is equally suited for agent delegation. The right call depends on how well the constraints can be specified upfront.

Delegate when:

  • The task has a clear definition of done
  • The relevant context fits in a spec + design doc
  • Failure modes are catchable by tests
  • The task is bounded to a limited set of files
  • You’ve done similar tasks before and know what good output looks like

Intervene when:

  • The agent is violating architecture rules after two corrections
  • The implementation is diverging from the spec and the agent isn’t noticing
  • The context window has been running for a long time (context decay is active)
  • The task requires judgement about business trade-offs the agent can’t have
  • You’re seeing patterns you don’t recognise and can’t evaluate quickly

Rebuild context when:

  • You start a new feature or a new layer of an existing one
  • The session has been running for more than 1-2 hours of complex work
  • You’re getting consistent architecture violations despite explicit rules

The cost of not intervening early is high. An agent that’s been building in the wrong direction for 30 minutes produces 30 minutes of work to throw away.

The five skills of the conductor

SkillWhy it matters
Specification writingOutput quality is directly proportional to spec quality. Vague specs produce vague outputs.
Task decompositionBreaking complex work into 15–60 minute pieces with a clear definition of done. Agents handle bounded tasks much better than open-ended ones.
Context curationKnowing what to give the agent and what to withhold. Too much context creates noise; too little creates drift.
Output evaluationAgent-generated code has no narrative — you can’t read the author’s intent. Evaluating it requires reading the code on its own terms, against the spec.
Risk calibrationKnowing when to let the agent run and when to step in. Calibrated by experience with a specific agent on a specific codebase.

Writing effective task specifications

A good task spec eliminates ambiguity before the agent starts. It doesn’t need to be long — it needs to be specific. Each sentence removes one possible misinterpretation.

OBJECTIVE:    What should be different when the task is complete?
              "The login form should validate email format before submitting."

CONTEXT:      What does the agent need to know?
              "The form is in src/components/LoginForm.tsx and uses React Hook Form.
              The validation utils are in src/utils/validators.ts."

CONSTRAINTS:  What must the agent NOT do?
              "Don't modify the form layout. Don't add new dependencies."

VERIFICATION: How will we know it's complete?
              "The form shows an error for invalid email formats.
              Existing tests still pass. A new test for email validation is added."

SCOPE:        Which files can be touched?
              "Only LoginForm.tsx and validators.ts."

A practical daily rhythm

A structured day prevents the context-switching cost that comes from supervising multiple agents while also doing deep work.

  • Morning (first 2 hours): Review output from the previous session. Approve clean work, give precise feedback on what needs adjustment. This is reactive time — don’t start new agent tasks yet.
  • Mid-morning (2 hours): Specifying and delegating. Write specs for the day’s work. Launch agents and move on to other tasks while they run.
  • Afternoon (2 hours): Protect this for deep work — architecture decisions, complex design, mentoring, reviewing complex changes. Don’t check agent status during this block.
  • Late afternoon (2 hours): Review and iterate. Evaluate the day’s delegations. Update AGENTS.md with patterns discovered during the day.

The key discipline is protecting the deep work block. Agents running in parallel create a pull towards constant status-checking. That fragmentation is costly.

AI fatigue

AI fatigue is a real and underreported phenomenon in engineering teams that have scaled agent use. It’s worth naming explicitly because teams that don’t acknowledge it tend to attribute its symptoms to individual performance rather than systemic causes.

Patterns that appear at scale:

Review burden. Before agents, a team might process 20–25 PRs per week. With agents running, that can jump to 100+. The review time per PR doesn’t decrease — it may increase because agent-generated code requires more scrutiny. If you scale PR volume without scaling review infrastructure, reviewers burn out.

Context switching costs. Multiple agents running in parallel create constant interruption pressure. Each status check carries a cognitive cost. The aggregate effect is exhausting in a way that’s hard to articulate.

Perceived cost aversion. An agent produces 70% of a solution in one minute. Spending an hour refining it feels irrational, even when that hour would prevent significant technical debt. Teams start shipping incomplete work because the cost of completion feels disproportionate to what the agent already delivered.

Variable reward fragility. Agents work exceptionally well sometimes and poorly other times. That variable reward pattern — sometimes perfect, sometimes not — makes it harder to step back and assess whether the overall approach is working.

The fix is structural, not personal. Telling individuals to “take breaks” or “use agents less” doesn’t address the root cause.

What works:

  • Automated backpressure — CI checks that catch errors before human review, so reviewers spend attention on judgement calls, not mechanical issues
  • Two-layer review policy — automated checks handle formatting, types, test coverage; human review handles architectural fit and business correctness
  • Visible cost budgets — when the cost of “one more attempt” is visible, decisions about when to stop become more calibrated
  • Protected deep work time — the daily rhythm above, enforced as a team norm

Developing junior engineers in an agentic team

The tasks historically used to develop junior engineers — boilerplate, simple fixes, documentation, straightforward tests — are precisely the tasks agents handle well. This creates a real pipeline problem: if juniors aren’t doing junior work, how do they develop the judgement that makes them senior?

The emerging pattern is a role shift: junior engineers move from writing agent output to reviewing it. Reviewing 20 agent PRs per day teaches more about code quality and failure patterns than writing 2 PRs per day — but only with deliberate mentoring on what to look for.

What to train juniors to evaluate:

  • Does the implementation match the spec? Not just “does it work?” but “does it do exactly what was asked?”
  • Are the architecture rules from AGENTS.md followed?
  • Are the tests meaningful? Agent-generated tests often verify implementation rather than behaviour.
  • What would break if this code were slightly wrong?
  • What is the agent optimising for that a human reviewer would weigh differently?

This requires explicit mentoring, not just exposure. Senior engineers need to pair with juniors on reviews initially, explaining what they’re looking for and why. The skill of evaluating agent output is learnable — it doesn’t develop automatically.

Definition of Ready and Definition of Done

These criteria apply to every feature regardless of who implements it — human or agent.

Definition of Ready (before implementation starts)

A feature is ready to implement when:

  • Proposal exists and is approved
  • Spec files cover all scenarios with WHEN/THEN format
  • Design document defines the architecture, ports, and file structure
  • Acceptance criteria are clear and testable
  • Affected context boundaries and existing specs are identified
  • Task list is broken into 15–60 minute units

A feature that isn’t ready isn’t ready for the agent either. “The agent will figure it out” is not a substitute for a spec.

Definition of Done (before merge)

A feature is done when:

  • All OpenSpec scenarios have corresponding Gherkin tests that pass
  • Domain logic has unit test coverage
  • Adapters have integration test coverage
  • No infrastructure imports in domain or application layers
  • All dependencies are constructor-injected as interfaces
  • Conventional Commits used throughout
  • Architecture lint passes
  • PR reviewed: automated checks passed, human reviewed architecture and business logic
  • Change archived in openspec/specs/
DoD applies to the agent too

The Definition of Done applies to agent-generated code exactly as it does to human-written code. “The agent did it” is not a reason to skip the checklist.