Measurement

Engineering teams adopting AI assistance tend to feel productive. Whether they are productive is a different question, and it requires measurement to answer.

The risk is optimising for the wrong signal. Lines of code generated per hour is not a useful metric. PRs merged per day is not a useful metric if half of them introduce rework. What you need are metrics that capture the full cycle — from idea to stable production.

The metrics that matter

Lead time for changes

Definition: Time from “work begins on a feature” to “feature is in production.”

This is the primary velocity metric. AI assistance should reduce it — if it doesn’t, something in the process is absorbing the gains (review backlog, rework, integration friction).

How to measure: Track from first commit on a branch to deployment. Most CI/CD tools expose this natively.

What to watch for: If lead time increases after adopting agents, the bottleneck is usually review throughput — too many PRs for the available reviewers.

AI PR acceptance rate

Definition: Percentage of agent-generated PRs that are accepted without requiring a full rewrite.

This measures the quality of the agent’s output relative to your standards. A low acceptance rate (< 60%) usually means one of three things: specs are too vague, AGENTS.md is missing critical rules, or the review criteria aren’t aligned with what the agent can produce.

How to measure: Tag PRs as “agent-generated” (many teams add a label automatically) and track merged vs. closed without merge.

Target: > 80% acceptance rate. Below 60% is a signal to fix upstream (specs, context) rather than accept the rework as normal.

Acceptance rate vs. review iteration count

Distinguish between “PR accepted after one round of feedback” (good) and “PR accepted after three major rewrites” (hidden rework). Track iteration count alongside acceptance rate.

Defect rate

Definition: Number of bugs introduced per feature, measured by defects found in production or in QA.

AI assistance should reduce defect rate — specs as tests, BDD coverage, and architecture rules all exist to catch errors early. If defect rate isn’t decreasing, the testing layer isn’t doing its job.

How to measure: Track bugs opened against features, tagged by which sprint introduced them. Most issue trackers support this with labels.

What to watch for: If defects increase after adopting agents, the most common cause is insufficient BDD coverage — the agent produced code that passes unit tests but doesn’t fulfil the actual business requirements.

Rework percentage

Definition: Percentage of code that is rewritten within 30 days of being merged.

Rework is the hidden cost of speed. A feature delivered in 2 days that requires 3 days of rework cost 5 days, not 2. Agents can produce high volumes of rework-generating code when specs are incomplete or review standards are lax.

How to measure: Track churn — lines changed within 30 days of initial merge. git log + git diff on a rolling window. Some CI tools have this built in.

Target: < 20% rework rate. Above 40% is a sign that something in the spec-to-implementation pipeline is consistently wrong.

Cost per change

Definition: Aggregate cost (engineer time + infrastructure + AI API costs) to deliver one unit of change (one feature, one user story, one ticket).

This is the metric that tells you whether AI assistance is delivering ROI. It requires combining:

Engineer hours on the feature (spec, review, iteration)
AI API cost for the sessions
Infrastructure cost during development

How to measure: Engineer hours from time-tracking or sprint velocity. AI costs from API billing dashboards (Anthropic, OpenAI, etc.) — most provide per-project breakdowns. Compute from CI billing.

How to use these metrics

Metrics are useful for diagnosis, not judgement. Use them to identify where the process is breaking down — not to evaluate individual engineers.

Metric	If it’s bad	Look at
Lead time increasing	Review backlog, rework cycles	PR throughput, acceptance rate
Acceptance rate < 60%	Spec quality, AGENTS.md completeness	Spec review process, context engineering
Defect rate increasing	Test coverage gaps	BDD coverage, acceptance criteria clarity
Rework > 40%	Spec-implementation gap	DoR checklist, review quality
Cost per change not decreasing	Which phase absorbs the time	Time per phase tracking

A minimal measurement setup

You don’t need a full analytics platform to start measuring. A simple spreadsheet tracking per-PR data works:

| PR | Date | Agent-generated? | Lines | Iterations to merge | Defects (30d) | Notes |

After 10–20 features, patterns emerge. Invest in proper tooling once you know which metrics matter most for your team.

What not to measure

Lines of code generated. It measures output volume, not output quality. A 500-line PR that requires a 400-line rewrite is worse than a 50-line PR that ships clean.

Number of PRs per day. Volume without quality is rework waiting to happen.

Agent uptime or usage rate. These are input metrics. You want output metrics.

Individual engineer acceptance rates. Acceptance rate varies by task complexity. Use it at team level, not individual level.

The feedback loop

Measurement only creates value if it closes a feedback loop:

Measure → Diagnose → Change process → Measure again

A concrete example: acceptance rate drops from 82% to 65% over three sprints. Diagnose: PRs from the new payments context have low acceptance rate. Root cause: the payments context AGENTS.md is missing authorisation rules. Fix: add authorisation rules to AGENTS.md. Measure: acceptance rate recovers to 80% over the next three sprints.

That feedback loop — measure, diagnose, fix upstream — is how AI-assisted engineering improves over time rather than plateau-ing at whatever quality level it started at.