Measurement
The metrics that tell you whether AI-assisted engineering is actually working — lead time, acceptance rate, defect rate, rework, and cost per change.
Engineering teams adopting AI assistance tend to feel productive. Whether they are productive is a different question, and it requires measurement to answer.
The risk is optimising for the wrong signal. Lines of code generated per hour is not a useful metric. PRs merged per day is not a useful metric if half of them introduce rework. What you need are metrics that capture the full cycle — from idea to stable production.
The metrics that matter
Lead time for changes
Definition: Time from “work begins on a feature” to “feature is in production.”
This is the primary velocity metric. AI assistance should reduce it — if it doesn’t, something in the process is absorbing the gains (review backlog, rework, integration friction).
How to measure: Track from first commit on a branch to deployment. Most CI/CD tools expose this natively.
What to watch for: If lead time increases after adopting agents, the bottleneck is usually review throughput — too many PRs for the available reviewers.
AI PR acceptance rate
Definition: Percentage of agent-generated PRs that are accepted without requiring a full rewrite.
This measures the quality of the agent’s output relative to your standards. A low acceptance rate (< 60%) usually means one of three things: specs are too vague, AGENTS.md is missing critical rules, or the review criteria aren’t aligned with what the agent can produce.
How to measure: Tag PRs as “agent-generated” (many teams add a label automatically) and track merged vs. closed without merge.
Target: > 80% acceptance rate. Below 60% is a signal to fix upstream (specs, context) rather than accept the rework as normal.
Acceptance rate vs. review iteration count
Distinguish between “PR accepted after one round of feedback” (good) and “PR accepted after three major rewrites” (hidden rework). Track iteration count alongside acceptance rate.
Defect rate
Definition: Number of bugs introduced per feature, measured by defects found in production or in QA.
AI assistance should reduce defect rate — specs as tests, BDD coverage, and architecture rules all exist to catch errors early. If defect rate isn’t decreasing, the testing layer isn’t doing its job.
How to measure: Track bugs opened against features, tagged by which sprint introduced them. Most issue trackers support this with labels.
What to watch for: If defects increase after adopting agents, the most common cause is insufficient BDD coverage — the agent produced code that passes unit tests but doesn’t fulfil the actual business requirements.
Rework percentage
Definition: Percentage of code that is rewritten within 30 days of being merged.
Rework is the hidden cost of speed. A feature delivered in 2 days that requires 3 days of rework cost 5 days, not 2. Agents can produce high volumes of rework-generating code when specs are incomplete or review standards are lax.
How to measure: Track churn — lines changed within 30 days of initial merge. git log + git diff on a rolling window. Some CI tools have this built in.
Target: < 20% rework rate. Above 40% is a sign that something in the spec-to-implementation pipeline is consistently wrong.
Cost per change
Definition: Aggregate cost (engineer time + infrastructure + AI API costs) to deliver one unit of change (one feature, one user story, one ticket).
This is the metric that tells you whether AI assistance is delivering ROI. It requires combining:
- Engineer hours on the feature (spec, review, iteration)
- AI API cost for the sessions
- Infrastructure cost during development
How to measure: Engineer hours from time-tracking or sprint velocity. AI costs from API billing dashboards (Anthropic, OpenAI, etc.) — most provide per-project breakdowns. Compute from CI billing.
How to use these metrics
Metrics are useful for diagnosis, not judgement. Use them to identify where the process is breaking down — not to evaluate individual engineers.
| Metric | If it’s bad | Look at |
|---|---|---|
| Lead time increasing | Review backlog, rework cycles | PR throughput, acceptance rate |
| Acceptance rate < 60% | Spec quality, AGENTS.md completeness | Spec review process, context engineering |
| Defect rate increasing | Test coverage gaps | BDD coverage, acceptance criteria clarity |
| Rework > 40% | Spec-implementation gap | DoR checklist, review quality |
| Cost per change not decreasing | Which phase absorbs the time | Time per phase tracking |
A minimal measurement setup
You don’t need a full analytics platform to start measuring. A simple spreadsheet tracking per-PR data works:
| PR | Date | Agent-generated? | Lines | Iterations to merge | Defects (30d) | Notes |
After 10–20 features, patterns emerge. Invest in proper tooling once you know which metrics matter most for your team.
What not to measure
Lines of code generated. It measures output volume, not output quality. A 500-line PR that requires a 400-line rewrite is worse than a 50-line PR that ships clean.
Number of PRs per day. Volume without quality is rework waiting to happen.
Agent uptime or usage rate. These are input metrics. You want output metrics.
Individual engineer acceptance rates. Acceptance rate varies by task complexity. Use it at team level, not individual level.
The feedback loop
Measurement only creates value if it closes a feedback loop:
Measure → Diagnose → Change process → Measure again
A concrete example: acceptance rate drops from 82% to 65% over three sprints. Diagnose: PRs from the new payments context have low acceptance rate. Root cause: the payments context AGENTS.md is missing authorisation rules. Fix: add authorisation rules to AGENTS.md. Measure: acceptance rate recovers to 80% over the next three sprints.
That feedback loop — measure, diagnose, fix upstream — is how AI-assisted engineering improves over time rather than plateau-ing at whatever quality level it started at.