AI Engineering · 6 min read

Measuring AI Engineering Properly

Lines of code and "percent faster" were always vanity metrics. Now they are actively misleading. If generation is abundant, measure the scarce step: acceptance, rework and stability.

Part of Agentic Engineering · The AI Engineering Maturity Model

Most AI engineering dashboards are measuring the wrong end of the problem. They count lines written, suggestions accepted, pull requests merged and the developers' own sense of how much faster they feel. All of that sits on the generation side of the work. And generation, as we have argued throughout this series, is precisely the part that AI has made cheap.

The Acceptance Gap is the simple observation that once code generation becomes abundant, the binding constraint moves to acceptance: the judgement that a change is correct, safe and worth shipping. Value migrates to the scarce step. It follows, almost mechanically, that your measurement should migrate with it. If you keep instrumenting typing speed while the real cost sits in review, rework and the production incidents that follow premature acceptance, your metrics will look healthy right up until delivery quietly degrades.

Why volume metrics stopped meaning anything

Activity counts were always weak proxies for value. The honest critique predates this moment: McKinsey's developer-productivity framework was widely rejected by engineering leaders because most of its proposed measures captured effort and output rather than outcome, and rewarding output creates perverse incentives. The recommended alternatives, DORA and SPACE, both refuse to collapse productivity into a single number.

AI does not rescue these metrics. It breaks them further. When the marginal cost of producing a line approaches zero, the volume of lines tells you nothing about whether anything valuable shipped, and the count becomes trivially gameable. A team can now generate ten thousand lines before lunch. The question that matters is how many of them survive contact with review and with production, and that is a different number entirely.

Ban "percent faster" from the dashboard

The most seductive vanity metric is perceived speed, and it is the one with the clearest evidence against it. METR ran a randomised controlled trial with sixteen experienced open-source developers across 246 real issues. The developers predicted AI would make them roughly 24% faster. Afterwards they believed they had been about 20% faster. They were, in fact, 19% slower with AI than without it. The gap between felt speed and measured speed was not a rounding error; it was the entire result reversed.

This is decisive for measurement design. Self-reported "percent faster" is not a delivery metric, it is a perception metric, and a demonstrably unreliable one. It belongs in an engagement survey, not on a delivery dashboard. Treat it as a signal about morale and adoption, never about output.

Throughput rising while stability falls is not a paradox to explain away. It is the Acceptance Gap widening, rendered in numbers.

The system-level evidence

The individual-versus-system divergence is now well documented. DORA's 2024 research found that a 25% increase in AI adoption was associated with an estimated 1.5% decrease in delivery throughput and a 7.2% decrease in delivery stability, alongside a 2.6% fall in time spent on valuable work. Yet 75% of respondents reported individual productivity gains; in 2025 that figure passed 80% and adoption reached roughly 90%. People feel more productive while the system delivers less reliably.

The acceptance-gap divergence

the gap
Perceived individual gains- - System throughput (DORA)
The acceptance-gap divergence.

DORA's framing in 2025 is the right one: AI is a mirror and a multiplier. It amplifies whatever surrounds it. Returns come from stable pipelines, clean architecture, strong review and genuine product discovery, not from the tool. If your system is healthy, AI compounds the health. If acceptance is weak, AI floods that weakness with more generated change than it can absorb.

Instrument the gap, not the typing

Here is the part the industry debate keeps missing. The argument is still framed as activity versus outcomes, or does-AI-speed-people-up. The more useful framing is the Acceptance Gap: measure the delta between what is generated and what is shipped and survives. Three metrics do this directly.

  • Acceptance rate, read as a quality signal rather than a usage brag. GitHub reports an overall Copilot suggestion-acceptance rate around 30%, varying sharply by language and falling as tasks get harder (roughly 89% on easy tasks, 43% on hard ones). The trap is treating "accepted into the editor" as "shipped and stuck". The number worth tracking is narrower: what fraction of generated change survives review and then survives fourteen days in production.
  • Rework and churn rate, the cost of premature acceptance. GitClear's analysis of 211 million changed lines found churn (lines revised within two weeks of commit) rising from about 3.3% in 2020 to 5.7% in 2024, copy-pasted code overtaking refactored code for the first time, and refactoring collapsing from a quarter of changes to under a tenth. Rework is the receipt for acceptance that happened too early.
  • DORA stability, the lagging signal that acceptance judgement failed. Change-failure rate and time-to-restore tell you, after the fact, where the gap closed badly. Pair them with throughput so neither axis is optimised alone.

Trust data underlines why all three matter together: DORA 2025 found only 24% of developers express substantial trust in AI output. People are already reviewing heavily. Your metrics should make that scarce review work visible and valued, not invisible behind a generation count.

What this means for the maturity model

Acceptance instrumentation is also a maturity marker. Counting suggestions accepted is an Assistance-stage habit. Wiring acceptance survival, rework and stability into the same view, and acting on the delta between generated and shipped, is what separates Automation from Agentic Engineering. You cannot direct agents safely if you cannot see, in near real time, how much of what they produce actually holds. Measurement is not a reporting afterthought here; it is the control system that makes delegation responsible.

The discipline is unglamorous and it is the whole job. Stop celebrating volume. Retire perceived speed to the survey it came from. Watch throughput and stability as a pair, and treat rising throughput with falling stability as the warning it is. Then instrument the scarce step directly. The work was never the typing. It is closing the gap between generated and shipped, and you can only close what you measure.

If you want to know where your organisation sits on this, our five-stage maturity model maps acceptance instrumentation against the move from assistance to agentic engineering, and pairs with the assessment to find the next concrete step.