AI Engineering · 6 min read · Updated 2026-06-18

Measuring AI Engineering Properly

Lines of code and "percent faster" were always vanity metrics. Now they are actively misleading. If generation is abundant, measure the scarce step: acceptance, rework and stability.

By Priyanka Pandey · Founder & Editorial Lead

Reviewed and challenged by Sanjeev Purohit · Principal, Decision Architecture

Built from

Independent research
Data-backed
Original framework
Reviewed with field experience

Last substantively reviewed · 2026-06-18

Part of Agentic Engineering · The AI Engineering Maturity Model

In brief

Once generation is abundant, volume metrics (lines of code, “percent faster”) are not just vanity but actively misleading; measure the scarce step — acceptance, rework and stability.

Volume metrics stopped meaning anything; ban “percent faster” from the dashboard.
Throughput rising while stability falls is the Acceptance Gap widening, rendered in numbers.
Instrument the gap — acceptance, rework, stability — not the typing.

Best for

Any team trying to measure AI’s real delivery impact

Most AI engineering dashboards are measuring the wrong end of the problem. They count lines written, suggestions accepted, pull requests merged and the developers' own sense of how much faster they feel. All of that sits on the generation side of the work. And generation, as we have argued throughout this series, is precisely the part that AI has made cheap.

The Acceptance Gap is the simple observation that once code generation becomes abundant, the binding constraint moves to acceptance: the judgement that a change is correct, safe and worth shipping. Value migrates to the scarce step. It follows, almost mechanically, that your measurement should migrate with it. If you keep instrumenting typing speed while the real cost sits in review, rework and the production incidents that follow premature acceptance, your metrics will look healthy right up until delivery quietly degrades.

Why volume metrics stopped meaning anything

Activity counts were always weak proxies for value. The honest critique predates this moment: McKinsey's developer-productivity framework was widely rejected by engineering leaders because most of its proposed measures captured effort and output rather than outcome, and rewarding output creates perverse incentives. The recommended alternatives, DORA and SPACE, both refuse to collapse productivity into a single number.

AI does not rescue these metrics. It breaks them further. When the marginal cost of producing a line approaches zero, the volume of lines tells you nothing about whether anything valuable shipped, and the count becomes trivially gameable. A team can now generate ten thousand lines before lunch. The question that matters is how many of them survive contact with review and with production, and that is a different number entirely.

Ban "percent faster" from the dashboard

The most seductive vanity metric is perceived speed, and it is the one with the clearest evidence against it. METR ran a randomised controlled trial with sixteen experienced open-source developers across 246 real issues. The developers predicted AI would make them roughly 24% faster. Afterwards they believed they had been about 20% faster. They were, in fact, 19% slower with AI than without it. The gap between felt speed and measured speed was not a rounding error; it was the entire result reversed.

This is decisive for measurement design. Self-reported "percent faster" is not a delivery metric, it is a perception metric, and a demonstrably unreliable one. It belongs in an engagement survey, not on a delivery dashboard. Treat it as a signal about morale and adoption, never about output.

Throughput rising while stability falls is not a paradox to explain away. It is the Acceptance Gap widening, rendered in numbers.

The system-level evidence

Vanity metric	Measure instead	Because
Lines of code / commits	Acceptance rate	Generation is abundant; acceptance is the constraint
“Percent faster”	Rework rate	Felt speed and real speed diverge
Output volume	DORA flow & stability	Only what lands and stays is value

Measure what lands, not what was typed.

The individual-versus-system divergence is now well documented. DORA's 2024 research found that a 25% increase in AI adoption was associated with an estimated 1.5% decrease in delivery throughput and a 7.2% decrease in delivery stability, alongside a 2.6% fall in time spent on valuable work. Yet 75% of respondents reported individual productivity gains; in 2025 that figure passed 80% and adoption reached roughly 90%. People feel more productive while the system delivers less reliably.

The acceptance-gap divergence

— Perceived individual gains- - System throughput (DORA)

The acceptance-gap divergence.

DORA's framing in 2025 is the right one: AI is a mirror and a multiplier. It amplifies whatever surrounds it. Returns come from stable pipelines, clean architecture, strong review and genuine product discovery, not from the tool. If your system is healthy, AI compounds the health. If acceptance is weak, AI floods that weakness with more generated change than it can absorb.

Instrument the gap, not the typing

Here is the part the industry debate keeps missing. The argument is still framed as activity versus outcomes, or does-AI-speed-people-up. The more useful framing is the Acceptance Gap: measure the delta between what is generated and what is shipped and survives. Three metrics do this directly.

Acceptance rate, read as a quality signal rather than a usage brag. GitHub reports an overall Copilot suggestion-acceptance rate around 30%, varying sharply by language and falling as tasks get harder (roughly 89% on easy tasks, 43% on hard ones). The trap is treating "accepted into the editor" as "shipped and stuck". The number worth tracking is narrower: what fraction of generated change survives review and then survives fourteen days in production.
Rework and churn rate, the cost of premature acceptance. GitClear's analysis of 211 million changed lines found churn (lines revised within two weeks of commit) rising from about 3.3% in 2020 to 5.7% in 2024, copy-pasted code overtaking refactored code for the first time, and refactoring collapsing from a quarter of changes to under a tenth. Rework is the receipt for acceptance that happened too early.
DORA stability, the lagging signal that acceptance judgement failed. Change-failure rate and time-to-restore tell you, after the fact, where the gap closed badly. Pair them with throughput so neither axis is optimised alone.

Trust data underlines why all three matter together: DORA 2025 found only 24% of developers express substantial trust in AI output. People are already reviewing heavily. Your metrics should make that scarce review work visible and valued, not invisible behind a generation count.

What this means for the maturity model

Acceptance instrumentation is also a maturity marker. Counting suggestions accepted is an Assistance-stage habit. Wiring acceptance survival, rework and stability into the same view, and acting on the delta between generated and shipped, is what separates Automation from Agentic Engineering. You cannot direct agents safely if you cannot see, in near real time, how much of what they produce actually holds. Measurement is not a reporting afterthought here; it is the control system that makes delegation responsible.

The discipline is unglamorous and it is the whole job. Stop celebrating volume. Retire perceived speed to the survey it came from. Watch throughput and stability as a pair, and treat rising throughput with falling stability as the warning it is. Then instrument the scarce step directly. The work was never the typing. It is closing the gap between generated and shipped, and you can only close what you measure.

If you want to know where your organisation sits on this, our five-stage maturity model maps acceptance instrumentation against the move from assistance to agentic engineering, and pairs with the assessment to find the next concrete step.

Our perspective

The common view

AI productivity means more code, faster.

The Ivaaya view

Generation is abundant; measure the scarce step — acceptance rate, rework and stability — not volume or perceived speed.

“Developers say they are much faster.”: — Perceived speed diverges from measured outcomes (METR: ~19% slower) — instrument acceptance and stability instead.

If you’re doing this tomorrow

Track acceptance rate, rework rate and DORA stability — not lines of code or “percent faster”.
Treat throughput-up / stability-down as a red flag, not a win.

Where teams go wrong

Counting lines of code or accepted suggestions as value
“Percent faster” on the dashboard
Celebrating throughput while change-failure and rework rise

At a glance

What: Measuring delivery by acceptance, rework and stability.
Why: Volume is abundant and misleading; the scarce step is acceptance.
When: Any team measuring AI’s delivery impact.
When not: When you only need a vanity number for a slide — but then don’t call it measurement.

The evidence & related ideas →

What we’ve observed

METR’s randomised trial found experienced developers ~19% slower with AI while feeling faster — “percent faster” is unreliable.
GitHub reports ~30% of Copilot suggestions are accepted, and DORA/GitClear show throughput rising with instability and rework — so measure acceptance and stability (SPACE/DORA), not output.

How certain are we?

Volume metrics mislead under abundant generation — established: Observed repeatedly across delivery programmes.
Perceived speed diverges from measured delivery — established: Observed repeatedly across delivery programmes.

About the author

Priyanka Pandey

Founder & Editorial Lead

Priyanka Pandey founded Ivaaya and leads its editorial voice, translating real delivery experience into practical thinking on AI-native engineering, decision-making and technology leadership. Her work focuses on helping senior leaders make sense of the changes reshaping software delivery without adding to the noise.

Reviewed and challenged by

Sanjeev Purohit

Principal, Decision Architecture

Sanjeev works across enterprise architecture, product strategy and AI-native delivery. The ideas in this article have been challenged against real programmes, production systems and organisational decision-making before publication.

Part of 2 perspectives

Related thinking

Compare notes

If you are still being shown “percent faster” and lines generated, you are measuring the wrong step. Tell us what you track today — we will compare notes on metrics that survive abundant generation.

What are you measuring? →

This made me think of…