Product Labs · Research

Delivery Telemetry & Engineering Evidence

If engineering productivity cannot be reduced to commits or lines, the only honest measure is whether intent reaches production safely and creates value — so we are learning to instrument that, not activity.

How do you know whether an engineering organisation is actually getting more done? The intuitive answers — commits, pull requests, lines of code, AI suggestions accepted — are the ones most likely to mislead you. This matters acutely now, because the industry is adopting AI coding tools faster than it is learning to measure them, and the early evidence suggests our intuitions about the gains are not just imprecise but directionally wrong. This is a Product Labs research theme: a set of convictions about measuring delivery that we are testing in the open as we build, not a finished method we are selling. We are seasoned delivery, architecture and product engineers applying emerging AI to real systems — not AI researchers with a proven track record. The honest position is that the field is young, the data is thin, and we are learning alongside everyone else. What follows is what we explore, what we build, and what we currently believe.

The thing worth measuring is the journey from intent to production

Engineering does not produce code; it produces working change in production that someone values. Most conventional metrics measure the wrong end of that pipeline — the typing, not the arriving. The patterns we build are organised around instrumenting flow rather than activity: the path a unit of intent takes from a decision, through implementation and review, to a safe release that does what was asked. We capture where that journey stalls, where it loops back on itself, and where meaning is lost in translation between the person who wanted something and the system that eventually delivered it. Concretely, that means instrumenting delivery telemetry the way you would instrument a distributed system — published, delivered, acknowledged — so that a change is treated as an event with a verifiable lifecycle rather than a checkbox someone ticked. This is the spine of our work on Delivery Telemetry: Instrumenting the Path from Intent to Production So You Can See Where It Stalls, and it is deliberately vendor-neutral: the goal is evidence that maps onto industry benchmarks, not a bespoke dashboard nobody else can interpret.

The patterns we are building

We describe these generically on purpose; the value is in the pattern, not in any one implementation. There are four we keep returning to. First, flow instrumentation: measuring elapsed time end-to-end and separating active work from waiting, because most of the latency in delivery is queues and handoffs, not coding. Second, correctness corpora: maintaining test and parity corpora as standing evidence that behaviour is preserved across change — a body of cases that a release must satisfy before anyone trusts it. Third, deterministic quality gates: checks that pass or fail the same way every time, so that the bar a change must clear does not drift with mood, deadline or model version. Fourth, capturing the soft signals that usually go unrecorded — rework, acceptance, decision latency, and translation loss — the gap between what was asked for and what was built. Together these let us reason about Measuring Product Delivery: Beyond Velocity and Story Points as a property of the whole value stream rather than the output of an individual. They also feed directly into the harder question we explore under The Acceptance Gap: not 'did the code ship', but 'did it ship the thing that was actually wanted, and was it accepted without a second loop'.

What the industry evidence is now telling us

The case for measuring delivery rather than activity has become much stronger in the last two years, and it comes from sources that do not agree with each other about much else. The sharpest single result is METR's randomised controlled trial: experienced open-source developers were 19% slower at completing real issues when allowed to use early-2025 AI tools, even though they had forecast a 24% speedup beforehand and still believed afterwards that AI had sped them up by around 20%. The full paper makes the perception gap starker still — economics experts predicted a 39% speedup and machine-learning experts predicted 38%, against a measured 19% slowdown. The lesson is not that AI is useless; it is that forecasts, intuitions and self-reported velocity are unreliable and must be replaced with measured before-and-after deltas on completion time and rework. That is the entire premise of Measuring AI Engineering Properly.

The DORA programme points at where the damage actually lands. The 2024 Accelerate State of DevOps report found that a 25% increase in AI adoption was associated with an estimated 1.5% decrease in delivery throughput and a 7.2% decrease in delivery stability — and it introduced 'rework rate', defined as unplanned deployments to fix a user-facing issue, as a first-class stability signal that correlates strongly with change failure rate. The 2025 follow-up found AI adoption had reached roughly 90% of organisations and now correlates positively with throughput, but continues to correlate with elevated instability: more change failures, more rework, slower recovery. DORA frames AI as a mirror and an amplifier — it magnifies whatever delivery discipline already exists. Independent commit-stream analysis corroborates the mechanism: GitClear's study of 211 million changed lines of code found copy-pasted lines rising from 8.3% in 2020 to 12.3% in 2024, refactored ('moved') lines falling from 24.1% to 9.5%, and short-horizon code churn rising from 3.1% to 5.7% — with 2024 the first year duplicated-code introduction exceeded refactoring. Throughput hides all of this; rework and churn telemetry surface it.

None of this is new in principle. The SPACE framework argued years ago that developer productivity cannot be captured by any single metric and that activity counts alone are actively misleading, proposing five dimensions instead: satisfaction, performance, activity, communication and efficiency. The Flow Framework's notion of flow efficiency — active work time as a fraction of total flow time, often as low as 20% — explains why measuring coding speed misses most of the available leverage. And recent empirical work cautions that isolated coding benchmarks do not represent real development outcomes, which is precisely why we invest in real-context parity corpora rather than synthetic scores. We did not invent these ideas. We are applying them to a moment where they suddenly matter much more.

Perceived productivity and measured productivity have come apart. The discipline of this decade is refusing to trust the feeling — and instrumenting whether intent actually reached production, safely, instead.

An honest note on the stage

This is research, and we want to be precise about what that means. We hold these patterns as convictions tested against real building, not as battle-tested intellectual property. We have seen flow instrumentation and deterministic gates change how a team behaves; we have not run the controlled study that would let us claim a number. Some of what we believe will not survive contact with more evidence. The DORA signals are correlational, the METR sample is sixteen developers on mature repositories, and our own work is early. We are comfortable saying that out loud, because the alternative — dressing exploration up as a proven method — is exactly the overconfidence the evidence above is warning about. What we can commit to is measuring our own claims the way we ask others to measure theirs: before-and-after, in real context, with rework and acceptance counted honestly.

Where this goes next

The through-line is simple even if the practice is hard: stop measuring how busy engineering looks and start measuring whether intent reaches production safely and creates value. Everything we build under this theme — flow telemetry, correctness corpora, deterministic gates, rework and acceptance capture — is in service of that one shift. It connects upward to how we think about Delivery Architecture: The Translation Layer, because evidence is only as good as the system that emits it, and a delivery pipeline that cannot describe its own behaviour cannot be trusted with AI on top of it. We will keep publishing what we find, including the parts that contradict where we started. If the central thesis holds, the organisations that win the AI era will not be the ones that adopted the tools fastest, but the ones that could see, in real time, whether the tools were helping — and had the discipline to act on the answer.