We have become exceptional at watching software run and almost wilfully blind to watching it get made. A modern production estate emits traces, metrics and logs from every request path; teams can tell you the p99 latency of an endpoint that fires four times an hour. Yet ask the same organisation where, between a stated business intent and a deployed outcome, that intent decayed into rework, and you will get anecdote, not data. The delivery process, the journey from intent to production, is the least-instrumented system most companies own.
This was a tolerable blind spot when writing code was the slow, expensive step, because the bottleneck was visible and self-announcing: engineers were busy, backlogs were long, everyone could feel where the time went. AI has removed that comfort. When generation becomes abundant, the constraint does not disappear; it relocates, and it relocates upstream into the parts of delivery nobody measures.
The instrument we point in the wrong direction
OpenTelemetry defines observability as the ability to understand the internal state of a system by examining its outputs, and is blunt about the precondition: to make a system observable it must be instrumented, meaning it must emit traces, metrics or logs. We accept this discipline without question for runtime systems. The argument of this piece is simply that the delivery pipeline is also a system, with internal state, queues and failure modes, and it deserves the same treatment. Treat intent-to-production as an observable system and instrument it to emit signals about its own health.
The signals worth capturing are not deploy counts. They are the three places intent leaks. Decision latency: how long a decision waits between being needed and being made. Translation loss: how much of the original intent survives the relay from business outcome to specification to acceptance criteria to code. Acceptance friction: the time and rework consumed in confirming a change actually does what was meant, not merely what was asked. These are first-class telemetry, not retrospective guesses.
The gaps between steps are where intent stalls
DORA's value stream management guidance frames the path as work flowing from idea to production and instructs teams to measure the time spent waiting between steps, because the wait and queue time between process stages is where hidden, non-value-adding inefficiency lives. This is the crucial inversion: the loss is not in the active work but in the gaps between it. Value stream research is sobering on the scale of this. Measured flow efficiency in most software delivery environments sits in the single digits, up to roughly fifteen per cent, meaning the overwhelming majority of elapsed time from request to production is queue and wait time. We have been optimising the fifteen per cent and leaving the eighty-five unobserved.
Translation loss is the larger and more disguised cost. Requirements research has long attributed the majority of downstream waste to upstream intent decay: roughly half of defects and around eighty per cent of rework effort trace back to poor requirements, and defects found in the field cost fifty to two hundred times more than if caught early. That is translation loss expressed as money. It is also, almost everywhere, completely un-instrumented; nobody emits a signal when an acceptance criterion silently diverges from the outcome it was meant to encode.
AI moves the bottleneck to exactly where we cannot see
The empirical case that the constraint has moved is now strong. Faros AI analysed telemetry from over ten thousand developers across 1,255 teams, drawing up to two years of data from task management, IDEs, CI/CD, version control and incident systems. High-adoption teams merged ninety-eight per cent more pull requests and completed twenty-one per cent more tasks, yet review time rose ninety-one per cent and average PR size grew one hundred and fifty-four per cent. The work did not vanish; it piled up at review and integration. More tellingly, Faros found no significant correlation between AI adoption and improvement at the company level across throughput, DORA metrics and quality. Individual speed-ups were neutralised by organisational bottlenecks nobody was measuring, a textbook expression of Amdahl's Law: a system moves only as fast as its slowest component.
Agoda reached the same conclusion from the other direction, reported via InfoQ: AI assistants had not sped up delivery because coding was never the bottleneck. The constraint had migrated to specification, defining what to build, and verification, confirming it meets intent, the two stages that demand human judgement. Agoda's Leonardo Stern describes human authority as migrating upward in the abstraction stack, from writing code to defining and governing intent. The team invokes Fred Brooks' observation that improving speed in only one part of the lifecycle yields diminishing returns overall, which is precisely the thesis: accelerate the build alone and you simply relocate the constraint to wherever you are not looking.
You cannot feel where delivery time goes. METR's randomised trial put experienced developers on mature repositories and found AI tools made them nineteen per cent slower, while they had predicted a twenty-four per cent speed-up and afterwards still believed they were twenty per cent faster. A perception-reality gap of nearly forty points is not a rounding error; it is proof that intuition is an unreliable instrument and that only measurement reveals the truth.
Designing telemetry that resists its own gaming
Instrumenting delivery invites an obvious failure: the moment a proxy becomes a target, it is gamed. Goodhart's Law is alive in engineering, where deployment frequency gets inflated with inconsequential changes the instant it appears on a dashboard. The discipline borrowed from the metrics literature is to never use a single measure in isolation and to pair the DORA signals, what is happening, with SPACE or DevEx context, why it is happening. DORA's own 2024-2025 research reinforces this by adding rework rate as a stability signal and by showing that throughput and stability gains depend on organisational factors such as psychological safety rather than tooling, and that the throughput-stability relationship strains when teams optimise for the metric itself. Delivery telemetry must therefore measure decision quality and acceptance, not just deploy counts, or it will decay into theatre.
There is also now a substrate for instrumenting authorship itself. SLSA v1.1, stable since 2024, defines verifiable provenance attestations, cryptographically signed in-toto statements recording author, committer and build environment, and GitHub shipped artifact attestation in June 2024. As authorship spans human, agent and model, these attestations become the trace spans of the delivery pipeline, letting each step be attributed and audited rather than assumed. Provenance is what turns delivery observability from a chart into evidence.
The instruction
Pick one product line. Map the path from a recorded business intent to its deployed outcome and place instrumentation at every handover: timestamp when a decision is needed versus made, capture how many times acceptance criteria are rewritten, record rework attributable to misread intent. Within a quarter you will have something you have never had, a picture of where intent decays, and you will almost certainly find it is not in the code. Once you can see it, you can manage it. That is the difference between a delivery organisation that feels fast and one that is measurably so.
If this reframes how you think about what to measure, the natural next step is to read how we define the metrics themselves in Measuring Product Delivery: Beyond Velocity and Story Points, why the upstream decision layer is so often missing in The Missing Architecture Layer Between Strategy and Delivery, and why acceptance, not generation, becomes the binding constraint in The Acceptance Gap.