Skip to content

Transformation Leadership · 10 min read · Updated 2026-06-20

Why AI Pilots Stall Before ROI

Most AI pilots work — and still return nothing — because they are built to prove the technology, then handed across a last mile no one owns into an operating model that never changed. Adoption is not value. We call the chasm between a working pilot and a changed operating model the Production Gap.

By Priyanka Pandey · Founder & Editorial Lead

Reviewed and challenged by Sanjeev Purohit · Principal, Decision Architecture

Built from

  • Field experience
  • Independent research
  • Original framework
  • Reviewed with field experience

Last substantively reviewed · 2026-06-20

In brief

AI pilots stall before ROI because they are scoped to prove the technology, then handed across a last mile no one owns into an operating model that never changed — the gap is organisational, not technical.

  • The pilot usually succeeds; the organisation fails to do anything with it. Adoption and usage are not value.
  • The funnel is steep: ~60% evaluate, ~20% pilot, ~5% reach production (MIT NANDA, enterprise-grade tools — bounded; the viral "95% zero return" headline is contested and not used).
  • The Production Gap: there are two gaps — idea→demo (cheap, AI made it trivial) and demo→value (hard, organisational). Almost everyone stalls at the second.
  • Root cause is a scoping error: pilots run as a Proof of Technology when ROI needs a Proof of Production (changed workflow, owned last mile, moved P&L line).
  • Workflow / operating-model redesign is the single biggest driver of bottom-line impact (McKinsey); the model is a small fraction of a production system (Sculley); AI amplifies existing discipline or dysfunction (DORA).
  • Not all stalling is failure — distinguish a healthy kill (never worth scaling), a slow-burning bet (J-curve lag), and a genuine absorption failure. The Production Gap is about the third.

Walk into almost any large organisation in 2026 and you will find AI pilots — dozens of them. Walk in a year later and you will find most of them exactly where you left them: working, demoed, admired, and contributing nothing to the P&L. The uncomfortable truth is that the pilots usually succeed. It is the organisations that fail to do anything with them. The technology crosses the line; the value does not.

Two gaps, not one. AI made the first trivial; the second — into a changed operating model — is where pilots stall.

Adoption is not value

The funnel is steep. In one widely-cited 2025 study of enterprise-grade GenAI tools, roughly 60% of organisations evaluated them, about 20% ran a pilot, and only around 5% reached production — a sharp drop-off, though one scoped to bought, task-specific tools rather than to all AI use. (We will set aside the more lurid headline from the same report — the “95% get zero return” claim has been contested enough that we will not lean on it — but the direction is not seriously in dispute.) The deeper problem is what gets counted as success. Seat licences, usage dashboards and “people are using it” are adoption metrics. They are not value. Deloitte’s 2026 enterprise survey found only about a fifth of organisations already growing revenue from AI, against the roughly three-quarters who merely expect to.

The last mile no one owns

Ask why a specific pilot stalled and the answer is rarely “the model wasn’t good enough”. It is that the pilot was scoped to prove the technology worked, and then handed across a gap that no one’s job description covered — from the team that built it to the team that would have to live with it, inside a workflow that was never redesigned to use it. The engineering literature has said this for a decade: the model is a small fraction of a real production system, and the hardest part is the transition into operations, which most teams treat informally. McKinsey’s 2025 work puts a number on the other side of it — of the many things organisations could do, end-to-end workflow redesign is the single biggest driver of bottom-line impact from AI. The pilot is the easy part. The redesign is the hard part, and it is the part everyone skips.

It is organisational, not technical

This reframes the whole problem. RAND’s analysis of why AI projects fail puts the leading causes upstream and organisational — the wrong problem, miscommunicated intent, missing data and infrastructure — well before model capability. Google’s DORA research describes AI as an amplifier: it magnifies whatever discipline, or dysfunction, an organisation already has. Point a powerful amplifier at a broken operating model and you get a louder broken operating model, faster. None of this is a reason to slow down on the technology. It is a reason to stop pretending the technology is the hard part.

The Production Gap

We call the chasm between a pilot that works and an operating model that delivers value the Production Gap. There are really two gaps, not one. The first — from an idea to a working demo — AI has made almost trivial to cross; that is why pilots are everywhere. The second — from a working demo to changed, value-producing production — is exactly as hard as it has always been, because it is organisational, and it is where almost everyone stalls.

The Production Gap: the first gap is cheap; the second — Proof of Production — is the organisational one nobody scopes.

Underneath it is a scoping error. Most pilots are run as a Proof of Technology — can the model do the thing? — when the question that actually decides ROI is a Proof of Production: will this change how the work is done, who owns it, and what it earns? A Proof of Technology that passes tells you almost nothing about whether the value will ever arrive, because it never tested the part that was always going to be hard.

Scope the pilot as a Proof of Production, not a Proof of Technology — or it cannot tell you anything about ROI.

Not all stalling is failure

An honest caution, before anyone over-corrects into a kill-every-pilot panic. Some pilots stall because they should — they were never worth scaling, and stopping them is good portfolio discipline, not failure. (The companion to this argument is that, when building is cheap, deciding what not to build becomes the scarce skill.) Others lag rather than fail: the returns from a genuine operating-model change arrive on a curve, not in a quarter. The discipline is to tell the three apart — a healthy kill, a slow-burning bet, and a genuine absorption failure — and to be honest about which one you are looking at. The Production Gap is about the third: the worthwhile pilot that dies in the handoff.

The most dangerous sentence in an AI programme is “the pilot was a success.” A pilot that proved the model but changed no one’s working day has proved nothing that pays. The teams actually capturing value treat the pilot as a dress rehearsal for a changed operating model, not a technology demo — and they name an owner for the last mile before they start, not after it has already stalled.
Sanjeev Purohit, from our delivery work

Closing the gap

For the people accountable for the return — CIOs, CTOs, boards, transformation sponsors — the move is not another pilot. It is to scope the operating-model change into the pilot from the start: name the workflow it will redesign, the owner of the last mile, and the P&L line it is meant to move, and then measure the pilot against that path to production rather than against a usage dashboard. Treat senior ownership as non-negotiable — value tracks the leaders who own the change, not the technical teams left to push it uphill. The organisations that win the AI era will not be the ones that ran the most pilots. They will be the ones that crossed the second gap.

Frequently asked

Why do most AI pilots fail to deliver ROI?
Usually not because the model is inadequate, but because the pilot was scoped to prove the technology and then handed across a last mile no one owns, into a workflow that was never redesigned. The blocker is organisational — operating-model change — not model capability.
Isn’t a successful pilot proof of value?
No. A working pilot is a Proof of Technology; usage and adoption are not value. ROI depends on a Proof of Production — a changed workflow, an owned last mile, and a moved P&L line — which a tech demo never tests.
What separates the teams that actually scale AI?
End-to-end workflow / operating-model redesign (McKinsey finds it the single biggest driver of bottom-line impact), senior leadership owning the change rather than delegating it, and narrow high-value use cases measured against P&L.
How should we measure an AI pilot?
Against a path to production and a specific P&L outcome, not a usage dashboard. Name the workflow it redesigns, the owner of the last mile, and the line it is meant to move — at scoping, not after.
Is the high failure rate unique to AI?
Read it against an already-high base rate for IT and data projects — some stalling is normal portfolio attrition, and some is healthy pruning of pilots that were never worth scaling. The concern here is the worthwhile pilot that dies in the handoff.

Our perspective

The common view

AI ROI is disappointing because the technology/tools or talent are not yet good enough, so the fix is more/better pilots and models.

The Ivaaya view

The pilot usually works; ROI fails because the pilot was scoped to prove the technology, not to change how value is delivered — so it dies in an unowned last mile and an unchanged operating model. ROI failure is a value-selection and workflow-redesign failure baked in at scoping. Close the Production Gap by scoping the operating-model change into the pilot and measuring path-to-production and P&L, not usage.

A high pilot failure rate just means the technology or use cases are immature.
Independent root-cause work (RAND) and the production-engineering literature (Sculley/Lavin) put the failure upstream and organisational, well before model capability; McKinsey isolates workflow redesign — not the model — as the biggest value driver. The model is rarely the binding constraint.
Isn’t all this just the known >80%/95% AI-failure statistics restated?
No — those viral figures (MIT "95% zero return", ">80% fail / twice non-AI IT") were contested and we deliberately do not use them. We rely on the bounded funnel, McKinsey/Deloitte value data, and RAND/academic root causes, read against a high IT base rate.
So we should stop running pilots / kill them faster?
Not indiscriminately. Distinguish a healthy kill (never worth scaling), a slow-burning bet (J-curve lag), and a genuine absorption failure. The Production Gap is about the last — the worthwhile pilot that dies in the handoff — not a licence to cancel everything.
  • Scope the operating-model change into the pilot from the start (workflow, last-mile owner, target P&L line).
  • Measure pilots on path-to-production and P&L outcomes, not adoption/usage dashboards.
  • Treat senior leadership ownership of the change as non-negotiable.
  • Run pilots as Proofs of Production, not Proofs of Technology.
The evidence & related ideas →

What we’ve observed

  • Enterprise-grade GenAI tool funnel ~60% evaluate / ~20% pilot / ~5% production (MIT NANDA 2025; bounded to bought task-specific tools).
  • Deloitte 2026: only ~20% of organisations already growing revenue from AI vs ~74% who merely expect to; value tracks senior ownership not delegation.
  • McKinsey 2025: end-to-end workflow redesign is the single biggest driver of AI EBIT impact; ~6% are "high performers".
  • RAND: leading causes of AI project failure are organisational/upstream (wrong problem, miscommunicated intent, data, infrastructure), not model capability.
  • Sculley et al. (NeurIPS 2015) / Lavin et al.: the model is a small fraction of a production ML system; the production transition is the hard, informally-handled part.
  • Gartner: >40% of agentic AI projects forecast cancelled by end-2027; DORA 2025: AI is an amplifier of existing organisational strengths/weaknesses.
  • A pilot that demoed beautifully then died in the handoff to a team that owned neither the model nor the workflow.
  • A board shown usage dashboards and seat counts, mistaking adoption for value.

How certain are we?

  • Workflow/operating-model redesign is the biggest driver of AI value captureobserved: Seen consistently in our own work.
  • AI project failure is predominantly organisational/upstream, not model capabilityobserved: Seen consistently in our own work.
  • Pilots stall at the demo→production "second gap" because it is organisationalemerging: Still early, but increasingly visible.
  • Scoping pilots as Proof of Production rather than Proof of Technology improves ROI oddsemerging: Still early, but increasingly visible.

Related ideas

About the author

Priyanka Pandey

Founder & Editorial Lead

Priyanka Pandey founded Ivaaya and leads its editorial voice, translating real delivery experience into practical thinking on AI-native engineering, decision-making and technology leadership. Her work focuses on helping senior leaders make sense of the changes reshaping software delivery without adding to the noise.

Reviewed and challenged by

Sanjeev Purohit

Principal, Decision Architecture

Sanjeev works across enterprise architecture, product strategy and AI-native delivery. The ideas in this article have been challenged against real programmes, production systems and organisational decision-making before publication.

Compare notes

If your AI pilots keep working and still not paying, the gap is probably not the model — it is the operating model around it. Tell us where one is stuck between demo and value; we are comparing notes with teams crossing the second gap.

Where is a pilot stuck?