MVPs, Pilots and Production Systems: Knowing the Difference

There is a sentence we hear in nearly every delivery review, usually said with a mixture of pride and quiet panic: "The pilot worked, so we're rolling it out." It is the moment a programme tends to go wrong. Not because the pilot was bad, but because "the pilot worked" answers a different question from the one production is about to ask. The team proved one thing and is now being held to a standard for proving something else entirely.

Most discussions of prototype, MVP, pilot and production treat them as sizes of the same thing: a small version, a medium version, a big version. That framing is the root of an enormous amount of wasted engineering. These are not sizes. They are different questions, each with a different definition of "done" and, crucially, a different acceptance bar. Confusing them is not a naming problem. It is an Intent Translation failure.

Four stages, four questions

The taxonomy itself is well established and worth stating plainly. A prototype or proof-of-concept answers "can we build it?" An MVP answers "will anyone adopt or pay for it?" A pilot answers "does this work operationally, in a real but controlled setting?" Production answers "can we run and scale this reliably, for everyone, indefinitely?" (UXPin; Dax Group). Eric Ries was careful about the middle term: an MVP is "that version of a new product which allows a team to collect the maximum amount of validated learning about customers with the least effort" — it may be a video, a paper mock-up, or a manually-delivered service. It is a learning mechanism, not a small production system (Lean Startup).

Notice what changes between stages. It is not the line count or the headcount. It is who has to accept the outcome. That single shift is the whole game.

The 'who has to accept this?' test

For any piece of work, ask one question: who, specifically, has to accept this output for it to count as done? At prototype stage, the answer is an engineer or a sponsor — "yes, this is technically possible." At MVP stage, it is a real user, voting with adoption or a payment. At pilot stage, it is an operator — someone who has to fit the thing into a live workflow, train staff, and integrate it with the systems they already run. At production stage, the acceptors multiply and harden: every user, a 24/7 on-call rota, a security reviewer, an auditor, a finance owner carrying the maintenance line.

Each new acceptor adds a non-negotiable requirement. An auditor needs traceability. An on-call engineer needs observability, rollback and a runbook. A security reviewer needs threat modelling and access control. None of these are improvements to the MVP; they are the entry fee for a different room. This is why production readiness is a multi-dimensional, gated discipline spanning technical, governance and operational readiness — and why it should be scoped before a pilot begins, not retrofitted after launch (Cortex, 2024 State of Software Production Readiness; SoftwareSeni).

Over- and under-engineering are the same mistake

Here is the contrarian point. Building production-grade observability, DR and a hardened security posture for an unvalidated hypothesis is over-engineering. Shipping pilot-grade code to your entire user base is under-engineering. These look like opposite sins. They are the identical error: a mismatch between the acceptance bar and the stage. In both cases, nobody translated business intent into a stage-appropriate definition of success, so the team built to the wrong bar.

This is the pillar spine in miniature. Organisations rarely fail because they cannot build software; they fail because they cannot consistently turn intent into outcomes. The scarce capability is the translation layer that states, for this stage, what outcome we are accepting and at what cost. "Confusing the stages" is just that translation layer being absent.

A prototype proves the idea can exist. An MVP proves someone wants it. A pilot proves it survives contact with one team's reality. Production proves it survives contact with everyone's. They are not sizes of the same thing — they are four different acceptance bars, and most 'pilot purgatory' is just an undeclared, un-owned one.

The pilot-to-production gap is an acceptance gap

The headline statistic of 2025 — MIT NANDA's finding that roughly 95% of enterprise GenAI pilots delivered no measurable P&L impact, with only about 5% achieving rapid revenue acceleration — is usually recycled as proof that the technology disappoints. The report says the opposite. The dominant root cause is not model quality but the organisational and integration layer: a "learning gap," flawed enterprise integration, missing operational ownership, and an inability to embed into live workflows (MIT NANDA, The GenAI Divide, 2025; HBR, Nov 2025). McKinsey's State of AI 2025 corroborates the structural picture: only around 23% of organisations report scaling an agentic system somewhere, while nearly two-thirds have not begun to scale across the enterprise.

Translate that into our language: 95% of pilots cleared the operator's bar and were then expected to clear the auditor's, the on-call rota's and the whole-population user's bar — without anyone ever declaring those bars, owning them, or budgeting for them. Pilot purgatory is rarely a stuck pilot. It is a system silently asked to graduate to an acceptance bar that no one defined.

Why AI sharpens the point, not softens it

AI collapses the cost of the early stages. Prototypes and MVPs are now nearly free to generate, which is exactly why the constraint moves. Once generation is cheap, the only thing separating a demo from a production system is whether anyone can accept and operate the outcome. DORA's 2024 research is the concrete proof: AI raises throughput but also increases batch size and delivery instability. More code, faster, does not mean more production-readiness — often the reverse. Cheap generation does not pay the entry fee for the production room; it just gets you to the door faster.

And the production room has a recurring bill. Sculley et al.'s work on hidden technical debt in ML systems made the foundational case a decade ago: productionising is a categorically different engineering problem, dominated by ongoing system-level maintenance that pilots and MVPs never surface (NeurIPS 2015). Governance frameworks now encode the same truth — NIST's AI RMF spans Govern, Map, Measure and Manage across the whole lifecycle, treating deployed systems as "living" rather than deploy-and-forget (NIST AI RMF 1.0 / GenAI Profile, 2024).

How to graduate a system deliberately

Two practical moves follow. First, declare the acceptance bar before you build, not after. For each stage, write down who must accept the outcome and what they will demand — then engineer to exactly that bar and no further. Second, use a reversibility and blast-radius lens to size the engineering: the right amount of rigour is a function of how many people are affected and how hard it is to roll back. A reversible feature touching ten internal users earns very little ceremony; an irreversible one touching every customer earns all of it.

Graduation between bars should be an explicit, owned decision with a named acceptor and a budget, not an accident of momentum. That ownership question — who holds the translation between strategy, architecture and operations — is precisely the gap we describe in our The Missing Architecture Layer Between Strategy and Delivery work.

If there is one idea to carry away, it is this: stages are acceptance bars, and the discipline is matching engineering to the bar you are actually clearing. For the deeper argument on why acceptance becomes the binding constraint once generation is cheap, read our companion piece, The Acceptance Gap, alongside our end-to-end view in From Idea to Production: A Practical Product Delivery Lifecycle.