The AI Engineering Maturity Model

Most maturity conversations begin in the wrong place: with tooling. A team installs an assistant, code starts flowing, and someone declares the organisation 'AI-native'. The trouble is that adoption is not maturity. DORA's 2025 research puts it bluntly — AI does not fix a team; it amplifies what is already there. Where delivery discipline is strong, AI raises throughput; where it is weak, it raises instability, and roughly 30% of practitioners report little or no trust in the code it produces. A maturity model is therefore not a scoreboard for how much AI you use. It is a map of how reliably you can turn intent into accepted, working software as the machine does more of the typing.

This is the spine that runs through our work on the acceptance gap: the distance between code that is generated and outcomes that are accepted into production and trusted by the people who own them. Each stage below is defined by how an organisation closes that gap, not by how fast it opens it.

The five stages

Stage 1 — Experimentation: Individuals trial AI tools informally. Gains are anecdotal, ungoverned and invisible in delivery data; the organisation has no shared stance on what AI is for.
Stage 2 — Assistance: AI is a sanctioned per-developer accelerant. It autocompletes, drafts and explains, but a human writes the intent and owns every line. Productivity claims outrun measured outcomes.
Stage 3 — Automation: AI becomes the primary code-generator for well-scoped work; humans shift from authoring to validating. This is the inflection point where local speed must be converted into delivered value, or it evaporates.
Stage 4 — Agentic Engineering: Agents execute multi-step tasks against specifications and curated context, with humans moving from 'in the loop' to 'over the loop' — setting intent, reviewing decisions, governing the boundary.
Stage 5 — AI-Native Organisation: Scaled AI ways of working are the default operating model, not a programme. Strategy, workflows, data, governance and platforms are re-engineered around AI; the function delivers measurably above peers.

The stages are convergent with the serious external models, which is reassuring rather than coincidental. MIT CISR's enterprise model (721 companies) finds the greatest financial impact comes precisely from the move out of its 'pilots and capabilities' stages into 'scaled AI ways of working' — our Automation-to-Agentic-to-Native progression — with stage 3-4 organisations performing well above industry average and stage 1-2 below it. The CMU SEI and Accenture AI Adoption Maturity Model, drawing on the CMM/CMMI lineage, names five levels (Exploratory, Implemented, Aligned, Scaled, Future-Ready) assessed across eight dimensions, against a backdrop where roughly 95% of organisations realise no return on generative-AI spend. The pattern is the same everywhere: value lives in scaling, and most organisations are stuck below it.

How to advance: the capability gates

Stages are not reached by buying the next tool; they are earned by acquiring the capabilities that make the next stage safe. The canonical assessment basis is the DORA AI Capabilities Model (September 2025), which identifies seven capabilities that amplify AI's positive impact: a clear and communicated AI stance, healthy data ecosystems, AI-accessible internal data, strong version control, working in small batches, a user-centric focus, and quality internal platforms. Treat these as the entry criteria for each gate rather than as a checklist completed once.

Each transition has a characteristic failure that a gate is meant to catch. Augment Code's AI-SDLC model names them well: the 'verification tax' as you move from Assistance to Automation, where time saved generating is re-spent auditing and senior engineers become the review bottleneck; the '10% productivity plateau' on the way to Agentic Engineering, where local gains are absorbed by unchanged downstream constraints; and 'governance debt' at the top, where compliance and monitoring are retrofitted after scaling rather than built in. ELEKS' five-level model (Traditional through AI-Autonomous) makes the same point from the other side — its governing insight is that human expertise becomes more critical as you advance, with the autonomous stage requiring tiered change classification: routine changes auto-approved, refactors AI-approved-with-logging, and features or security mandatory human review.

Strong foundations will scale faster and safer. Weak ones will fail more visibly. — ELEKS, AI-SDLC Maturity Model

To operate at stages 4 and 5 you need a coherent stack, not a single agent. Thoughtworks' Sunit Parekh frames AI-native engineering as five building blocks: the Agent (autonomous execution), the Model (an increasingly task-specialised knowledge processor), the Methodology (a disciplined framework that prevents 'agent thrashing'), the Spec (precise requirements bridging intent and execution), and the Context (curated institutional knowledge and guardrails, expressed in artefacts like AGENTS.md). Absent these, an organisation that claims Agentic maturity is usually running Stage 2 with more autonomy and less accountability.

What to measure at each stage

The metric that matters changes as you climb, and using the wrong one is how organisations fool themselves. At Experimentation and Assistance, measure adoption breadth and the perception-versus-outcome gap — and be sceptical. A METR randomised controlled trial found experienced open-source developers took 19% longer with early-2025 AI tools while believing they had been sped up by 20%; self-reported velocity is the least trustworthy figure you will collect. At Automation, measure delivery throughput and stability together, never apart: GitHub's Octoverse 2025 reports 46% of the average developer's code is now AI-written (61% in Java), so the question is no longer how much AI writes but how much survives review and stays stable in production. At Agentic Engineering, measure the acceptance rate of agent-produced changes and the human cost of review per change — the verification tax made visible. At AI-Native, measure outcomes the business recognises: relative financial performance, cycle time from idea to accepted production, and trust, the 30% scepticism figure being the one to drive down.

A model is only useful if it changes a decision. Read against this one, most organisations discover they are a stage lower than they assumed, because they counted tools instead of capabilities and velocity instead of acceptance. That is not a failure; it is the start of an honest plan. Architecture, here as everywhere, is decision quality — and the model's job is to tell you which decision to make next.

Where are you? Symptoms by stage

A faster read for executives — the symptom you would actually notice, the risk it carries, and the next move.

Experimentation — symptom: scattered individual use, no shared stance. Risk: fragmented, ungoverned adoption. Next move: set an explicit AI stance.
Assistance — symptom: copilots everywhere, everyone claims productivity, nobody measures outcomes. Risk: the productivity illusion. Next move: measure outcomes, not adoption.
Automation — symptom: AI generates most well-scoped code and reviewers become the bottleneck. Risk: the verification tax. Next move: build the harness and deterministic gates.
Agentic Engineering — symptom: agents run multi-step tasks while humans govern the boundary. Risk: autonomy without accountability. Next move: invest in context, evals and provenance.
AI-Native — symptom: AI is the default operating model. Risk: governance debt retrofitted after scaling. Next move: re-engineer strategy, data and governance around it.

The acceptance-gap overlay

Lay this model over The Acceptance Gap and the two move together: the gap is large and invisible at Experimentation and Assistance, becomes painfully visible at Automation, is actively managed at Agentic Engineering, and is small and measured at AI-Native. Maturity is not how much AI you use; it is how narrow and how known your acceptance gap is.

Diagnose your stage honestly

Do you have a shared, communicated AI stance — or scattered individual use?
Are you measuring delivered outcomes, or adoption and self-reported speed?
Do agents operate against curated context and specs, or improvise?
Is acceptance — rate and rework — measured, or assumed?
Was governance built in, or is it being retrofitted after scaling?

From insight to action

This model is a mirror, not a sales ladder. Read against it, most organisations sit a stage lower than they feel — usually because they counted tools instead of capabilities and velocity instead of acceptance. We are explicit that nobody, ourselves included, has years of Stage-5 production experience; the field is too young for that claim to be honest. What we offer is a frank read of where you are and which capability gate comes next. Our AI Maturity Assessment turns the questions above into a score you can act on.

If you want to move from this assessment to a working instrument, start with what to actually count: see our companion piece on measuring AI engineering, then close the loop you have just opened with the acceptance gap.