Agentic Code Review

Once code generation becomes abundant, the binding constraint moves to acceptance: the judgement that a change is correct, safe and worth shipping. We have argued elsewhere that this is where value now concentrates. Code review is where that judgement is supposed to happen. So it is worth asking a blunt question that most teams have not stopped to ask: when the author is a machine, is review still the same job?

It is not. Reviewing AI code is a categorically different cognitive task from reviewing a colleague's, and most review practice has not caught up. The checklist, the heuristics, the instincts that served us for twenty years were calibrated for a human author. Applied unchanged to a fluent, tireless, confidently-wrong machine, they quietly fail. This is the operational mechanic of closing the Acceptance Gap, and it deserves to be specified properly rather than gestured at.

The heuristics you lost

A human pull request comes with two things you rarely notice you rely on. The first is recoverable intent. You can ask the author what they were trying to do, and the answer is a real, retrievable plan. The second is the texture of effort and uncertainty: hesitant code, an odd variable name, a TODO left in, a comment that says "not sure this is right". These are signposts. They tell an experienced reviewer where to look hard, because the author has already told you, often without meaning to, where they were unsure.

Human PR · signs of doubt

function settle(order) {
  // TODO: handle partial refunds?
  const t = calcTax(order)   // unsure
  return total(order) - t
}

AI PR · uniform fluency

function settle(order) {
  const tax = calculateTax(order);
  const total = sum(order.items);
  return total - tax;
}

The absence of friction masks the cognitive gaps

Lost heuristics — uniform fluency hides the gaps.

AI strips both away. The output is uniformly fluent. It is internally consistent. It has no hesitation to read, no fingerprints of doubt, no TODOs born of a real worry. Every function looks equally finished. The plausible parts and the wrong parts wear the same clothes. The reviewer's most reliable instrument, the ability to feel where the author was on shaky ground, returns a flat reading on everything. You are reviewing prose written by something that is incapable of sounding unsure, including when it should be.

What AI gets wrong that humans rarely do

	Reviewing human code	Reviewing AI code
The author	Can tell you where they were unsure	Cannot — confidence is uniform and unreliable
Risk surface	Logic and edge cases	Plausible-but-wrong: invented dependencies, crossed boundaries
The checklist	Refined over years	Needs a new one — not the old one applied faster

Reviewing AI code is not reviewing human code — the gate needs a different checklist.

The failure modes are different in kind, not just degree. Human reviewers learn to anticipate human mistakes: a misread requirement, a forgotten edge case, a pattern copied without being fully understood. AI makes mistakes that humans almost never produce. It fabricates APIs that do not exist but look exactly as though they should. It writes tests that pass and assert nothing meaningful. It produces designs that are architecturally consistent with the surrounding code and also wrong, because consistency is what it optimises for, not correctness. It is confidently, plausibly, fluently mistaken in ways a tired human simply is not.

This is the dominant failure mode in the field, not an edge case. Stack Overflow's 2025 survey found the single largest developer frustration, cited by 45 percent, was AI output that is "almost right, but not quite"; 66 percent said they spend more time fixing such code, and trust in AI accuracy fell to 29 percent, down eleven points in a year. Almost-right is the expensive case precisely because it survives a casual read. It is the hidden tax on abundant generation.

The cost compounds because we are demonstrably bad at catching it. The research on automation bias is unkind to our self-image. Systematic-review evidence finds erroneous automated advice is followed at roughly a 26 percent higher rate, and when otherwise-reliable automation occasionally fails, operators detect only about 30 percent of its errors, against roughly 75 percent when failures are visible. Inexperience and higher complexity both push the miss rate up. A fluent, mostly-correct system is the most dangerous possible thing to review, because its reliability is what disarms you.

A human PR tells you where the author was unsure. An AI PR tells you nothing, because it is incapable of sounding unsure. Review the confidence, not just the code.

Re-specifying the gate

If the old checklist was "did they write it right", line by line, syntax and style, the new gate is "is this confident artefact actually correct, safe and worth shipping". Those are not the same question asked faster. They are different questions, and they point review at a different surface. Five things deserve the reviewer's scarce attention; almost nothing else does.

Intent. The AI had no recoverable plan, so you must supply and verify one. What was this change meant to achieve, and does the diff actually achieve that rather than something adjacent and plausible?
Architecture. AI optimises for local consistency. Ask whether the design is right, not whether it matches its neighbours. Consistent-but-wrong is the signature failure here.
Edge cases. The happy path will almost always work and almost always look convincing. The boundaries, the nulls, the concurrency, the failure handling are where confident generation quietly omits.
Security. NIST SP 800-218A is explicit that all source code in AI-assisted development should be reviewed, human-written or generated alike. Fabricated dependencies and overlooked input handling are AI-shaped risks.
Test quality. Not test presence. AI is excellent at producing tests that pass and prove nothing. A green suite is now a claim to be audited, not evidence to be trusted.

Test quality is the one that catches teams out, so it is worth dwelling on. When the same system writes the code and the tests, a passing suite tells you the two agree with each other, not that either is correct. Thoughtworks' Technology Radar (Vol. 34) makes the related point well: as agents generate ever-larger volumes of code it warns of accumulating "cognitive debt", argues that established engineering discipline is "more vital than ever", and recommends feedback controls such as mutation testing to force self-correction before a human ever looks. Mutation testing is precisely the instrument for asking whether tests assert anything: deliberately break the code and see if the suite notices.

Why the human stays on the loop

The instinct, faced with volume, is to automate the review too. The evidence says be careful. An empirical study of thirteen code-review agents across 19,450 pull requests found that PRs reviewed by an agent alone merged at 45.2 percent against 68.4 percent for human-reviewed ones, that 60 percent of agent-only reviews sat in the lowest useful-signal band, and that twelve of the thirteen agents averaged signal ratios below 60 percent. The authors' conclusion is unambiguous: agents should augment, not replace, human reviewers. Review agents are useful as a first pass that triages noise. They are not the gate.

The wider data tells the same story from a different angle. DORA's 2025 report frames AI as an amplifier of existing engineering conditions rather than an automatic improvement: 90 percent use AI and most feel more productive, yet around 30 percent report little or no trust in AI-generated code, and adoption carries a negative relationship with delivery stability absent strong controls. GitClear's longitudinal analysis of more than 211 million changed lines shows the quality signatures degrading in step, with churn up, copy-pasted blocks up, and refactoring's share of changes falling below 10 percent. And the productivity premise itself is shakier than it feels: METR's randomised trial of experienced open-source developers found AI made them 19 percent slower on real tasks, even as they believed it had sped them up by 20 percent. Faster generation is not faster shipping. The burden has simply moved to the gate.

That is the whole argument in one line. AI does not remove the work of acceptance; it concentrates it. The human-on-the-loop reviewer is no longer asking whether the author wrote it correctly. They are asking whether a confident artefact, produced by something that cannot tell you where it was unsure, is correct, safe and worth shipping. That needs a different checklist, not the old one applied faster, and it needs a reviewer who has not been lulled into stepping out of the loop.

Agentic code review is where the AI-native operating model earns its keep, the point where roles move from doing to directing without surrendering judgement. If you want to know how mature your review gate actually is, that is a question the 5-stage maturity model is built to answer. Start there, or read What Is Agentic Engineering? for how the pieces fit together.