Delivery Assurance · 10 min read

Security Review for AI-Generated Code

AI writes code that compiles about 95% of the time and is secure barely half the time — and that gap has not moved in two years. The review that matters is no longer “does it run?” but “what did it quietly let in?”

Here is the uncomfortable number to start from. On Veracode’s 2025 measure — across more than a hundred models and tasks in four languages — roughly 45% of AI-generated code introduced a vulnerability from the OWASP Top 10. Their longitudinal follow-up into 2026 found the security pass rate stuck near 55%, virtually unchanged in two years, while syntactic correctness climbed past 95%. Treat that as one vendor’s benchmark rather than a universal constant — Veracode sells application security — but the shape is corroborated elsewhere, and the shape is the point: capability scaled, security did not.

Capability is not security

It is tempting to assume that as models get better at writing code they get better at writing safe code. The data says otherwise: the two curves have come apart. The most security-capable models on that benchmark — the reasoning-focused ones — top out around 70–72%, the largest improvement seen, and still leave roughly one in three snippets carrying a flaw. There is no model generation, on current evidence, whose output you can accept unreviewed. The better the model gets at producing plausible, working code, the more confidently it produces insecure code that looks right.

It fails exactly where it cannot see the whole program

The aggregate number hides a sharper, more useful pattern. AI-generated code passes well on flaws that have a documented surface fix — parameterised queries for SQL injection (~82%), standard library calls for weak cryptography (~86%). It fails badly on flaws that require reasoning about data as it flows across the whole program: cross-site scripting passes only ~15% of the time, log injection ~13%. Models reproduce the well-known local pattern; they cannot reliably perform the whole-program dataflow reasoning that output-encoding and input-sanitisation demand. For an AppSec function that changes the instruction from a vague “AI code is risky” to something actionable: gate hardest on the dataflow-dependent classes.

The agent cannot review itself

The most dangerous assumption in agentic delivery is that you can ask the model to fix its own security. A controlled study of iterative refinement found the opposite: critical vulnerabilities rose 37.6% after five “improve this” iterations, average flaws per sample tripling — and prompts that explicitly asked for security improvements still introduced new errors. The model cannot recognise the flaw it just wrote, even when told to look for it. The crucial nuance — and this is the contrarian point worth holding onto — is that refinement degrades security only when the loop has no external check. The same studies show that refinement with static analysis or tool feedback in the loop reduces vulnerabilities sharply. The safety mechanism is the assurance tooling, not the model. Self-remediation is not assurance.

A new supply-chain seam: hallucinated packages

There is also an attack surface that did not exist before generation. Models invent package names that do not exist — at rates around 5% for commercial and over 20% for open-source models in 2024 academic work, which catalogued more than two hundred thousand unique hallucinated names. Attackers register the common ones as malicious lookalikes; OWASP documents the chain. The complacent reading is that 2026 frontier models have cut hallucination to ~5–6%, so the problem is shrinking. The honest reading is that the same study found 127 names that five different frontier models all invented identically, dozens still registrable — a model-agnostic attack surface that falling per-model rates actively hide. Lower numbers here are false comfort.

What review has to become

None of this argues against AI-generated code; it argues for treating “the agent wrote it” as raising the review bar, not lowering it. Concretely:

  • Mandatory, class-specific security gates — weighted hardest at the dataflow-dependent flaws (XSS, injection, log handling) where generation fails most, not a uniform pass.
  • External verification in the loop — SAST/DAST and dependency scanning that the model does not control; never let the agent be its own security reviewer.
  • Defend the supply chain explicitly — lockfiles, package allowlists or provenance checks, and pre-merge validation that every imported dependency actually exists and is the one you meant.
  • A named human who accepts the security posture — the assurance does not complete until someone owns it (the accountable core), not when the checks merely go green.

Generation got cheap; the vulnerability rate did not. Security review is the part of delivery that does not get automated away by better models — on the evidence, better models make it more necessary, not less. It is where you pay back the speed, and it is increasingly the work that distinguishes a team shipping safely from one shipping fast.