AI Engineering · 8 min read

The Integration Seam Is Where AI-Generated Software Breaks: Payments, Identity and the Limits of Generation

AI agents write clean code inside a service and confident nonsense at the boundary between systems. Idempotency, payment state machines, token refresh races and eventual consistency live in vendor quirks and production incidents, not in training data — and that is exactly where acceptance must now concentrate.

Part of Agentic Engineering · The AI Engineering Maturity Model

There is a comfortable story being told about AI-generated software, and it is mostly true: ask a capable agent to write a function, a service, a self-contained module, and it will hand you something coherent, idiomatic and frequently correct. The story holds right up to the point where one system has to talk to another. At the seam between services — a payment processor and your ledger, an identity provider and your session store, an event bus and your fulfilment logic — the same agent that wrote a clean module becomes dangerously confident about behaviour it has never actually observed. This is not a marginal weakness. It is where the most expensive production incidents originate, and it is precisely where generation is weakest.

The reason is structural. An agent reasons from patterns in its training data, and the inside of a service is well represented there: idioms, algorithms, framework conventions, the shape of a REST handler. The behaviour at a seam is not. Whether Stripe re-executes a failed request on the same idempotency key, whether your OAuth provider rotates refresh tokens, whether a webhook can arrive before the API has settled the underlying state — these are not patterns. They are vendor quirks and post-incident scar tissue, learned by teams the hard way and recorded, if at all, in changelogs and runbooks rather than in the public corpus. The agent has read the happy path. It has not been on call.

The evidence: defects concentrate where context is deepest

The aggregate numbers are now hard to wave away. A March 2026 empirical study, 'Debt Behind the AI Boom', analysed 304,362 verified AI-authored commits across 6,275 repositories spanning Copilot, Claude, Cursor, Gemini and Devin. Over 15% of commits from every assistant introduced at least one detectable problem, rising to 28.7% for Gemini; AI commits introduced nearly twice as many security issues as they fixed, and 24.2% of the issues they introduced were still present at the latest revision. CodeRabbit's December 2025 analysis of 470 pull requests reached a complementary conclusion: AI-written code produced roughly 1.7x more issues than human code, with logic and correctness defects up 75% and security vulnerabilities up 1.5 to 2x. The Register, covering the same data, noted that AI-authored pull requests carried 1.4x more critical and 1.7x more major issues, and 'need more attention' precisely in the categories that require deeper context.

That last phrase is the whole argument. The defects are not evenly distributed across the codebase. They cluster in the categories that demand context the model does not have — and integration logic is the densest such category. An ACM study of hallucinations in practical code generation found that API Knowledge Conflicts alone account for 20.41% of hallucinations: wrong parameters, missing guard conditions, similar-but-wrong APIs, and unhandled call exceptions. Maninger and colleagues, in their December 2025 work on errors in LLM-generated web API integrations, found that specific integration failure patterns recur far more often than others. These are not random mistakes. They are systematic blind spots at the boundary.

Four seams agents reliably get wrong

It helps to be concrete, because 'integration is hard' is a platitude until you name the failures. Consider four that recur in payments, identity and commerce.

First, webhook idempotency. Stripe's own engineering position is that idempotency is not a nicety but a consequence of distributed systems being inherently unreliable: you must plan for duplicate and partially-failed requests at the boundary. Stripe's API will skip new changes on a repeat idempotency key if the first attempt succeeded, but re-execute it if the first failed — a distinction an agent rarely encodes correctly. Worse, payment webhooks can arrive out of order, and a failed event such as a chargeback may follow a succeeded one, which is why mature handlers verify status via the API before fulfilment rather than trusting event order. An agent, generating from the happy path, will cheerfully fulfil on the first 'succeeded' it sees.

Second, payment state machines. A charge is not a boolean. It moves through pending, authorised, captured, failed, refunded, disputed — and the transitions are owned by the processor, not by you. Code that treats a payment as succeeded-or-not will double-charge on retry or fulfil against money it will later lose. Reliable handling, as the webhook-at-scale literature describes, means queue-first ingestion, at-least-once delivery paired with idempotency, state-machine retries with backoff and dead-letter queues — an architecture, not a function.

Third, token refresh races. When several concurrent requests each detect an expiring OAuth token and each trigger their own refresh, they end up using different token versions, and with refresh-token rotation you can overwrite a good new token with an already-expired one and lose the valid refresh entirely, forcing re-authentication. This is not theoretical: it was filed against Claude Code itself (issue #43392), where parallel agents sharing one OAuth profile raced on refresh and one agent's rotation invalidated another's session with a refresh_token_reused error. The fix is coordination — a single upfront refresh — which an agent generating per-call retry logic will not invent on its own.

The agent has read the happy path. It has not been on call. Everything it gets wrong at the seam lives in production incidents, not in the corpus it learned from.

Fourth, eventual consistency in identity and commerce flows. A user provisioned in an identity provider is not instantly readable in every replica; an order created is not instantly reflected in inventory. Agents generate code that assumes read-after-write consistency because that is the textbook model, and the seam is exactly where that assumption breaks.

Why the corpus cannot save the agent

There is a deeper reason this will not be fixed by a larger model. As Nordic APIs argues, LLMs are fundamentally mismatched with API contracts: good APIs are predictable, while LLM output is probabilistic, and models will disregard specifications, ignore authentication and rate-limit constraints, and generate invalid calls — with research finding 70% of code instances containing security API misuse across 20 distinct misuse types. The behaviour at a seam is governed by a contract held by the other party, often under-specified, often contradicted by the actual implementation, and frequently changed without notice. No amount of next-token fluency substitutes for the lived knowledge of how a specific vendor behaves under partial failure. That knowledge is not in the weights because it was never written down in a form the weights could absorb.

Where accountability concentrates, attention must follow

This is the Acceptance Gap made physical. Once generation is abundant, the binding constraint is no longer producing code but accepting it — deciding it is fit to run. And acceptance does not distribute evenly across a diff. It concentrates at the seams, because that is where the model's confidence and its competence diverge most sharply, and where a defect costs a duplicated payment or a locked-out user rather than a failing unit test. The practical implication is uncomfortable for teams optimising for throughput: human review and architectural attention should be deliberately rationed toward integration boundaries and starved from the well-trodden interior, not spread uniformly. A reviewer skimming an agent's clean internal module while waving through its webhook handler has the allocation exactly backwards.

This reframes the architect's job, too. Architecture in an agentic world is decision quality at the boundary: which seams need idempotency keys, where the state machine lives and who owns it, how token refresh is coordinated, where you must verify rather than trust. These are decisions, not diagrams, and they are precisely the ones an agent will paper over with confident, plausible, wrong code. The CNCF CloudEvents specification standardising the event envelope helps at the wire level, but the semantics of partial failure remain a human design responsibility.

The conclusion is not that agents should be kept away from integration code. It is that the seam is where human accountability now lives, and where review must be designed to live with it. If you want the discipline that makes this concrete — the practice of reviewing agent output where it matters rather than everywhere equally — start with how we think about Agentic Code Review as the place senior judgement is spent.