Event-Driven Architecture Beyond the Technology

Ask an engineering team why they are adopting event-driven architecture and the answer usually arrives as a product name. Kafka. SNS and SQS. A managed event bus. The conversation is about throughput, partitions and retention, and it is a comfortable conversation because those are tractable, well-documented problems. They are also, in our experience, the easy fifth of the work. The mature view — the one most teams reach only after their second production incident that monitoring never flagged — is that the hard parts of event-driven architecture are not technical at all. They are organisational: ownership, contracts, governance and operational accountability.

This sits squarely on the spine that runs through everything we write about architecture. Architecture is not about diagrams or technology; it is the discipline of making decisions that let an organisation change safely and deliver consistently. Event-driven architecture is one of the cleanest tests of that idea, because the broker will work whatever you decide. The system will compile and pass its smoke tests. Whether it lets the organisation change safely is settled long before, by decisions that have almost nothing to do with the message bus.

The schema is the contract, and a contract needs an owner

Start with the one piece of vendor guidance that is genuinely load-bearing. AWS's own best-practice writing on event-driven architecture (vendor framing) is blunt about it: the producing team owns the event schema and its semantics, and that schema is 'the only contract' — the only sanctioned coupling — between a producer and everything downstream. Get this wrong and you have not built loose coupling; you have built hidden coupling through field meanings nobody wrote down. The corollary, which the same guidance makes explicit, is that producers must run deliberate change management for both breaking and non-breaking changes. A schema without a published change policy is not a contract. It is a promise that can be revoked without notice.

Notice how quickly this stops being a technology question. A schema registry will happily enforce structural compatibility; it has nothing to say about who is allowed to retire a field, how long consumers have to migrate, or what the event actually means when an order is 'confirmed'. Those are ownership and governance decisions. AWS recommends a hybrid model that captures the tension well: decentralise broker ownership to the producing teams so they keep their autonomy, but centralise logging standards and observability strategy across the organisation. Autonomy where it aids change, standardisation where fragmentation would cost you.

Delivery semantics are not business semantics

Here is the trap that catches careful teams. Most brokers default to at-least-once delivery — they would rather send a message twice than risk losing it. As practitioners have long noted, that means duplicate delivery is not an edge case to be designed around later; it is the contract. Every consumer that mutates business state must therefore be idempotent, and that correctness is the application team's accountability, not the infrastructure's. The broker delivered the message exactly as promised. If processing it twice charged the customer twice, that is not a Kafka bug. It is an architecture decision that was never made.

This is where the contrarian thread starts to bite. Schema registries and 'at-least-once' guarantees give a powerful, false sense of safety. The platform is green. Lag is zero. Delivery is confirmed. And the business outcome is wrong, because a duplicate was processed, or because a consumer interpreted an ambiguous event differently from how the producer intended it. Infrastructure health and business correctness are different measurements, and conflating them is how accountability quietly evaporates the moment a message lands in a dead-letter queue that nobody owns.

Observability is a design decision, taken early or paid for late

The cost of treating observability as an afterthought is now reasonably well quantified. One 2025 study of event-driven systems (academic) found that roughly 60 per cent of production debugging time was spent simply reconstructing event sequences and causal relationships — not fixing the fault, just working out what happened in which order. The same work reported that introducing 'tracing by default' lifted observable event flows from 32.7 per cent to 99.3 per cent of production events within six months. The lesson is not 'add tracing'. It is that infrastructure tracing and business-level tracing are different things, and only the latter answers the question that matters during an incident: which business decision did this event represent, and what did it cause? Correlation IDs, OpenTelemetry and an event catalogue are operational requirements designed in from the start, not extras bolted on after the first silent failure.

A business event is a decision artefact, not a Kafka topic. The broker carries it; it does not own it, define it, or answer for it when it goes wrong.

Boundaries before brokers

If events are how bounded contexts talk to each other, then your event design is really a statement about your team boundaries — whether you meant it to be or not. This is Conway's Law, and DORA's research on loosely coupled teams (analyst) leans on it directly, invoking the Inverse Conway Maneuver: deliberately structuring teams so their communication patterns produce the architecture you actually want. DORA's 2024 work, drawing on more than 39,000 respondents, is also refreshingly unromantic about technology. Loose coupling, it argues, is the ability to make large-scale changes, complete work and deploy independently without fine-grained cross-team coordination — and choosing microservices or an event bus does not, on its own, deliver any of that. It must be engineered, and proven through independent testability and deployability. Pick the broker before you have agreed the boundaries and you tend to get duplicated events, contested ownership and a topic list that mirrors your integration plumbing rather than your domain.

The standards community has been quietly converging on the same diagnosis. CloudEvents graduated in the CNCF in January 2024 (governance), standardising the event envelope and metadata in a vendor-neutral form so handling can be decoupled from any one broker or protocol. AsyncAPI describes the asynchronous interface; event catalogues add discoverability and an explicit answer to 'where are the events and who owns them'. These are excellent, and they resolve the technical layer well. What they cannot do is tell you which business events deserve to exist.

The missing translation layer

This is the white space, and it is the heart of our view. The literature splits into two camps that rarely meet. One is technical — Kafka, CloudEvents, AsyncAPI, schema registries. The other is modelling — domain-driven design and Brandolini's EventStorming, exploring which business events matter. Almost nobody connects the full Delivery Architecture: The Translation Layer spine: business strategy decides which events are worth caring about, which determines who should own them, which fixes the data contract, which implies a team boundary, which carries operational accountability when the event fails. Skip that spine and events get designed bottom-up, from whatever was convenient to emit at an integration point, rather than top-down from domain decisions. Ownership becomes ambiguous because the event was never anyone's decision in the first place.

This is the same missing translation layer we describe elsewhere — the gap where strategy is meant to become structure and instead becomes plumbing. The practical remedy is to treat event design as an architecture-decision discipline, in the lineage of Nygard's Architecture Decision Records and Harmel-Law's advice process. Keep an ADR-style record per business event covering the questions that actually decide ownership:

Name and meaning: what real-world decision does this event represent, in domain language rather than table names?
Owner: which producing team is accountable for its schema and semantics?
Breaking-change policy: how are breaking and non-breaking changes versioned and communicated, and how long do consumers have to migrate?
Consumer SLAs: what guarantees can downstream teams rely on for freshness, ordering and idempotency?
On-call ownership: who is paged when this event fails or lands in a dead-letter queue? If you cannot name that person, you have not designed an event — you have published a message and hoped.

Why this is about to matter more

There is a fresh and under-appreciated reason to get this right now. Agentic systems and AI consumers make stable, well-owned business events dramatically more load-bearing. A human consumer reading an ambiguous or duplicated event will often notice something is off and pause. An autonomous consumer will not. It will act, confidently, at scale, and fail silently — which is precisely the failure mode our work on the acceptance gap warns about. The narrowing distance between a plausible action and a correct one depends on contract discipline upstream. If your business events are ambiguous, your AI consumers will industrialise that ambiguity.

So the mature perspective is not anti-Kafka or anti-queue; those choices are largely solved and well served by open standards. It is that the broker is the last decision, not the first. Decide which business events matter, who owns them, what they mean and who answers when they break — and align team topology to those boundaries — before you reach for a product. That ordering is the whole discipline. If you want the principle underneath it, read how we frame architecture as decision-making rather than diagrams, and how those decisions carry an enterprise transformation when the stakes are highest.