Every domain has a body of knowledge that experts trust because they can trace where it came from. A clinician, a surveyor, a compliance officer — none of them accept an answer without knowing the source, the rule that was applied, and where the judgement stops. The interesting question for us is not whether a language model can produce a fluent answer in such a domain. It plainly can. The question is whether you can build a product that a domain expert would stake their reputation on — and that question turns out to be almost entirely about provenance, constraint design and explicit reasoning boundaries. We are not AI experts; we are delivery, architecture and product engineers applying emerging AI to real systems, and this is what we are learning by building.
This matters now because the easy version is everywhere. Point a model at a corpus, wrap it in chat, ship it. It demos beautifully and fails quietly: it cannot show its work, it answers confidently when it should abstain, and nobody can tell a sound inference from a plausible-sounding guess. A domain reasoning product has to do the opposite. It has to be the kind of thing an expert audits rather than trusts.
What we explore and build
The pattern we keep returning to puts deterministic logic, not the model, on the decision path. We encode specialist domain knowledge as machine-readable, schema-validated contracts — the rules, the vocabulary, the relationships an expert would recognise — so that the knowledge is explicit, versioned and inspectable rather than implied by weights. On top of that sits a deterministic resolver: given an input, it emits structural signals together with a full evidence chain, and there is no machine learning anywhere in that path. The resolver's output is reproducible and explainable by construction, because the same input always yields the same answer and the same justification.
Feeding it is an ETL pipeline that mines canonical source texts: optical character recognition, translation where the source is not in the working language, then strict validation before anything is admitted. The validation gate is the point — we check output rather than trust it, and a record that fails the contract fails early and loudly rather than silently corrupting downstream state. The product surface is a multi-tenant API with both native and conversational clients, so the same governed core serves a programmatic integration and a chat interface without the chat layer becoming a side door around the rules. And we hold the whole thing to large test and parity corpora, checked against independent references, so that 'it works' is a measured claim rather than a feeling. Treating those corpora as the real specification is what we mean by The Eval Is the Spec: Why Acceptance Criteria Become Executable Tests in Agentic Delivery.
The model still earns its place. It assists interpretation — reading messy source text, proposing a reading, surfacing candidates a human or the resolver then adjudicates. It is a powerful interpreter sitting outside the decision boundary, never the boundary itself.
What the wider field is finding
We are not alone in pushing the source of truth out of the model. Microsoft's GraphRAG work provides source-grounding provenance with each generated response, letting a user audit the output directly against the original text; in Microsoft Research's own evaluation it consistently outperformed baseline vector retrieval on comprehensiveness and on what they call 'human enfranchisement' — giving people the supporting material to check the claim — as well as on diversity of perspective. That is the same instinct as a resolver that returns evidence chains rather than opaque answers.
The grounding effect is measurable, not aesthetic. A data.world benchmark found that backing a language model with a structured knowledge graph improved its accuracy on enterprise business questions roughly threefold across a forty-three-question set drawn from a real insurance data model. Most striking, the ungrounded model answered the schema-intensive questions correctly zero per cent of the time — the structure, not the model alone, was where correctness came from.
There is also a sharpening lesson here. The CRANE work on constrained generation shows that bolting strict grammars onto the generator can shrink the search space and degrade reasoning — syntactic constraints alone are not enough when you need both valid structure and sound thinking. That is precisely why we keep determinism in validation and resolution rather than welding it onto the model's output. The neuro-symbolic literature points the same way: a 2024 systematic review of the field found that explainability and trustworthiness accounted for only about 28% of papers — 44 of those reviewed — an explicit, measurable gap. The premise of keeping deterministic logic in the loop is sound; most work still under-invests in the verifiability that, for a domain product, is the entire point. This is the heart of what we call Provenance Engineering: Reconstructing Who Decided What When Humans, Agents and Models All Contributed.
What we are learning
The conviction underneath all of this is simple to state and hard to live by.
In a domain reasoning product, the model assists interpretation; it is not the source of truth. Provenance, constraints and explicit reasoning boundaries are the product — the model is a tenant inside them.
Three things follow that we did not fully appreciate until we built them. First, provenance is an architecture, not a feature you add later. The W3C's PROV work already models this domain-agnostically — entities (the state of data), activities (the transformations and lineage) and the agents responsible — as a family of W3C standards. C2PA's Content Credentials does the analogous thing for media: each edit preserves the prior provenance and appends the new change, producing a tamper-evident chain. If you do not design for the chain from the first commit, you cannot retrofit trust onto it.
Second, the constraint is where the domain knowledge actually lives. Treating a schema as an enforceable contract between producer and consumer, validated before application logic runs, forces the expert knowledge into the open — which is uncomfortable and clarifying in equal measure. Vagueness that a model would paper over becomes a validation error you have to resolve.
Third, the reasoning boundary has to be drawn deliberately, and the model must be allowed to abstain. The historical precedent is older than the current wave: the early expert systems of the 1970s ran their rules through an inference engine that could explain its reasoning step by step on demand — transparency was designed in, not bolted on. The frontier now is wrapping probabilistic generation in deterministic verification: recent work computes sound bounds on whether a model's output satisfies a stated constraint. That is the direction we find most credible — let the model interpret, then verify deterministically before anything counts.
An honest note on the stage
These patterns are in production, which means they survive contact with real source data, real tenants and real edge cases — not a sandbox. It does not mean we have a proven AI track record or a battle-tested method; the field is too young for anyone to claim that honestly, and we are early in it. What we have are convictions tested against real building. Some have held: keeping ML off the decision path has repeatedly paid for itself in debuggability and in the ability to answer 'why did it say that?' Others we are still refining — chiefly the OCR-and-correct front of the ETL, where the published research is sobering. Recent work on using language models to post-correct historical transcripts found them anything but reliable at the task, which is exactly why the deterministic validation gate, not the model, earns its keep there daily.
We treat the gap between a model that demos well and a product an expert will stake their name on as the real work — the acceptance-gap — and we are closing it in the open rather than claiming to have crossed it.
Where this points
Domain reasoning products are, for us, the proving ground for a larger belief: that the durable value of applied AI sits in the scaffolding around the model — the contracts, the evidence chains, the boundaries, the corpora — far more than in the model choice itself. That is also the bridge from The Governance-to-Value Ratio: provenance and constraints are not compliance overhead, they are what makes the output worth using. We will keep publishing what holds and what breaks as we build, because in a field this young, learning in public is the only honest way to earn trust.