Local AI & Private Engineering Systems

For most teams, "local or hosted?" gets framed as a budget question or an ideological one. We think both framings miss the point. The real question is narrower and more useful: what boundary does this data require, and what model and toolchain actually fit inside that boundary? Once you ask it that way, the architecture stops being a religion and becomes a design problem — one with a clear set of constraints, trade-offs and acceptance criteria. This is an Experiment-stage strand of our Product Labs work: we are seasoned delivery and architecture engineers applying emerging AI to that question, building small private systems, and learning in the open about where the line between local and public should fall. We are not claiming a mature method. We are reporting what we are finding as we build.

What we explore and build

The patterns we work with are deliberately unglamorous, and we describe them generically because that is the honest level of confidence we have. We build on-workstation local models — GPU-backed, running on RTX-class hardware — where the core inference path never touches a cloud provider. We stand up private coding environments that engineers reach only over a private network, so editor context and proprietary logic stay inside a controlled perimeter. We build internal context stores that ground a model in an organisation's own knowledge without that knowledge ever leaving the boundary. And, most often, we build hybrid designs: sensitive material stays local and governed, while public frontier models are used selectively for the work that genuinely benefits from them and carries no data-residency cost.

The tooling here has matured faster than the discourse. A single MIT-licensed runtime such as Ollama wraps llama.cpp and exposes both a native API and an OpenAI-compatible endpoint on localhost, which means a local model can be a near drop-in substitute for a cloud API in a private path. Editor-side assistants like Aider, AGENTS.md-driven agents, and MCP-mediated tool access all run against that local endpoint without a packet leaving the machine or the private subnet. None of this is exotic any more. The interesting work is in the boundary design, not the binaries.

Why this matters now: the industry context

Two forces have converged. The first is regulation moving from advice to obligation. Article 10 of the EU AI Act (Regulation (EU) 2024/1689) requires high-risk AI systems to be developed on datasets that meet defined quality criteria and to be subject to data governance covering collection, preparation, bias examination and gap identification — a binding duty ahead of the 2 August 2026 compliance deadline, not a best-practice nicety. In parallel, NIST published its Generative AI Profile (NIST-AI-600-1) in July 2024, naming data privacy and information security as distinct generative-AI risk categories and offering a catalogue of suggested actions mapped to the AI RMF functions. Auditable, controlled data handling is now something you can be measured against.

The second force is that the quality penalty for staying private has largely evaporated. Meta's Llama 3.2 launch ships small text models engineered for on-device use, explicitly so that running locally keeps data such as messages and calendar information off the cloud. Open-weight coding models have closed the gap at the top: Alibaba's Qwen3 technical report puts its flagship at 70.7 on LiveCodeBench v5 and 2056 on CodeForces, with the team recommending Ollama, LM Studio, MLX and llama.cpp for local deployment. The case for Private by Architecture: Running Local and Self-Hosted Models When Code, Payments or Identity Data Cannot Leave no longer rests on accepting a worse model — which is precisely why the decision has become a boundary decision rather than a capability one.

The risk that makes the boundary real is data egress. By design, cloud coding assistants send editor context — which can include proprietary algorithms and business logic — to vendor servers, and the privacy guarantees vary by plan: consumer and free tiers may use prompts and interactions to improve models, while business and enterprise tiers typically commit not to train on private code. For a regulated client, that distinction is not a hypothetical. The egress itself is the exact leak a network-isolated local environment is built to prevent.

What we are learning

The first thing we are learning is that cost is the wrong argument to lead with. The local-versus-cloud trade-off is utilisation-dependent: a 2026 break-even analysis finds on-premise H100 inference only beats hyperscaler on-demand pricing at roughly 50–83% sustained GPU utilisation, while most production teams run nearer 40–65%. So private deployment is justified by privacy, latency and data-residency control — not by a blanket claim that local is cheaper. That falsifies the naive pitch and, usefully, points straight at hybrid designs as the rational default.

The second thing we are learning is that the controlled boundary has a soft edge people forget: retrieval. Grounding a model in a private knowledge base avoids baking secrets into model parameters, but it widens the security perimeter. A 2026 systematic review of secure retrieval-augmented generation shows a retriever can expose confidential information if access controls are not enforced per user permission, and points to defences such as dynamic access control and encrypted or federated retrieval. An internal context store is only as private as its authorisation layer — the retrieval step has to honour the same permissions as the rest of the system, or the boundary leaks from the inside.

The question was never "local or hosted?" It is: what boundary does the data require — and does the model, the runtime and the retrieval layer all fit honestly inside it?

The third thing we are learning is that this is a recognised market direction, not a fringe preference. Sovereign and private AI is increasingly framed as an explicit architecture and governance pattern — the ability to govern data, infrastructure, models and policy within a chosen legal boundary — and a growing share of regulated enterprises are treating that boundary as a deliberate design choice rather than an afterthought. The hybrid, controlled-boundary-with-selective-public-models design we keep arriving at is the same shape the wider market appears to be converging on.

An honest note on the stage

This is an Experiment, and we want to be precise about what that means. We are not AI experts and we have no long, proven AI track record; the field is young enough that almost nobody honestly does. What we bring is years of delivery, architecture and product engineering — and we are applying that discipline to emerging AI rather than pretending to have mastered it. The patterns here are convictions tested against real building, not battle-tested intellectual property. Some will not survive contact with scale. The boundary that feels right on one engagement may be wrong for the next, and we expect to be corrected by the work. That posture is itself part of the thesis: we would rather show The Acceptance Gap between what AI can demonstrably do and what teams are willing to trust than paper over it with confidence we have not earned.

Where this points

If the question is which boundary the data requires, then the next questions follow naturally: which model belongs at each tier of that boundary, and how does an organisation grow into running these systems safely? Those are the threads we are pulling on next — Choosing Models for Engineering Teams as a disciplined choice rather than a default, and an The AI Engineering Maturity Model that treats private, governed AI as a capability you build up to, not a switch you flip. We will keep building small, keep publishing what breaks, and keep treating "local or hosted?" as the question it never really was.