Choosing Models for Engineering Teams

Most guidance on choosing a model is a leaderboard recap wearing a strategy hat. Pick the top of the table, swap your API key, and wait for the gains. It is a tidy story, and it quietly assumes the thing we have argued is no longer true: that the model is the product. Once generation is abundant, it is not. The binding constraint has moved to acceptance, the judgement that a change is correct, safe and worth shipping. So the first question for an engineering leader is not which model is best. It is what your harness needs a model to do, and where your data is allowed to go.

Start with the unglamorous evidence. On SWE-bench Verified, the 500 human-validated GitHub issues that have become the de facto coding benchmark, the frontier has bunched up. The leaders now cluster between 80 and 94 per cent, while the average across roughly 83 evaluated models sits near 63 per cent. Read those numbers as directional, not precise, because the benchmark is saturating. The practical signal is that the spread between the best models has narrowed to the point where, for most real work, the difference between two top-tier models is smaller than the difference your scaffolding makes.

The model is a commodity. The harness is the product.

That last claim deserves its own line of evidence. The gap between the best and worst agent frameworks on SWE-bench frequently exceeds the gap between the best and worst models. Hold the model constant and change the scaffolding, the planning loop, the tool schemas, the verification gates, the retry behaviour, and performance swings dramatically. Cursor's own CursorBench work in 2026 lands in the same place from a different angle: evaluation methodology and workflow alignment, not raw model swaps, decide perceived quality. The same model on the same task can look excellent or hopeless depending on the rig it sits in.

This is the AI-native operating model in miniature. The scarce step is not producing a candidate change; it is deciding the change is acceptable. Your harness, the verification, the planning loops, the review gates, is the machinery of that decision. It is where your engineering judgement compounds. A better model lifts the ceiling of what a good harness can do. It does not build the harness, and on its own it almost never closes the acceptance gap.

Which model is the wrong first question. The right one is: what does our harness need this model to do, and where can the data go?

Four constraints the leaderboards ignore

Constraint	The question	Why the leaderboard misses it
Use case	What is this model actually for?	A benchmark average hides task fit
Cost at acceptance	What does a shipped change cost?	Cheap tokens ≠ cheap accepted output
Verified context	Can it see your conventions and contracts?	Context moves acceptance more than raw capability
Data boundary	Where may this data legally run?	Jurisdiction is now a first-class axis

The model is a commodity input; choose by the constraints the leaderboards ignore.

If selection is not a capability beauty contest, what is it? A portfolio and routing decision, governed by four constraints that no leaderboard surfaces.

First, cost-at-acceptance, not headline per-token. Inference economics are collapsing in your favour, so do not over-optimise the wrong number. a16z's LLMflation argument puts the fall at roughly 10x per year for equivalent performance and about 1,000x over three years; the independent measurements from Epoch AI are messier and more useful, with task-dependent annual declines between 9x and 900x, accelerating after January 2024. What matters operationally is not the price of a token but the total cost of getting to an accepted change: tokens spent across retries, verification passes and human review time. A cheaper model that needs three attempts and a careful human read can cost more at acceptance than a pricier one that lands first time. Measure the unit that the business actually pays for.

Second, effective context, not advertised context. A 128K window on the spec sheet is a marketing figure, not a capability. NVIDIA's RULER benchmark shows most models fall well short of their stated windows once you hold them to a real performance threshold: GPT-4-1106 advertises 128K but is effective to around 64K, and several nominal-128K models are effective only to about 32K. Treat context as something to verify against your own retrieval and prompt patterns, not a number to trust. This is also why context engineering, what you put in front of the model and how you structure it, often outperforms reaching for a model with a bigger advertised window.

Third, tool-use and harness fit. Agentic work lives or dies on how reliably a model calls tools, follows structured-output schemas and behaves inside a planning loop. A model that reasons beautifully in chat but is sloppy with function calls is the wrong choice for an agent, regardless of its benchmark rank. Test the model in your harness, against your tool definitions, not in the abstract.

Fourth, the data boundary. Some workloads cannot leave your perimeter, and that decides cloud-versus-on-prem before capability enters the conversation. Open-weight models, Qwen3 under Apache-2.0, Llama, Mistral, make private and on-premise deployment viable where data residency and confidentiality bind. And you do not always need a frontier model: Thoughtworks' Technology Radar now puts small language models in Assess, noting they are starting to offer better intelligence-per-dollar than large models for tasks like summarisation and basic coding. Routing the right-sized model to each use case beats paying frontier rates for everything.

Selection is a risk decision, and a routing decision

Governance frameworks already treat model choice this way, as risk management rather than capability shopping. NIST's Generative AI Profile catalogues twelve GenAI-specific risks, prompt injection, data poisoning, intellectual property, over-reliance among them, and stresses shared responsibility across model producers, system builders and acquirers. OWASP ranks prompt injection the number-one LLM risk for a second edition and has added an Agentic Applications Top 10. None of this appears on a leaderboard, yet for an enterprise it often outweighs a few percentage points of benchmark score.

The recommended mitigation also happens to be good engineering: an abstraction layer. Vendor lock-in is a recognised failure mode, and structured-output libraries such as Instructor and Pydantic AI, both in Thoughtworks' Adopt ring, let you keep a stable interface while swapping models behind it. Build the harness once, keep the models replaceable, and route per use case: a small open-weight model on-prem for sensitive summarisation, a frontier model for the hardest agentic refactors, mid-tier models for the broad middle. As prices fall 9x to 900x a year and the frontier keeps bunching, a swappable architecture means you capture each improvement without re-platforming.

So spend your scarce judgement where it pays. Not on the monthly question of which model tops the chart, but on the acceptance harness, verification, planning loops, tool schemas and review gates, and on the routing and data-boundary decisions behind it. The model is a commodity input. The harness is the product. If you want to know where your organisation actually sits on this, our five-stage maturity model, from Experimentation through Assistance, Automation and Agentic Engineering to AI-Native Organisation, is the place to calibrate, and the agentic engineering primer sets out why acceptance, not generation, is the work that remains.