Agentic Engineering Workbench

For two years the industry has been transfixed by generation: how fluently a model can produce code from a prompt. That problem is, for practical purposes, solved well enough. The question we keep running into when we actually build is the inverse one. Once a machine can write a plausible change in seconds, what does it take to accept that change into a system real people depend on? Acceptance is not a single act. It is context, validation, review, governance and trust, layered on top of each other. The Agentic Engineering Workbench is our incubation-stage attempt to treat that layer as the real engineering problem - to build the harness, conventions and gates that let emerging coding agents do useful work inside disciplined delivery, rather than scattering plausible diffs nobody can vouch for. We are seasoned delivery, architecture and product engineers applying AI to our own craft; we are not claiming to have solved it. We are building in the open and reporting what holds.

Why the hard part moved

When generation was scarce, every line a developer wrote carried review attention by default - it was expensive to produce, so it was expensive to ignore. Cheap generation breaks that economics. A model can now emit more code in an afternoon than a team can meaningfully read in a week, and the cost of a confident, wrong change has dropped to nearly zero while the cost of catching it has not. This is the gap we keep returning to: The Acceptance Gap between what a model can produce and what an organisation can responsibly absorb. Treating that gap as a tooling and process problem - not a model-quality problem - is the central conviction of this work. Better models narrow some failure modes; they do not, on their own, tell you whether a given change is safe to merge into your repository, against your conventions, with your tests passing.

What we explore and build

The workbench is a set of public-safe patterns we assemble and re-test rather than a finished product. We describe them generically on purpose.

A pair-programming harness that makes weaker, locally hosted models behave reliably through aggressive context optimisation - per-model context windows tuned to each model's real working memory, and curated repository context assembled per task rather than dumped wholesale. The bet is that a smaller model fed the right few thousand tokens can outperform a larger one drowning in irrelevant ones.
Autonomous verify loops: generate, run, feed the resulting error back, and repeat until the tests pass or the loop gives up and asks for a human. The error trace, not the prompt, becomes the steering signal. This is where most of the reliability comes from - and where most of the honest failures show up.
Repository-level instructions and conventions captured as a first-class, machine-readable artefact in the AGENTS.md style - build and dev-environment steps, testing instructions, and commit and PR conventions that an agent reads before it touches anything.
Reusable bootstrap and scaffold templates so an agent can initialise a working environment, a progress-tracking file and a first commit consistently, instead of improvising the same setup badly every time.
Review gates and acceptance metrics: explicit human checkpoints, small-batch changes, and measurement of whether the agentic path actually paid off rather than assuming it did.

None of these are exotic. They are the unglamorous scaffolding of Context Engineering: Context Is the New Architecture and an The Agentic SDLC Is an Acceptance-Gate Problem applied to a new class of contributor that happens to be a model. The work is in making them fit together and hold under real repositories.

What the wider evidence says

We are not building in a vacuum, and the external research is sobering in a useful way. Anthropic's Context Engineering: Context Is the New Architecture guidance frames the model as having a finite attention budget subject to 'context rot' - as the window fills, recall degrades because pairwise attention relationships grow with length - so the discipline is to curate the smallest set of high-signal tokens that maximises the chance of the outcome you want. That is precisely the principle behind our per-model context windows and curated repository context. Their companion guidance on long-running agents recommends an initialiser that bootstraps the environment on first run and per-session loops that read progress, run a basic test to catch undocumented bugs, and self-verify features before marking them done - which is the backbone of our bootstrap templates and verify loops.

On conventions, AGENTS.md has gone from a convenient idea to a vendor-neutral open standard: adopted by more than 60,000 open-source projects, supported across tools including Cursor, GitHub Copilot, VS Code, Codex, Devin, Jules and Gemini CLI, and now stewarded by the Linux Foundation's Agentic AI Foundation, formed on 9 December 2025 alongside MCP and goose. Building on that convention rather than a proprietary instruction file is a deliberate, low-regret choice.

The acceptance case is made most forcefully by the people measuring it honestly. METR's randomised trial of 16 experienced open-source developers across 246 tasks on mature repositories found them 19% slower when allowed early-2025 AI tools, even though they forecast a 24% speedup and still believed afterwards that AI had sped them up. METR's February 2026 follow-up on newer tools saw an apparent swing back towards speedup but explicitly called it 'only very weak evidence', citing recruitment and task-selection bias, and is redesigning the experiment. The lesson is not that AI does not help; it is that perceived uplift and measured uplift diverge, so acceptance must be measured, not assumed. The same instinct underwrites the move from raw benchmarks to curated ones: SWE-bench Verified exists because automated grading was unreliable - it is a human-filtered subset of 500 instances, with a large share of the original tasks removed for unclear specifications or tests that could mark valid solutions wrong - and even a curated subset is eventually exposed to data contamination, pushing the field towards private, decontaminated, continuously updated evaluation. Repository-specific verify loops and your own acceptance tests beat a public leaderboard score every time.

Where the verify loop genuinely earns its keep, the evidence is encouraging: research on self-improving coding systems that orchestrate generate-run-feed-error-back reports gains on curated benchmarks through iterative, failure-conditioned edits. And the organisational truth is plain enough - AI is an amplifier, and without strong version control, small batches and quality internal platforms it can amplify the wrong things just as readily as the right ones. That is review gates and scaffold investment stated as a precondition, not a nicety.

What we are learning

The recurring lesson across our own building is uncomfortable for anyone selling AI velocity, and it is the thesis we keep testing against real repositories rather than slideware.

Generation is no longer the hard part. The hard part is acceptance - giving a machine the right context, verifying what it produced, reviewing it like work that matters, governing how it lands, and earning the trust to let it run again. The model writes the code; the harness decides whether anyone should keep it.

A second lesson follows from the first: most of the leverage is in subtraction. The biggest reliability wins we have seen come from giving the model less - a tighter context window, a smaller batch, a clearer convention - not from a bigger model or a cleverer prompt. This is the spirit of Beyond Vibe Coding: replacing the seductive single-shot prompt with a closed loop that proves its own output. It is also why we think about an The AI Engineering Maturity Model at all - the difference between teams that benefit from these tools and teams that quietly regress is rarely the model they use; it is whether they have the acceptance machinery around it.

An honest note on the stage

This is Incubation, and we will not dress it up. We are not AI experts and we have no proven AI track record to point to; the field is roughly two years old and the most rigorous studies in it are still arguing about their own methodology. What we bring is years of delivery, architecture and product engineering, applied to emerging AI with the same scepticism we would apply to any new tool that promised to change how software is built. The patterns described here are convictions tested against real building, not battle-tested intellectual property, and some of them will not survive contact with the next twelve months. Local pair-programming harnesses depend on hardware and model trajectories we do not control; verify loops can give false confidence when the tests themselves are weak; acceptance metrics are only as honest as the team reading them.

We are publishing this anyway, because the alternative - quiet claims of mastery - is exactly the posture the evidence punishes. The workbench will keep evolving as we learn which gates earn their cost and which we can automate away. What we are confident about is the direction: the consultancy that wins the agentic era will not be the one with the best model access, but the one that built the most trustworthy path from a generated change to a change a team is willing to own. We are building that path in the open, and we will report what holds and what breaks.