There is a quiet assumption in most agentic-engineering teams: that the bottleneck is the prompt. Get the instruction right, give the agent enough context, and good code falls out the other side. So we obsess over prompt libraries, context windows and system messages, while treating the question of whether the output is actually acceptable as an afterthought - a manual review somewhere downstream. That order is backwards. Once an agent can generate a plausible pull request in seconds, the prompt is no longer the binding artefact. The thing that decides whether the work is done - the evaluation harness - is.
This is the concrete engineering mechanism beneath what we have called the Acceptance Gap: when generation is abundant and nearly free, the constraint moves to acceptance. The economics are now stated plainly by practitioners. Mews engineer Nate Goethel describes the inversion bluntly - "the cost of writing code dropped to near zero... Writing is no longer the bottleneck. Review is" - and adds that "getting a human to sign off with real attention is the new constraint, and our SDLC policies haven't adapted to that." Addy Osmani, writing for O'Reilly Radar, sharpens the same point into a structural deficit he calls comprehension debt: "a junior engineer can now generate code faster than a senior engineer can critically audit it." If acceptance is the constraint, then the artefact that defines acceptance is the spec - and in agentic delivery that artefact is the eval.
From description to verdict
A prompt describes intent. An eval renders a verdict. The difference matters enormously once a non-human author is in the loop, because intent under-determines behaviour. Anthropic's engineering team puts the failure mode precisely: "two engineers reading the same initial spec could come away with different interpretations on how the AI should handle edge cases. An eval suite resolves this ambiguity." Their working definition of a good eval task is one where "two domain experts would independently reach the same pass/fail verdict." Read that again as an engineering requirement rather than a testing nicety: it is the operational definition of an acceptance criterion. A criterion that two competent reviewers would disagree about is not a criterion; it is a preference. The eval forces the criterion to become decidable before generation begins.
This reframes specification itself. GitHub's Spec Kit, open-sourced in September 2025, treats the specification as "the canonical, executable source of truth" and runs generation through four gated phases - Specify, Plan, Tasks, Implement - explicitly to counter unstructured vibe-coding. Microsoft's ASSERT framework, published by its Responsible AI group in June 2026, goes further on the binding step: it is "built on the premise that a behavior specification should be a first-class input to evaluation - not just background context," and exists to automate "the more difficult step" of turning written intent - "a product requirement, a policy document, a system prompt, a launch checklist, or a review note" - into maintainable, executable evaluation suites. The common move across both is to refuse the gap between what we said we wanted and what we can mechanically check.
The team that writes the best evals, not the best prompts, controls quality and velocity. A prompt is a wish; an eval is a contract the system must satisfy before anyone's attention is spent.
The evidence: signal, not instruction, drives quality
The contrarian claim here is that prompt engineering is over-weighted and eval engineering under-weighted. The research is beginning to bear this out. TDFlow (Han et al., October 2025) operationalises tests as executable specification for agents: instead of natural-language descriptions, agents receive concrete test cases, and test-execution outcomes serve as the unambiguous termination criterion - the agent either satisfies the suite or keeps iterating, evaluated against SWE-bench Verified. The signal, not the instruction, governs when work is finished. A separate Test-Driven Agentic Development study reports an autonomous test-guided improvement loop lifting code generation from 28% to 80% and resolution from 12% to 60%, with zero regression across iterations. The lever was the evaluation feedback, not a cleverer prompt.
The cost of omitting that lever is also now quantified. A practitioner survey cited in agentic-TDD research found that 36% of teams using AI code generation skip quality assurance entirely, 18% place uncritical trust in the output, and 10% delegate QA back to the same model that wrote the code. That is the acceptance gap rendered as a number: in a clear majority of cases there is no executable definition of acceptable at all, so acceptance collapses into either endless manual review or, worse, no review. Anthropic's team is candid that rigour here feels like overhead - but argues it is non-optional past prototyping: "once an agent is in production and has started scaling, building without evals starts to break down," leaving teams "stuck in reactive loops - catching issues only in production."
Write the eval before you generate
The practical inversion is simple to state and hard to adopt: author acceptance criteria as runnable evals before generation begins, and treat the eval suite as the real specification of the system. The prompt becomes disposable; the eval persists. Thoughtworks' Liu Shangqi frames spec-driven development as a corrective to generation that is too fast rather than too slow - "vibe coding is too fast, spontaneous and haphazard," producing "too much unmaintainable, defective, one-off code." Writing the eval first is the discipline that slows the right thing down: it forces the team to decide what done means while the decision is still cheap, rather than relitigating it across every generated PR.
None of this makes the spec total, and pretending otherwise is its own hype. Osmani supplies the necessary caveat: tests and specs are necessary but insufficient. "A test suite capable of covering all observable behavior would, in many cases, be more complex than the code it validates," and no spec captures "the enormous number of implicit decisions... that no spec ever fully captures." Liu agrees from the other direction that "spec drift and hallucination are inherently difficult to avoid." So the eval is not a replacement for human judgement; it is a way of rationing it. A good eval suite absorbs the decidable acceptance work so that scarce senior attention is spent only where the spec genuinely cannot reach - the implicit, the architectural, the consequential.
For delivery leaders, the reallocation is the point. If your investment in agentic engineering goes into prompt scaffolding and your acceptance still depends on a human reading every diff, you have automated generation and left the constraint untouched. Move the craft to the eval. Make acceptance criteria executable, make pass/fail reproducible by two experts, and let the harness - not a reviewer's stamina - be the thing that scales with generation. That is how the Acceptance Gap is closed in practice rather than admired in theory; for the strategic case beneath this engineering mechanism, start with The Acceptance Gap.