AI Engineering · 7 min read · Updated 2026-06-18

The Eval Is the Spec: Why Acceptance Criteria Become Executable Tests in Agentic Delivery

When agents generate code for free, the prompt stops being the binding artefact. The evaluation harness becomes the real specification - and the team that writes the best evals, not the best prompts, controls quality and velocity.

By Priyanka Pandey · Founder & Editorial Lead

Reviewed and challenged by Sanjeev Purohit · Principal, Decision Architecture

Built from

Independent research
Data-backed
Original framework
Reviewed with field experience

Last substantively reviewed · 2026-06-18

Part of Agentic Engineering · The AI Engineering Maturity Model

In brief

When generation is free, the evaluation harness — not the prompt — becomes the binding specification: the team that writes the best evals controls quality and velocity.

A prompt is a wish; an eval is a contract the system must satisfy.
Signal, not instruction, drives quality — verdicts beat descriptions.
Write the eval before you generate.

Best for

Agent-generated changes that must meet a quality bar

Not for

Throwaway exploration where nothing must hold

There is a quiet assumption in most agentic-engineering teams: that the bottleneck is the prompt. Get the instruction right, give the agent enough context, and good code falls out the other side. So we obsess over prompt libraries, context windows and system messages, while treating the question of whether the output is actually acceptable as an afterthought - a manual review somewhere downstream. That order is backwards. Once an agent can generate a plausible pull request in seconds, the prompt is no longer the binding artefact. The thing that decides whether the work is done - the evaluation harness - is.

This is the concrete engineering mechanism beneath what we have called the Acceptance Gap: when generation is abundant and nearly free, the constraint moves to acceptance. The economics are now stated plainly by practitioners. Mews engineer Nate Goethel describes the inversion bluntly - "the cost of writing code dropped to near zero... Writing is no longer the bottleneck. Review is" - and adds that "getting a human to sign off with real attention is the new constraint, and our SDLC policies haven't adapted to that." Addy Osmani, writing for O'Reilly Radar, sharpens the same point into a structural deficit he calls comprehension debt: "a junior engineer can now generate code faster than a senior engineer can critically audit it." If acceptance is the constraint, then the artefact that defines acceptance is the spec - and in agentic delivery that artefact is the eval.

From description to verdict

	Prompt as the artefact (old)	Eval as the artefact (new)
What binds quality	A description of intent	A verdict — pass/fail evidence
When it is written	Before generation, loosely	Before generation, precisely
Who controls quality	The best prompter	The team with the best evals

When generation is free, the binding artefact stops being the prompt and becomes the eval.

A prompt describes intent. An eval renders a verdict. The difference matters enormously once a non-human author is in the loop, because intent under-determines behaviour. Anthropic's engineering team puts the failure mode precisely: "two engineers reading the same initial spec could come away with different interpretations on how the AI should handle edge cases. An eval suite resolves this ambiguity." Their working definition of a good eval task is one where "two domain experts would independently reach the same pass/fail verdict." Read that again as an engineering requirement rather than a testing nicety: it is the operational definition of an acceptance criterion. A criterion that two competent reviewers would disagree about is not a criterion; it is a preference. The eval forces the criterion to become decidable before generation begins.

This reframes specification itself. GitHub's Spec Kit, open-sourced in September 2025, treats the specification as "the canonical, executable source of truth" and runs generation through four gated phases - Specify, Plan, Tasks, Implement - explicitly to counter unstructured vibe-coding. Microsoft's ASSERT framework, published by its Responsible AI group in June 2026, goes further on the binding step: it is "built on the premise that a behavior specification should be a first-class input to evaluation - not just background context," and exists to automate "the more difficult step" of turning written intent - "a product requirement, a policy document, a system prompt, a launch checklist, or a review note" - into maintainable, executable evaluation suites. The common move across both is to refuse the gap between what we said we wanted and what we can mechanically check.

The team that writes the best evals, not the best prompts, controls quality and velocity. A prompt is a wish; an eval is a contract the system must satisfy before anyone's attention is spent.

The evidence: signal, not instruction, drives quality

The contrarian claim here is that prompt engineering is over-weighted and eval engineering under-weighted. The research is beginning to bear this out. TDFlow (Han et al., October 2025) operationalises tests as executable specification for agents: instead of natural-language descriptions, agents receive concrete test cases, and test-execution outcomes serve as the unambiguous termination criterion - the agent either satisfies the suite or keeps iterating, evaluated against SWE-bench Verified. The signal, not the instruction, governs when work is finished. A separate Test-Driven Agentic Development study reports an autonomous test-guided improvement loop lifting code generation from 28% to 80% and resolution from 12% to 60%, with zero regression across iterations. The lever was the evaluation feedback, not a cleverer prompt.

The cost of omitting that lever is also now quantified. A practitioner survey cited in agentic-TDD research found that 36% of teams using AI code generation skip quality assurance entirely, 18% place uncritical trust in the output, and 10% delegate QA back to the same model that wrote the code. That is the acceptance gap rendered as a number: in a clear majority of cases there is no executable definition of acceptable at all, so acceptance collapses into either endless manual review or, worse, no review. Anthropic's team is candid that rigour here feels like overhead - but argues it is non-optional past prototyping: "once an agent is in production and has started scaling, building without evals starts to break down," leaving teams "stuck in reactive loops - catching issues only in production."

Write the eval before you generate

The practical inversion is simple to state and hard to adopt: author acceptance criteria as runnable evals before generation begins, and treat the eval suite as the real specification of the system. The prompt becomes disposable; the eval persists. Thoughtworks' Liu Shangqi frames spec-driven development as a corrective to generation that is too fast rather than too slow - "vibe coding is too fast, spontaneous and haphazard," producing "too much unmaintainable, defective, one-off code." Writing the eval first is the discipline that slows the right thing down: it forces the team to decide what done means while the decision is still cheap, rather than relitigating it across every generated PR.

None of this makes the spec total, and pretending otherwise is its own hype. Osmani supplies the necessary caveat: tests and specs are necessary but insufficient. "A test suite capable of covering all observable behavior would, in many cases, be more complex than the code it validates," and no spec captures "the enormous number of implicit decisions... that no spec ever fully captures." Liu agrees from the other direction that "spec drift and hallucination are inherently difficult to avoid." So the eval is not a replacement for human judgement; it is a way of rationing it. A good eval suite absorbs the decidable acceptance work so that scarce senior attention is spent only where the spec genuinely cannot reach - the implicit, the architectural, the consequential.

For delivery leaders, the reallocation is the point. If your investment in agentic engineering goes into prompt scaffolding and your acceptance still depends on a human reading every diff, you have automated generation and left the constraint untouched. Move the craft to the eval. Make acceptance criteria executable, make pass/fail reproducible by two experts, and let the harness - not a reviewer's stamina - be the thing that scales with generation. That is how the Acceptance Gap is closed in practice rather than admired in theory; for the strategic case beneath this engineering mechanism, start with The Acceptance Gap.

Our perspective

The common view

Better prompts get better AI output.

The Ivaaya view

Prompts do not bind; evals do. The eval harness is the real spec — author it first and it governs quality and speed.

“Good prompts are enough.”: — A prompt cannot fail a build; an eval can. Only executable acceptance criteria hold quality at agent speed.

If you’re doing this tomorrow

Write the eval / acceptance test before generating.
Invest in eval coverage over prompt tuning.

Where teams go wrong

Tuning prompts instead of writing evals
Accepting plausible output with no executable verdict
Comprehension debt from unreviewed generated code

At a glance

What: Executable acceptance criteria that gate generated work.
Why: Generation is free; the binding artefact is the verdict, not the wish.
When: Agent-generated changes at any scale.
When not: Throwaway exploration.

The evidence & related ideas →

What we’ve observed

Spec-driven development (GitHub Spec Kit) and tests-as-specification research (TDFlow) turn acceptance criteria into executable inputs for agents.
Code generation is cheap; review and comprehension are the cost (“comprehension debt”), so leverage moves to the eval.

How certain are we?

The eval, not the prompt, is the binding spec — observed: Seen consistently in our own work.
Review/comprehension is the dominant cost of generated code — established: Observed repeatedly across delivery programmes.

About the author

Priyanka Pandey

Founder & Editorial Lead

Priyanka Pandey founded Ivaaya and leads its editorial voice, translating real delivery experience into practical thinking on AI-native engineering, decision-making and technology leadership. Her work focuses on helping senior leaders make sense of the changes reshaping software delivery without adding to the noise.

Reviewed and challenged by

Sanjeev Purohit

Principal, Decision Architecture

Sanjeev works across enterprise architecture, product strategy and AI-native delivery. The ideas in this article have been challenged against real programmes, production systems and organisational decision-making before publication.

Related thinking

Compare notes

If the eval harness is quietly becoming where your real spec lives — and where arguments about quality actually get settled — we would like to hear how that is playing out for you. Who writes the evals on your team, and do they hold the line?

Where does your spec live now? →

This made me think of…