The demo was flawless, as demos are. The agent took a vague request, fanned out across three systems, and produced exactly the outcome the salesperson had promised. The room was impressed. Then the prospective buyer asked a quiet question: what happens when it gets that wrong in front of a customer — and can you show me where it decided to do what it did? The slide deck did not have an answer. That question, and a handful like it, tells you more about an agentic-AI product than any demo ever will.
Why the demo is the wrong evidence
A demo is built to show capability on a chosen path; production punishes you on the paths nobody demoed. The questions that predict success are therefore not about what the agent can do at its best, but about what it does at its worst — when it is wrong, when a tool fails, when the data is messy, when an attacker is probing it, and when you want to change vendors. Group your due diligence around five things that actually break: capability, control, evidence, economics and exit.
1. Capability — is it actually an agent, and good at your problem?
Before anything else, establish that you are buying what you think you are. Much of what is sold as agentic is a workflow with a chat interface — a distinction we draw in our piece on agent washing. Then test capability against your reality, not the vendor’s curated case.
- Show me where the agent decides a step you did not pre-author — and where it does not.
- How does it perform on our data and our edge cases, not the demo set? Can we run a time-boxed pilot on a real task?
- What is your evaluation methodology — how do you measure that it works, and can we see results on tasks like ours rather than a public benchmark?
- What does it do when it does not know? Does it stop and ask, or guess confidently?
2. Control — what is it allowed to touch, and who is accountable?
An agent that acts through tools is a new actor inside your systems, with its own potential for damage. The controls a chatbot never needed become essential.
- What identity does the agent act under, and is access scoped to least privilege — or does it borrow a broad, over-privileged token?
- Is authorisation enforced at the level of each action, or only at login?
- Where is the human in the loop for high-impact actions, and can we configure that gate?
- Is there a kill switch, and what is the blast radius if we use it mid-task?
3. Evidence — can you reconstruct what it did?
When an agent acts autonomously, the question is not only whether it works but whether you can prove what happened — for debugging, for customers, and increasingly for regulators.
- Show me the audit trail of a run that went wrong: what the agent saw, decided and did, step by step.
- Can we trace a single outcome back to the inputs and actions that produced it?
- How do you support our compliance obligations — logging, human-oversight records, documentation (relevant under regimes like the EU AI Act)?
- How do you detect and surface regressions when the underlying model changes under us?
4. Economics — what does it really cost at your scale?
Agentic workloads consume tokens in ways that do not scale linearly, and the demo price is rarely the production price.
- How is pricing structured — per seat, per task, per token — and what does a realistic month at our volume cost?
- How do costs behave under load and on hard tasks, where an agent may loop or fan out? What stops runaway spend?
- What is the all-in cost including the integration, oversight and review effort on our side, not just the licence?
5. Exit — can you leave?
The most expensive question to ask late is how you get out. Agentic products embed deeply — in your data, your workflows, and the context you have built up.
- If we leave in two years, what do we keep — our data, our prompts, our evaluation sets, the context layer we have built?
- How much of the value is portable versus locked to your platform and your model choices?
- What is your model strategy — are we tied to one provider, and what happens to us when it changes, deprecates or reprices?
| Dimension | The question the demo dodges | What a weak answer tells you |
|---|---|---|
| Capability | How does it do on OUR data and edge cases? | It was tuned to the demo; expect a gap in production |
| Control | What identity and scope does it act under? | It borrows broad access; a breach is a matter of time |
| Evidence | Show the audit trail of a run that failed | You cannot govern or explain it after the fact |
| Economics | What does a real month at our volume cost? | The bill will surprise you under load |
| Exit | What do we keep if we leave? | You are buying lock-in, not a capability |
Buy the failure mode, not the demo. Any agentic product looks good on the path it was built to show you. What you are actually purchasing is how it behaves on the paths it was not — and whether, when it is wrong, you can see what it did and take control.
Using the checklist
You will rarely find a vendor who answers all five dimensions well; the category is too young. The point is not to demand perfection but to locate the risk: a strong agent with weak evidence is a governance project; strong capability with weak exit is a lock-in decision; great answers everywhere except economics is a budgeting problem. Score each dimension, decide which gaps you can live with, and price the rest into the deal. And start with the two questions that filter fastest — is it even an agent, and can you see what it did when it failed.
Frequently asked
- What should you ask an agentic-AI vendor?
- Organise questions by what fails in production, across five dimensions: capability (is it really an agent, and good on your data?), control (what can it touch, under what identity, with what human oversight and kill switch?), evidence (can you reconstruct what it did?), economics (what does a real month at your volume cost?) and exit (what do you keep if you leave?). The demo proves capability on a chosen path; due diligence is about everything it skips.
- How do you evaluate an AI agent before buying?
- Run a time-boxed pilot on a real task with your own data and edge cases, not the vendor’s demo set; ask for their evaluation methodology and results on tasks like yours; and test failure behaviour — what it does when it does not know, and whether you can see and control what it did when it went wrong.
- What are the biggest risks when buying agentic AI?
- Over-privileged access (the agent acts under a broad token), no audit trail (you cannot explain or govern what it did), runaway cost (agentic workloads do not scale linearly) and lock-in (you cannot leave with your data, prompts and context). Each maps to a due-diligence dimension: control, evidence, economics and exit.
- How do I know if a vendor’s “agent” is really an agent?
- Ask them to show where it decides a step you did not pre-author and acts through tools to change something, then observes and adapts. If the demo is a tour of a chat interface over a fixed workflow, it is automation relabelled — useful, perhaps, but buy and govern it as automation, not as an agent.