Benchmark result
Prompt-only did not solve action correctness. VerifiedX did.
Same 50 support workflows. Same tools. Same agent goals. Prompt-only reached zero unjustified executions by completing almost none of the justified actions. VerifiedX reached zero unjustified executions while completing 18x more justified actions than prompt-only.
No action boundary
Baseline
Acts more, but some actions should not have happened.
Prompt-only policy
Prompting
Looks safer on one metric, but fails the automation job: only 1 justified action completed.
Runtime action boundary
VerifiedX
18 justified completions versus 1 with prompt-only, with zero unjustified executions.
Modern support agents are no longer chat widgets. They issue refunds, update subscriptions, change reservations, tag cases, verify identity, and send customer-facing messages. That creates a new evaluation problem: a model can look careful while simply doing nothing, and it can sound helpful while executing the wrong side effect.
The Support Agent Action Benchmark compares three variants on the same workflows: a permissive baseline,
a prompt-only policy baseline, and VerifiedX protection against production api.verifiedx.me.
The benchmark counts actual tool execution, not intent. Reads and lookups are evidence; executed side
effects are the final truth.
Headline result
Prompt-only collapsed useful work. VerifiedX preserved it.
50/50
VerifiedX allowed the justified action or stopped the unjustified one in every scenario.
0/29
Wrong customer-impacting actions executed with VerifiedX.
18/21
Right customer-impacting actions still completed with VerifiedX.
42/50
Original support requests completed when completion was still the right outcome.
The action-correctness plane
The north-west corner is the product.
A useful support agent should sit high on right actions completed and low on wrong actions executed. In this run, prompt-only completed almost none of the justified actions. Baseline acted more, but executed wrong side effects. VerifiedX is the only variant that lands in the useful corner.
Main table
Same scenarios. Same support tools. Different action boundary.
| Metric | Baseline | Prompt-only | VerifiedX |
|---|---|---|---|
| Right allow/stop decisions | 31/50 | 30/50 | 50/50 |
| Right actions completed | 5/21 | 1/21 | 18/21 |
| Wrong actions executed | 4/29 | 0/29 | 0/29 |
| Wrong actions stopped or redirected | 0/29 | 0/29 | 29/29 |
| Original support request completed | 22/50 | 18/50 | 42/50 |
Fifty-case map
Every dot is one workflow.
The benchmark separates action correctness from task completion. Sometimes the correct behavior is to stop, return the receipt, and let the owning workflow decide not to proceed.
Methodology
Built for support systems people actually ship.
Scenario set
50 synthetic, Kustomer-style scenarios. No Kustomer data, no affiliation, and no private customer traces.
System shape
45 composed workflows and 5 single-agent workflows. Composed flows include intake, execution, specialist review, identity, and customer-message steps.
Compared variants
Baseline agent, prompt-only policy baseline, and the same workflow with VerifiedX checking the pending high-impact action.
Scoring rule
The benchmark postmortem checks what actually happened in the run. Fixture truth is not enough if the agent never observed it.
Trust notes
Public enough to audit. Sanitized enough to be responsible.
The public repo includes the scenario catalog, aggregate results, methodology, and sanitized scenario summaries. It intentionally excludes raw production traces, full tool payloads, private prompts, API keys, and VerifiedX internals.
The presentation follows common benchmark norms: a clearly separated dataset and method, side-by-side comparisons, reproducible artifacts where safe to publish, and explicit caveats. We reviewed benchmark reporting patterns from OpenAI on SWE-bench Verified, SWE-bench Verified, tau-bench, MLCommons, and the Agentic Benchmark Checklist.