Support eval

Support Agent Action Benchmark

Fifty Kustomer-style support workflows for agents that issue refunds, update subscriptions, change reservations, write account state, tag cases, and send customer-facing messages.

The shareable proof page for support teams: same scenarios, same support tools, baseline versus prompt-only versus VerifiedX.

Benchmark result

Prompt-only did not solve action correctness. VerifiedX did.

Same 50 support workflows. Same tools. Same agent goals. Prompt-only reached zero unjustified executions by completing almost none of the justified actions. VerifiedX reached zero unjustified executions while completing 18x more justified actions than prompt-only.

No action boundary

Baseline

5/21 justified actions completed
4/29 unjustified actions executed

Acts more, but some actions should not have happened.

Prompt-only policy

Prompting

1/21 justified actions completed
0/29 unjustified actions executed

Looks safer on one metric, but fails the automation job: only 1 justified action completed.

VerifiedX result

Runtime action boundary

VerifiedX

18/21 justified actions completed
0/29 unjustified actions executed

18 justified completions versus 1 with prompt-only, with zero unjustified executions.

Modern support agents are no longer chat widgets. They issue refunds, update subscriptions, change reservations, tag cases, verify identity, and send customer-facing messages. That creates a new evaluation problem: a model can look careful while simply doing nothing, and it can sound helpful while executing the wrong side effect.

The Support Agent Action Benchmark compares three variants on the same workflows: a permissive baseline, a prompt-only policy baseline, and VerifiedX protection against production api.verifiedx.me. The benchmark counts actual tool execution, not intent. Reads and lookups are evidence; executed side effects are the final truth.

Headline result

Prompt-only collapsed useful work. VerifiedX preserved it.

50/50

VerifiedX allowed the justified action or stopped the unjustified one in every scenario.

0/29

Wrong customer-impacting actions executed with VerifiedX.

18/21

Right customer-impacting actions still completed with VerifiedX.

42/50

Original support requests completed when completion was still the right outcome.

The action-correctness plane

The north-west corner is the product.

A useful support agent should sit high on right actions completed and low on wrong actions executed. In this run, prompt-only completed almost none of the justified actions. Baseline acted more, but executed wrong side effects. VerifiedX is the only variant that lands in the useful corner.

Main table

Same scenarios. Same support tools. Different action boundary.

Metric Baseline Prompt-only VerifiedX
Right allow/stop decisions 31/50 30/50 50/50
Right actions completed 5/21 1/21 18/21
Wrong actions executed 4/29 0/29 0/29
Wrong actions stopped or redirected 0/29 0/29 29/29
Original support request completed 22/50 18/50 42/50

Fifty-case map

Every dot is one workflow.

The benchmark separates action correctness from task completion. Sometimes the correct behavior is to stop, return the receipt, and let the owning workflow decide not to proceed.

right action completed right action allowed, completed elsewhere wrong action stopped or redirected

Methodology

Built for support systems people actually ship.

Scenario set

50 synthetic, Kustomer-style scenarios. No Kustomer data, no affiliation, and no private customer traces.

System shape

45 composed workflows and 5 single-agent workflows. Composed flows include intake, execution, specialist review, identity, and customer-message steps.

Compared variants

Baseline agent, prompt-only policy baseline, and the same workflow with VerifiedX checking the pending high-impact action.

Scoring rule

The benchmark postmortem checks what actually happened in the run. Fixture truth is not enough if the agent never observed it.

Trust notes

Public enough to audit. Sanitized enough to be responsible.

The public repo includes the scenario catalog, aggregate results, methodology, and sanitized scenario summaries. It intentionally excludes raw production traces, full tool payloads, private prompts, API keys, and VerifiedX internals.

The presentation follows common benchmark norms: a clearly separated dataset and method, side-by-side comparisons, reproducible artifacts where safe to publish, and explicit caveats. We reviewed benchmark reporting patterns from OpenAI on SWE-bench Verified, SWE-bench Verified, tau-bench, MLCommons, and the Agentic Benchmark Checklist.

Install in 2 steps

  1. Get API key
  2. Run the installer

    Codex, Cursor, Claude Code.

    VERIFIEDX_API_KEY=vxpk_... npx -y @verifiedx-core/sdk@latest install --cwd .