AI Agent term

Agent Benchmark

A repeatable test set for comparing AI agents on documented tasks.

Definition

An agent benchmark compares multiple agents under shared conditions. A useful benchmark defines tasks, run settings, scoring dimensions, failure tags, and limitations.

Why it matters

Without a documented benchmark, rankings become marketing. Repeatable evaluation helps teams understand whether an agent is good at the work they actually need.

Example

AAA.win compares agents on support, writing, and extraction tasks across Chinese, English, Japanese, and Spanish.