English Benchmarks Are Not Enough
Generated from batch: maa-preview-001
AAA.win tested 6 AI agents on 12 multilingual business tasks across 4 languages. This preview report is generated from structured run data and should be edited before publication.
Executive Summary
- Overall winner: Claude Main with an average score of 87.
- Lowest critical-failure rate among top agents: Claude Main.
- Most common failure mode: unsafe_refund_promise.
- Strong overall performance does not imply winning every language or task type.
Overall Leaderboard
| Rank |
Agent |
Score |
Pass Rate |
Critical Failure Rate |
Cost Tier |
| 1 |
Claude Main |
87 |
94% |
8% |
premium |
| 2 |
OpenAI Main |
86 |
94% |
8% |
premium |
| 3 |
Qwen Main |
84 |
92% |
11% |
standard |
| 4 |
Gemini Main |
80 |
86% |
8% |
standard |
| 5 |
DeepSeek Main |
79 |
67% |
8% |
low |
| 6 |
Grok Main |
75 |
42% |
33% |
standard |
Language Winners
| Language |
Winner |
Score |
Critical Failure Rate |
| Chinese |
Qwen Main |
89 |
0% |
| English |
OpenAI Main |
93 |
0% |
| Japanese |
Claude Main |
87 |
11% |
| Spanish |
Claude Main |
89 |
0% |
Task Type Winners
| Task Type |
Winner |
Score |
Critical Failure Rate |
| Support |
Claude Main |
90 |
17% |
| Writing / Localization |
OpenAI Main |
89 |
0% |
| Extraction / Analysis |
Qwen Main |
88 |
8% |
Failure Modes
| Failure Tag |
Count |
| unsafe_refund_promise |
17 |
| literal_translation |
16 |
| weak_cta |
14 |
| unsupported_claim |
10 |
| invalid_json |
7 |
| wrong_intent |
5 |
| hallucinated_signing_date |
4 |
| missing_field |
4 |
| missed_dependency |
3 |
| too_verbose |
3 |
Task Results
| Task |
Language |
Type |
Winner |
Score |
Primary Risk |
| Chinese Customer Complaint Triage |
中文 |
Support |
Qwen Main |
85 |
unsafe_refund_promise |
| Chinese App Review Pain Point Summary |
中文 |
Writing / Localization |
OpenAI Main |
89 |
hallucinated_issue |
| Chinese Contract Field Extraction |
中文 |
Extraction / Analysis |
Qwen Main |
96 |
hallucinated_signing_date |
| SaaS Landing Page Hero Rewrite |
English |
Writing / Localization |
OpenAI Main |
93 |
generic_ai_copy |
| Meeting Notes Action Item Extraction |
English |
Extraction / Analysis |
OpenAI Main |
89 |
discussion_as_action |
| Refund Policy Boundary Reply |
English |
Support |
OpenAI Main |
96 |
unsafe_refund_promise |
| Japanese Business Email Politeness Rewrite |
日本語 |
Writing / Localization |
OpenAI Main |
85 |
unnatural_japanese |
| Japanese Appointment Intent Classification |
日本語 |
Support |
Claude Main |
92 |
wrong_intent |
| Japanese Product Specification Extraction |
日本語 |
Extraction / Analysis |
Qwen Main |
91 |
hallucinated_material |
| Spanish Support Reply for Wrong Item |
Español |
Support |
Claude Main |
89 |
unsafe_refund_promise |
| Spanish Ad Headline Localization |
Español |
Writing / Localization |
Claude Main |
92 |
literal_translation |
| Spanish Order Confirmation Extraction |
Español |
Extraction / Analysis |
Claude Main |
85 |
wrong_date_format |
Methodology Snapshot
- Runs per task-agent pair: 3
- Tools enabled: false
- Web browsing enabled: false
- Memory enabled: false
- Scores are computed from five dimensions: task success, language fit, instruction following, business safety, and output reliability.
Publication Notes
- Replace preview seed outputs with exact model outputs before public claims.
- Human-review all critical business-safety failures.
- Confirm model versions, pricing dates, and evaluation dates.
- Keep the vendor policy visible: vendors cannot pay to change scores.