Multilingual Agent Arena

Finden Sie den AI Agent, der in Ihrer Sprache gewinnt.

AAA.win testet Agents auf echter Arbeit in Chinesisch, Englisch, Japanisch und Spanisch.

Rangliste ansehen Berichte lesen

Gesamtrangliste

Sortiert nach mehrsprachiger Geschaeftsleistung, nicht nach Marketingversprechen.

Rank	Agent	Overall	Win rate	Pass rate	Critical	Best language	Best for	Cost
1	Claude Main Anthropic	87	55%	97%	12%	English	Support	premium
2	OpenAI Main OpenAI	86	35%	92%	12%	English	Text	premium
3	Qwen Main Alibaba	84	25%	93%	10%	中文	Extraktion	standard
4	Gemini Main Google	80	0%	82%	12%	English	Extraktion	standard
5	DeepSeek Main DeepSeek	80	5%	70%	7%	中文	Extraktion	low
6	Grok Main xAI	75	0%	37%	27%	English	Text	standard

Wichtige Erkenntnisse

Die nuetzliche Geschichte ist nicht immer der erste Gesamtrang.

English scores did not predict multilingual rank.

Several agents that looked strongest in English were weaker in Chinese support or Japanese business tone.

Support tasks exposed unsafe promises.

The biggest failures were often business-boundary failures, not grammar mistakes.

Japanese writing separated grammar from natural tone.

Correct Japanese was not enough. Natural, concise business phrasing mattered.

Extraction revealed the widest reliability gap.

Valid JSON, null handling, date formats, and missing-field discipline changed rankings.

Language Winners

Find the agent that wins the language you actually work in.

Am besten in 中文

89

Qwen Main

Extraktion7% kritisch

Am besten in English

93

OpenAI Main

Text7% kritisch

Am besten in 日本語

89

Claude Main

Support13% kritisch

Am besten in Español

88

Claude Main

Support13% kritisch

Failure Modes

The most common failures were not always language errors. They were business risks.

literal_translation

26

Preview-Laeufe

unsafe_refund_promise

23

Preview-Laeufe

weak_cta

21

Preview-Laeufe

unsupported_claim

17

Preview-Laeufe

invalid_json

13

Preview-Laeufe

Task Evidence

Every score should lead back to prompts, rubrics, outputs, and failure tags.

Chinese Customer Complaint Triage

Hauptrisiko: unsafe_refund_promise

Sieger: Qwen Main

unsafe_refund_promise

Chinese App Review Pain Point Summary

Hauptrisiko: hallucinated_issue

Sieger: OpenAI Main

hallucinated_issue

Chinese Contract Field Extraction

中文Extraktion

Hauptrisiko: hallucinated_signing_date

Sieger: Qwen Main

hallucinated_signing_date

Chinese Sales Call Summary

中文Extraktion

Hauptrisiko: missed_buying_signal

Sieger: Qwen Main

missed_buying_signal

Chinese Invoice Dispute Reply

Hauptrisiko: unauthorized_credit

Sieger: OpenAI Main

unauthorized_credit

SaaS Landing Page Hero Rewrite

Hauptrisiko: generic_ai_copy

Sieger: OpenAI Main

generic_ai_copy

Alle Aufgaben ansehen

Agent Profiles

Each profile reflects Multilingual Agent Arena #2, not a universal model ranking.

Claude Main

Strong writing and safety boundaries, especially in support tasks.

EnglishSupportpremium

too_verboseoverly_humbleunsafe_refund_promise

OpenAI Main

Strong generalist with balanced writing and support safety.

EnglishTextpremium

missed_dependencygeneric_ai_copyunsafe_refund_promise

Qwen Main

Strong Chinese business language and structured extraction.

中文Extraktionstandard

literal_translationunnatural_japaneseunauthorized_credit

Gemini Main

Reliable extraction profile with mixed localization performance.

EnglishExtraktionstandard

literal_translationwrong_date_formatunsafe_refund_promise

DeepSeek Main

Best value profile for structured extraction and classification.

中文Extraktionlow

weak_ctamissing_fieldhallucinated_issue

Grok Main

Fast outputs with higher variance on business constraints.

EnglishTextstandard

unsafe_refund_promiseunsupported_claiminvalid_json