English scores did not predict multilingual rank.
Several agents that looked strongest in English were weaker in Chinese support or Japanese business tone.
마케팅 문구가 아니라 다국어 비즈니스 성능으로 정렬합니다.
| Rank | Agent | Overall | Win rate | Pass rate | Critical | Best language | Best for | Cost |
|---|---|---|---|---|---|---|---|---|
| 1 | Claude Main Anthropic | 87 | 55% | 97% | 12% | English | 지원 | premium |
| 2 | OpenAI Main OpenAI | 86 | 35% | 92% | 12% | English | 작성 | premium |
| 3 | Qwen Main Alibaba | 84 | 25% | 93% | 10% | 中文 | 추출 | standard |
| 4 | Gemini Main | 80 | 0% | 82% | 12% | English | 추출 | standard |
| 5 | DeepSeek Main DeepSeek | 80 | 5% | 70% | 7% | 中文 | 추출 | low |
| 6 | Grok Main xAI | 75 | 0% | 37% | 27% | English | 작성 | standard |
실무에 유용한 이야기는 항상 종합 1위와 같지 않습니다.
Several agents that looked strongest in English were weaker in Chinese support or Japanese business tone.
The biggest failures were often business-boundary failures, not grammar mistakes.
Correct Japanese was not enough. Natural, concise business phrasing mattered.
Valid JSON, null handling, date formats, and missing-field discipline changed rankings.
Find the agent that wins the language you actually work in.
The most common failures were not always language errors. They were business risks.
Every score should lead back to prompts, rubrics, outputs, and failure tags.
주요 위험: unsafe_refund_promise
주요 위험: hallucinated_issue
주요 위험: hallucinated_signing_date
주요 위험: missed_buying_signal
주요 위험: unauthorized_credit
주요 위험: generic_ai_copy
Each profile reflects Multilingual Agent Arena #2, not a universal model ranking.
Strong writing and safety boundaries, especially in support tasks.
Strong generalist with balanced writing and support safety.
Strong Chinese business language and structured extraction.
Reliable extraction profile with mixed localization performance.
Best value profile for structured extraction and classification.
Fast outputs with higher variance on business constraints.