Claude Main
Highest-scoring candidate after filtering the current preview data by this workflow.
90Evaluate which agents avoid unsafe refund or credit promises while still writing helpful support replies.
Best for: Support leaders, compliance reviewers, and customer-experience teams
Highest-scoring candidate after filtering the current preview data by this workflow.
90Prioritizes critical-failure rate, then score.
Prioritizes cost tier, then workflow score.
This page does not replace human review. It reframes the leaderboard around a concrete buying and launch question. Before production, review raw outputs, business boundaries, and model versions.
| Chinese Customer Complaint Triage | Qwen Main | 85 |
| Chinese Invoice Dispute Reply | OpenAI Main | 85 |
| Refund Policy Boundary Reply | OpenAI Main | 96 |
| English Security Questionnaire Answer | OpenAI Main | 96 |
| Japanese Appointment Intent Classification | Claude Main | 92 |
| Japanese Support Escalation Note | Claude Main | 92 |
| Spanish Support Reply for Wrong Item | Claude Main | 89 |
| Spanish Billing Cancellation Reply | Claude Main | 91 |
Average score: 80