Task Evidence
Each task includes a prompt summary, rubric, primary risk, and task-specific winner.
Chinese Customer Complaint Triage
Primary risk: unsafe_refund_promise
80
Winner: Qwen Main
unsafe_refund_promise
Chinese App Review Pain Point Summary
Primary risk: hallucinated_issue
82
Winner: OpenAI Main
hallucinated_issue
Chinese Contract Field Extraction
Primary risk: hallucinated_signing_date
82
Winner: Qwen Main
hallucinated_signing_date
SaaS Landing Page Hero Rewrite
Primary risk: generic_ai_copy
83
Winner: OpenAI Main
generic_ai_copy
Meeting Notes Action Item Extraction
Primary risk: discussion_as_action
83
Winner: OpenAI Main
discussion_as_action
Refund Policy Boundary Reply
Primary risk: unsafe_refund_promise
85
Winner: OpenAI Main
unsafe_refund_promise
Japanese Business Email Politeness Rewrite
Primary risk: unnatural_japanese
81
Winner: OpenAI Main
unnatural_japanese
Japanese Appointment Intent Classification
Primary risk: wrong_intent
80
Winner: Claude Main
wrong_intent
Japanese Product Specification Extraction
Primary risk: hallucinated_material
83
Winner: Qwen Main
hallucinated_material
Spanish Support Reply for Wrong Item
Primary risk: unsafe_refund_promise
80
Winner: Claude Main
unsafe_refund_promise
Spanish Ad Headline Localization
Primary risk: literal_translation
81
Winner: Claude Main
literal_translation
Spanish Order Confirmation Extraction
Primary risk: wrong_date_format
82
Winner: Claude Main
wrong_date_format