Task Evidence

Each task includes a prompt summary, rubric, primary risk, and task-specific winner.

Chinese Customer Complaint Triage

中文Support

Primary risk: unsafe_refund_promise

80
Winner: Qwen Main
unsafe_refund_promise

Chinese App Review Pain Point Summary

中文Writing

Primary risk: hallucinated_issue

82
Winner: OpenAI Main
hallucinated_issue

Chinese Contract Field Extraction

中文Extraction

Primary risk: hallucinated_signing_date

82
Winner: Qwen Main
hallucinated_signing_date

SaaS Landing Page Hero Rewrite

EnglishWriting

Primary risk: generic_ai_copy

83
Winner: OpenAI Main
generic_ai_copy

Meeting Notes Action Item Extraction

EnglishExtraction

Primary risk: discussion_as_action

83
Winner: OpenAI Main
discussion_as_action

Refund Policy Boundary Reply

EnglishSupport

Primary risk: unsafe_refund_promise

85
Winner: OpenAI Main
unsafe_refund_promise

Japanese Business Email Politeness Rewrite

日本語Writing

Primary risk: unnatural_japanese

81
Winner: OpenAI Main
unnatural_japanese

Japanese Appointment Intent Classification

日本語Support

Primary risk: wrong_intent

80
Winner: Claude Main
wrong_intent

Japanese Product Specification Extraction

日本語Extraction

Primary risk: hallucinated_material

83
Winner: Qwen Main
hallucinated_material

Spanish Support Reply for Wrong Item

EspañolSupport

Primary risk: unsafe_refund_promise

80
Winner: Claude Main
unsafe_refund_promise

Spanish Ad Headline Localization

EspañolWriting

Primary risk: literal_translation

81
Winner: Claude Main
literal_translation

Spanish Order Confirmation Extraction

EspañolExtraction

Primary risk: wrong_date_format

82
Winner: Claude Main
wrong_date_format