Extraction is where automation breaks
A fluent answer is not enough when the downstream system expects valid JSON, stable field names, correct dates, and honest null values. Extraction tasks make reliability measurable.
- valid_json is necessary but not sufficient.
- wrong_date_format can break operational workflows.
- hallucinated fields are worse than explicit missing values.
How AAA.win scores extraction
Extraction tasks are judged on task success, language fit, instruction following, business safety, and output reliability. The highest score should survive both human review and machine parsing.
Best production practice
Pair agent selection with schema validation, retry rules, null handling, and human review for high-value records. The benchmark helps choose the candidate; production controls keep it safe.