Definition
Failure tags make benchmark results easier to audit. Instead of only showing a score, they identify recurring problems such as literal translation, invalid JSON, missing fields, or unsupported claims.
A label that explains what went wrong in an agent output.
Failure tags make benchmark results easier to audit. Instead of only showing a score, they identify recurring problems such as literal translation, invalid JSON, missing fields, or unsupported claims.
Tags help teams decide where to inspect raw outputs and where to add guardrails.
literal_translation means the output translated words but missed local tone or market convention.