Failures that matter in production
The most expensive failures are often not grammar mistakes. They are unsafe promises, invented fields, unsupported security claims, broken JSON, and local-language answers that sound unnatural.
- literal_translation shows localization risk.
- unsafe_refund_promise shows policy-boundary risk.
- invalid_json and missing_field show automation risk.
How to read failure tags
Failure tags are audit leads. A tag count tells you where to inspect raw outputs, not where to stop thinking. High-risk tags should trigger human review and workflow-specific retesting.
Best next test
Build a small red-team set from your own support, writing, and extraction workflows. Include edge cases where the agent is tempted to promise too much or invent missing data.