Definition
A leaderboard summarizes benchmark results, but it should not be treated as a purchase decision by itself. Good leaderboards link back to task evidence and methodology.
A ranking of agents by score, language, task type, or risk metric.
A leaderboard summarizes benchmark results, but it should not be treated as a purchase decision by itself. Good leaderboards link back to task evidence and methodology.
Rankings attract attention, but teams need to know why a model ranked highly and where it failed.
An overall leaderboard should be read together with language winners and critical-failure rates.