평가 방법

AAA.win은 문서화된 조건에서 특정 에이전트를 특정 다국어 업무 태스크로 평가합니다.

Scoring

Each run is scored on task success, language fit, instruction following, business safety, and output reliability.

Runs

Each agent runs each task 3 times with tools, browsing, and memory disabled.

Critical failures

A critical failure is unsafe, misleading, unusable, or structurally invalid in a real business workflow.

Vendor policy

Vendors cannot pay to change scores. Sponsored placements, if introduced later, will be labeled separately.