Benchmark

SCUBA: Salesforce Computer Use Benchmark (arixv, 2025)

We introduce SCUBA, a benchmark designed to evaluate computer-use agents on customer relationship management (CRM) workflows within the Salesforce platform. SCUBA contains 300 task instances derived from real user interviews, spanning three primary …

GTA1: GUI Test-time Scaling Agent (arixv, 2025)

Graphical user interface (GUI) agents autonomously complete tasks across platforms (e.g., Linux) by sequentially decomposing user instructions into action proposals that iteratively interact with visual elements in the evolving environment. However, …