Selected Publications | Dai, Yutong/ 戴宇童

Yutong Dai, Krithika Ramakrishnan, Jing Gu, Matthew Fernandez, Yanqi Luo, Viraj Prabhu, Zhenyu Hu, Silvio Savarese, Caiming Xiong, Zeyuan Chen, Ran Xu

September 2025 arXiv

SCUBA: Salesforce Computer Use Benchmark (arixv, 2025)

We introduce SCUBA, a benchmark designed to evaluate computer-use agents on customer relationship management (CRM) workflows within the Salesforce platform. SCUBA contains 300 task instances derived from real user interviews, spanning three primary personas—platform administrators, sales representatives, and service agents. The tasks test a range of enterprise-critical abilities, including Enterprise Software UI navigation, data manipulation, workflow automation, information retrieval, and troubleshooting. To ensure realism, SCUBA operates in Salesforce sandbox environments with support for parallel execution and fine-grained evaluation metrics to capture milestone progress. We benchmark a diverse set of agents under both zero-shot and demonstration-augmented settings. We observed huge performance gaps in different agent design paradigms and gaps between the open-source model and the closed-source model. In the zero-shot setting, open-source model powered computer-use agents that have strong performance on related benchmarks like OSWorld only have less than 5% success rate on SCUBA, while methods built on closed-source models can still have up to 39% task success rate. In the demonstration-augmented settings, task success rates can be improved to 50% while simultaneously reducing time and costs by 13% and 16%, respectively. These findings highlight both the challenges of enterprise tasks automation and the promise of agentic solutions. By offering a realistic benchmark with interpretable evaluation, SCUBA aims to accelerate progress in building reliable computer-use agents for complex business software ecosystems.

PDF Code

Linxin Song, Yutong Dai, Viraj Prabhu, Jieyu Zhang, Taiwei Shi, Li Li, Junnan Li, Silvio Savarese, Zeyuan Chen, Jieyu Zhao, Ran Xu, Caiming Xiong

August 2025 arXiv

CoAct-1: Computer-using Agents with Coding as Actions (arixv, 2025)

Autonomous agents that operate computers via Graphical User Interfaces (GUIs) often struggle with efficiency and reliability on complex, long-horizon tasks. While augmenting these agents with planners can improve task decomposition, they remain constrained by the inherent limitations of performing all actions through GUI manipulation, leading to brittleness and inefficiency. In this work, we introduce a more robust and flexible paradigm: enabling agents to use coding as a enhanced action. We present CoAct-1, a novel multi-agent system that synergistically combines GUI-based control with direct programmatic execution. CoAct-1 features an Orchestrator that dynamically delegates subtasks to either a conventional GUI Operator or a specialized Programmer agent, which can write and execute Python or Bash scripts. This hybrid approach allows the agent to bypass inefficient GUI action sequences for tasks like file management and data processing, while still leveraging visual interaction when necessary. We evaluate our system on the challenging OSWorld benchmark, where CoAct-1 achieves a new state-of-the-art success rate of 60.76%, significantly outperforming prior methods. Furthermore, our approach dramatically improves efficiency, reducing the average number of steps required to complete a task to just 10.15, compared to 15 for leading GUI agents. Our results demonstrate that integrating coding as a core action provides a more powerful, efficient, and scalable path toward generalized computer automation.

PDF Code

Yan Yang, Dongxu Li, Yutong Dai, Yuhao Yang, Ziyang Luo, Zirui Zhao, Zhiyuan Hu, Junzhe Huang, Amrita Saha, Zeyuan Chen, Ran Xu, Liyuan Pan, Silvio Savarese, Caiming Xiong, Junnan Li

July 2025 arXiv

GTA1: GUI Test-time Scaling Agent (arixv, 2025)

Graphical user interface (GUI) agents autonomously complete tasks across platforms (e.g., Linux) by sequentially decomposing user instructions into action proposals that iteratively interact with visual elements in the evolving environment. However, two main challenges arise: i) planning (i.e., the action proposal sequence) under expansive action space, where selecting an appropriate plan is nontrivial, as many valid ones may exist; ii) accurately grounding actions in complex and high-resolution interfaces, i.e., precisely interacting with visual targets. This paper investigates the aforementioned challenges with our GUI Test-time Scaling Agent, namely GTA1. First, we conduct test-time scaling to select the most appropriate action proposal: at each step, multiple candidate proposals are sampled and evaluated and selected by a judge model. It trades off computation for better decision quality by concurrent sampling. Second, we propose a model that improves grounding of the selected action proposals to its corresponding visual elements. Our key insight is that reinforcement learning (RL) facilitates grounding through inherent objective alignments, rewarding successful clicks on interface elements. Experimentally, GTA1 achieves state-of-the-art performance on both grounding and agent task execution benchmarks.

PDF Code

Yutong Dai, Tianyi Chen, Guanyi Wang, Daniel P. Robinson

May 2023 TMLR

An Adaptive Half-Space Projection Method for Stochastic Optimization Problems with Group Sparse Regularization (TMLR, 2023)

Optimization problems with group sparse regularization are ubiquitous in various popular downstream applications, such as feature selection and compression for Deep Neural Networks (DNNs). Nonetheless, the existing methods in the literature do not perform particularly well when such regularization is used in combination with a stochastic loss function. In particular, it is challenging to design an algorithm that is computationally efficient, has a convergence guarantee, and is able to compute group-sparse solutions. Recently, a half-space stochastic projected gradient (HSPG) method was proposed that partly addressed these challenges. In this paper, we present a substantially enhanced version of HSPG that we call~ AdaHSPG+ that makes two noticeable advances. First, AdaHSPG+ is shown to have a stronger convergence result under significantly looser assumptions than those required by HSPG. This improvement in convergence is achieved by integrating variance reduction techniques with a new adaptive strategy for iteratively predicting the support of a solution. Second, AdaHSPG+ requires significantly less parameter tuning compared to HSPG, thus making it more practical and user friendly. This advance is achieved by designing automatic and adaptive strategies for choosing the type of step employed at each iteration and for updating key hyperparameters. The numerical effectiveness of our proposed AdaHSPG+ algorithm is demonstrated on both convex and non-convex benchmark problems.

PDF

Yutong Dai, Guanyi Wang, Frank E. Curtis, Daniel P. Robinson

January 2023 AISTATS2023

A Variance-Reduced and Stabilized Proximal Stochastic Gradient Method with Support Identification Guarantees for Structured Optimization (AISTATS, 2023)

This paper introduces a new proximal stochastic gradient method with variance reduction and stabilization for minimizing the sum of a convex stochastic function and a group sparsity-inducing regularization function. Since the method may be viewed as a stabilized version of the recently proposed algorithm PStorm, we call our algorithm S-PStorm. Our analysis shows that S-PStorm has strong convergence results. In particular, we prove an upper bound on the number of iterations required by S-PStorm before its iterates correctly identify (with high probability) an optimal support (i.e., the zero and nonzero structure of an optimal solution). Most algorithms in the literature with such a support identification property use variance reduction techniques that require either periodically evaluating an exact gradient or storing a history of stochastic gradients. Unlike these methods, S-PStorm achieves variance reduction without requiring either of these, which is advantageous. Moreover, our support-identification result for S-PStorm shows that, with high probability, an optimal support will be identified correctly in all iterations with the index above a threshold. We believe that this type of result is new to the literature since the few existing other results prove that the optimal support is identified with high probability at each iteration with a sufficiently large index (meaning that the optimal support might be identified in some iterations, but not in others). Numerical experiments on regularized logistic loss problems show that S-PStorm outperforms existing methods in various metrics that measure how efficiently and robustly iterates of an algorithm identify an optimal support.

PDF Code Poster

Yutong Dai, Zeyuan Chen, Junnan Li, Shelby Heinecke, Lichao Sun, Ran Xu

December 2022 AAAI2023

Tackling Data Heterogeneity in Federated Learning with Class Prototypes (AAAI, 2023)

Data heterogeneity across clients in federated learning (FL) settings is a widely acknowledged challenge. In response, personalized federated learning (PFL) emerged as a framework to curate local models for clients’ tasks. In PFL, a common strategy is to develop local and global models jointly - the global model (for generalization) informs the local models, and the local models (for personalization) are aggregated to update the global model. A key observation is that if we can improve the generalization ability of local models, then we can improve the generalization of global models, which in turn builds better personalized models. In this work, we consider class imbalance, an overlooked type of data heterogeneity, in the classification setting. We propose FedNH, a novel method that improves the local models’ performance for both personalization and generalization by combining the uniformity and semantics of class prototypes. FedNH initially distributes class prototypes uniformly in the latent space and smoothly infuses the class semantics into class prototypes. We show that imposing uniformity helps to combat prototype collapse while infusing class semantics improves local models. Extensive experiments were conducted on popular classification datasets under the cross-device setting. Our results demonstrate the effectiveness and stability of our method over recent works.

Code Poster Slides arxiv Video: Bilibili Video: Google Drive

Yutong Dai, Daniel P. Robinson

November 2022

Inexact Proximal-Gradient Methods with Support Identification

We consider the proximal-gradient method for minimizing an objective function that is the sum of a smooth function and a non-smooth convex function. A feature that distinguishes our work from most in the literature is that we assume that the associated proximal operator does not admit a closed-form solution. To address this challenge, we study two adaptive and implementable termination conditions that dictate how accurately the proximal-gradient subproblem is solved. We prove that the number of iterations required for the inexact proximal-gradient method to reach a $τ > 0$ approximate first-order stationary point is $O (τ^{- 2})$ , which matches the similar result that holds when exact subproblem solutions are computed. Also, by focusing on the overlapping group regularizer, we propose an algorithm for approximately solving the proximal-gradient subproblem, and then prove that its iterates identify (asymptotically) the support of an optimal solution. If one imposes additional control over the accuracy to which each subproblem is solved, we give an upper bound on the maximum number of iterations before the support of an optimal solution is obtained.

Code arxiv

Frank E. Curtis, Yutong Dai, Daniel P. Robinson

May 2022 SIOPT

A Subspace Acceleration Method for Minimization Involving a Group Sparsity-Inducing Regularizer. (SIOPT, 2022)

We consider the problem of minimizing an objective function that is the sum of a convex function and a group sparsity-inducing regularizer. Problems that integrate such regularizers arise in modern machine learning applications, often for the purpose of obtaining models that are easier to interpret and that have higher predictive accuracy. We present a new method for solving such problems that utilize subspace acceleration, domain decomposition, and support identification. Our analysis shows, under common assumptions, that the iterate sequence generated by our framework is globally convergent, converges to an $ϵ$ -approximate solution in at most $O (ϵ^{- (1 + p)})$ (respectively, $O (ϵ^{- (2 + p)})$ ) iterations for all $ϵ$ bounded above and large enough (respectively, all $ϵ$ bounded above) where $p > 0$ is an algorithm parameter, and exhibits superlinear local convergence. Preliminary numerical results for the task of binary classification based on regularized logistic regression show that our approach is efficient and robust, with the ability to outperform a state-of-the-art method.

PDF Code Poster arxiv

Yutong Dai, Yang Weng

April 2019 JSSC

Synchronous Parallel Block Coordinate Descent Method for Nonsmooth Convex Function Minimization (JSSC, 2019)

In this paper, we propose a synchronous parallel block coordinate descent algorithm(PSUM) for minimizing a composite function, which consists of a smooth convex function plus a non-smooth but separable convex function. Due to the generalization of our method, some existing synchronous parallel algorithms can be considered as special cases. To tackle high dimensional problems, we further develop a randomized variant, which randomly update some blocks of coordinates at each round of computation. Both proposed parallel algorithms are proven to have sub-linear convergence rate under rather mild assumptions. The numerical experiments on solving the large scale regularized logistic regression with $l_{1}$ norm penalty show that the implementation is quite efficient. We conclude with explanation on the observed experimental results and discussion on the potential improvements.

PDF