arxiv:2601.11044

AgencyBench: Benchmarking the Frontiers of Autonomous Agents in 1M-Token Real-World Contexts

Published on Jan 16

· Submitted by

Rajkumar rawal on Jan 19

SII - GAIR

Upvote

Authors:

Keyu Li ,

Yang Xiao ,

Jie Sun ,

Dequan Wang ,

Abstract

AgencyBench presents a comprehensive benchmark for evaluating autonomous agents across real-world scenarios, enabling automated evaluation through user simulation and sandbox environments while revealing performance gaps between closed-source and open-source models.

AI-generated summary

Large Language Models (LLMs) based autonomous agents demonstrate multifaceted capabilities to contribute substantially to economic production. However, existing benchmarks remain focused on single agentic capability, failing to capture long-horizon real-world scenarios. Moreover, the reliance on human-in-the-loop feedback for realistic tasks creates a scalability bottleneck, hindering automated rollout collection and evaluation. To bridge this gap, we introduce AgencyBench, a comprehensive benchmark derived from daily AI usage, evaluating 6 core agentic capabilities across 32 real-world scenarios, comprising 138 tasks with specific queries, deliverables, and rubrics. These scenarios require an average of 90 tool calls, 1 million tokens, and hours of execution time to resolve. To enable automated evaluation, we employ a user simulation agent to provide iterative feedback, and a Docker sandbox to conduct visual and functional rubric-based assessment. Experiments reveal that closed-source models significantly outperform open-source models (48.4% vs 32.1%). Further analysis reveals significant disparities across models in resource efficiency, feedback-driven self-correction, and specific tool-use preferences. Finally, we investigate the impact of agentic scaffolds, observing that proprietary models demonstrate superior performance within their native ecosystems (e.g., Claude-4.5-Opus via Claude-Agent-SDK), while open-source models exhibit distinct performance peaks, suggesting potential optimization for specific execution frameworks. AgencyBench serves as a critical testbed for next-generation agents, highlighting the necessity of co-optimizing model architecture with agentic frameworks. We believe this work sheds light on the future direction of autonomous agents, and we release the full benchmark and evaluation toolkit at https://github.com/GAIR-NLP/AgencyBench.

View arXiv page View PDF Project page GitHub 21 Add to collection

Community

rajkumarrawal

Paper submitter about 17 hours ago

•

edited about 4 hours ago

Some of the observations founded are :-

-- Long-horizon tasks remain challenging :
Even frontier models struggle with sustained reasoning over real world tasks that require 1M tokens and 90 tool calls, indicating limits in long context autonomy.

-- Proprietary models outperform open source models:
Closed source models achieve a higher average score (48.4%) than open source counterparts (32.1%), revealing a persistent performance gap on complex agentic tasks.

-- Feedback driven self correction varies widely:
Models like GPT 5.2 and Claude show strong gains from iterative feedback, while others (e.g. DeepSeek V3.2) exhibit minimal or no improvement after feedback.

-- Efficiency trade offs are significant:
High performing models often consume far more tokens and time, some models (e.g. Grok 4.1 Fast) are more token efficient despite lower absolute scores.

-- Agentic scaffolds strongly influence performance:
Models tend to perform best within their native or optimized ecosystems, highlighting that agent performance depends on tight coupling between the model and its scaffold not the model alone.

librarian-bot

about 9 hours ago

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

weizhihao1

Paper author about 3 hours ago

"The boundary of evaluation determines the upper limit of intelligence."
Introducing AgencyBench: The first benchmark designed for the 1 Million Token era. 🚀
138 Tasks. 32 Scenarios. 90+ Tool Calls. This is where the real agents play.

Paper: https://arxiv.org/abs/2601.11044
Github: https://github.com/GAIR-NLP/AgencyBench
Huggingface: https://huggingface.co/datasets/GAIR/AgencyBench
Project: https://agencybench.opensii.ai