CS-Bench: A Comprehensive Benchmark for Large Language Models towards Computer Science Mastery Paper • 2406.08587 • Published Jun 12, 2024 • 16
Test of Time: A Benchmark for Evaluating LLMs on Temporal Reasoning Paper • 2406.09170 • Published Jun 13, 2024 • 27
AppWorld: A Controllable World of Apps and People for Benchmarking Interactive Coding Agents Paper • 2407.18901 • Published Jul 26, 2024 • 35
OmniDocBench: Benchmarking Diverse PDF Document Parsing with Comprehensive Annotations Paper • 2412.07626 • Published Dec 10, 2024 • 28
EmbodiedBench: Comprehensive Benchmarking Multi-modal Large Language Models for Vision-Driven Embodied Agents Paper • 2502.09560 • Published Feb 13, 2025 • 35
MultiAgentBench: Evaluating the Collaboration and Competition of LLM agents Paper • 2503.01935 • Published Mar 3, 2025 • 29
SWE-Lancer: Can Frontier LLMs Earn $1 Million from Real-World Freelance Software Engineering? Paper • 2502.12115 • Published Feb 17, 2025 • 46
MMTEB: Massive Multilingual Text Embedding Benchmark Paper • 2502.13595 • Published Feb 19, 2025 • 43
YourBench: Easy Custom Evaluation Sets for Everyone Paper • 2504.01833 • Published Apr 2, 2025 • 22
AgentRewardBench: Evaluating Automatic Evaluations of Web Agent Trajectories Paper • 2504.08942 • Published Apr 11, 2025 • 28
FullFront: Benchmarking MLLMs Across the Full Front-End Engineering Workflow Paper • 2505.17399 • Published May 23, 2025 • 14
MultiFinBen: A Multilingual, Multimodal, and Difficulty-Aware Benchmark for Financial LLM Evaluation Paper • 2506.14028 • Published Jun 16, 2025 • 93