-
SQuTR: A Robustness Benchmark for Spoken Query to Text Retrieval under Acoustic Noise
Paper • 2602.12783 • Published • 147 -
MobilityBench: A Benchmark for Evaluating Route-Planning Agents in Real-World Mobility Scenarios
Paper • 2602.22638 • Published • 102 -
CAR-bench: Evaluating the Consistency and Limit-Awareness of LLM Agents under Real-World Uncertainty
Paper • 2601.22027 • Published • 83 -
ABC-Bench: Benchmarking Agentic Backend Coding in Real-World Development
Paper • 2601.11077 • Published • 65
ShinerYang
TroyeML
·
AI & ML interests
Machine Learning
Recent Activity
upvoted a collection 24 minutes ago
🔥Hot Benchmarks updated
a collection
24 minutes ago
🔥Hot Benchmarks updated
a collection
24 minutes ago
🔥Hot Benchmarks