SciEvalKit: An Open-source Evaluation Toolkit for Scientific General Intelligence Paper • 2512.22334 • Published 13 days ago • 32
NitroGen: An Open Foundation Model for Generalist Gaming Agents Paper • 2601.02427 • Published 4 days ago • 27
Masking Teacher and Reinforcing Student for Distilling Vision-Language Models Paper • 2512.22238 • Published 16 days ago • 18
UniPercept: Towards Unified Perceptual-Level Image Understanding across Aesthetics, Quality, Structure, and Texture Paper • 2512.21675 • Published 14 days ago • 24
SWE-EVO: Benchmarking Coding Agents in Long-Horizon Software Evolution Scenarios Paper • 2512.18470 • Published 19 days ago • 10
Probing Scientific General Intelligence of LLMs with Scientist-Aligned Workflows Paper • 2512.16969 • Published 21 days ago • 111
MobileWorld: Benchmarking Autonomous Mobile Agents in Agent-User Interactive, and MCP-Augmented Environments Paper • 2512.19432 • Published 17 days ago • 12
QuantiPhy: A Quantitative Benchmark Evaluating Physical Reasoning Abilities of Vision-Language Models Paper • 2512.19526 • Published 17 days ago • 11
Reinforcement Learning for Self-Improving Agent with Skill Library Paper • 2512.17102 • Published 21 days ago • 32
Nemotron-Math: Efficient Long-Context Distillation of Mathematical Reasoning from Multi-Mode Supervision Paper • 2512.15489 • Published 22 days ago • 6
Finch: Benchmarking Finance & Accounting across Spreadsheet-Centric Enterprise Workflows Paper • 2512.13168 • Published 24 days ago • 49
The FACTS Leaderboard: A Comprehensive Benchmark for Large Language Model Factuality Paper • 2512.10791 • Published 28 days ago • 7
Evaluating Gemini Robotics Policies in a Veo World Simulator Paper • 2512.10675 • Published 28 days ago • 17