FIRE-Bench: Evaluating Agents on the Rediscovery of Scientific Insights Paper • 2602.02905 • Published 4 days ago • 5
Amuro & Char: Analyzing the Relationship between Pre-Training and Fine-Tuning of Large Language Models Paper • 2408.06663 • Published Aug 13, 2024 • 16
The Validity of Evaluation Results: Assessing Concurrence Across Compositionality Benchmarks Paper • 2310.17514 • Published Oct 26, 2023 • 1