Models deployed on HuggingFace or RunPods.
AI & ML interests
LLM Evaluation
Recent Activity
View all activity
Papers
Benchmarking Reward Hack Detection in Code Environments via Contrastive Analysis
MEMTRACK: Evaluating Long-Term Memory and State Tracking in Multi-Platform Dynamic Agent Environments
A benchmark for tip-of-the-tongue search and reasoning.
-
PatronusAI/lynx-70b-instruct-covidqa-generations
Viewer • Updated • 1k • 6 -
PatronusAI/lynx-70b-instruct-drop-generations
Viewer • Updated • 1k • 6 -
PatronusAI/lynx-70b-instruct-financebench-generations
Viewer • Updated • 1k • 7 -
PatronusAI/lynx-70b-instruct-halueval-generations
Viewer • Updated • 10k • 5
Models deployed on HuggingFace or RunPods.
A benchmark for tip-of-the-tongue search and reasoning.
-
PatronusAI/lynx-70b-instruct-covidqa-generations
Viewer • Updated • 1k • 6 -
PatronusAI/lynx-70b-instruct-drop-generations
Viewer • Updated • 1k • 6 -
PatronusAI/lynx-70b-instruct-financebench-generations
Viewer • Updated • 1k • 7 -
PatronusAI/lynx-70b-instruct-halueval-generations
Viewer • Updated • 10k • 5