HalluHard: A Hard Multi-Turn Hallucination Benchmark
Abstract
Large language models continue to generate plausible but ungrounded factual claims in multi-turn dialogue, with hallucinations remaining significant even when utilizing web search for verification across high-stakes domains.
Large language models (LLMs) still produce plausible-sounding but ungrounded factual claims, a problem that worsens in multi-turn dialogue as context grows and early errors cascade. We introduce HalluHard, a challenging multi-turn hallucination benchmark with 950 seed questions spanning four high-stakes domains: legal cases, research questions, medical guidelines, and coding. We operationalize groundedness by requiring inline citations for factual assertions. To support reliable evaluation in open-ended settings, we propose a judging pipeline that iteratively retrieves evidence via web search. It can fetch, filter, and parse full-text sources (including PDFs) to assess whether cited material actually supports the generated content. Across a diverse set of frontier proprietary and open-weight models, hallucinations remain substantial even with web search (approx 30% for the strongest configuration, Opus-4.5 with web search), with content-grounding errors persisting at high rates. Finally, we show that hallucination behavior is shaped by model capacity, turn position, effective reasoning, and the type of knowledge required.
Community
LLM hallucinations are far from solved!
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- AHA: Aligning Large Audio-Language Models for Reasoning Hallucinations via Counterfactual Hard Negatives (2025)
- AgentHallu: Benchmarking Automated Hallucination Attribution of LLM-based Agents (2026)
- From RAG to Agentic RAG for Faithful Islamic Question Answering (2026)
- FFE-Hallu:Hallucinations in Fixed Figurative Expressions:Benchmark of Idioms and Proverbs in the Persian Language (2026)
- DSC2025 -- ViHallu Challenge: Detecting Hallucination in Vietnamese LLMs (2026)
- Do I Really Know? Learning Factual Self-Verification for Hallucination Reduction (2026)
- Automated Rubrics for Reliable Evaluation of Medical Dialogue Systems (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper