YouZum

Detecting Hallucinations in Authentic LLM-Human Interactions

arXiv:2510.10539v1 Announce Type: new
Abstract: As large language models (LLMs) are increasingly applied in sensitive domains such as medicine and law, hallucination detection has become a critical task. Although numerous benchmarks have been proposed to advance research in this area, most of them are artificially constructed–either through deliberate hallucination induction or simulated interactions–rather than derived from genuine LLM-human dialogues. Consequently, these benchmarks fail to fully capture the characteristics of hallucinations that occur in real-world usage. To address this limitation, we introduce AuthenHallu, the first hallucination detection benchmark built entirely from authentic LLM-human interactions. For AuthenHallu, we select and annotate samples from genuine LLM-human dialogues, thereby providing a faithful reflection of how LLMs hallucinate in everyday user interactions. Statistical analysis shows that hallucinations occur in 31.4% of the query-response pairs in our benchmark, and this proportion increases dramatically to 60.0% in challenging domains such as Math & Number Problems. Furthermore, we explore the potential of using vanilla LLMs themselves as hallucination detectors and find that, despite some promise, their current performance remains insufficient in real-world scenarios.

We use cookies to improve your experience and performance on our website. You can learn more at นโยบายความเป็นส่วนตัว and manage your privacy settings by clicking Settings.

ตั้งค่าความเป็นส่วนตัว

You can choose your cookie settings by turning on/off each type of cookie as you wish, except for essential cookies.

ยอมรับทั้งหมด
จัดการความเป็นส่วนตัว
  • เปิดใช้งานตลอด

บันทึกการตั้งค่า
th