• Turing Post
  • Posts
  • 10+ Tools for Hallucination Detection and Evaluation in Large Language Models

10+ Tools for Hallucination Detection and Evaluation in Large Language Models

In this short article, we share hallucination benchmarks you can use to detect and evaluate hallucinations in your large language models.

What are hallucinations?

In large language models (LLMs), "hallucinations" are cases when a model produces text with details, facts, or claims that are fictional, misleading, or completely made up, instead of giving reliable and truthful information.

Read our article to learn more about hallucinations, including their causes, how to identify them, and why they can be beneficial:

Now, to the list of benchmarks β†’

Truthfulness and Factuality Evaluation

  • TruthfulQA (2022): The TruthfulQA benchmark measures the truthfulness of language models by evaluating their answers to 817 questions across 38 categories designed to elicit false answers based on common misconceptions. The benchmark reveals that even the best-performing models are truthful on only 58% of questions, compared to 94% for human performance, suggesting larger models often generate more false answers that mimic popular misconceptions. β†’ Original paper, website

  • FACTOR (2024): FACTOR (Factual Assessment via Corpus TransfORmation), a novel approach to evaluate language model factuality. FACTOR transforms a factual corpus into benchmarks by comparing language models' ability to generate true facts against similar but incorrect statements. This method was applied to create three domain-specific benchmarks: Wiki-FACTOR, News-FACTOR, and Expert-FACTOR. β†’ Original paper, website

  • FacTool (2023): FacTool addresses the challenges posed by the generation of lengthy texts lacking clear fact granularity and the scarcity of explicit evidence for fact-checking. It leverages tools such as Google Search, Google Scholar, code interpreters, and Python, alongside the reasoning capabilities of language models. Experiments across tasks like knowledge-based QA, code generation, mathematical reasoning, and scientific literature review demonstrate its effectiveness. β†’ Original paper, website

  • FreshQA (2023): FreshQA is a dynamic QA benchmark for evaluating the factuality of LLMs in the context of current world knowledge. Authors found all models, regardless of size, struggled with questions requiring up-to-date knowledge or addressing false premises. To address this, they developed FRESHPROMPT, a prompting method enhancing LLM performance by integrating search engine retrieved information. β†’ Original paper, website

Hallucination Detection and Mitigation

  • Med-HALT (2023): The Med-HALT benchmark is a pioneering evaluation framework to test and mitigate hallucinations in LLMs within the medical domain. It introduces a multinational and multi-subject dataset, incorporating innovative reasoning and memory-based tests to comprehensively assess LLMs' capabilities in generating accurate and reliable medical information. β†’ Original paper, website

  • SelfCheckGPT (2023): SelfCheckGPT proposes a novel method for detecting hallucinations without needing external databases or access to the model's internals. By sampling multiple responses to the same prompt and analyzing their consistency, SelfCheckGPT can effectively identify factual inaccuracies, showcasing significant improvements in detecting non-factual information compared to existing methods. β†’ Original paper, website

  • HalluQA (2023): The HalluQA benchmark is an adversarial question-answering dataset to evaluate hallucination phenomena in Chinese large language models, encompassing imitative falsehoods and factual errors across diverse domains, including Chinese historical culture and social phenomena. β†’ Original paper, website

  • HaluEval (2023): This benchmark consists of 35,000 samples, including generated responses to user queries and task-specific examples in areas like question answering and text summarization. Their method, sampling-then-filtering, aims to automatically generate samples that LLMs often hallucinate on. β†’ Original paper, website

  • HalOmi (2023): HalOmi is an annotated dataset focusing on hallucinations and omissions in machine translation across 18 language pairs of varied resource levels. HalOmi features fine-grained annotations at both sentence and word levels, allowing for the examination of partial and full hallucinations and omissions. β†’ Original paper, website

Dynamic and Real-time Information Handling

  • REALTIME QA (2024): REALTIME QA is a dynamic question-answering (QA) platform that evaluates systems weekly, focusing on current world events and novel information, thereby challenging the static assumptions of conventional open-domain QA datasets to pursue instantaneous applications. β†’ Original paper, website

  • SAC3 (2024): SAC3 introduces a novel method for detecting hallucinations in language models by assessing semantic-aware cross-check consistency. This approach leverages semantically equivalent question perturbation and cross-model response consistency, offering a more robust evaluation of model trustworthiness without relying on external knowledge sources or internal model states. β†’ Original paper, website

Comprehensive and Multifaceted Evaluation

  • BAMBOO (2024): BAMBOO is a multifaceted benchmark designed to assess the long-text modeling capabilities of LLMs. BAMBOO comprises 10 datasets across 5 tasks: question answering, hallucination detection, text sorting, language modeling, and code completion, aiming to cover a broad spectrum of domains and core capacities of LLMs. It includes diverse length levels to simulate real-world scenarios accurately. β†’ Original paper, website

  • FELM (2023): FELM focuses on errors beyond just world knowledge, including math and reasoning. It includes fine-grained annotations for over 35,000 samples, aiming to pinpoint specific factual errors and guide the development of more reliable LLMs. β†’ Original paper, website

We post helpful lists and bite-sized explanations daily on our X (Twitter). Let’s connect!

Subscribe to keep reading

This content is free, but you must be subscribed to Turing Post to continue reading.

Already a subscriber?Sign In.Not now

Join the conversation

or to participate.