ASHE Leaderboard

Aggregated Span-level Hallucination Evaluation

ASHE is a unified benchmark for span-level hallucination detection on six tasks: open_qa, context_qa, data-to-text, open_biography, summarization, and machine translation, with 6,461 examples in total.

Span-level Sentence-level
Method Reason Opt Overall (std) open_qa ctx_qa data2txt open_bio summ mt
F1 scores across six generation tasks. Higher is better.

Reason = chain-of-thought reasoning enabled  |  Opt = prompt optimization (MiPROv2) applied
Paradigms: Disc discriminatively fine-tuned   Gen generatively fine-tuned   Prompt prompt-based