ASHE Leaderboard

Aggregated Span-level Hallucination Evaluation

ASHE is a unified benchmark for span-level hallucination detection on six tasks: open_qa, context_qa, data-to-text, open_biography, summarization, and machine translation, with 6,461 examples in total.

Span-level Sentence-level

Method	Reason	Opt	Overall (std)	open_qa	ctx_qa	data2txt	open_bio	summ	mt

F₁ scores across six generation tasks. Higher is better.

Reason = chain-of-thought reasoning enabled | Opt = prompt optimization (MiPROv2) applied
Paradigms: Disc discriminatively fine-tuned Gen generatively fine-tuned Prompt prompt-based