New Framework Detects Hallucinations in Video Q&A Models

Global: New Framework Detects Hallucinations in Video Question-Answering Models

Background

Researchers at Simula have introduced VideoHEDGE, a modular framework aimed at identifying hallucinations—incorrect yet confident outputs—in video-capable vision-language models (Video‑VLMs) during question‑answering tasks. Hallucinations remain a frequent challenge for these models, and existing uncertainty metrics often fail to correlate with factual correctness.

Methodology

VideoHEDGE operates on a given video‑question pair by first generating a baseline answer. It then produces multiple high‑temperature generations from both unaltered video clips and variants that are photometrically and spatiotemporally perturbed. The resulting textual outputs are clustered into semantic hypotheses using either Natural Language Inference (NLI)‑based or embedding‑based techniques.

Reliability Scores

The framework derives three reliability scores from the cluster‑level probability masses: Semantic Entropy (SE), RadFlag, and Vision‑Amplified Semantic Entropy (VASE). These scores are intended to quantify the confidence and consistency of the model’s responses across the perturbed inputs.

Benchmark Evaluation

Evaluation on the SoccerChat benchmark employed three 7B Video‑VLMs—Qwen2‑VL, Qwen2.5‑VL, and a SoccerChat‑finetuned model—using an LLM‑as‑a‑judge to produce binary hallucination labels. Across all three models, VASE consistently achieved the highest ROC‑AUC, particularly when larger distortion budgets were applied, whereas SE and RadFlag often performed near chance levels.

Efficiency and Fine‑Tuning

The study found that embedding‑based clustering matched the detection performance of NLI‑based clustering while requiring substantially lower computational resources. Additionally, domain fine‑tuning was shown to reduce the overall frequency of hallucinations, though it offered only modest gains in calibration accuracy.

Open‑Source Release

The authors have released the hedge‑bench library on PyPI, enabling reproducible and extensible benchmarking. Full code and experimental resources are publicly available at https://github.com/Simula/HEDGE#videohedge.
This report is based on information from arXiv, licensed under Academic Preprint / Open Access. Based on the abstract of the research paper. Full text available via ArXiv.

New Framework Detects Hallucinations in Video Question-Answering Models