[memo]Evaluating the Quality of Hallucination Benchmarks for Large Vision-Language Models



This content originally appeared on DEV Community and was authored by Takara Taniguchi

先行研究は複数のはるしネーションベンチマークを作ってきた

hallucination benchmarkの評価を行うベンチマークを作成した

Introduction

LVLMs tend to generate hallucinations

responses that are inconsistent with the corresponding visual inputs

Hallucination benchmark quality measurement framework

Contribution

  • Propose a hallucination benchmark quality measurement framework for VLMs
  • Construct a new high-quality hallucination benchmark

Related works

POPE constructs yes, no questions, and multiple-choice questions which contain non-existent objects

AMBER extended yes-no questions to other types of hallucinations.

HallusionBench yes,no pairs

Evaluation metrics

CHAIR

OpenCHAIR

Hallucination benchmark quality measurement framework

We select 6 representative publicly available hallucination benchmarks

MMHal, GAVIE

Follows from the psychological test.

Across different benchmarks, the scores are different from one another.

From the perspective of test-retest reliability, closed-ended benchmarks reveal obvious shortcomings.

Existing free-form VQA benchmarks exhibit limitations in both reliability and validity.

Conclusion

Introduced a quality measurement framework for hallucination benchmarks

感想
心理学的な信頼性テストの知見をAIに取り込んだ感じなんですかね


This content originally appeared on DEV Community and was authored by Takara Taniguchi