This content originally appeared on DEV Community and was authored by Takara Taniguchi
先行研究は複数のはるしネーションベンチマークを作ってきた
hallucination benchmarkの評価を行うベンチマークを作成した
Introduction
LVLMs tend to generate hallucinations
responses that are inconsistent with the corresponding visual inputs
Hallucination benchmark quality measurement framework
Contribution
- Propose a hallucination benchmark quality measurement framework for VLMs
- Construct a new high-quality hallucination benchmark
Related works
POPE constructs yes, no questions, and multiple-choice questions which contain non-existent objects
AMBER extended yes-no questions to other types of hallucinations.
HallusionBench yes,no pairs
Evaluation metrics
CHAIR
OpenCHAIR
Hallucination benchmark quality measurement framework
We select 6 representative publicly available hallucination benchmarks
MMHal, GAVIE
Follows from the psychological test.
Across different benchmarks, the scores are different from one another.
From the perspective of test-retest reliability, closed-ended benchmarks reveal obvious shortcomings.
Existing free-form VQA benchmarks exhibit limitations in both reliability and validity.
Conclusion
Introduced a quality measurement framework for hallucination benchmarks
感想
心理学的な信頼性テストの知見をAIに取り込んだ感じなんですかね
This content originally appeared on DEV Community and was authored by Takara Taniguchi