A Note on the Reliability of Japanese Question Answering Evaluation
スポンサーリンク
概要
- 論文の詳細を見る
This paper compares some existing QA evaluation metrics from the viewpoint of reliability and usefulness, using the NTCIR-4 QAC2 Japanese QA tasks and our adaptations of Buckley/Voorhees and Voorhees/Buckley reliability measurement methods. Our main findings are : (1) The fraction of questions with a correct answer within Top 5 (NQcorrect5) and that with a correct answer at Rank 1 (NQcorrectl) are not as stable as Reciprocal Rank based on ranked lists containing up to five answers. (2) Q-measure, which can handle multiple correct answers and answer correctness levels, is as reliable and useful as Reciprocal Rank, provided that a mild gain value assignment is used. Emphasising answer correctness levels tends to hurt stability, while handling multiple correct answers improves it.
- 一般社団法人情報処理学会の論文
- 2004-11-26
著者
関連論文
- High-Precision Search via Question Abstraction for Japanese Question Answering
- High-Precision Search via Question Abstraction for Japanese Question Answering
- A Note on the Reliability of Japanese Question Answering Evaluation
- A Further Note on Evaluation Metrics for the Task of Finding One Highly Relevant Document(情報検索・分類,テーマ : 「デジタルアーカイブの活用(応用)」および一般)
- A Further Note on Evaluation Metrics for the Task of Finding One Highly Relevant Document(情報検索・分類,テーマ : 「デジタルアーカイブの活用(応用)」および一般)
- Controlling the Penalty on Late Arrival of Relevant Documents in Information Retrieval Evaluation with Graded Relevance