A Note on the Reliability of Japanese Question Answering Evaluation

概要

論文の詳細を見る
This paper compares some existing QA evaluation metrics from the viewpoint of reliability and usefulness, using the NTCIR-4 QAC2 Japanese QA tasks and our adaptations of Buckley/Voorhees and Voorhees/Buckley reliability measurement methods. Our main findings are : (1) The fraction of questions with a correct answer within Top 5 (NQcorrect5) and that with a correct answer at Rank 1 (NQcorrectl) are not as stable as Reciprocal Rank based on ranked lists containing up to five answers. (2) Q-measure, which can handle multiple correct answers and answer correctness levels, is as reliable and useful as Reciprocal Rank, provided that a mild gain value assignment is used. Emphasising answer correctness levels tends to hurt stability, while handling multiple correct answers improves it.
一般社団法人情報処理学会の論文
2004-11-26