On the Robustness of Information Retrieval Metrics to Biased Relevance Assessments
スポンサーリンク
概要
- 論文の詳細を見る
Information Retrieval (IR) test collections are growing larger, and relevance data constructed through <i>pooling</i> are suspected of becoming more and more <i>incomplete</i> and <i>biased</i>. Several studies have used IR evaluation metrics specifically designed to handle this problem, but most of them have only examined the metrics under <i>incomplete but unbiased</i> conditions, using random samples of the original relevance data. This paper examines nine metrics in more realistic settings, by reducing the number of pooled systems and the number of pooled documents. Even though previous studies have shown that metrics based on a <i>condensed list</i>, obtained by removing all unjudged documents from the original ranked list, are effective for handling very incomplete but unbiased relevance data, we show that these results do not hold when the relevance data are biased towards particular systems or towards the top of the pools. More specifically, we show that the condensed-list versions of <i>Average Precision, Qmeasure</i> and <i>normalised Discounted Cumulative Gain</i>, which we denote as AP', Q' and nDCG', are not necessarily superior to the original metrics for handling biases. Nevertheless, AP' and Q' <i>are</i> generally superior to <i>bpref</i>, <i>Rank-Biased Precision</i> and its condensed-list version even in the presence of biases.
- 一般社団法人情報処理学会の論文
- 2009-04-15