The Reusability of a Diversified Search Test Collection
スポンサーリンク
概要
- 論文の詳細を見る
Traditional ad hoc IR test collections were built using a relatively large pool depth (e.g. 100), and are usually assumed to be reusable. Moreover, when they are reused to compare a new system with another or with systems that contributed to the pools ("contributors"), an even larger measurement depth (e.g. 1,000) is often used for computing evaluation measures. In contrast, the web diversity test collections that have been created in the past few years at TREC and NTCIR use a much smaller pool depth (e.g. 20). The measurement depth is also small (e.g. 10-30), as search result diversification is primarily intended for the first result page. In this study, we examine the reusability of a typical web diversity test collection, namely, one from the NTCIR-9 INTENT-1 Chinese Document Ranking task, which used a pool depth of 20 and official measurement depths of 10, 20 and 30. First, we conducted additional relevance assessments to expand the official INTENT-1 collection to achieve a pool depth of 40. Using the expanded relevance assessments, we show that run rankings at the measurement depth of 30 are too unreliable, given that the pool depth is 20. Second, we conduct a leave-one-out experiment for every participating team of the INTENT-1 Chinese task, to examine how (un)fairly new runs are evaluated with the INTENT-1 collection. We show that, for the purpose of comparing new systems with the contributors of the test collection being used, condensed-list versions of existing diversity evaluation measures are more reliable than the raw measures. However, even the condensed-list measures may be unreliable if the new systems are not competitive compared to the contributors.
- 2012-07-25
著者
関連論文
- Japanese Hyponymy Extraction based on a Term Similarity Graph
- Query Snowball: A Co-occurrence-based Approach to Multi-document Summarization for Question Answering
- The Reusability of a Diversified Search Test Collection
- One Click One Revisited: Enhancing Evaluation based on Information Units
- The Reusability of a Diversified Search Test Collection
- One Click One Revisited: Enhancing Evaluation based on Information Units
- Web Search Evaluation with Informational and Navigational Intents (Preprint)
- A Preview of the NTCIR-10 INTENT-2 Results
- A Preview of the NTCIR-10 INTENT-2 Results
- How Intuitive Are Diversified Search Metrics? Concordance Test Results for the Diversity U-measures
- How Intuitive Are Diversified Search Metrics? Concordance Test Results for the Diversity U-measures