Japanese-English Cross Language Information Retrieval based on Comparable Corpora and Bilingual Dictionary

概要

論文の詳細を見る
This paper proposes a method to translate query terms for cross-language information retrieval (CLIR). CLIR is generally performed by query translation and information retrieval (IR). CLIR is less precise than IR because of query term translation ambiguities, especially in Japanese and English CLIR. We developed Double MAXimize criteria based on comparable corpora (DMAX), which is an equivalent translation selection method for machine translation (MT), by using term co-occurrence frequency in comparable corpora. Though a term should be translated into one word for MT, a query term should be translated into several appropriate terms for CLIR. This paper describes a generalized query term selection model, the GDMAX for CLIR. In this model, a source query is represented in the vector form of the term co-occurrence frequency in source corpora. Translation queries are searched by vector similarity calculation between a source query and a target query represented by the co-occurrence frequency in comparable target corpora. GDMAX was evaluated by using TREC6 (Text Retrieval Conference) English data and 15 Japanese queries. GDMAX queries had approximately 62% accuracy of human queries, and 6% higher accuracy than machine translation queries and 12% higher accuracy than bilingual dictionary-based aueries.
言語処理学会の論文