An EM-Based Approach for Mining Word Senses from Corpora(Natural Language Processing)
スポンサーリンク
概要
- 論文の詳細を見る
Manually collecting contexts of a target word and grouping them based on their meanings yields a set of word senses but the task is quite tedious. Towards automated lexicography, this paper proposes a word-sense discrimination method based on two modern techniques; EM algorithm and principal component analysis (PCA). The spherical Gaussian EM algorithm enhanced with PCA for robust initialization is proposed to cluster word senses of a target word automatically. Three variants of the algorithm, namely PCA, sGEM, and PCA-sGEM, are investigated using a gold standard dataset of two polysemous words. The clustering result is evaluated using the measures of purity and entropy as well as a more recent measure called normalized mutual information (NMI). The experimental result indicates that the proposed algorithms gain promising performance with regard to discriminate word senses and the PCA-sGEM outperforms the other two methods to some extent.
- 2007-04-01
著者
-
Sornlertlamvanich Virach
Nict Asia Research Center
-
CHAROENPORN Thatsanee
Sirindhorn International Institute of Technology, Thammasat University
-
KRUENGKRAI Canasai
NICT Asia Research Center
-
THEERAMUNKONG Thanaruk
Sirindhorn International Institute of Technology, Thammasat University
-
Kruengkrai Canasai
National Inst. Information And Communications Technol. Kyoto‐fu Jpn
-
Charoenporn Thatsanee
Sirindhorn International Institute Of Technology Thammasat University
-
Theeramunkong Thanaruk
Thammasat Univ. Tha
-
Theeramunkong Thanaruk
Sirindhorn International Institute Of Technology Thammasat University
関連論文
- An EM-Based Approach for Mining Word Senses from Corpora(Natural Language Processing)
- Statistical-Based Approach to Non-segmented Language Processing(Knowledge, Information and Creativity Support System)
- Improving Search Performance : A Lesson Learned from Evaluating Search Engines Using Thai Queries(Knowledge, Information and Creativity Support System)
- Construction of Thai Lexicon from Existing Dictionaries and Texts on the Web(Natural Language Processing)
- Extracting Chemical Reactions from Thai Text for Semantics-Based Information Retrieval
- Extracting Semantic Frames from Thai Medical-Symptom Unstructured Text with Unknown Target-Phrase Boundaries
- Fast Algorithms for Mining Generalized Frequent Patterns of Generalized Association Rules(Databases)