An Efficient Algorithm for Unsupervised Word Segmentation with Branching Entropy and MDL
スポンサーリンク
概要
- 論文の詳細を見る
This paper proposes a fast and simple unsupervised word segmentation algorithm that utilizes the local pre-dictability of adjacent character sequences, while searching for a least-effort representation of the data. The model uses branching entropy as a means of constraining the hypothesis space, in order to efficiently obtain a solution that minimizes the length of a two-part MDL code. An evaluation with corpora in Japanese, Thai, English, and the "CHILDES" corpus for research in language development reveals that the algorithm achieves a F-score, comparable to that of the state-of-the-art methods in unsupervised word segmentation, in a significantly reduced computational time. In view of its capability to induce the vocabulary of large-scale corpora of domain-specific text, the method has potential to improve the coverage of morphological analyzers for languages without explicit word boundary markers. A semi-supervised word segmentation approach is also proposed, in which the word boundaries obtained through the unsupervised model are used as features for a state-of-the-art word segmentation method.
著者
-
Okumura Manabu
Precision And Intelligence Laboratory Tokyo Institute Of Technology
-
TAKAMURA HIROYA
Precision and Intelligence Laboratory, Tokyo Institute of Technology
-
Takamura Hiroya
Precision And Intelligence Laboratory Tokyo Institute Of Technology
-
Zhikov Valentin
Interdisciplinary Graduate School of Science and Engineering, Tokyo Institute of Technology
関連論文
- Active Learning with Partially Annotated Sequence
- Semi-Supervised Learning to Classify Evaluative Expressions from Labeled and Unlabeled Texts(Knowledge, Information and Creativity Support System)
- Collecting Object-attribute Noun Pairs and Constructing Concept Graphs for the Argument of Adjectives from Japanese N1-Adj-N2 Constructions
- On SemEval-2010 Japanese WSD task ([SemEval-2日本語タスクを中心とする日本語語義曖昧性解消])
- Active Learning with Subsequence Sampling Strategy for Sequence Labeling Tasks
- Active Learning with Subsequence Sampling Strategy for Sequence Labeling Tasks
- Query Snowball: A Co-occurrence-based Approach to Multi-document Summarization for Question Answering
- Query Snowball: A Co-occurrence-based Approach to Multi-document Summarization for Question Answering
- An Efficient Algorithm for Unsupervised Word Segmentation with Branching Entropy and MDL
- On SemEval-2010 Japanese WSD Task