Pruning-Based Unsupervised Segmentation for Korean(Natural Language Processing)
スポンサーリンク
概要
- 論文の詳細を見る
Compound noun segmentation is a key component for Korean language processing. Supervised approaches require some types of human intervention such as maintaining lexicons, manually segmenting the corpora, or devising heuristic rules. Thus, they suffer from the unknown word problem, and cannot distinguish domain-oriented or corpusdirected segmentation results from the others. These problems can be over-come by unsupervised approaches that employ segmentation clues obtained purely from a raw corpus. However, most unsupervised approaches require tuning of empirical parameters or learning of the statistical dictionary. To develop a tuning-less, learning-free unsupervised segmentation algorithm, this study proposes a pruning-based unsupervised technique that eliminates unhelpful segmentation candidates. In addition, unlike previous unsupervised methods that have relied on purely character-based segmentation clues, this study utilizes word-based segmentation clues. Experimental evaluations show that the pruning scheme is very effective to unsupervised segmentation of Korean compound nouns, and the use of word-based prior knowledge enables better segmentation accuracy. This study also shows that the proposed algorithm performs competitively with or better than other unsupervised methods.
- 社団法人電子情報通信学会の論文
- 2006-10-01
著者
-
Kang In-su
Department Of Computer Science And Engineering Electrical And Computer Engineering Division Postech:
-
Lee Jong-hyeok
Knowledge And Language Engineering Lab. Postech
-
NA Seung-Hoon
Knowledge and Language Engineering lab., POSTECH
-
Na Seung-hoon
Knowledge And Language Engineering Lab. Postech