Reconstructing the Language Family Tree from Multilingual Corpus Based on Probabilistic Language Modeling

概要

論文の詳細を見る
This paper proposes a new method for automatically clustering languages.The basicidea of this method involves developing a probabilistic model for each languagefrom the given linguistic data, and then computing the distances between languagesaccording to the distance measure defined on the language models.Clustering isperformed based on this distance measure.The paper embodies this idea when the <I>N-gram</I> language model is concerned.The effectiveness of the proposed methodhas been confirmed by evaluation experiments using multilingual texts of nineteendifferent languages from the ECI Corpus (European Corpus Initiative Multilingual Corpus).The results were very encouraging.They were very close to the family treeof languages established in linguistics.
言語処理学会の論文