Identifying the Coding System and Language of On-line Documents Using Statistical Language Models

概要

論文の詳細を見る
This paper proposes a new algorithm that simultaneously identifies the coding system and language of a code string retrieved from the Internets, especially the World-Wide Web. The algorithm uses statistical language models to select the correctly decoded string as well as to determine the language. The proposed algorithm covers 43 combinations of 15 languages and 11 coding systems used in Eastern Asia and Western Europe. Experimental results show that the level of accuracy of our algorithm is over 95% for 929 on-line documents.
一般社団法人情報処理学会の論文
1997-12-15