An Accurate Morphological Analysis and Proper Name Identification for Japanese Text Processing

概要

論文の詳細を見る
This paper describes a Japanese preprocessor used for syntactic and semantic parsing. It consists of three major components: (1) a morphological analyzer called MAJESTY (Morphological Analyzer for Japanese Text Analysis), (2) a proper name identification and grouping program, and (3) a format conversion program for an input to Tomita's generalized LR parser. To enable the parser to perform efficiently, the original morphological analyzer was modified to disambiguate its output when multiple possibilities for segmentalions and parts of speech were found, and to pack ambiguous segments locally in the output. The grouping program identifies several segments forming one concept, which is often the case with proper names, and puts them together to provide a meaningful set of segments for the parser. The grouped segments are finally converted into a Lisp readable format and fed into the parser. Tested on financial news articles, the preprocessor successfully segmented text and tagged parts of speech with a greater than 98% accuracy. Company names have been identified with over 80% in both recall and precision. Person and place names have also been recognized with over 90% accuracy. The preprocessor has been successfully integrated into the SHOGUN and TEXTRACT information extraction systems which process texts in the TIPSTER domain of corporate joint ventures and microelectronics.
一般社団法人情報処理学会の論文
1994-03-15