Automatic Extraction of Japanese Probabilistic Context Free Grammar From a Bracketed Corpus

概要

論文の詳細を見る
In this paper, we describe a method to extract a probabilistic context free grammar of Japanese from a bracketed corpus. To extract grammar rules, we assign appropriate non-terminal symbols to the intermediate nodes of the bracketed trees by taking account of the heads of phrases. We estimate the probabilities of the rules based on their frequency of occurrence. We also propose several improvements to the extracted grammar. The size of the grammar is reduced by removing any redundant rules. The number of the parse tree is reduced (1) by allowing only a right linear binary branching tree for a constituent that consists of items of the same POS, (2) by subcategorizing the POSs "symbol" ("KIGOU") and "postposition" ("JOSI"), and (3) by assigning a consistent structure to constructs representing clausal modality. Finally, we conducted an experiment that evaluated the proposed methods. 2, 219 grammar rules were extracted from about 180, 000 sentences. When we analyzed 20, 000 test sentences with the extracted grammar, a 92% acceptance rate was calculated, showing that the grammar has a broad coverage. For the most probable 30 parse trees, we obtained a 62% brackets recall, 74% brackets precision and 29% sentence accuracy.
言語処理学会の論文