Training Set Selection for Building Compact and Efficient Language Models
スポンサーリンク
概要
- 論文の詳細を見る
For statistical language model training, target domain matched corpora are required. However, training corpora sometimes include both target domain matched and unmatched sentences. In such a case, training set selection is effective for both reducing model size and improving model performance. In this paper, training set selection method for statistical language model training is described. The method provides two advantages for training a language model. One is its capacity to improve the language model performance, and the other is its capacity to reduce computational loads for the language model. The method has four steps. 1) Sentence clustering is applied to all available corpora. 2) Language models are trained on each cluster. 3) Perplexity on the development set is calculated using the language models. 4) For the final language model training, we use the clusters whose language models yield low perplexities. The experimental results indicate that the language model trained on the data selected by our method gives lower perplexity on an open test set than a language model trained on all available corpora.
- (社)電子情報通信学会の論文
- 2009-03-01
著者
-
SUMITA Eiichiro
ATR Spoken Language Translation Research Laboratories
-
Yasuda Keiji
Atr Spoken Language Translation Research Laboratories
-
Yamamoto Hirofumi
Atr Spoken Language Translation Res. Lab. Kyoto‐fu Jpn
-
Sumita Eiichiro
ATR Spoken Language Communication Research Laboratories
関連論文
- A Reordering Model Using a Source-Side Parse-Tree for Statistical Machine Translation
- Splitting Input for Machine Translation Using N-gram Language Model Together with Utterance Similarity(Natural Language Processing)
- An Objective Method for Evaluating Speech Translation System : Using a Second Language Learner's Corpus(Speech Corpora and Related Topics, Corpus-Based Speech Technologies)
- Imposing Constraints from the Source Tree on ITG Constraints for SMT
- Introducing a Translation Dictionary into Phrase-Based SMT
- Proposal of an Evaluation Set Selection Method for a Corpus-Based Speech Translation Technology
- Training Set Selection for Building Compact and Efficient Language Models
- Bilingual Cluster Based Models for Statistical Machine Translation
- Statistical Language Model Adaptation with Additional Text Generated by Machine Translation
- A trainable method for pronominal anaphora resolution using shallow information
- Multiple Translation-Engine-based Hypotheses and Edit-Distance-based Rescoring for a Greedy Decoder for Statistical Machine Translation(Natural-Language Processing)
- Multiple Translation-Engine-based Hypotheses and Edit-Distance-based Rescoring for a Greedy Decoder for Statistical Machine Translation