A Bayesian Model of Transliteration and Its Human Evaluation When Integrated into a Machine Translation System
スポンサーリンク
概要
- 論文の詳細を見る
The contribution of this paper is two-fold. Firstly, we conduct a large-scale real-world evaluation of the effectiveness of integrating an automatic transliteration system with a machine translation system. A human evaluation is usually preferable to an automatic evaluation, and in the case of this evaluation especially so, since the common machine translation evaluation methods are affected by the length of the translations they are evaluating, often being biassed towards translations in terms of their length rather than the information they convey. We evaluate our transliteration system on data collected in field experiments conducted all over Japan. Our results conclusively show that using a transliteration system can improve machine translation quality when translating unknown words. Our second contribution is to propose a novel Bayesian model for unsupervised bilingual character sequence segmentation of corpora for transliteration. The system is based on a Dirichlet process model trained using Bayesian inference through blocked Gibbs sampling implemented using an efficient forward filtering/backward sampling dynamic programming algorithm. The Bayesian approach is able to overcome the overfitting problem inherent in maximum likelihood training. We demonstrate the effectiveness of our Bayesian segmentation by using it to build a translation model for a phrase-based statistical machine translation (SMT) system trained to perform transliteration by monotonic transduction from character sequence to character sequence. The Bayesian segmentation was used to construct a phrase-table and we compared the quality of this phrase-table to one generated in the usual manner by the state-of-the-art GIZA++ word alignment process used in combination with phrase extraction heuristics from the MOSES statistical machine translation system, by using both to perform transliteration generation within an identical framework. In our experiments on English-Japanese data from the NEWS2010 transliteration generation shared task, we used our technique to bilingually co-segment the training corpus. We then derived a phrase-table from the segmentation from the sample at the final iteration of the training procedure, and the resulting phrase-table was used to directly substitute for the phrase-table extracted by using GIZA++/MOSES. The phrase-table resulting from our Bayesian segmentation model was approximately 30% smaller than that produced by the SMT systems training procedure, and gave an increase in transliteration quality measured in terms of both word accuracy and F-score.
- (社)電子情報通信学会の論文
- 2011-10-01
著者
-
FINCH Andrew
NICT
-
SUMITA Eiichiro
NICT
-
NAKAMURA Satoshi
NICT
-
Nakamura Satoshi
National Inst. Information And Communications Technol. (nict) Kyoto‐fu Jpn
-
Yasuda Keiji
Nict
-
OKUMA Hideo
NICT
関連論文
- Class-Dependent Modeling for Dialog Translation
- CENSREC-1-C : An evaluation framework for voice activity detection under noisy environments
- Class-Dependent Modeling for Dialog Translation
- Using Mutual Information Criterion to Design an Efficient Phoneme Set for Chinese Speech Recognition
- A Non-stationary Noise Suppression Method Based on Particle Filtering and Polyak Averaging(Speech Recognition, Statistical Modeling for Speech Processing)
- Translation of Untranslatable Words-Integration of Lexical Approximation and Phrase-Table Extension Techniques into Statistical Machine Translation
- Learning, Generation and Recognition of Motions by Reference-Point-Dependent Probabilistic Models
- Prosody reconstruction by rescaling fundamental frequency contours in order to synthesize communicative speech (Speech) -- (国際ワークショップ"Asian workshop on speech science and technology")
- Multiple Translation-Engine-based Hypotheses and Edit-Distance-based Rescoring for a Greedy Decoder for Statistical Machine Translation(Natural-Language Processing)
- Ambient Browser: Web Browser for Daily Use (日韓合同ワークショップ 1st Korea-Japan Joint Workshop on Ubiquitous Computing and Networking Systems (ubiCNS 2005))
- A Bayesian Model of Transliteration and Its Human Evaluation When Integrated into a Machine Translation System
- CENSREC-4: An evaluation framework for distant-talking speech recognition in reverberant environments
- Situated Spoken Dialogue with Robots Using Active Learning