Paraphrasing Training Data for Statistical Machine Translation
スポンサーリンク
概要
- 論文の詳細を見る
Large amounts of data are essential for training statistical machine translation systems. In this paper we show how training data can be expanded by paraphrasing one side of a parallel corpus. The new data is made by parsing then generating using an open-source, precise HPSG-based grammar. This gives sentences with the same meaning, but with minor variations in lexical choice and word order. In experiments paraphrasing the English in the Tanaka Corpus, a freely-available Japanese-English parallel corpus, we show consistent, statistically-significant gains on training data sets ranging from 10,000 to 147,000 sentence pairs in size as evaluated by the BLEU and METEOR automatic evaluation metrics.
著者
-
Nichols Eric
Nara Institute of Science and Technology
-
Bond Francis
Nanyang Technological University
-
Appling D.
Georgia Institute of Technology
-
Matsumoto Yuji
Nara Institute of Science and Technology
-
Matsumoto Yuji
Nara Inst. Sci. And Technol.
関連論文
- Paraphrasing Training Data for Statistical Machine Translation
- Paraphrasing Training Data for Statistical Machine Translation
- Opinion mining from web documents: extraction and structurization (論文特集:データマイニングと統計数理)
- Document Clustering : Before and After the Singular Value Decomposition
- A Method for Syntactic Behavior Analysis
- Effects of Structural Matching and Paraphrasing in Question Answering(Special Issue on Text Processing for Information Access)
- Information Extraction from MEDLINE abstracts of clinical trials(Medical Data Mining)
- Information Extraction from MEDLINE abstracts of clinical trials(Medical Data Mining)(Joint Workshop of Vietnamese Society of AI, SIGKBS-JSAI, ICS-IPSJ, and IEICE-SIGAI on Active Mining)
- A Generalization of Forward-backward Algorithm
- Opinion Mining from Web Documents: Extraction and Structurization