High Quality Speech Synthesis System Based on Waveform Concatenation of Phoneme Segment (Special Section on Speech Synthesis: Current Technologies and Equipment)
スポンサーリンク
概要
- 論文の詳細を見る
A new system for speech synthesis by concatenating waveforms selected from a dictinary is described. The dictionary is constructed from a two-hour speech that includes isolated words and sentences uttered by one male speaker, and contains over 45,000 entries which are identified by their average pitch, dynamic pitch parameter which represents micro pitch structure in a segment, duration and average amplitude. Phoneme duration is set according to phoneme environment, and phoneme power is controlled by both pitch frequency and phoneme environment. Tests show the average errors in vowel duration and consonant duration are 28.8 ms and 16.8 ms respectively, and the vowel power average error is 2.9 dB. The pitch frequency patterns are calculated according to a conventional model in which the accent component is added to a gross phrase component. Set a phoneme string and prosody information, the optimum waveforms are selected from the dictionary by matching their attributes with the given phonetic and prosodic information. A waveform selection function, which has two terms corresponding to prosody and phonological coincidence between rule-set values and waveform values from the dictionary, is proposed. The weight coefficients used in the selection function are determined through subjective hearing tests. The selected waveform segments are then modified in waveform domain to further adjust for the desired prosody. A pitch frequency modification method based on pitch synchronous overlap-add technique is introduced into the system. Lastly, the waveforms are interpolated between voiced waveforms to avoid abrupt changes in voice spectrum and waveform shape. An absolute evaluation test of five grades is performed to the synthesized voice and the mean of the score is 3.1, which is over "good," and while the original speaker quality is retained.
- 社団法人電子情報通信学会の論文
- 1993-11-25
著者
-
Sato Hirokazu
Speech And Acoustics Laboratory Ntt Human Interface Laboratories
-
Hirokawa Tomohisa
NTT Human Interface Laboratories
-
Itoh Kenzo
NTT Human Interface Laboratories
-
Sato Hirokazu
NTT Intelligent Technology Corpolation
-
Hirokawa Tomohisa
Speech and Acoustics Laboratory, NTT Human Interface Laboratories
-
Itoh Kenzo
Speech And Acoustics Laboratory Ntt Human Interface Laboratories
-
Hirokawa Tomohisa
Speech And Acoustics Laboratory Ntt Human Interface Laboratories
関連論文
- High Quality Speech Synthesis System Based on Waveform Concatenation of Phoneme Segment (Special Section on Speech Synthesis: Current Technologies and Equipment)
- Phoneme Power Control for Speech Synthesis (Special Section on Speech Synthesis: Current Technologies and Equipment)