ホルマント周波数上での調音結合の定式化と音声自動認識への適用

概要

論文の詳細を見る
In order to realize the reliable automatic recognition of phonemes in connected sppech, effective means are required to cope with the variations in their acoustic characteristic due to the idiosyncrasy of speakers and coarticulation. This paper describes a new scheme for carrying out the segmentation and recognition of connected vowels and semivowels, based on a speakeradaptive model of the coarticulatory process. The process of coarticulation between the adjoining phonemes in connected vowels can be modeled in the domain of formant frequencies by a smoothing system which converts the stepwise varying target value corresponding to each successive vowels into the actual formant trajectory (Fig. 1). As the characteristics of this system, those of a critically-damped second-order linear system are generally valid as shown by the example of the word /ie/ (Fig. 2), but further elaborations, taking the continuity and coupling of reasonance modes into consideration, are required in case of the combinations of front and back vowels, as shown by the example of the word /ai/ (Fig. 3). As the input, the proposed scheme (Fig. 4) uses the trajectories of the first three formant frequencies, extrated pitch-synchronously from the short-term frequency spectra of speech, but converted to the sample values at uniform intervals by interpolation. Since highly accurate recognition of initial vowels is possible by the established techniques for the recoanition of sustained vowels, their formant frequencies can be used to estimate the target values of other vowels of the same speaker. The estimation is based on the average relationships found among the formant frequencies of all five vowels of many speakers, and by this stimation, the coarticulatory model can be adapted to an arbitrary speaker. The model can then be used for determining the underlying targets from observed formant trajectories by the method of analysis-by-synthesis, thereby accomplishing successive segmention and recognition of each phoneme in connected vowels. The validity of the scheme was proved by having obtained the overall rate of correct recognition of 98. 7% (Table 1) for a total of 445 utterances consisting of vowel dyads, triads, and quadruplets by three male speakers. The scheme can be extended to the recognition of semivowels. It has been found that formant targets of the semivowels /j/ and /w/ are quite close to these of the vowels /i/ and /u/, respectively, but their command durations are significantly different (Fig. 7). The utilization of the speech rate information, represented by the command duration of the immediately following vowel, is necessary for the accurate separation of /j/, /i/, and /ij/, when the speech rate varies over a wide range (Fig. 8). If the speech rate information is given, the rate of correct recognition of these categories is 97. 5% for a total of 270 utterances of 15 words containig semivowels, vowels, and vowel-semivowel combinations in the same context.
1978-03-01

ホルマント周波数上での調音結合の定式化と音声自動認識への適用

スポンサーリンク

概要

著者

関連論文

スポンサーリンク