デコンボルーションによる声道形の推定と適応型音声分析システム

概要

論文の詳細を見る
Disregarding the nasal tract, the vocal organ in speech production is regarded as a tube passing from lungs to the lip (see Fig. 1). From the assumption that the most remarkable loss effect appears in the glottal portion, the total vocal tract loss is represented by means of a no-loss, infinitely long, uniform acoustic tube below the glottis. The speech production process in the vocal tract can be simulated by the Kelly's ladder-form circuit, as shown in Fig. 2 (a). According to Itakura (1971) and Wakita (1972), it is shown that the partial autocorrelation coefficients k are extracted with the self-control system shown in Fig. 3 (a). Figure 2 (a) can be transformed to the equivalent circuit shown in Fig. 3 (b), neglecting the loss near the lip portion (r_0→-1). Comparing Fig. 3 (a) with Fig. 3 (b), it is clear that the k parameter extraction process corresponds formally to the inverse tracing of speceh production process. To ascertain the relation, the synthesized speech generated with a given vocal tract shape and impulse train excitation as the vocal source was analyzed. Matching partial-autocorrelation coefficients to the reflection coefficients r_i, (i=1, 2, …) from the lip side, the reflection coefficients are converted to area functions as shown in Fig. 4. From the experiments, it was concluded that the vocal tract shape can be perfectly estimated by this method, except when the vocal tract resonance is quite sharp as compared with actual speech (that is, when the loss at the glottis is extremely small). The next problem is how to separate the vocal tract impulse response from speech waves. Two hypotheses were developed for the separation. One is that, since the gross frequency transmission characteristics of the vocal tract are flat, the gross speech spectrum gradient and bending are based on the glottal wave and radiation characteristics. The second hypothesis is that the power spectrum of the glottal wave, including radiation characteristics, is smooth and has no sharp resonance. Figure 6 is a proposed inverse model of vocal cord wave (with radiation characteristics) model, including unknown parameters ε_i (i=1, …, 5)(Nakajima and Suzuki, 1976). The unkown parameters of this model are estimated from speech waves by the following technique. As an example, the parameter in the 2nd-order critical damping system corresponding to the inverse of the first stage in Fig. 6 is calculated from 1st and 2nd delayed autocorrelation coefficients of the speech wave (ref. Eq. 1〜4). When the power spectrum of sound source and radiation characteristics is expressed with this model, the vacal tract impulse response is extracted by inverse filtering of the estimated vocal cord wave model, and the gross power spectrum is assured to be flat. At this time, pole frequency and band width are not affected. The principle of this method is illustrared in Fig. 5. Experimental results on natural speech by an adult man and a child are shown in Fig. 7 and 8, respectively. In the section 5, an adaptive speech analysis system is described, which selects automatically the suitable speech analysis methods, on the basis of the decision of voiced/unvoiced/plosive sounds with the input speech wave. Vocal tract shape is estimated in case of voiced sounds. In the case of unvoiced sounds, the acoustic tube shape equivalent to the power spectrum of L. P. C. analysis is obtained. In the plosive sounds, shorter analysis window and frame interval than usual are used for the analysis. Finally, examples of analysed results are illustrated (see Fig. 10). It is shown that the system is useful for the observation of speech from both sides of power spectrum and articulatory domain, and the obtained pattern is useful for automatic speech recognition.
社団法人日本音響学会の論文
1978-03-01

デコンボルーションによる声道形の推定と適応型音声分析システム

スポンサーリンク

概要

著者

関連論文

スポンサーリンク