Nonlinear interactive source-filter model for voiced speech

Koç, Turgay
The linear source-filter model (LSFM) has been used as a primary model for speech processing since 1960 when G. Fant presented acoustic speech production theory. It assumes that the source of voiced speech sounds, glottal flow, is independent of the filter, vocal tract. However, acoustic simulations based on the physical speech production models show that, especially when the fundamental frequency (F0) of source harmonics approaches to the first formant frequency (F1) of vocal tract filter, the filter has significant effects on the source due to the nonlinear coupling between them. In this thesis, as an alternative to linear source-filter model, nonlinear interactive source-filter models are proposed for voiced speech. This thesis has two parts, in the first part, a framework for the coupling of the source and the filter is presented. Then, two interactive system models are proposed assuming that glottal flow is a quasi-steady Bernoulli flow and acoustics in vocal tract is linear. In these models, instead of glottal flow, glottal area is used as a source for voiced speech. In the proposed interactive models, the relation between the glottal flow, glottal area and vocal tract is determined by the quasi-steady Bernoulli flow equation. It is theoretically shown that linear source-filter model is an approximation of the nonlinear models. Estimation of ISFM’s parameters from only speech signal is a nonlinear blind deconvolution problem. The problem is solved by a robust method developed based on the acoustical interpretation of the systems. Experimental results show that ISFMs produce source-filter coupling effects seen in the physical simulations and the parameter estimation method produce always stable and better performing models than LSFM model. In addition, a framework for the incorporation of the source-filter interaction into classical source-filter model is presented. The Rosenberg source model is extended to an interactive source for voiced speech and its performance is evaluated on a large speech database. The results of the experiments conducted on vowels in the database show that the interactive Rosenberg model is always better than its noninteractive version. In the second part of the thesis, LSFM and ISFMs are compared by using not only the speech signal but also HSV (High Speed Endocopic Video) of vocal folds in a system identification approach. In this case, HSV and speech are used as a reference input-output data for the analysis and comparison of the models. First, a new robust HSV processing algorithm is developed and applied on HSV images to extract the glottal area. Then, system parameters are estimated by using a modified version of the method proposed in the first part. The experimental results show that speech signal can contain some harmonics of the fundamental frequency of the glottal area other than those contained in the glottal area signal. Proposed nonlinear interactive source-filter models can generate harmonics components in speech and produce more realistic speech sounds than LSFM.