Estimation of Articulatory Trajectories Based on Gaussian Mixture Model (GMM) With Audio-Visual Information Fusion and Dynamic Kalman Smoothing

2011-07-01
ÖZBEK, İbrahim Yücel
Hasegawa-Johnson, Mark
Demirekler, Mübeccel
This paper presents a detailed framework for Gaussian mixture model (GMM)-based articulatory inversion equipped with special postprocessing smoothers, and with the capability to perform audio-visual information fusion. The effects of different acoustic features on the GMM inversion performance are investigated and it is shown that the integration of various types of acoustic (and visual) features improves the performance of the articulatory inversion process. Dynamic Kalman smoothers are proposed to adapt the cutoff frequency of the smoother to data and noise characteristics; Kalman smoothers also enable the incorporation of auxiliary information such as phonetic transcriptions to improve articulatory estimation. Two types of dynamic Kalman smoothers are introduced: global Kalman (GK) and phoneme-based Kalman (PBK). The same dynamic model is used for all phonemes in the GK smoother; it is shown that GK improves the performance of articulatory inversion better than the conventional low-pass (LP) smoother. However, the PBK smoother, which uses one dynamic model for each phoneme, gives significantly better results than the GK smoother. Different methodologies to fuse the audio and visual information are examined. A novel modified late fusion algorithm, designed to consider the observability degree of the articulators, is shown to give better results than either the early or the late fusion methods. Extensive experimental studies are conducted with the MOCHA database to illustrate the performance gains obtained by the proposed algorithms. The average RMS error and correlation coefficient between the true (measured) and the estimated articulatory trajectories are 1.227 mm and 0.868 using audiovisual information fusion and GK smoothing, and 1.199 mm and 0.876 using audiovisual information fusion together with PBK smoothing based on a phonetic transcription of the utterance.
IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING

Suggestions

Dynamic Speech Spectrum Representation and Tracking Variable Number of Vocal Tract Resonance Frequencies With Time-Varying Dirichlet Process Mixture Models
Özkan, Emre; Demirekler, Muebeccel (Institute of Electrical and Electronics Engineers (IEEE), 2009-11-01)
In this paper, we propose a new approach for dynamic speech spectrum representation and tracking vocal tract resonance (VTR) frequencies. The method involves representing the spectral density of the speech signals as a mixture of Gaussians with unknown number of components for which time-varying Dirichlet process mixture model (DPM) is utilized. In the resulting representation, the number of formants is allowed to vary in time. The paper first presents an analysis on the continuity of the formants in the sp...
Prediction of ducted diaphragm noise using a stochastic approach with adapted temporal filters
Karban, Ugur; Schram, Christophe; Sovardi, Carlo; Polifke, Wolfgang (SAGE Publications, 2019-01-01)
The noise production by ducted single- and double-diaphragm configurations is simulated using a stochastic noise generation and radiation numerical method. The importance of modeling correctly the anisotropy and temporal de-correlation is discussed, based on numerical results obtained by large eddy simulation. A new temporal filter is proposed, designed to provide the targeted spectral decay of energy in an Eulerian reference frame. An anisotropy correction is implemented using a non-linear model. The acous...
On Improving Dynamic State Space Approaches to Articulatory Inversion With MAP-Based Parameter Estimation
Özbek Arslan, Işıl; Hasegawa-Johnson, Mark; Demirekler, Mübeccel (Institute of Electrical and Electronics Engineers (IEEE), 2012-01-01)
This paper presents a complete framework for articulatory inversion based on jump Markov linear systems (JMLS). In the model, the acoustic measurements and the position of each articulator are considered as observable measurement and continuous-valued hidden state of the system, respectively, and discrete regimes of the system are represented by the use of a discrete-valued hidden modal state. Articulatory inversion based on JMLS involves learning the model parameter set of the system and making inference a...
Ultrathick and high-aspect-ratio nickel microgyroscope using EFAB multilayer additive electroforming
Alper, Said Emre; Ocak, Ilker Ender; Akın, Tayfun (Institute of Electrical and Electronics Engineers (IEEE), 2007-10-01)
This paper presents a new approach for the development of a microgyroscope that has a 240-/mu m-thick multilayer electroformed-nickel structural mass and a lateral aspect ratio greater than 100. The gyroscope is fabricated using commercial multilayer additive electroforming process EFAB of Microfabrica, Inc., which allows defining the thickness of different structural regions, such as suspensions, proof mass, and capacitive electrodes, unlike many classical surface-micromachining technologies that require a...
Bilateral CMUT Cells and Arrays: Equivalent Circuits, Diffraction Constants, and Substrate Impedance
KÖYMEN, Hayrettin; ATALAR, ABDULLAH; Tasdelen, A. Sinan (Institute of Electrical and Electronics Engineers (IEEE), 2017-02-01)
We introduce the large-signal and small-signal equivalent circuit models for a capacitive micromachined ultrasonic transducer (CMUT) cell, which has radiating plates on both sides. We present the diffraction coefficient of baffled and unbaffled CMUT cells. We show that the substrate can be modeled as a very thick radiating plate on one side, which can be readily incorporated in the introduced model. In the limiting case, the reactance of this backing impedance is entirely compliant for substrate materials w...
Citation Formats
İ. Y. ÖZBEK, M. Hasegawa-Johnson, and M. Demirekler, “Estimation of Articulatory Trajectories Based on Gaussian Mixture Model (GMM) With Audio-Visual Information Fusion and Dynamic Kalman Smoothing,” IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, pp. 1180–1195, 2011, Accessed: 00, 2020. [Online]. Available: https://hdl.handle.net/11511/58050.