Bimodal automatic speech segmentation based on audio and visual information fusion

Date

2011-07-01

Author

Akdemir, Eren
Çiloğlu, Tolga

Metadata

Show full item record

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

Item Usage Stats

240
views

0
downloads

Bimodal automatic speech segmentation using visual information together with audio data is introduced. The accuracy of automatic segmentation directly affects the quality of speech processing systems using the segmented database. The collaboration of audio and visual data results in lower average absolute boundary error between the manual segmentation and automatic segmentation results. The information from two modalities are fused at the feature level and used in a HMM based speech segmentation system. A Turkish audiovisual speech database has been prepared and used in the experiments. The average absolute boundary error decreases up to 18% by using different audiovisual feature vectors. The benefits of incorporating visual information are discussed for different phoneme boundary types. Each audiovisual feature vector results in a different performance at different types of phoneme boundaries. The average absolute boundary error decreases by approximately 25% by using audiovisual feature vectors selectively for different boundary classes. Visual data is collected using an ordinary webcam. The proposed method is very convenient to be used in practice.

Subject Keywords

Automatic speech segmentation, Audiovisual, Lip motion, Text-to-speech

URI

https://hdl.handle.net/11511/35997

Journal

SPEECH COMMUNICATION

DOI

https://doi.org/10.1016/j.specom.2011.03.001

Collections

Department of Electrical and Electronics Engineering, Article

Suggestions

OpenMETU
Core

Wireless speech recognition using fixed point mixed excitation linear prediction (MELP) vocoder Acar, D; Karci, MH; Ilk, HG; Demirekler, Mübeccel (2002-07-19) A bit stream based front-end for wireless speech recognition system that operates on fixed point mixed excitation linear prediction (MELP) vocoder is presented in this paper. Speaker dependent, isolated word recognition accuracies obtained from conventional and bit stream based front-end systems are obtained and their statistical significance is discussed. Feature parameters are extracted from original (wireline) and decoded speech (conventional) and from the quantized spectral information (bit stream) of t...
Multilingual Video Indexing and Retrieval Employing an Information Extraction Tool for Turkish News Texts: A Case Study Kucuk, Dilek; Yazıcı, Adnan (2011-10-28) In this paper, a multilingual video indexing and retrieval system is proposed which relies on an information extraction tool, a hybrid named entity recognizer, for Turkish to determine the semantic annotations for the considered videos. The system is executed on a set of news videos in English and encompasses several other components including an automatic speech recognition system for English, an English-to-Turkish machine translation system, a news video database, and a semantic video retrieval interface....
Multimodal query-level fusion for efficient multimedia information retrieval Sattari, Saeid; Yazıcı, Adnan (2018-10-01) Managing a large volume of multimedia data containing various modalities such as visual, audio, and text reveals the necessity for efficient methods for modeling, processing, storing, and retrieving complex data. In this paper, we propose a fusion-based approach at the query level to improve query retrieval performance of multimedia data. We discuss various flexible query types including the combination of content as well as concept-based queries that provide users with the ability to efficiently perform mu...
MAGiC: A Multimodal Framework for Analysing Gaze in Dyadic Communication Aydin, Ulku Arslan; Kalkan, Sinan; Acartürk, Cengiz (2018-01-01) The analysis of dynamic scenes has been a challenging domain in eye tracking research. This study presents a framework, named MAGiC, for analyzing gaze contact and gaze aversion in face-to-face communication. MAGiC provides an environment that is able to detect and track the conversation partner's face automatically, overlay gaze data on top of the face video, and incorporate speech by means of speech-act annotation. Specifically, MAGiC integrates eye tracking data for gaze, audio data for speech segmentati...
On lexicon creation for turkish LVCSR Kadri, Hacıoğlu; Bryan, Pellom; Çiloğlu, Tolga; Öztürk, Özlem; Mikko, Kurimo; Mathias, Creutz (null; 2003-09-14) In this paper, we address the lexicon design problem in Turkish large vocabulary speech recognition. Although we focus only on Turkish, the methods described here are general enough that they can be considered for other agglutinative languages like Finnish, Korean etc. In an agglutinative language, several words can be created from a single root word using a rich collection of morphological rules. So, a virtually infinite size lexicon is required to cover the language if words are used as the basic units. T...

Citation Formats

E. Akdemir and T. Çiloğlu, “Bimodal automatic speech segmentation based on audio and visual information fusion,” SPEECH COMMUNICATION, pp. 889–902, 2011, Accessed: 00, 2020. [Online]. Available: https://hdl.handle.net/11511/35997.